A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.

Here is an implementation of Vector space searching using python (2.4+).

1 Stemming & Stop words

Fetch all terms within documents and clean - use a stemmer to reduce. A stemmer takes words and tries to reduce them to there base or root. Words which have a common stem often have similar meanings.
Example:
CONNECTED
CONNECTING
CONNECTION
CONNECTIONS

all map to CONNECT

We also remove any stopwords from the documents. [a,am,an,also,any,and] are all examples of stopwords in English. Stop words have little value in search so we strip them. The stoplist used was taken from: ftp://ftp.cs.cornell.edu/pub/smart/english.stop

 self.stemmer = PorterStemmer()
  1.  
  2. def removeStopWords(self,list):
  3. """ Remove common words which have no search value """
  4. return [word for word in list if word not in self.stopwords ]
  5.  
  6.  
  7. def tokenise(self, string):
  8. """ break string up into tokens and stem words """
  9. string = self.clean(string)
  10. words = string.split(" ")
  11.  
  12. return [self.stemmer.stem(word,0,len(word)-1) for word in words]
  13.  

2 Map Keywords to Vector Dimensions

Map the vector dimensions to keywords using a dictionary: keyword=>position

  1.  
  2. def getVectorKeywordIndex(self, documentList):
  3. """ create the keyword associated to the position of the elements within the document vectors """
  4.  
  5. #Mapped documents into a single word string
  6. vocabularyString = " ".join(documentList)
  7.  
  8. vocabularyList = self.parser.tokenise(vocabularyString)
  9. #Remove common words which have no search value
  10. vocabularyList = self.parser.removeStopWords(vocabularyList)
  11. uniqueVocabularyList = util.removeDuplicates(vocabularyList)
  12.  
  13. vectorIndex={}
  14. offset=0
  15. #Associate a position with the keywords which maps to the dimension on the vector used to represent this word
  16. for word in uniqueVocabularyList:
  17. vectorIndex[word]=offset
  18. offset+=1
  19. return vectorIndex  #(keyword:position)
  20.  
  21.  

3 Map Document Strings to Vectors.

We use the simple Term Count Model. A more accurate model would be to use tf-idf (termFrequency-inverseDocumentFrequnecy).

  1.  
  2. def makeVector(self, wordString):
  3. """ @pre: unique(vectorIndex) """
  4.  
  5. #Initialise vector with 0’s
  6. vector = [0] * len(self.vectorKeywordIndex)
  7. wordList = self.parser.tokenise(wordString)
  8. wordList = self.parser.removeStopWords(wordList)
  9. for word in wordList:
  10. vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
  11. return vector
  12.  

4 Find Related Documents

We now have the ability to find related documents. We can test if two documents are in the concept space by looking at the the cosine of the angle between the document vectors. We use the cosine of the angle as a metric for comparison. If the cosine is 1 then the angle is 0° and hence the vectors are parallel (and the document terms are related). If the cosine is 0 then the angle is 90° and the vectors are perpendicular (and the document terms are not related).

We calculate the cosine between the document vectors in python using scipy.

  1.  
  2. def cosine(vector1, vector2):
  3. """ related documents j and q are in the concept space by comparing the vectors :
  4. cosine  = ( V1 * V2 ) / ||V1|| x ||V2|| """
  5. return float(dot(vector1,vector2) / (norm(vector1) * norm(vector2)))
  6.  

5 Search the Vector Space!

In order to perform a search across keywords we need to map the keywords to the vector space. We create a temporary document which represents the search terms and then we compare it against the document vectors using the same cosine measurement mentioned for relatedness.

  1.  
  2. def search(self,searchList):
  3. """ search for documents that match based on a list of terms """
  4. queryVector = self.buildQueryVector(searchList)
  5.  
  6. ratings = [util.cosine(queryVector, documentVector) for documentVector in self.documentVectors]
  7. ratings.sort(reverse=True)
  8. return ratings
  9.  

Further Extensions

  1. Use tf-idf rather than the Term count model for term weightings
  2. Instead of linear processing of all document vectors when searching for related content use: Lanczos methods OR a neural network-like approach.
  3. Moving towards Latent Semantic analysis, Probabilistic latent semantic analysis or Latent Dirichlet allocation.

Third Party tools

The stemmer used comes from:
http://tartarus.org/~martin/PorterStemmer/python.txt

And the library for performing cosine calculations comes from NumPy:
http://www.scipy.org/

Downloads

Download all files

VectorSpace.py
util.py
Parser.py

PorterStemmer.py

VectorSpace.py - Core vector space search logic

  1. from pprint import pprint
  2. from Parser import Parser
  3. import util
  4.  
  5. class VectorSpace:
  6. """ A algebraic model for representing text documents as vectors of identifiers.
  7. A document is represented as a vector. Each dimension of the vector corresponds to a
  8. separate term. If a term occurs in the document, then the value in the vector is non-zero.
  9. """
  10.  
  11. #Collection of document term vectors
  12. documentVectors = []
  13.  
  14. #Mapping of vector index to keyword
  15. vectorKeywordIndex=[]
  16.  
  17. #Tidies terms
  18. parser=None
  19.  
  20.  
  21. def __init__(self, documents=[]):
  22. self.documentVectors=[]
  23. self.parser = Parser()
  24. if(len(documents)>0):
  25. self.build(documents)
  26.  
  27.  
  28. def build(self,documents):
  29. """ Create the vector space for the passed document strings """
  30. self.vectorKeywordIndex = self.getVectorKeywordIndex(documents)
  31.  
  32. self.documentVectors = [self.makeVector(document) for document in documents]
  33.  
  34.  
  35. def getVectorKeywordIndex(self, documentList):
  36. """ create the keyword associated to the position of the elements within the document vectors """
  37.  
  38. #Mapped documents into a single word string
  39. vocabularyString = " ".join(documentList)
  40.  
  41. vocabularyList = self.parser.tokenise(vocabularyString)
  42. #Remove common words which have no search value
  43. vocabularyList = self.parser.removeStopWords(vocabularyList)
  44. uniqueVocabularyList = util.removeDuplicates(vocabularyList)
  45.  
  46. vectorIndex={}
  47. offset=0
  48. #Associate a position with the keywords which maps to the dimension on the vector used to represent this word
  49. for word in uniqueVocabularyList:
  50. vectorIndex[word]=offset
  51. offset+=1
  52. return vectorIndex  #(keyword:position)
  53.  
  54.  
  55. def makeVector(self, wordString):
  56. """ @pre: unique(vectorIndex) """
  57.  
  58. #Initialise vector with 0’s
  59. vector = [0] * len(self.vectorKeywordIndex)
  60. wordList = self.parser.tokenise(wordString)
  61. wordList = self.parser.removeStopWords(wordList)
  62. for word in wordList:
  63. vector[self.vectorKeywordIndex[word]] += 1; #Use simple Term Count Model
  64. return vector
  65.  
  66.  
  67. def buildQueryVector(self, termList):
  68. """ convert query string into a term vector """
  69. query = self.makeVector(" ".join(termList))
  70. return query
  71.  
  72.  
  73. def related(self,documentId):
  74. """ find documents that are related to the document indexed by passed Id within the document Vectors"""
  75. ratings = [util.cosine(self.documentVectors[documentId], documentVector) for documentVector in self.documentVectors]
  76. ratings.sort(reverse=True)
  77. return ratings
  78.  
  79.  
  80. def search(self,searchList):
  81. """ search for documents that match based on a list of terms """
  82. queryVector = self.buildQueryVector(searchList)
  83.  
  84. ratings = [util.cosine(queryVector, documentVector) for documentVector in self.documentVectors]
  85. ratings.sort(reverse=True)
  86. return ratings
  87.  
  88.  
  89. if __name__ == ‘__main__’:
  90. #test data
  91. documents = ["The cat in the hat disabled", "A cat is a fine pet ponies.", "Dogs and cats make good pets.","I haven’t got a hat."]
  92.  
  93. vectorSpace= VectorSpace(documents)
  94. pprint(vectorSpace.related(0))
  95. pprint(vectorSpace.search(["cat"]))
  96.  

5 Responses to “Building a Vector Space Search Engine in Python”

  1. babajan Says:

    Hi, where I can donwloads the source code?

    VectorSpace.py
    util.py
    Parser.py

    PorterStemmer.py
    All are 404 response.

    Thank’s

  2. Joseph Wilk Says:

    All code downloads should now work + the tar.gzip download of all files.

  3. dineshv Says:

    In parser.py, the tokenise function:

    def tokenise(self, string):
    “”" break string up into tokens and stem words “”"
    string = self.clean(string)
    words = string.split(” “)
    return [self.stemmer.stem(word,0,len(word)-1) for word in words]

    causes a ‘no instance’ error because it cannot find the ‘.stem’ in stemmer.stem.

    Dinesh

  4. dineshv Says:

    Wrt my previous note the download zip file only included a portion of the PorterStemmer.py file. Anyway, I was just interested in a Python stemming solution which I got from http://tartarus.org/~martin/PorterStemmer/python.txt. Cheers!

    Dinesh

  5. links for 2008-07-03 « Breyten’s Dev Blog Says:

    [...] Building a Vector Space Search Engine in Python (tags: code python search) [...]

Leave a Reply