JoeSniff

Joseph Wilk - JoeSniff

Joseph Wilk’s Blog - Monkey amongst men

Latent Semantic Analysis in Python

Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships. An example of LSA: Using a search engine search for "sand". Documents are returned which do not contain the search term "sand" but contains ...

Building a Vector Space Search Engine in Python

A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero. Here is an implementation of Vector space searching using python (2.4+). 1 Stemming & Stop words Fetch all terms within documents and clean - use a stemmer to reduce. A stemmer takes words and tries to reduce them ...

Automatic Tag Generation

This project looked at dynamically generating suggestion tags for content. To simplify the task some constraints where introduced. The content which will be tagged is news articles with HTML markup. Only English content. I used the following HTML page to experiment on with suggestion tags: http://news.bbc.co.uk/1/hi/entertainment/6624223.stm To help evaluate the tagging methods I asked a sample of people to suggest what they thought the best tags would be. They came up with: paris, ...

Continue