Dec 19, 2007
Latent Semantic Analysis (LSA) is a mathematical method that tries to bring out latent relationships within a collection of documents. Rather than looking at each document isolated from the others it looks at all the documents as a whole and the terms within them to identify relationships.
An example of LSA:
Using a search engine search for "sand".
Documents are returned which do not contain the search term "sand" but contains ...
Nov 27, 2007
A vector space search involves converting documents into vectors. Each dimension within the vectors represents a term. If a document contains that term then the value within the vector is greater than zero.
Here is an implementation of Vector space searching using python (2.4+).
1 Stemming & Stop words
Fetch all terms within documents and clean - use a stemmer to reduce. A stemmer takes words and tries to reduce them ...
Oct 22, 2007
This project looked at dynamically generating suggestion tags for content. To simplify the task some constraints where introduced.
The content which will be tagged is news articles with HTML markup.
Only English content.
I used the following HTML page to experiment on with suggestion tags: http://news.bbc.co.uk/1/hi/entertainment/6624223.stm
To help evaluate the tagging methods I asked a sample of people to suggest what they thought the best tags would be. They came up with:
paris, ...