Holy Trinity School, Richmond Hill, Ontario, CANADA
Microformats such as Twitter tweets, Facebook status updates, and news headlines represent the fastest growing information medium; in fact the number of tweets published each day is growing by a factor of over 100 each year. However, microsearch, or search over very short document like tweets, is particularly difficult because length constraints limit the number of words, and hence number of possible search terms, in each document, causing a problem called lexical mismatch. In this project, a novel language modelling information retrieval algorithm for microsearch called "Apodora" was developed, using limiting distributions of Markov chain-like stochastic processes as a means of semantic smoothing and weighted document expansion. Apodora identifies contextual statistical relationships that exist in the semantics of words and exploits these to reduce the impact of the choice of query terms on the search results. A theoretical framework motivating the algorithm was formulated and proofs of convergence were established. Apodora was implemented in a scalable information retrieval system based on MapReduce techniques. When compared to the popular vector space model on the CACM Corpus, the system yielded results with approximately twice the precision. Using the new TREC 2011 Microblog corpus, Apodora exceeded the overall TREC median precisions and achieved the highest mean average precision of any published result that did not use a combination of algorithms or Twitter-specific metadata analysis. Apodora was shown to be a powerful information retrieval algorithm for general microsearch and has important applications in the socially and economically important microblog, news feed, and medical search industries.