Workshop: Week 7: Language Models

===============================================================

Question 1: KL divergence
=========================

In lecture 12, slide 12, we calculated the log probability of
Shakespeare's play, "All's Well that Ends Well", for each of
five different source texts.  Redo this calculation, but
using the KL divergence between the play, $p$, and each of the
source texts, $T$:

     KL_D(M_p || M_T)

Some words will occur in the play that don't occur in the
source texts.  We will therefore have smooth the text model
with a background model, $M_B$:

    M_T'  = (1 - \lambda) M_T + \lambda M_B

(You don't need to do this for the play model (why?)).

A reasonable way to form $M_B$ is to aggregate the counts of
term occurrences in all the six texts (the five background
texts and the play), and then form a MLE estimator from
these counts.  (In principle, \lambda should be smaller where the
source text $T$ is larger, since we have a larger sample
of the desired language, but we won't go into this detail here).

The frequency data for each of the source texts can be found
at:

   http://www.williamwebber.com/comp90042/wksh/wk07/dat/txt-freq.zip

Calculate the KL_D between the play and each of the source
texts, for \lambda = 0.4.  Does the ranking agree with that 
given in the lecture?  Try \lambda = 0.1 and \lambda=0.9.  Does
the ranking change?  Why or why not?

IMPLEMENTATION NOTE: 

You may find it useful to define the following functions:

    load_freq(fname):
       Load a frequency file into a {term: freq} dictionary

    freq_to_prop(freq_dict):
       Take a {term : freq} dictionary and produce a
          {term: prop} dictionary, where prop is the proportion
          of times the term appears (all props in one dictionary
          should sum to 1).

    sum_freq_dicts(freq_dict_list):
       Sum the term frequencies in a list of {term: freq}
          dictionaries, and return.

    smooth_models(fg_mdl, bg_mdl, lmbda):
       Smooth a foreground model (a {term: prop} dictionary)
          with a background model, using the formula

            M_S  = (1 - \lambda) M_F + \lambda M_B
     
          and return

    kl_div(p, q):
       Calculate KL divergence between two distributions
          (expressed as {term : prop} dictionaries)


Question 2: KL divergence and LM
================================

In lecture 13, slide 6, we said that calculating document--query
match using KL divergence between the document model and the MLE
query model:

    R(q, D) = - KL_D (M_q || M_D)

is rank equivalent to regular LM:  

    S(q, D) = \sum_i P(q_i | M_D)

Prove the rank equivalence. 


Question 3: Language model querying
===================================

Implmement an LM search engine.  Use the LYRL30k dataset 
as the corpus.  For the document model, use:

     M_D' = (1 - \lambda) M_D + \lambda M_C

Set \lambda to 0.4.  Calculate the score of a $D$ for query $q$
as:

    R(D, q; C) = \sum_i \log P(q_i | M_D')

(Why do we do this rather than directly multiply the probabilities:

    S(D, q; C) = \prod_i P(q_i | M_D')

?  What is the relationship between R() and S()?)

Give the depth-10 ranking returned to the query 
"oil price iraq tension".

IMPLEMENTATION NOTE:

The file:

   http://www.williamwebber.com/comp90042/wksh/wk07/code/coll.py

contains parse_lyrl_coll(fname), which parses the LYRL30k dataset
into a dictionary of {docid:docbow}, where each docbow is a
{term:freq} pair.  Use this to load up the dataset.

Iterate through the documents, and reuse freq_to_prop() to create
a foreground model for each document.  Then invert this to
create a nested { term : {doc : P(t | D) } } dictionary.  Write
this out to a shelve file (or other index). 

Reuse sum_freq_dicts() and freq_to_prop() from question 1 to 
create your background model.  Then write this out to a file
(use pickle()).

Note that you should _not_ smooth the document models first,
then store in the index, as this will create dictionary for
each document that is as large as the collection vocab list.

At query processing time, process terms one at a time, and
keep an accumulator dictionary of documents, adding the
term's contribution of 

    \log P(t | M_D')

to that document, as with standard inverted index processing.

Note, though, that you have to take some care.  If a
query term does not appear in a document, it will still get
a score from the background model.  We obviously don't want
to score documents that contain _no_ query terms.  But if
a document contains _any_ query terms, it should still
get the background score of:

     \log \lambda P(w | M_C)

An easy (though slightly inefficient) way to do this is
to first figure out the list of documents that have at
least one query term (that is, create an accumulator
dictionary where every candidate document has a score
of 0.0 at the outset).  Then, when processing a query
term, iterate through the list of candidate documents,
_not_ the list of documents that term appears in.  If
the candidate document does not occur in the query,
then give it the background score.


Question 4: Relevance model
===========================

In lecture 13, we saw how one could build a "model of relevance"
by using pseudo-relevance feedback.  We used this model to
improve search results, and also to perform CLIR.  We can
also directly interrogate the model to see what terms have a
high weight in it; this helps interpret the model.

Write code to calculate a relevance model using the formula
given by Lavrenko and Croft (2001) (lecture 13, slide 15).
Set their parameter $\lambda$ to 0.6.  Instantiate the model 
(that is, calculate actual

    P(w | q; {\cal F})

) for the words in the feedback vocabulary (w \in V_{\cal F}).
Report the top 20 highest-weighted terms in the relevance model
for "gas price iraq tensions", built from the top 10 results
returned in Question 3.