Workshop: Week 8: Probabilistic Classification

===============================================================

Question 1: Naive Bayes predictions
===================================

Implement a binary multinomial Naive Bayes classifier, with
Laplace smoothing.  Run it on the C15 topic, with 1000 training
and 1000 test documents.  What accuracy and F1 scores do you get?

WRAPPER CODE:
-------------

In:

    http://www.williamwebber.com/comp90042/wksh/wk08/q1.code.tar.gz

I provide wrapper code to help (mostly taken from week 6).  You'll
need to add the data files:

   lyrl30k_tpcs.txt
   lyrl_tokens_30k.dat

Then run:

   python mk_bow_idx.py lyrl_tokens_30k.dat bow.idx

to create a shelf index "bow.idx" (you'll need to modify this if
shelf doesn't work on your system).  The bow index maps from 
docids to docvecs; each docvec has { term : f_dt }.

Then look at:

   nb_bin.py

This main function loads up the index, creates the test and
train sets, and then creates a nb.MultinomialNaiveBayesClassifier
object.  This object has two methods: train(), and classify().
You have to implement these methods.  Skeleton code is provided
in the nb.py file.

Save the output of nb_bin.py to a file (say, "C15.run").  You
can then calculate effectiveness by:

   python score_bin_run.py C15.run lyrl30k_tpcs.txt run.txt

IMPLEMENTATION NOTE:
--------------------

Note the statistics you need to calculate (from Slides 6 and 8
of the lecture notes):

   - $N_c$
   - $N$
   - $F_c$
   - $F_{ci}$
   - $|V|$

One way of implementing this (though not the only way) is to collect 
and store all of these statistics (for each class for $F_c$ and $N_c$,
and for each term of each class for $F_{ci}$) in the train() method,
then calculate the formula in the classify() method.  Note that you
need to calculate the formula for each class, then return the class
that has the highest score.  Be careful to avoid floating point
underflows!  (How does one do this?)


Question 2: Naive Bayes probability estimates
=============================================

Modify your code for Question 1 so that instead of emitting class
predictions, it gives the probability of membership in the
positive class.  Check your answer by also calculating probability
of membership in the negative class, and make sure that 
$P(!c | d) + P(c | d) = 1$ (allowing for rounding errors).
Print out the probability of positive class membership, and
whether the document is actually in the positive class, for
all 1000 test documents on topic C15.

To do this, you need to estimate $P(d)$ from Equation 4 in the
notes.  How does one do this?  

Hint: if we have a set of $E_1$, ..., $E_N$ that partitions
the probability space, such that:

   $P(E_1) + P(E_2) + ... + P(E_N) = 1$

then we can calculate $P(X)$ as:

  $P(X | E_1) + P(X | E_2) + ... + P(X | E_N)$

Note: to do this in practice, you're again have to be careful
about float underflow.  This involves some lateral thinking!


Question 3: Logistic and logit functions
========================================

On slide 15 of the lecture notes (lecture 15), we assert that if:

     logit(P) = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n

then:

     P = 1 / (1 + e^(-(\beta_0 + \beta_1 + x_1 + ... + \beta_n x_n)))

Prove this.


Question 4: $\beta_0$
=====================

What does the learnt parameter $\beta_0$ mean?  If $\beta_0 = 0$, and
the document has no terms in it, what is $P(c | d)$?