Workshop: Week 8: Probabilistic Classification =============================================================== Question 1: Naive Bayes predictions =================================== Implement a binary multinomial Naive Bayes classifier, with Laplace smoothing. Run it on the C15 topic, with 1000 training and 1000 test documents. What accuracy and F1 scores do you get? WRAPPER CODE: ------------- In: http://www.williamwebber.com/comp90042/wksh/wk08/q1.code.tar.gz I provide wrapper code to help (mostly taken from week 6). You'll need to add the data files: lyrl30k_tpcs.txt lyrl_tokens_30k.dat Then run: python mk_bow_idx.py lyrl_tokens_30k.dat bow.idx to create a shelf index "bow.idx" (you'll need to modify this if shelf doesn't work on your system). The bow index maps from docids to docvecs; each docvec has { term : f_dt }. Then look at: nb_bin.py This main function loads up the index, creates the test and train sets, and then creates a nb.MultinomialNaiveBayesClassifier object. This object has two methods: train(), and classify(). You have to implement these methods. Skeleton code is provided in the nb.py file. Save the output of nb_bin.py to a file (say, "C15.run"). You can then calculate effectiveness by: python score_bin_run.py C15.run lyrl30k_tpcs.txt run.txt IMPLEMENTATION NOTE: -------------------- Note the statistics you need to calculate (from Slides 6 and 8 of the lecture notes): - $N_c$ - $N$ - $F_c$ - $F_{ci}$ - $|V|$ One way of implementing this (though not the only way) is to collect and store all of these statistics (for each class for $F_c$ and $N_c$, and for each term of each class for $F_{ci}$) in the train() method, then calculate the formula in the classify() method. Note that you need to calculate the formula for each class, then return the class that has the highest score. Be careful to avoid floating point underflows! (How does one do this?) Question 2: Naive Bayes probability estimates ============================================= Modify your code for Question 1 so that instead of emitting class predictions, it gives the probability of membership in the positive class. Check your answer by also calculating probability of membership in the negative class, and make sure that $P(!c | d) + P(c | d) = 1$ (allowing for rounding errors). Print out the probability of positive class membership, and whether the document is actually in the positive class, for all 1000 test documents on topic C15. To do this, you need to estimate $P(d)$ from Equation 4 in the notes. How does one do this? Hint: if we have a set of $E_1$, ..., $E_N$ that partitions the probability space, such that: $P(E_1) + P(E_2) + ... + P(E_N) = 1$ then we can calculate $P(X)$ as: $P(X | E_1) + P(X | E_2) + ... + P(X | E_N)$ Note: to do this in practice, you're again have to be careful about float underflow. This involves some lateral thinking! Question 3: Logistic and logit functions ======================================== On slide 15 of the lecture notes (lecture 15), we assert that if: logit(P) = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n then: P = 1 / (1 + e^(-(\beta_0 + \beta_1 + x_1 + ... + \beta_n x_n))) Prove this. Question 4: $\beta_0$ ===================== What does the learnt parameter $\beta_0$ mean? If $\beta_0 = 0$, and the document has no terms in it, what is $P(c | d)$?