IN4080 2018, Mandatory assignment 2, part A (= Exercise set 5)
Mandatory assignment 2 consists of two parts:
? Part A (= Exercise set 5) on text classification
? Part B (= Exercise set 6) on sequence classification
You should answer both parts. It is possible to get 50 points on each part, 100 points altogether. You are required to get at least 60 points to pass. It is more important that you try to answer each question than that you get it correct.

======================

? Your answer should be delivered in devilry no later than Tuesday, 16 October at 23:59
? We assume that you have read and are familiar with IFI's requirements and guidelines for mandatory assignments
o /english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html
o /english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html
? This is an individual assignment. You should not deliver joint submissions.
? You may redeliver in Devilry before the deadline, but include all files in the last delivery. Only the last delivery will be read!
? You should make only one submission including both part A and part B. If you deliver more than one file, put them into a zip-archive.
? Name your submission <username>_in4080_submission_2
? You may choose your format of presentation. One possibility is to deliver a text document, preferably in pdf, with all the answers asked for and separate code files. Another possibility is to use Jupyter notebooks, if you are familiar with the format.

=======================

The original plan was that this set should assume 每 and not include 每 exercise set 4. It should start where exercise set 4 ends. The impression is, however, that not everybody has done exercise set 4. Hence, we make it simpler. We back-off and use most of exercise set 4 part of this mandatory assignment, and only add a few points. So, do not be surprised if you feel you have read this before.



Exercise 1 (0 points)
a) We will work interactively in python/ipython/notebook. We start by importing the toolboxes we will be using:
import nltk
import random
import numpy as np
import scipy as sp
import sklearn

We will use the Movie Reviews Corpus that comes with NLTK.

from nltk.corpus import movie_reviews

b) We can import the documents similarly to how it is done in the NLTK book for the Bernoulli Naive Bayes, with one change. We there use the tokenized texts with the command

movie_reviews.words(fileid)

Following the recipe from the scikit "Working with text data" page, we can instead use the raw documents which we can get from NLTK by

movie_reviews.raw(fileid)

scikit will then do the tokenization for us as part of

count_vect.fit_transform

c) We will shuffle the data and split it into 200 documents for final testing (which we will not use for a while) and 1800 documents for development. See exercise set 2, exercise 2. If you did that exercise, you are recommended to use the same split, so that you can compare results.

d) Then split the development data into 1600 sentences for training and 200 for development test set, call them train_data and dev_test_data. The train_data should now be a list of 1600 items, where each is a pair of a text represented as a string and a label. You should then split this train_data into two lists, each of 1600 elements, the first, train_texts, containing the texts (as strings) for each document, and the train_target, containing the corresponding 1600 labels. Do the same to the dev_test_data. 

e) It is then time to extract features from the text. We import

from sklearn.feature_extraction.text import CountVectorizer

We then make a CountVectorizer v. This first considers the whole set of training data, to determine which features to extract:
v = CountVectorizer()
v.fit(train_texts)

Then we use this vectorizer to extract features from the training data and the test data

train_vectors = v.transform(train_texts)
dev_test_vectors = v.transform(dev_test_texts)

To understand what is going on, you may inspect the train_vectors a little more.

f) We are now ready to train a classifier

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(train_vectors, train_target)

g) We can proceed and see how the classifier will classify one test object, e.g.

test_texts[14]
clf.predict(dev_test_vectors[14])

We can use the procedure to predict the results for all the test_data, by

clf.predict(dev_test_vectors)

We can use this for further evaluation (accuracy, recall, precision, etc.) by comparing to test_targets. We can alternatively get the accuracy directly by

clf.score(dev_test_vectors, dev_test_target)

Congratulations! You have now made and tested a multinomial naive Bayes text classifier.

Deliveries: No deliveries for this exercise.



Exercise 2 (10 points)
a) To make it easier to rerun the experiment and proceed to cross-validation, put most of exercise 1 into a procedure

multi_nb_exp(train_data, test_data):
"""train-data is a list of pairs, where 
      first element is a string representing a text
      second element a string representing a label
   test-data has the same form
   
Train a multinomialNB on train_data, test on test_data
and return the result
"""
Beware, when you run this experiment reconstructing exercise 1, the input should be

multi_nb_exp(train_data, dev_test_data)

Check that you get the same result as in exercise 1.

b) Make a procedure for n-fold cross-validation 
 n_fold(experiment, dev_data, n=10):
"""experiment is an experiment like multi_nb_exp
dev_data is a set of pairs 
      first element is a string representing a text
      second element a string representing a label

Run an n-fold cross-validation of experiment on
dev_data. return the results.

Then run
n_fold(multi_nb_exp, dev_data, n=9)

and calculate the accuracies for each of the 9 runs, the mean accuracy and the
standard deviation of the accuracies.

Deliveries: Code and results of running the code as described.



Exercise 3 每 (10 points)
We have so far considered the standard parameters for the procedures from scikit-learn. These procedures have, however, many parameters, and to get optimal results we should adjust the parameters. We can go back to the split in exercise 1, and use train_data for training various models and dev_test_data for testing and comparing them.

a)  To see the parameters for CountVectorizer we may use

help(CountVectorizer)

In ipython we may alternatively use

CountVectorizer?

One thing we observe is that CountVectorizer lowercases by default. For a different corpus, it could be interesting to check the effect of this feature, but even the movie_reviews.raw() is already lowercased, so that does not have  an effect here (You may check!)

Another interesting feature is binary. Setting this to False implies only counting whether a word occurs in a document and not how many times it occurs. It could be interesting to see the effect of this feature.

The feature ngram_range=[1,1] means we use tokens  (=unigrams) only, [2,2] means that we using bigrams only, while [1,2] means both unigrams and  bigrams.

Run experiments where you let binary vary over [False, True] and ngram_range vary over [[1,1], [1,2], [1,3]] and report the results. Which combination of parameters yield the best result?

Deliveries: Code and results of running the code as described.

Exercise 4 (10 points)
We will explore how the Bernoulli naive Bayes, which is used in the NLTK book for the movie data set is implemented in scikit-learn. We can follow the NLTK-book and extract features similarly, i.e., using document_features(document) to extract features from document into a dictionary.

Remember that we here have to use

movie_reviews.words(fileid)

where we in exercise 1 used

movie_reviews.raw(fileid)

We can use the same splits as in exercise 1. The train_data will now be a list of 1600 items, where each is a pair where the first element is a dictionary. It remains to transform these dictionaries to numpy arrays of the form scikit accepts. For this, we can use scikit's DictVectorizer, see sec tion 4.2.1 in http://scikit-learn.org/stable/modules/feature_extraction.html. (Alternatively, you can extract the features directly as arrays without using dictionaries. It is a little more work, but the experiments will run faster.)

a) Make a procedure for the experiment

bernoulli_nb_exp(train_data, test_data):
"""train-data is a list of pairs, where 
      first element is a feature dictionary
      second element a string representing a label
   test-data has the same form
   
Train a BernoulliNB on train_data, test on test_data
and return the result
"""
b) Combining this with exercise 2, run

n_fold(bernoulli_nb_exp, dev_data, n=9)

and report the results.

Deliveries: Code and results of running the code as described.

Exercise 5 (10 points)
a) In the framework of exercise 4, we may also try other classifiers, in particular Logistic Regression. Exchange BernoullNB with LogisticRegression and repeat the 9-fold cross-validation experiment from exercise 4. You import LogisticRegression by

from sklearn.linear_model import LogisticRegression

b) The default from NLTK is to use the 2000 most frequent words. We will explore the effect of the size of the feature set. Repeat the 9-fold cross-validation experiment with the 1000, 2000, 5000, 10000 and 20000 most frequent words as features, both with BernoulliNB and with LogisticRegression.

Deliveries: Code. A 5*2 table showing the mean accuracies for 9-fold cross-validation for 1000, 2000, 5000, 1000, and 20000 features for the two different classifiers.

Exercise 6 (10 points)
From the different classifiers with which you have experimented in this exercise set, choose the one with the best performance on the dev_data. Find the 200 item test set we tucked away in exercise 1. Run the best classifier on this final test set. Calculate accuracy, recall, precision and F-score for both classes.

Deliveries: Which classifier dis you choose? The numbers asked for. Code.

END OF PART A 每 DO NOT FORGET PART B
6