IN4080 2018, Mandatory assignment 2, part A (= Exercise set 5) Mandatory assignment 2 consists of two parts: ? Part A (= Exercise set 5) on text classification ? Part B (= Exercise set 6) on sequence classification You should answer both parts. It is possible to get 50 points on each part, 100 points altogether. You are required to get at least 60 points to pass. It is more important that you try to answer each question than that you get it correct. ====================== ? Your answer should be delivered in devilry no later than Tuesday, 16 October at 23:59 ? We assume that you have read and are familiar with IFI's requirements and guidelines for mandatory assignments o /english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html o /english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html ? This is an individual assignment. You should not deliver joint submissions. ? You may redeliver in Devilry before the deadline, but include all files in the last delivery. Only the last delivery will be read! ? You should make only one submission including both part A and part B. If you deliver more than one file, put them into a zip-archive. ? Name your submission _in4080_submission_2 ? You may choose your format of presentation. One possibility is to deliver a text document, preferably in pdf, with all the answers asked for and separate code files. Another possibility is to use Jupyter notebooks, if you are familiar with the format. ======================= The original plan was that this set should assume ¨C and not include ¨C exercise set 4. It should start where exercise set 4 ends. The impression is, however, that not everybody has done exercise set 4. Hence, we make it simpler. We back-off and use most of exercise set 4 part of this mandatory assignment, and only add a few points. So, do not be surprised if you feel you have read this before. Exercise 1 (0 points) a) We will work interactively in python/ipython/notebook. We start by importing the toolboxes we will be using: import nltk import random import numpy as np import scipy as sp import sklearn We will use the Movie Reviews Corpus that comes with NLTK. from nltk.corpus import movie_reviews b) We can import the documents similarly to how it is done in the NLTK book for the Bernoulli Naive Bayes, with one change. We there use the tokenized texts with the command movie_reviews.words(fileid) Following the recipe from the scikit "Working with text data" page, we can instead use the raw documents which we can get from NLTK by movie_reviews.raw(fileid) scikit will then do the tokenization for us as part of count_vect.fit_transform c) We will shuffle the data and split it into 200 documents for final testing (which we will not use for a while) and 1800 documents for development. See exercise set 2, exercise 2. If you did that exercise, you are recommended to use the same split, so that you can compare results. d) Then split the development data into 1600 sentences for training and 200 for development test set, call them train_data and dev_test_data. The train_data should now be a list of 1600 items, where each is a pair of a text represented as a string and a label. You should then split this train_data into two lists, each of 1600 elements, the first, train_texts, containing the texts (as strings) for each document, and the train_target, containing the corresponding 1600 labels. Do the same to the dev_test_data. e) It is then time to extract features from the text. We import from sklearn.feature_extraction.text import CountVectorizer We then make a CountVectorizer v. This first considers the whole set of training data, to determine which features to extract: v = CountVectorizer() v.fit(train_texts) Then we use this vectorizer to extract features from the training data and the test data train_vectors = v.transform(train_texts) dev_test_vectors = v.transform(dev_test_texts) To understand what is going on, you may inspect the train_vectors a little more. f) We are now ready to train a classifier from sklearn.naive_bayes import MultinomialNB clf = MultinomialNB() clf.fit(train_vectors, train_target) g) We can proceed and see how the classifier will classify one test object, e.g. test_texts[14] clf.predict(dev_test_vectors[14]) We can use the procedure to predict the results for all the test_data, by clf.predict(dev_test_vectors) We can use this for further evaluation (accuracy, recall, precision, etc.) by comparing to test_targets. We can alternatively get the accuracy directly by clf.score(dev_test_vectors, dev_test_target) Congratulations! You have now made and tested a multinomial naive Bayes text classifier. Deliveries: No deliveries for this exercise. Exercise 2 (10 points) a) To make it easier to rerun the experiment and proceed to cross-validation, put most of exercise 1 into a procedure multi_nb_exp(train_data, test_data): """train-data is a list of pairs, where first element is a string representing a text second element a string representing a label test-data has the same form Train a multinomialNB on train_data, test on test_data and return the result """ Beware, when you run this experiment reconstructing exercise 1, the input should be multi_nb_exp(train_data, dev_test_data) Check that you get the same result as in exercise 1. b) Make a procedure for n-fold cross-validation n_fold(experiment, dev_data, n=10): """experiment is an experiment like multi_nb_exp dev_data is a set of pairs first element is a string representing a text second element a string representing a label Run an n-fold cross-validation of experiment on dev_data. return the results. Then run n_fold(multi_nb_exp, dev_data, n=9) and calculate the accuracies for each of the 9 runs, the mean accuracy and the standard deviation of the accuracies. Deliveries: Code and results of running the code as described. Exercise 3 ¨C (10 points) We have so far considered the standard parameters for the procedures from scikit-learn. These procedures have, however, many parameters, and to get optimal results we should adjust the parameters. We can go back to the split in exercise 1, and use train_data for training various models and dev_test_data for testing and comparing them. a) To see the parameters for CountVectorizer we may use help(CountVectorizer) In ipython we may alternatively use CountVectorizer? One thing we observe is that CountVectorizer lowercases by default. For a different corpus, it could be interesting to check the effect of this feature, but even the movie_reviews.raw() is already lowercased, so that does not have an effect here (You may check!) Another interesting feature is binary. Setting this to False implies only counting whether a word occurs in a document and not how many times it occurs. It could be interesting to see the effect of this feature. The feature ngram_range=[1,1] means we use tokens (=unigrams) only, [2,2] means that we using bigrams only, while [1,2] means both unigrams and bigrams. Run experiments where you let binary vary over [False, True] and ngram_range vary over [[1,1], [1,2], [1,3]] and report the results. Which combination of parameters yield the best result? Deliveries: Code and results of running the code as described. Exercise 4 (10 points) We will explore how the Bernoulli naive Bayes, which is used in the NLTK book for the movie data set is implemented in scikit-learn. We can follow the NLTK-book and extract features similarly, i.e., using document_features(document) to extract features from document into a dictionary. Remember that we here have to use movie_reviews.words(fileid) where we in exercise 1 used movie_reviews.raw(fileid) We can use the same splits as in exercise 1. The train_data will now be a list of 1600 items, where each is a pair where the first element is a dictionary. It remains to transform these dictionaries to numpy arrays of the form scikit accepts. For this, we can use scikit's DictVectorizer, see sec tion 4.2.1 in http://scikit-learn.org/stable/modules/feature_extraction.html. (Alternatively, you can extract the features directly as arrays without using dictionaries. It is a little more work, but the experiments will run faster.) a) Make a procedure for the experiment bernoulli_nb_exp(train_data, test_data): """train-data is a list of pairs, where first element is a feature dictionary second element a string representing a label test-data has the same form Train a BernoulliNB on train_data, test on test_data and return the result """ b) Combining this with exercise 2, run n_fold(bernoulli_nb_exp, dev_data, n=9) and report the results. Deliveries: Code and results of running the code as described. Exercise 5 (10 points) a) In the framework of exercise 4, we may also try other classifiers, in particular Logistic Regression. Exchange BernoullNB with LogisticRegression and repeat the 9-fold cross-validation experiment from exercise 4. You import LogisticRegression by from sklearn.linear_model import LogisticRegression b) The default from NLTK is to use the 2000 most frequent words. We will explore the effect of the size of the feature set. Repeat the 9-fold cross-validation experiment with the 1000, 2000, 5000, 10000 and 20000 most frequent words as features, both with BernoulliNB and with LogisticRegression. Deliveries: Code. A 5*2 table showing the mean accuracies for 9-fold cross-validation for 1000, 2000, 5000, 1000, and 20000 features for the two different classifiers. Exercise 6 (10 points) From the different classifiers with which you have experimented in this exercise set, choose the one with the best performance on the dev_data. Find the 200 item test set we tucked away in exercise 1. Run the best classifier on this final test set. Calculate accuracy, recall, precision and F-score for both classes. Deliveries: Which classifier dis you choose? The numbers asked for. Code. END OF PART A ¨C DO NOT FORGET PART B 6