Welcome!

Welcome to IN4080 - Natural Language Processing 2018!

What?

The content - and the learning goals - for the course is described in broad terms at the main course page. More specifically, we will consider various steps involved in NLP systems from sentence splitting, tokenization and tagging to named-entity recognition, dependency parsing, role labeling, information extraction and more. Central to the course will be the use of experiments in NLP, as many of you will carry out such experiments as part of your master's thesis project. We will in particular consider  how experiments should be set up and evaluated, as well as various machine learning algorithms, and what makes linguistic material special when it comes to machine learning experiments.

The MN faculty is reforming its study programs, and this year all master's programs are "new". The earlier master's program in Informatics: Language and Computation has been replaced by the program in Informatics: Language Technology. This is a golden opportunity to also revamp the courses. This course will partly replace INF5830 Natural Language Processing, but also include some parts that were earlier taught in INF4820 Algorithms for artificial intelligence and natural language processing (other parts of INF4820 will be included in IN2110 – Spr?kteknologiske metoder.) together with some new items. It will be followed by a more advances course in Topics in NLP in the spring semesters.

Background

Natural Language Processing is an interdisciplinary subject building on insights from various fields including

  • Language and Linguistics
  • Computer Science in general and programming in particular
  • Probability theory and statistics (and mathematics)
  • Machine Learning and "Data Science"

Students who come to this class have different backgrounds. Some are familiar with some of the fields, others are familiar with other fields. This can be a challenge. We will try to solve this as follows. We will give more lectures than what is usual for a course at this level; some weeks we will give two two-hours lectures in addition to the weekly lab sessions. Some of these lectures will be purely background lectures, e.g., on Probabilities, for they who are not familiar with the concepts. Other lectures will cover rather basic themes in NLP which are already familiar to students who have taken courses in computational linguistics. The primary audience for this course are first semester master's students in the Language Technology program, but by organizing the course this way, we hope to make the course accesible also to other students with no background in NLP/computational linguistics.

Here is some more on assumed background and recommendations on what to read.

Language and linguistics

You should be familiar with some core concepts of linguistics, like "parts of speech" and "sentence structure". We plan lectures on this background; one on words, lexicon and text processing in the second week of the semester, and one on sentence structure and context free grammars a few weeks later. If you have not taken any courses in linguistics or NLP/Computational Linguistics you should consult some of the following.

  • Chapter 3, "Linguistic Essentials", p. 81-115, in Manning and  Schütze: Foundations of Statistical Natural Language Processing. This is the best overview for what will be assumed in the course. Unfortunately, the book is not online, but you find it in the library.
  • You are recommended to acquire Jurafsky and Martin, Speech and Language Processing, anyhow. The sections 3.1 + 5.1 (=10.1 in the 3.ed) cover some of the background on words
  • While sections 12.1-12.3 introduce some of the key concepts of sentence syntax.
  • Related to sentence syntax, you are also recommended to read sections 8.1-8.3 in the NLTK book: Natural Language Processing with Python, by Bird, Klein and Loper

Programming in Python

The course will not be a heavy programming course, but you have to be able to write programs to solve simple tasks. Moreover, many of the tools we will use are Python modules. We assume that you know how to program and that you are able to learn yourself  Python if you are not already familiar with it. Sources for learning Python includes

The Natural Language Toolkit (NLKT)

This toolkit is used in several bachelor courses. We will also use parts of it in this course. You are advised to familiarize yourself with the 3 first chapters of the book as soon as possible, in particular chapter 1 and chapter 2, sec. 2.1-2.2.

Probability theory and Statistics

Since we don't presuppose any background in Probability theory and Statistics, we will give some lectures on the basic concepts. We will address probabilities early (second week of the semester) and return to more statistics during the semester when needed.

Do you need a book on statistics? We will cover all the concepts on the slides, so a book is not strictly required. But if you like some more to read, most books on statistics will do. To make it cheap we will use some parts of a book that is freely available on the web, the OpenIntro Statistics.   

Other usuful sources

  • If you already own a book on statistics, that will probably cover what we will consider, e.g. the STK1000 book, Moore and McCabe, Introduction to the Practice of Statistics.
  • If you want to invest in a paper book, Statistics in a Nutshell by Sarah Boslaugh covers what we we need in not too many pages, and in roughly the same order as we will present the material.
  • I like Gonnick and Smith's, The Cartoon Guide to Statistics. It is mostly drawings - not too many words, but it covers the essentials.
  • In earlier semesters, some students recommended Khan academy

What first?

Question: If I Lack some of this background, in which order should I attack it?

  1. If you lack experience with Python and NLTK, that is most urgent. We are going to use it from the first week.
  2. Then, if you don't have knowledge of linguistics, that's next on your agenda.
  3. If you already know Python, NLTK and some linguistics, say you have taken INF2820 Computational Linguistics, it is time for probabilities and statistics. As said, we will have some tutorials, but is wise to start ahead and use the first weeks of the semester.
Published Aug. 13, 2018 9:22 PM - Last modified Aug. 13, 2018 9:22 PM