Welcome!

Welcome to IN4080 - Natural Language Processing 2018!

What?

The content - and the learning goals - for the course are described in broad terms at the main course page. More specifically, we will consider various steps involved in NLP systems from sentence splitting, tokenization and tagging to named-entity recognition, dependency parsing, role labeling, information extraction and more. Central to the course will be the use of experiments in NLP, as many of you will carry out such experiments as part of your master's thesis project. We will in particular consider how experiments should be set up and evaluated, as well as various machine learning algorithms, and what makes linguistic material special when it comes to machine learning experiments.

The MN faculty has reformed its study programs over the last years. The bachelor programs were "new" for the students starting in 2017, and all master's programs, including Informatics: Language Technology, were "new" last year, 2018.  All the courses involving language technology have been revamped as part of these reforms. IN4080 Natural Language Technology was taught for the first time in its current form in 2018. It replaced INF5830 Natural Language Processing, but also includes some parts that were earlier taught in INF4820 Algorithms for artificial intelligence and natural language processing, together with some new items. Other parts of INF4820 have been included in IN2110 – Spr?kteknologiske metoder.  IN4080 provides a background for the more advanced course IN5550 – Advanced Topics in Natural Language Processing, which currently has a main emphasis on deep learning methods in NLP and which is taught in the spring semesters.

IN4080 2019 will mainly follow the same structure as in 2018. In particular, as the students who completed their bachelor's degrees this summer followed the old bachelor's programs, they are not supposed to have taken IN2110. Hence IN4080 this semester will assume you have not taken IN2110. Some of the content from IN2110 will also be part of IN4080 this semester. Students who have already taken IN2110 may inevitably experience some overlap between the two courses. (But there is no study point reduction.) On the other hand, we will not include content from INF2820 Computational Linguistics.

Background

Natural Language Processing is an interdisciplinary subject building on insights from various fields including

  • Language and Linguistics
  • Computer Science in general and programming in particular
  • Probability theory and statistics (and mathematics)
  • Machine Learning and "Data Science"

Students who come to this class have different backgrounds. Some are familiar with some of the fields, others are familiar with other fields. This can be a challenge. Last year we tried to solve this as follows. We gave more lectures than what is usual for a course at this level. Most lectures were aimed at all students, while some where mainly aimed at they with no prior background in NLP and others were mainly aimed at they who lacked a sufficient background in mathematics and statistics. That did not work as well as expected. This year, we will return to a model we practiced in INF5830. We will only give one set of regular lectures, but offer additional tutorials on specific topics.

Here is some more on assumed background and recommendations on what to read.

Language and linguistics

You should be familiar with some core concepts of linguistics, like "parts of speech" and "sentence structure". If you have not taken any courses in linguistics or NLP/Computational Linguistics you should consult some of the following.

  • Chapter 3, "Linguistic Essentials", p. 81-115, in Manning and  Schütze: Foundations of Statistical Natural Language Processing. This is the best overview for what will be assumed in the course. Unfortunately, the book is not online, but you find it in the library.
  • You are recommended to acquire Jurafsky and Martin, Speech and Language Processing, anyhow. The sections 3.1 + 5.1 (=8.1 in the 3.ed) cover some of the background on words
  • While chapter 10 Formal Grammars of English, Sections 10.1-10.3, introduce some of the key concepts of sentence syntax.
  • Related to sentence syntax, you are also recommended to read sections 8.1-8.3 in the NLTK book: Natural Language Processing with Python, by Bird, Klein and Loper.

Programming in Python

The course will not be a heavy programming course, but you have to be able to write programs to solve simple tasks. Moreover, many of the tools we will use are Python modules. We assume that you know how to program and that you are able to learn yourself  Python if you are not already familiar with it. Sources for learning Python includes

The Natural Language Toolkit (NLKT)

This toolkit is used in several bachelor courses. We will also use parts of it in this course. You are advised to familiarize yourself with the 3 first chapters of the book as soon as possible, in particular chapter 1 and chapter 2, sec. 2.1-2.2.

Probability theory and Statistics

Since we don't presuppose any background in Probability theory and Statistics, we will offer a turorial on the basic concepts for they without a background. We will address probabilities early (second week of the semester) and return to more statistics during the semester when needed.

Do you need a book on statistics? We will cover all the concepts on the slides, so a book is not strictly required. But if you like some more to read, most books on statistics will do. To make it cheap we will use some parts of a book that is freely available on the web, the OpenIntro Statistics (3.ed)

Other usuful sources

  • If you already own a book on statistics, that will probably cover what we will consider, e.g. the STK1000 book, Moore and McCabe, Introduction to the Practice of Statistics.
  • If you want to invest in a paper book, Statistics in a Nutshell by Sarah Boslaugh covers what we we need in not too many pages.
  • I like Gonnick and Smith's, The Cartoon Guide to Statistics. It is mostly drawings - not too many words, but it covers the essentials.
  • In earlier semesters, some students recommended Khan academy

What first?

Question: If I lack some of this background, in which order should I attack it?

  1. If you lack experience with Python and NLTK, that is most urgent. We are going to use it from the first week.
  2. Then, if you don't have knowledge of linguistics, that's next on your agenda.
  3. If you already know Python, NLTK and some linguistics, say you have taken INF2820 Computational Linguistics, it is time for probabilities and statistics. As said, we will have some tutorials, but is wise to start ahead and use the first weeks of the semester.
Published Aug. 12, 2019 3:37 PM - Last modified Aug. 21, 2019 9:05 PM