Week 2

Scroll down to look at the part on probabilities!

Lecture 2, Thursday Sept. 1:
Words, tokenization, tagged text

This lecture will consider

  • some basic linguistics concepts related to words
  • the processes of tokenization and normalization
  • tagged text

Presentation

Recordings

Mandatory readings

Jurafsky and Martin, Speech and Language Processing, 3. ed. (edition of 12 Jan. 2022!)

  • Ch. 2 Regular expressions, etc.
    • Sec. 2.0
    • Sec. 2.2 Words
    • Sec. 2.3 Corpora
    • Sec. 2.4 Normalization, except 2.4.3 and the technical details of 2.4.1
  • Ch. 8 Sequence Labelling ...
    • Sec 8.1 and 8.2

NLTK Book

  • Ch. 3, sec. 6 Normalizing Text
  • Ch. 3, sec. 8 Segmentation
  • Ch. 5, sec. 1 Using a tagger
  • Ch. 5, sec. 2 Tagged corpora

Wikipedia

Recommended reading

Wikipedia

Probabilities - background and tutorial

The slides of last year and the readings below indicate what we expect with respect to your  knowledge of probabilities. Many of you have a background in probabilities, but some of you may lack it. If anybody are interested, we will arrange a tutorial on probabilities sometime between Fri Sept. 2 and Wed Sept. 7. We can decide on time in the lecture Sept. 1. (Sept. 1 at 14 turned out not to be an option.) If you are interested, you may send me (jtl) a mail indicating possible times.

Presentation

Readings

OpenIntro (3. ed.) (In the 4th ed. add one to the chapter numbers)

  • Ch. 2, "Probability", sec. 2.1-2.4
  • Ch. 3, "Distributions of random variables":
    • Sec. 3.3.1 Bernoulli distribution
    • Sec. 3.4.1 Binomial distribution

Group session, Tuesday 6.8 12.15 in Sed

Program

Solutions

Published Aug. 26, 2022 11:44 AM - Last modified Nov. 9, 2022 5:37 PM