Lab04: Count-Based Models¶

In this lab, we will look at how to process natural language text to build two different types of count-based matrices, one for word characterisation (i.e. word co-occurence matrix), one for document characterisation (i.e. document term matrix).

While putting the theory into practice, we will also introduce two more packages that are excelled at count-based methods, namely scikit learn and gensim. After introducing the basics using small toy corpus, we demonstrate how tf-idf can be used for document classification in scikit learn.

Work on your project after you finish this lab.

CITS4012 Natural Language Processing

Lab04: Count-Based Models¶