Lab04: Count-Based ModelsΒΆ
In this lab, we will look at how to process natural language text to build two different types of count-based matrices, one for word characterisation (i.e. word co-occurence matrix), one for document characterisation (i.e. document term matrix).
While putting the theory into practice, we will also introduce two more packages that are excelled at count-based methods, namely scikit learn
and gensim
. After introducing the basics using small toy corpus, we demonstrate how tf-idf
can be used for document classification in scikit learn
.
Work on your project after you finish this lab.