9) Text Mining 1

📖 Lecture

We move on to our next topic which involves basic principles behind Text Mining, “the process of distilling actionable insights from text”.

Broadly speaking, Text Mining analysis can be divided in two big categories: 1) Bag-of-Words (bow) analysis, and 2) Syntactic Parsing. In this course, we will focus exclusively on topics related to bag-of-words analysis.

Preliminary Operations: Because text data is rarely ready to be analyzed, we need to first discuss some of the common preparation steps for having text data in a way that can be mathematically/statistically analyzed.

Frequency Analysis: The most basic type of bag-of-words text mining analysis consists of counting the number of occurrences or frequencies of each token.

📚 Reading

Read chapters 1, 2 and 4 of “Text Mining with R” (by Julia Silge and David Robinson):

🔬 Lab

Publish a shiny app to shinyapps.io
Keep practicing regular expressions.

🎯 Objectives

List at least four operations that are commonly applied to a text data set before it can be analyzed with text mining techniques.
Describe the notion of stopwords.
Explain what a corpus (or corpora) is.
Explain what a token is.
Explain the concept of tokenization.
Compute frequencies of tokens.
Visualize frequencies with barcharts and wordclouds.

🔔 Assignments

Midterm on Friday 03/15
Keep working on your shiny App1, due 03/22