9) Text Mining 1
UC Berkeley, STAT 133, Fall 2024
π Lecture
We move on to our next topic which involves basic principles behind Text Mining, βthe process of distilling actionable insights from textβ.
Broadly speaking, Text Mining analysis can be divided in two big categories: 1) Bag-of-Words (bow) analysis, and 2) Syntactic Parsing. In this course, we will focus exclusively on topics related to bag-of-words analysis.
Preliminary Operations: Because text data is rarely ready to be analyzed, we need to first discuss some of the common preparation steps for having text data in a way that can be mathematically/statistically analyzed.
Frequency Analysis: The most basic type of bag-of-words text mining analysis consists of counting the number of occurrences or frequencies of each token.
π Reading
Read chapters 1, 2 and 4 of βText Mining with Rβ (by Julia Silge and David Robinson):
π¬ Lab
- Publish a shiny app to shinyapps.io
- Keep practicing regular expressions.
π― Objectives
- List at least four operations that are commonly applied to a text data set before it can be analyzed with text mining techniques.
- Describe the notion of stopwords.
- Explain what a corpus (or corpora) is.
- Explain what a token is.
- Explain the concept of tokenization.
- Compute frequencies of tokens.
- Visualize frequencies with barcharts and wordclouds.
π Assignments
- Midterm on Friday 10/25
- Keep working on your shiny App1, due 11/01