9) Text Mining 1

UC Berkeley, STAT 133, Fall 2024

πŸ“– Lecture

We move on to our next topic which involves basic principles behind Text Mining, β€œthe process of distilling actionable insights from text”.

Broadly speaking, Text Mining analysis can be divided in two big categories: 1) Bag-of-Words (bow) analysis, and 2) Syntactic Parsing. In this course, we will focus exclusively on topics related to bag-of-words analysis.

Preliminary Operations: Because text data is rarely ready to be analyzed, we need to first discuss some of the common preparation steps for having text data in a way that can be mathematically/statistically analyzed.

Frequency Analysis: The most basic type of bag-of-words text mining analysis consists of counting the number of occurrences or frequencies of each token.

πŸ“š Reading

Read chapters 1, 2 and 4 of β€œText Mining with R” (by Julia Silge and David Robinson):

πŸ”¬ Lab

  • Publish a shiny app to shinyapps.io
  • Keep practicing regular expressions.

🎯 Objectives

  • List at least four operations that are commonly applied to a text data set before it can be analyzed with text mining techniques.
  • Describe the notion of stopwords.
  • Explain what a corpus (or corpora) is.
  • Explain what a token is.
  • Explain the concept of tokenization.
  • Compute frequencies of tokens.
  • Visualize frequencies with barcharts and wordclouds.

πŸ”” Assignments

  • Midterm on Friday 10/25
  • Keep working on your shiny App1, due 11/01