Jamie’s Text Mining with R

Tutorials

These tutorials reflect course content from Introduction to Text Mining (CSCD 5000, Fall, 2020) at Temple University.

  1. accessing documents (janeaustenr, gutenbergr, harrypotter)

    • Finding corpora and importing them into R is not always easy. Some of these packages provide access to huge amounts of text, but there’s always a catch!

  2. Characters and string basic operations: Part I

    • An introduction to working with characters and strings

  3. cleaning, stripping, and prepping text: our custom cleaning genie

    • Half the challenge with R is getting the data read in and cleaned up into a form where you can actually do the analyses you hope to do. This document reflects our lab’s efforts at developing a text stripping and cleaning function. We call it the cleaning genie. I hope to one day implement the genie as a Shiny app but for now here it is in all its raw regex glory.

  4. Code snippets specialized for text-based analysis

    • Working with text (e.g., characters, strings) in R has its unique syntax relative to more “generic” code for statistical modeling. This document represents examples of code snippets that I have found helpful for dealing with thorny recurring issues in text mining.

  5. document classification

    • building simple classifiers (e.g., naive bayes) using Quanteda that “learn” a particular distinction during supervised training and apply the model’s predictions to new documents.

  6. lexical diversity metrics applied to Harry potter novels

    • type token ratios come in many flavors and are the bread-and-butter of many narrative analyses. The Quanteda package offers numerous off the shelf TTR variants. This document reflects their application in the context of a personal curiosity — Namely, did JK Rowling’s vocabulary repertoire increase as she crafted the series? For this we are using moving averages to examine lexical diversity both within and between novels including the Sorcerer’s Stone, Prisoner of Azakaban, and the Deathly Hallows.

  7. Ngrams

    • An introduction to ngrams applied to Edgar Allan Poe’s The Raven

  8. project gutenberg search

    • Thanks to Ann Marie Finley for her summary of search strategies for Library of Congress identifiers within Project Gutenberg using the gutenbergr package.

  9. quanteda and working with multi-document corpora

    • how to read multiple text files, convert to corpus objects, document feature matrices (DFM), cosine distances between documents and features, simple topic modeling.

  10. regular expressions

    • Grep your way through an introduction to regular expressions.

  11. topic models

    • Using the SOTU package, we will analyze topics within the state of the union addresses by one president and then many presidents at once.

  12. Word Cloud

    • Generate a simple wordcloud based on lexical frequency within a document.

  13. plot arousal ratings for every word of Goodfellas

    • Read in the script, remove stopwords, split and unlist the script into a long vector of single content words. Yoke arousal values from the Warriner et al database to each word of Goodfellas in the order it appeared in the movie. Then plot the arousal values coloring bars by whether the word is a curse or not. FUN STUFF.

  14. auto-segmenting verbal fluency data using semantic distance

    This script takes a time series of category fluency data and breaks it into clusters using semantic distance


Sample text mining analyses

The following documents represent samples of work we have done in the lab. Note — this work was not peer reviewed. There are likely many errors, and we make no claims regarding validity of the findings. These documents represent stops/starts and lessons learned. We are grateful for feedback or suggestions for improving anything you encounter here.

  1. Unabomber Manifesto

    • This document represents a simple sentiment analysis, word cloud, and frequency output using a bag-of-words approach on Theodore Kaczynski’s essay, Industrial Society and its Future.

  2. Are these lyrics from Katy Perry or Metallica?

    • A case exercise Using Quanteda to evaluate cosine distance and produce cluster dendrograms plotting distances between the lyrics of Katy Perry vs. Metallica — with some topic modeling and text analytics such as type token ratio swirled in for good measure.