A Quick Start for Text-Based Machine Learning Projects with Text-Specific Exploratory Data Analysis

Xiangqian Hu
6 min readMar 5, 2021

--

Talk about a quick start. Photo by SpaceX on Unsplash

Editor’s Note: This article includes analysis of Reddit posts with language that some may find offensive.

As I mentioned in my previous blog, document data volume, diversity, and complexity make document analysis challenging. In this blog, let us focus on a specific problem that happens after we get raw text data: developing a module to help data scientists efficiently work on it.

Ideally, data scientists can quickly get an understanding of text by creating exploratory data analysis (EDA) reports. After all, we don’t want our data scientists reading through thousands of texts. But text, as one type of unstructured data, often requires different techniques to be explored and summarized. We don’t want our data scientists to keep switching from one package to another just to do basic EDA on raw text.

At Infinia ML, we’ve created a new Python module, TextExplorer, to provide basic EDA tools with a consistent and friendly interface. This module helps our data scientists quickly run text-specific analysis, gain some insights into data, and create more advanced machine learning pipelines. Below are the technical details on how we use this module to perform text-specific EDA.

Data

Data science starts with data. For our example, we’ll use this dataset from Kaggle, which contains Reddit WallStreetBets (WSB) posts from January to February 2021. These posts infamously created some turbulence in the stock market; GameStop’s stock price shot through the roof in late January because of traders on the Reddit WallStreetBets discussion board.

After loading the downloaded csv file into a pandas DataFrame, some random samples are shown below:

Figure 1. A simple snapshot of the raw text.

A glimpse of the raw data suggests that the text column “title” contains a lot of rich information about these Reddit posts. Next, we will focus on getting more insights using this text column with the help of our TextExplorer EDA tool.

Simple Exploration

We design the TextExplorer Python methods with minimal parameter requirements, creating the instance using the input DataFrame and the column name that contains the targeted text data. The text is composed of words. During the instance initialization, our module scans all the raw texts and analyzes basic word information. At each stage, the summary method shows the explorations that have been performed and summarizations for each exploration.

For instance, Figure 2-a shows how many characters, words, and words without stop words (using a stop word list from spaCy) the entire data set contains by using the WSB data with the text column “title”.

Text usually is messy with various tags, cases, newlines, spaces, and so on. At Infinia ML, we already built an internal text data transformation pipeline to readily normalize text by removing html tags, markdowns, newlines, and redundant spaces, as well as converting all characters to lowercase. As such, we add this default process into the clean method (Figure 2-a). The cleaned text data is stored as a new column in the given DataFrame, which is used later to facilitate text data mining.

Next, words are essential for any text mining task. Our tool provides more functions to inspect the words, including biased word reports (Figure 2-b), top word analysis (Figure 2-c), and word clouds (e.g., Figure 2-d). This specific dataset does not contain many biased words related to gender, names, and religion.

It is interesting to see some numbers (such as 128640) as top frequent words (Figure 2-c). Checking the raw texts that contain these numbers tells us these numbers are actually part of HTML code for emoji. For example, 128640 means 🚀 while 128142 means 💎. Note that this information could be translated into actual words or features to build machine learning models.

Figure 2-d illustrates a word cloud using WSB posts, which vividly shows people were constantly talking about the stocks GME (Gamestop) and AMC (AMC Entertainment), the Robinhood trading app, buying, holding, etc.

These simple statistics and visualizations can help data scientists make sense of what these posts are and how they might transform data further for more advanced analysis.

Figure 2-a. Text data cleaning
Figure 2-b. Biased word analysis
Figure 2-c. Top 10 words
Figure 2-d. Word cloud

Simple Data Mining

Can we gain more insights from raw texts besides word summary statistics? Here, some common machine learning algorithms can be helpful and handy. We provide three methods to extract named entities, perform sentiment analysis, and perform topic modeling.

Figures 3-a and 3-b illustrate the named_entities method not only generates a summary of the extracted data, but also enables data scientists to explore data using the extracted information for each row (e.g., what the texts look like that contain the DATE entity shown in Figure 3-b). For this particular dataset, a lot of posts talk about dates using words such as “today”, “next week”, “weeks”, etc.

Figure 3-a. Summary of named entities extracted from text.
Figure 3-b. Explore the data samples using named entities.

Could we tell if these posts are positive or negative? A simple method in Figure 4-a combines all the texts and analyzes the overall sentiment. A total polarity score ranging from -1 (negative) to 1 (positive) suggests that these posts were slightly positive but close to neutral.

We also provide row-level information about sentiment analysis. As shown in Figure 4-b, the summary statistics show polarity, subjectivity, and the number of sentences for each post. For instance, the post with 12 sentences could be an interesting data point to investigate. Further analysis shows the text uses a lot of periods with 2 or 3 words for each sentence. Or what are the posts with the polarity score -1? (Answer: these posts contain many insulting words.) All these explorations can help understanding and cleaning of the raw texts.

Figure 4-a. Overview of sentiment analysis on post texts.
Figure 4-b. Summary statistics about the sentiment scores.

Another way to discover hidden semantic structures in texts is topic modeling. We create two methods shown in Figure 5 to train a topic model and visualize the modeling results. From the top ten words for five topics, it seems that some people talked about buying and holding GME and AMC stocks while some mentioned short selling them (possibly in relation to hedge funds). Visualizing topic clusters in Figure 5-b, we can see that some topics are conceptually closer than others. The chart on the right shows the specific words associated with a chosen sphere (shown in red on the left).

Figure 5-a. Five topics about the Reddit post dataset
Figure 5-b. A snapshot of topic visualizations using pyLDAvis.

Here, our methods focus on training a small model to quickly generate some initial insights. Then, data scientists could further fine-tune the data and models based on specific requirements for machine learning projects.

Conclusion

Analyzing documents is hard — and analyzing raw texts is a difficult, but essential, sub-task. With the aid of our TextExplorer module, which provides simple interfaces by combining popular Python packages such as spaCy, pyLDAVis, textblob, yellowbrick, and tomotopy, our data scientists can readily explore raw text data without interacting with interfaces of various packages. They can gain useful insights for further processing these texts and building more complicated machine learning models to solve real-world problems.

I hope the illustrations using this module with Reddit WSB posts can help you work on your own machine learning problems for document analysis.

The author would like to thank James Kotecki for his valuable feedback to this blog.

--

--