Talk about a quick start. Photo by SpaceX on Unsplash

Editor’s Note: This article includes analysis of Reddit posts with language that some may find offensive.

As I mentioned in my previous blog, document data volume, diversity, and complexity make document analysis challenging. In this blog, let us focus on a specific problem that happens after we get raw text data: developing a module to help data scientists efficiently work on it.

Ideally, data scientists can quickly get an understanding of text by creating exploratory data analysis (EDA) reports. After all, we don’t want our data scientists reading through thousands of texts. But text, as one type of unstructured data…


Photo by Wesley Tingey on Unsplash

It’s not hard to understand why businesses want to use technologies to deal with their documents. Given the massive and growing amount of documents to process, machine help is inevitable. And machine analysis has shown greater efficiencies in everything from processing medical records and insurance claims to detecting frauds in emails.

The success of any given document processing project, however, is far from preordained. Those who think of their documents simply as text may be caught off guard by a project’s difficulty and complexity.

For clarity, let’s define document analysis as analyzing and extracting information from digital documents that contain…

Xiangqian Hu

Director of Infinia ML Engineering. Machine Learning Lover.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store