Document Analysis Is More Than Processing Text

Xiangqian Hu
5 min readOct 16, 2020

--

Photo by Wesley Tingey on Unsplash

It’s not hard to understand why businesses want to use technologies to deal with their documents. Given the massive and growing amount of documents to process, machine help is inevitable. And machine analysis has shown greater efficiencies in everything from processing medical records and insurance claims to detecting frauds in emails.

The success of any given document processing project, however, is far from preordained. Those who think of their documents simply as text may be caught off guard by a project’s difficulty and complexity.

For clarity, let’s define document analysis as analyzing and extracting information from digital documents that contain rich components such as text and graphs. The daunting challenge of building machines for this task covers plenty of disciplines, including database systems, image processing, natural language processing, pattern recognition, and machine learning.

Why Is Document Analysis So Hard?

To analyze documents, even humans need years of training to understand words, forms, tables, and graphs. Machines, which are designed to accomplish repetitive tasks with limited generalization, face their own challenges in becoming useful. Here are three:

Big Amount: Supervised Learning Requires Human Labeling

It’s often better to train machines on documents that humans have already read and labeled. This requires people who can read, write, and understand documents for a given domain — an especially intensive and expensive process when there are many documents and/or the documents require specialized knowledge. Labeling thousands or millions of medical records, which could contain hand-written notes, for example, can be an exhausting process.

Big Diversity: Extreme Data Diversity Goes Way Beyond Text

Documents often contain rich content like digital and hand-written texts, pictures, graphs, and tables. Documents can be derived from various sources including text, image, video, and audio, and they can be stored in a variety of formats such as simple text strings, scanned texts, rich web pages, emails, images, logs, and more.

Overall, documents are heterogeneous and unstructured. While these content and format varieties are usually suitable for human comprehension and analysis, they can be difficult for machines to organize, analyze, and extract.

Take PDF text extraction as one example. A PDF file could contain digital or scanned text, off-page or small characters, and strange font formats. One PDF file could be extremely long with different layouts and languages. Moreover, the PDF file is often not composed of only text.

Big Complexity: Words, Formats, and Models

Words, as building blocks for documents, are not easy to process. “A picture is worth a thousand words” implies that it’s much easier for humans to understand images. In a sense, this is also true of machines — grasping words poses fundamental challenges to AI systems when it comes to natural language processing (NLP), understanding, and reasoning.

Document format diversity makes the analysis pipeline even more complicated. For instance, computer vision is necessary for optical character recognition (OCR) to convert scanned documents into digital ones, which later are used in NLP. Therefore, the pipeline often requires multiple machine learning models to analyze documents.

This pipeline complexity further complicates data preprocessing and labeling as well as model development and management. Processing and labeling documents with high quality requires the capabilities of reading and understanding for a given language. Data bias could be unintentionally introduced during this process, and this bias factor can be amplified further when multiple models are developed. These models typically are of different types, spanning different ML disciplines. Therefore, model auditing becomes another necessary component for a mature document analysis pipeline.

How We Do Document Analysis At Infinia ML

At Infinia ML, we are enthusiastic about applying machine learning to help companies and organizations move faster with their documents. With a group of talented data scientists and software engineers, we have developed a flexible internal toolbox, the “Infinia ML Cloud Layer”, to tackle the above challenges. Figure 1 illustrates our overall architecture.

Figure 1. Infinia ML Cloud Layer. Learn more about Infinia’s Approach to Machine Learning [VIDEO].

The middle Infinia ML Cloud Layer contains our core technologies with four building blocks, which are designed to be cloud-native and cloud-aware. All these blocks are interconnected seamlessly and can be readily customized in terms of different customers’ needs.

  1. The Cloud Infrastructure block handles data input/output, software developments, deployments, system maintenance, security, and scalability. This powers our entire development cycle for coding, modeling, UI, and middle-tier business logic.
  2. The Library block is a mixture of open-source packages (such as scikit-learn and PyTorch) and our in-house ML technologies. We have absorbed our data science experience and brand-new ML ideas and methods into this reusable package. This speeds up our model developments for customers.
  3. AI/ML systems without auditing cannot be trusted. We have built our Auditor with user-friendly UIs to monitor model performance and audit machine learning pipelines.
  4. The Document Analysis block is our specialized ML application with UIs, which is designed to analyze documents, extract data, and display and review document information.

Analysis results are domain-specific and depend on customer needs. They could contain the retrieved documents from a search query, or they could be the extracted information from scanned documents such as addresses, phone numbers, company names, invoice amounts, and so on.

We are also strong believers in involving humans in the loop. The entire ML pipeline needs to be supervised by our domain experts. Their feedback can be added back into our document analysis process.

In conclusion, machine-driven document analysis is not easy in practice, and pure text analysis is not sufficient for machines to analyze documents. We hope sharing our own experience can help inspire new ideas and speed up your document analysis processes. After all, we might never know when machines will learn by themselves — but we do know people always learn.

The author would like to thank James Kotecki for his valuable feedback to this blog.

--

--