EPI-USE MENDIX | Articles

RAD AI – Ignite | Episode 3: Extract and Organize Data

Written by Matthew Daniels | Jun 4, 2024 2:00:00 PM

This is the third installment of RAD AI – Ignite, a three-part series on use cases to begin realizing the value of AI/ML technologies in your organization. These use cases illustrate the speed and power of combining Amazon Web Services (AWS) managed AI/ML services and rapid application development with Mendix.

 



Documents and forms are large components of business processes, even today. These documents range from being standardized, like receipts and government IDs, to unstandardized, like government and company-created forms. These documents contain a lot of information, but that data is difficult access, search, and understand. For that reason, extracting information into a digital format (e.g. storing it in a database) has been a high-value prize.

Optical character recognition (OCR) has been around for decades but has generally suffered from limitations due to its character-by-character nature. This leads to large text dumps that often lack accuracy and are very challenging to work with. OCR services generally do not utilize sequential modeling and contextual dependencies in characters – that is, they do not typically “see the bigger picture” when analyzing documents. That can make it difficult to extract form (key / value) and table (column / row) data which are very common in documents.

In this installment, we’ll tackle this digitization problem with Amazon Textract integrated into a Mendix-built application. Amazon Textract accurately breaks documents into related chunks that are passed along via JSON – this allows you to easily identify form-data, table data and more. This information is easily integrated into Mendix which makes building workflows and logic around the extracted data much simpler. For example, this allows you to pre-populate forms, connect extracted data to external systems (like an ERP, Timekeeping System or CRM) and build review and approval workflows with ease.

AWS Service: Amazon Textract

Amazon Textract is a fully managed offering from Amazon that leverages an affordable pay-as-you-go model. Textract is highly scalable solution and is used to analyze and extract data from documents. The Textract service goes beyond standard OCR – it is enhanced with Machine Learning capabilities that help it understand document layouts and concepts like tabular data. Amazon invests in research to improve OCR and document understanding technology and publishes research on these topics. Those advancements are then incorporated into the underlying models that Textract uses to understand documents. This alleviates the need for organizations to hire ML specialists to constantly update and improve the model as technology improves, which is valuable in the rapidly evolving field of document processing.

In addition to returning the extracted data, Textract also includes a confidence score for each extracted piece of data that ranges from 0-100.

Standardized Documents

Textract has specific implementations for standardized documents like receipts and identification documents. These implementations apply a normalized data model that contains the standard information for these types of documents. For example, if different vendors use “Subtotal” or “Net sales” on their receipts, Textract will recognize that these are equivalent fields and return that data including the “SUBTOTAL” normalized field name. This can be very helpful for building applications that deal with expense reimbursement and tracking.

Similarly, the identification service can pull all standard Driver’s License / Passport fields from an image or scan and return the normalized data. These documents do not always have field titles, for example your name is likely just written on your ID without a label, but the Textract service will correctly pick up this field and return them with the “First name” field tag. This can be very helpful in adding age and identity verification components into an application.

Non-Standardized Documents

Most forms, however, are non-standardized. These are forms that are created by governments, organizations, and companies to collect information. These forms can be in any format with any number of fields and/or tables. Due to their highly customized nature, these types of forms can be difficult to automate or purchase off-the-shelf solutions around.

Textract includes a service called “Analyze Document” that is built specifically for these use cases. This service takes in a document (PDF, jpeg, png), analyzes the structure as well as the contents of the document, then returns the extracted information. This information includes the text from the form as well as the relationships between those text fields (e.g. Key/Value for a form field).

Since non-standardized documents are the more common and challenging example, it is what we’ll highlight in this demo.

Bringing the App to Life – Rapid Application Development with Mendix

Amazon Textract can identify and break out the different pieces of data in a form, but then what is done with this data? How do we take the information that has been extracted and make it valuable to our organization?

The Mendix platform is an excellent application building tool that supports visual modeling and is perfect for complex logic and workflows. This simplifies building applications that incorporate the extracted data from Textract. For example, we can extract the information from a document, store that in a database, and create validations and workflow from this data. You may be familiar with a similar process from using tax filing software. In those applications, income documents are generally uploaded as a starting point for a guided tax return process. You can quickly build a similar experience leveraging out-of-the-box functionality with Textract and a Mendix-built application.

Setting the Stage

Our demo application is centered around a timesheet form use case. These timesheets are particular to our company and are filled out daily. The form contains a few key identifiers – the employee ID of the individual that the timesheet belongs to and task IDs for how they’ve spent their time. These IDs are foreign keys that correspond to data stored elsewhere in our IT ecosystem.

 
Part 1: Extracting Data from PDF

The landing page of our application consists of an upload dialog for the filled out timesheet document. Upon upload, this form is passed to Amazon Textract which responds with a block representation of all data that is available on the form. This includes important contextual information like the relationship between form labels and data entered and the column and row number for each cell in our tabular data.

That allows us to build logical flows to extract this data and populate normalized data tables with it. We can easily extract the form data, like employee ID, as well as the table data that stores each timesheet line entry. This can then be stored in our database.

The data can also be easily enriched. By extracting the employee ID and task IDs from the document, we can then look up these foreign keys and connect them to source data from external systems. In our example, this data could be sourced from integrated ERP systems (e.g. SAP, ADP) or Timekeeping systems. Mendix provides many out-of-the-box connectors to enterprise software solutions and provides robust support for custom integrations.

Here, we can see how our application easily takes a timesheet upload and parses out all information including associating the correct employee and timesheet task for each line that the employee has recorded:



Part 2: Flexibility for Additional Formats

Being able to extract the data and connect it to master data from our other systems is great, but what if our forms take on different formats?

Since Textract breaks down each form into connected components (“blocks”), this becomes a simple exercise for our custom-built application. We can extract the data even when the form design, layout and fields have variation.

Here, our form has a different style and the fields are in different areas, but we can still extract the information with the same level of accuracy. This is thanks to Textract and the ease of coding logic in a Mendix application. We can easily create logic to identify and parse the different columns to add flexibility so that we can capture time and tasks accurately even if they are in different locations for each form variation.



Part 3: Handling Images of Handwritten Forms

The Textract service provides out-of-the-box support to identify handwritten forms as well. This allows you to capture images (phone, scan, etc.) of filled out forms and easily parse this in the same way as our digital forms in the previous examples.

Here, we can see that a picture of a handwritten form is also picked up with similar accuracy. This is all handled by the Textract service and does not require any additional coding or configuration.

Wrapping Up / Final Thoughts

Amazon Textract is a powerful tool to augment and/or replace form or paper-based processes. It allows you to extract the data in a meaningful way since the logical connections between the datapoints are maintained. For example, form fields are returned as key-value pairs and tables are returned with row and column indexes. This allows application logic to extract the data more easily, even when the format of a document is subject to change or multiple versions. This, in turn, makes your applications more robust in supporting future updates.

Amazon has researchers who dig into OCR and document understanding technology – they publish this research in the public domain and incorporate these learnings and techniques into their Textract service. Some examples are recent improvements to form field recognition of low-quality images and foreign currency support. This allows organizations take advantage of cutting-edge capabilities without requiring heavy investment, making it a compelling option for many.

Mendix and the out-of-the-box Textract connector allows you to take advantage of the speed and power of low-code in addition to Amazon Textract. This allows you to build applications that extract the data and turn it into valuable insights quickly.

Further Extensibility

Amazon Textract has a myriad of use cases – paper or form-based processes, expense reports or identity verification are high-level categories where this service is valuable. The service can also pull pure text – if there are documents or notes (e.g. meeting minutes) that need to be digitized. These are all possibilities with the Textract service.

In addition to the extracted data, the confidence score that is included with each piece of data can be used to automate human-approval processes in your application. This step where we determine what is done with the extracted data is where Mendix shines. Mendix provides flexibility to rapidly iterate the solution and visually model complex logic and workflows with their integrated full-stack IDE. It also simplifies the integration of other systems. ERPs, in our example, can be integrated with platform-supported connectors and robust integration support as needed.

The iterative nature of Mendix projects vastly improves project success when dealing with emerging technologies and novel use cases, which we are highlighting in our RAD AI Ignite series.


Be sure to check out additional installments of our RAD AI – Ignite series here:

RAD AI – Ignite | Episode 1: Unlock Your Organization’s Knowledge

Your organization sits on a wealth of knowledge and experience. This information is often locked away inside disparate software and file systems which can take significant effort to parse through. How do you unlock this knowledge and provide it to employees at critical points in their workflow?

Read More

 

RAD AI – Ignite | Episode 2: Identify and Address Anomalies

Proactively catching mistakes and potential fraud is critical within organizations. Complex rules engines are costly, time-consuming, and generally require modification as new threats emerge. Instead, what if we could leverage past and future data to better identify anomalies in a dynamic environment?

Read More