Intelligent Document Processing leveraging Tesseract, Google Vision API and elDoc: OCR solutions comparison analysis

Image

Intelligent Document Processing leveraging Tesseract, Google Vision API and elDoc: OCR solutions comparison analysis

11 Jul 2022

Intelligent Document Processing is now a part of the strategic automation roadmap of many high-performing organizations around the globe. Recently, interest to IDP technology was increased drastically that is related mainly to post Covid-19 period when enterprises have been forced to seek for automated solutions which could have helped them to further reduce operational cost, increase operational efficiency and accelerate document processing time by ensuring business continuity from anywhere.

In the course of growing interest to this technology, sadly, there is also a growing misunderstanding within the business community what intelligent document processing is. Many bloggers, internet-sources often equate OCR solution with IDP what leads to false understanding by the end-users and their managers / decision-makers what OCR and IDP technology can really deliver. Thus, in this blog we would try to provide more insights on these technologies by means of providing illustrative examples (demo) how document processing works leveraging different type of the automated solutions: Tesseract, Google Vision API and elDoc.

So, let's start with formalizing the most common business requirements for document processing to greater understand what automation capabilities will be necessary to have in place to cover the most common business requirements.

When it comes to the document processing (no matter what nature of documents will be in scope), it is a common use case when it will be required:

to classify the documents by type (for automated solution to be capable to understand different document forms: invoice, packaging list, payment advice, facility bills, reports, bank statements, claims, product specification, drawings, enquires, etc.);
to process files that may contain different type of document forms that can span across hundreds of pages;
to locate and capture the required (target) fields from the particular document form;
to enhance the document (the image) before the processing with an aim to improve document quality to achieve greater recognition results;
to locate and recognize printed text, optical marks, handwriting, logos, stamps, signatures, etc;
to verify and cross-validate the data according to pre-defined scenarios or when recognized data comes with lower confidence level.

Based on the most common business requirements for document processing listed above, let's define the scope for our verification - test task:

Scope of test task:

Goal: to capture and recognize specified data from Student Transcripts such as name of the candidate, candidate ID, list of subjects with grades, issuance date and respectively from Service Report - name of the company, company ID, list of equipment, equipment ID;
Document in scope: file that contains 2 (two) document forms (one is one-page Student Transcript and the second one - multi-page Service Report);
Leveraged automated solution: Tesseract, Google Vision API and elDoc.

So, let's start with Tesseract.

Tesseract – is an optical character recognition engine for various operating systems. It is free software, originally developed by Hewlett-Packard as proprietary software in the 1980s, it was released as open source in 2005 and development has been sponsored by Google since 2006.

As of today, Tesseract is recognized as one of the most powerful open-source OCR solution. Tesseract supports more than 100 languages what makes Tesseract universal and widely used around the globe. Many technology companies use Tesseract as a base for building complex automated platform for intelligent document processing. However, it is important to mention that Tesseract is positioned as plain OCR, the solution that is capable to recognize the whole text without capability to locate and capture the specified data from the document (unless Tesseract is powered with additional cognitive capabilities).

In scope of our test task, let's send to Tesseract for the processing the multi-page file that contains two type of the documents "Student Transcript" and "Service report" in PDF format and one-page file that contains only "Service Report" in JPG format and let's review the results:

Tesseract - illustrative demo (as per above scenario):

As we may see from our test task, we tried to process via Tesseract multi-page file (PDF format), however Tessaract gave us back the response - "PDF reading is not supported". Thus, as a second attempt we have sent to Tesseract one page document in JPG format. This time Tesseract was capable to process it, however as you may see from the results not all data are captured correctly. Moreover, Tesseract captured all data but not target data (as per our business requirements).

Let's try to simplify our test task and let's send to Tesseract one-page file "Student Transcript" in JPG format to review the results.

Tesseract - illustrative demo (as per above scenario):

As we may see from the video, Tesseract recognized the data quite well. However, if we analyze the results in detail (captured and recognized data) - we may see that Tesseract captured data also from the document background (if you zoom in, you will see that document background consists of text) as a result we have received a lot of noise (unnecessary information) which is very difficult to process further.

Key Tesseract advantages:

Tesseract is very powerful OCR-engine and if used correctly (with having required fine-tuning) may deliver quite high recognition results;
Tesseract – is open-source solution that doesn't require any investment / license cost.

Key Tesseract limitations or what is required in order to process documents comprehensively:

does not support processing of documents in PDF format;
the data captured is presented in the format of raw data, not structured along with captured noise;
there is no possibility to locate and capture the required (target) data / fields;
there is no possibility to enhance the image before the processing (properly rotate, scale the image; remove unnecessary noise or artefacts from the image, etc.);
there is no possibility to process file that may contain different document forms with further document form classification and target fields (data) capture;
there is no possibility to process multi-page files where data, for example in table format spans across dozens of pages with an aim to capture the required data / field that could be dynamically placed on any page of the submitted file;
there is no possibility to cross validate the data automatically (as per defined criteria) or validate the data in case confidence level for recognized data is below the defined threshold;
there is no possibility to track and monitor recognition queue when mass document processing takes place;
there is no possibility to review the result of recognition in user-friendly format, perform required audit and monitoring where required.

Let's also perform the same test through leveraging Google Vision API.

Google Vision API – cloud service, designed to discover in-depth analysis information from images. The service uses pre-trained Vision API models to detect emotions, understand text.

We will replicate the same actions we have performed for Tesseract: namely, let's send to Google Vision API for the processing the multi-page file that contains two type of documents "Student Transcript" and "Service report" in PDF format and one-page file that contains only "Service Report" in JPG format and let's review the results:

Google Vision API - illustrative demo (as per above scenario):

As we may see from the test demo, Google Vision API (important note: available trail version) doesn't support processing the documents in PDF format thus we had to split multi-page file into pages and convert them to JPG format. The results we have received in terms of recognition are quite good, however we may also see that the data captured is far from structured format what makes it difficult to process the data further.

As with Tesseract, let's also simplify the task and let's send to Google Vision API one-page file "Student Transcript" in JPG format to review the results.

Google Vision API - illustrative demo (as per above scenario):

As we may see from the video, Google Vision API tried to structure the data and the recognition results are quite good, however as Tesseract performed, Google Vision API also recognized the background as a text that makes it difficult to process the data further as there is a lot of noise in our retrieved data set.

Important Note: the above test (analysis) is performed leveraging public (trial) Google Vision API version.

Key Google Vision API advantages:

powerful tool built leveraging machine learning and in-depth analysis capabilities in order to process images;
the solution provides very high recognition results.

Key Google Vision API limitations or what is required in order to process documents comprehensively:

the data captured is presented along with captured noise;
there is no possibility to enhance the image before the processing (properly rotate, scale the image; remove unnecessary noise or artefacts from the image, etc.);
there is no possibility to process file that may contain different document forms with further document form classification and target fields (data) capture;
there is no possibility to process multi-page files where data, for example in table format spans across dozens of pages with an aim to capture the required data / field that could be dynamically placed on any page of the submitted file;
there is no possibility to cross validate the data automatically (as per defined criteria) or validate the data in case confidence level for recognized data is below the defined threshold;
there is no possibility to track and monitor recognition queue when mass document processing takes place;
there is no possibility to review the result of recognition in user-friendly format, perform required audit and monitoring where required.

As a third solution in scope of our test task, let's take Intelligent Automated Platform - elDoc and let's review the results on how elDoc is capable to handle the same test task.

elDoc – Intelligent Integrated Platform for Document Understanding and end-to-end Document Processing. elDoc is powered with cognitive capabilities including Computer Vision and advanced mathematical models (AI based) designed to process the images of different complexity. As for recognition part elDoc uses Tesseract - the most recent version based on neural networks and machine learning.

elDoc - illustrative demo:

As we may see from the demo, we have performed the same task as for Tesseract and Google Vision API, namely we have uploaded into elDoc the multi-page file that contains two type of documents "Student Transcript" and "Service report" in PDF format. The results of processing demostrates that elDoc system performed the following operations:

automatically enhanced the image (by removing the noise from the background, improved the quality of the image, normalized the image) before processing;
automatically classified the documents by type (Student Transcript, Service Report) by being capable to process multi-page file in PDF format;
captured and recognized the required (target) with converting the data to the structured format and providing confidence level of recognized data per field.

Consequently, we may make a conclusion based on the performed analysis (test):

in case it is required only to extract the data in the format of raw data (whole text) with an aim to covert the captured data to editable format / format subject to further processing leveraging other automated capabilities - such powerful engines as Google Vision API, Tesseracts or similar one could be leveraged. These solutions can be considered also as great base for building complex intelligent automated solution for document processing and document understanding;
in case the goal of the project is to process the documents from end-to-end perspective by capturing the target (specified) data from the scanned documents with further data handling (cross data validation, verification, review, maintenance, etc.) - it makes sense to consider complex automated solutions that are powered not only with OCR but other cognitive capabilities to be capable to process the images intelligently (image enhancement, multi-page and multi-format handling, capturing target data) as well as other automated capabilities for end-to-end document processing: data verification, cross data validation, document review, document archiving, document access control, simultaneous document editing, monitoring, audit, analytics & reporting, etc).

Willing to know more about e2e Intelligent Document Processing, please visit the following link - elDoc

About «elDoc»
«elDoc» - enterprise level solution for end-to-end document workflow automation and intelligent document processing. «elDoc» is powered with cognitive technologies to be capable to process intelligently documents (scanned and digitally generated) to capture the required data by converting semi-structured and unstructured data to structured format for further processing and end-to-end automation.

About «DMS Solutions»

DMS Solutions» is a technology company, a professional service provider delivering Intelligent Automation Solutions and is an authorized implementation partner of “elDoc” - Integrated Automated Platform for Intelligent Document Processing, Document Workflow Automation and File Management.

We leverage Computer Vision, Machine Learning, Artificial Intelligence to build a powerful digital workforce for your business to win on the market. We focus on exploring new ways to apply disruptive technologies and recent inventions to bring innovative automation solutions in. We break boundaries and help clients to achieve their strategic goals through delivering next-generation solutions.

Our Clients are high-performing enterprises coming from different industries: banking & financial services, insurance, logistics, healthcare, manufacturing & retail, government, real estate, transportation, oil & gas, energy, etc. We operate globally having HQ in Hong Kong.