A Quick Introduction to AWS Textract
AWS’ Textract is a service that does as the name suggests, extracts text from documents. The official description from the AWS site is “Textract is a managed machine learning service that automatically extracts text and structured data from virtually any document.” That’s a mouthful of buzzwords, but at it’s core, the service examines a graphical document, finds text in those graphics, then tries to make sense of it where possible, and return the results to an end user.
Textract is a fairly new service. Introduced into beta at re:Invent 2018, the service hit general availability to most in May of 2019. And because it’s a fairly new service, there are still some aspects that feel unpolished, or in need of improvement. An example of this is a head-scratching 10-page limitation for PDF documents. This limit can present hurdles with some customers such as government customers considering it for an input mechanism of forms such as the SF-182 (a US Government training form). The form is similar to other US Government forms in that it has 14 total pages but only 2-3 pages of fillable information with the rest being instructions. The pages with form data could be separated from the instructions, but that would require another step. Unfortunately, that document in it’s unaltered state can’t be used with Textract.
The service works with jpg, png, and pdf. The documents can be read from an S3 bucket. Textract has a GUI accessible from the AWS console that’s nice to use for example purposes and to get a sense of what the program identifies as text in the documents.
Once you have the data in Textract, the data can be downloaded. By default, a zip file is provided that includes a JSON formatted file, a csv of key-value pairs, a txt file of the raw text, and csv files of any tabular table that can be identified.
Realistically, the console GUI isn’t how people will use this method for large collections of files. The API for Textract is provided to make this a component in a larger workflow application. AWS’s power comes from the fact that tools such as Textract can be combined with other AWS capabilities in an almost low-code manner.
Textract has shown itself to be a tool worth watching. If you can live within the limits, it’s worth looking at deeper. AWS may have more announcements at this year’s re:Invent. If so, I may re-visit this and share my thoughts on the updates.