AWS AI Services Programming Series - Part1 (Polly, Translate & Textract)

   Go back to the Task List

  « 4: Programming with Amazon Translate    6. Clean up »

5: Programming with Amazon Textract

Amazon Textract enables automatic extraction of text and data from the scanned documents. It is more capable than simple optical character recognition (OCR) because it can analyze and understand the data formats such as forms, tables in the scanned document. Textract can not only help in digitization but can also help in taking action based on the document data.

In this task, you write two types of codes - one to extract text from the scanned document and other to read \ parse form fields and values from the scanned document.

  1. Before we start coding, let’s upload two scanned documents to the S3 bucket which you will use later to program with Textract. Goto the AWS S3 Management console, create a bucket with the name dojo-textract-bucket. If this bucket name is not available, create bucket with a name which is available. Then upload the two scanned documents textractsample1.png and textractsample2.png. You can download the scanned documents from the links below.

    Textract

    textractsample1.png

    textractsample2.png

  2. The scanned documents are ready. In the first task, you will write code to scan the following textractsample1.png document. This is a paragraph from the AWS Security Whitepaper.

    Textract

  3. Goto the AWS Cloud9 console and click on the New File option under the File menu.

    Textract

  4. A new Untitled1 file gets created. Write the following code in the Untitled1 file.

    Textract

    import boto3
    
    textract_client = boto3.client('textract')
    
    response = textract_client.detect_document_text(
         Document={
             'S3Object': {
                 'Bucket': "dojo-textract-bucket",
                 'Name': "textractsample1.png"
             }
         })
           
    for item in response["Blocks"]:
        if item["BlockType"] == "LINE":
            print (item["Text"])
    

    `

  5. In the code above, you first create client to textract. Then you call detect_document_text method passing the scanned document location in S3 bucket. If you created bucket with a different name; then use that name here. You also have choice to pass the scanned document as bytes. The detect_document_text method detects the text from the scanned document and returns as response. You then write code to parse and print each of the lines from the response.

  6. Click on the Save option under the File menu. On the Save As popup, type in dojotextract1.py as the file name and click on the Save button. The code is saved in the file.

    Textract

  7. In the console window, run python dojotextract1.py command to run the dojotextract1.py code. The code execution finishes in no time. You can see the scanned text in the console window.

    Textract

  8. This was an example where you used Textract to detect text out of the scanned document. Let’s do another task which is more interesting. Would you like to read the fields and their values from the scanned form? Like the one shown below which is UK Border Control Landing Form as an example for this workshop.

    Textract

  9. In order to analyze a formatted document such as form or table, you need to use a different method analyze_document along with an additional parameter FeatureTypes. When scanning formatted document, you can use Amazon Textract Results Parser published at GitHub. For this workshop, we have provided the code at the link. Download the code file trp.py and use Upload Local Files… option under the File menu to upload the trp.py to the Cloud9 environment. Basically - trp.py is the parser for the Textract analyze result.

    Textract

    Textract

  10. The parser is in place. Time to move to the second task to scan a form. Create another file dojotextract2.py with the code shown below.

    Textract

    import boto3
    import trp
    
    textract_client = boto3.client('textract')
    
    response = textract_client.analyze_document(
        Document={
            'S3Object': {
                'Bucket': "dojo-textract-bucket",
                'Name': "textractsample2.png"
            }
        },
        FeatureTypes=["FORMS"])
    
    doc = trp.Document(response)
    
    for page in doc.pages:
        print("The Fields in the Form are:")
        for field in page.form.fields:
            print("Field Name: {}, Field Value: {}".format(field.key, field.value))
    
  11. In the code above, to start with, you first imported trp library which will perform parsing of the textract analysis. You then create client to textract. You call analyze_document method passing the scanned document location in S3 bucket and FeatureTypes as the parameters. If you created bucket with a different name; then use that name in the parameter. You also have choice to pass the scanned document as bytes. The FeatureTypes parameter determines the type of analysis to perform. It can have two values - TABLES | FORMS. Since you are analyzing a form, you passed FORMS as the parameter. The analyze_document method analyzes the form and extracts fields and their values from the form. You then use trp library to parse the analysis response and print the form field names and values.

  12. In the console window, run python dojotextract2.py command to run the dojotextract2.py code. The code execution finishes in no time. You can see the form details (fields and their values) in the console window.

    Textract

  13. You can see in the output that the code has extracted each of the form fields and there values. The values are null because the form was empty. You can use another form which is filled for the experiment - we leave that to you.

  14. These were two examples to learn how Amazon Textract works. The part1 of the workshop series finishes here. Kindly follow the next task to clean-up the resources so that you don’t incur cost after the workshop.