I am able to extract text from my multi-page PDF using Amazon Textract. Now I want start Human Loop Review. I have already created a workflow and specified the condition there to trigger the Human Loop. Below is my code: -
import os
import json
import time
import uuid
from urllib.parse import unquote_plus
import boto3
def lambda_handler(event, context):
textract = boto3.client("textract")
a2i = boto3.client("sagemaker-a2i-runtime")
FLOW_ARN = os.environ["FLOW_ARN"]
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
# Start document analysis for the whole document
response = textract.start_document_analysis(
DocumentLocation={
"S3Object": {
"Bucket": bucketname,
"Name": filename,
}
},
FeatureTypes=["FORMS"], # Specify the feature types to analyze
ClientRequestToken=str(uuid.uuid4()), # Generate a unique client request token
)
# Retrieve the job ID from the response
job_id = response["JobId"]
# Poll for the completion of the job
while True:
job_status = textract.get_document_analysis(JobId=job_id)['JobStatus']
if job_status in ['SUCCEEDED', 'FAILED']:
break
time.sleep(5) # Wait for 5 seconds before checking again
# Get the results of the analysis
response = textract.get_document_analysis(JobId=job_id)
# Process the results
print(json.dumps(response))
a2i.start_human_loop(
HumanLoopName=uuid.uuid4().hex,
FlowDefinitionArn=FLOW_ARN,
HumanLoopInput={
'InputContent': json.dumps({
"InitialValue": {
"Bucket": bucketname,
"DocumentPath": filename,
}
})
},
DataAttributes={
'ContentClassifiers': [
'FreeOfAdultContent',
]
}
)
return {
"statusCode": 200,
"body": json.dumps("Document processed successfully!"),
}
return {"statusCode": 500, "body": json.dumps("Issue processing file!")}
I was expecting it to start the human loop review but it return following error: -
[ERROR] ValidationException: An error occurred (ValidationException) when calling the StartHumanLoop operation: Provided InputContent is not valid. Please use valid InputContent JSON and try your request again.
Could someone please point what I am doing wrong? I need to pass my PDF in S3 bucket to HumanLoopInput.
--------------------EDIT------------------------------
I am using default worker template, here it is: -
<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>
{% capture s3_uri %}s3://{{ task.input.aiServiceRequest.document.s3Object.bucket }}/{{ task.input.aiServiceRequest.document.s3Object.name }}{% endcapture %}
<crowd-form>
<crowd-textract-analyze-document src="{{ s3_uri | grant_read_access }}" initial-value="{{ task.input.selectedAiServiceResponse.blocks }}" header="Review the key-value pairs listed on the right and correct them if they don't match the following document." no-key-edit="" no-geometry-edit="" keys="{{ task.input.humanLoopContext.importantFormKeys }}" block-types="['KEY_VALUE_SET']">
<short-instructions header="Instructions"><p>Click on a key-value block to highlight the corresponding key-value pair in the document.</p><p><br></p><p>If it is a valid key-value pair, review the content for the value. If the content is incorrect, correct it.</p><p><br></p><p>If the text of the value is incorrect, correct it.</p><p><img src="https://assets.crowd.aws/images/a2i-console/correct-value-text.png" width="100%"></p><p><br></p><p>If a wrong value is identified, correct it.</p><p><img src="https://assets.crowd.aws/images/a2i-console/correct-value.png" width="100%"></p><p><br></p><p>If it is not a valid key-value relationship, choose <strong>No</strong>.</p><p><img src="https://assets.crowd.aws/images/a2i-console/not-a-key-value-pair.png" width="100%"></p><p><br></p><p>If you can’t find the key in the document, choose <strong>Key not found</strong>.</p><p><img src="https://assets.crowd.aws/images/a2i-console/key-is-not-found.png" width="100%"></p><p><br></p><p>If the content of a field is empty, choose <strong>Value is blank</strong>.</p><p><img src="https://assets.crowd.aws/images/a2i-console/value-is-blank.png" width="100%"></p><p><br></p><p><strong>Examples</strong></p><p>The key and value are often displayed next or below to each other.</p><p><br></p><p>For example, key and value displayed in one line.</p><p><img src="https://assets.crowd.aws/images/a2i-console/sample-key-value-pair-1.png" width="100%"></p><p><br></p><p>For example, key and value displayed in two lines.</p><p><img src="https://assets.crowd.aws/images/a2i-console/sample-key-value-pair-2.png" width="100%"></p><p><br></p><p>If the content of the value has multiple lines, enter all the text without a line break. Include all value text, even if it extends beyond the highlighted box.</p><p><img src="https://assets.crowd.aws/images/a2i-console/multiple-lines.png" width="100%"></p></short-instructions>
<full-instructions header="Instructions"></full-instructions>
</crowd-textract-analyze-document>
</crowd-form>
I can see below keys in this snippet: -
> task.input.aiServiceRequest.document.s3Object.bucket
task.input.aiServiceRequest.document.s3Object.name
task.input.selectedAiServiceResponse.blocks
task.input.humanLoopContext.importantFormKeys
But it looks like, it is calling some internal libraries of AWS, because it is a default template.
The keys in your InputContent need to be aligned with the human task UI template, which you have not shared.
Check out this example: https://github.com/aws-samples/amazon-textract-a2i-dynamodb-handwritten-tabular/blob/main/textract-hand-written-a2i-forms.ipynb
Notice the keys in InputContent are also found in the human task UI template. Are your keys aligned with your UI template?