As a quick continuation to the previous post about PDF manuals we've decide to dive deeper in to how to organise such a task.
Now, an objective is to convert PDF manuals into API service. If this happens, then the we can charge a small amount for access to API service so that other parties can build automated things and innovate using new data source. This can bring us extra revenues.
In general, the project might look like this:

Step 1: PDF processing pipeline
It is straightforward - we create a Amazon S3 storage and AWS Lambda function to trigger PDF process every time the PDF is loaded into our folder. To process PDFs we will use Amazon Textract. Once we created the labelling project, we can start designing human in the flow task.
S3 bucket:

Labelling workforce team:

Step 2: Human in the loop
We need to design a task that will be provided to humans to verify the data. To help us we can use Amazon A2I systems that automatically derive content from the scan. You will have to create a team, IAM role and human review workflow.

When you create a human review task, you define keys and confidence threshold. Basically, you are describing what you want to detect in the document. In the beginning, it makes sense to focus on foundation data, let's say, maintenance schedules for parts and extract only this information. Using it you can build a beta version of API and test it with your partners.
Then you can run the task again for more details (e.g. prepare procedures for Augmented Reality).
First we select the task type - we will use Key-Value pair extraction.

Then the next step is to configure the keys we want to extract:

Then we need to design what the worker will see and provide instructions. This is pretty important - you need to provide very clear instructions to avoid errors.

Step 3: Scale the task
As you've noticed before, you can run the human review with Amazon Turk, via trusted partners by Amazon or do yourself.

If you have no problem with IP or confidentiality or competition, then you can run it through Amazon. Bitskout offer an alternative where you cannot use external providers but you need the scale. Therefore, you can run this task through a Bitskout campaign for your employees with a reward per each review. This way you can get what you need fairly quickly, raise your employee engagement and save on costs by avoiding external spent.
Worker's view
After the task is launched the worker will see the following screen.

Once the task is submitted and you chose Bitskout as a validation system, the platform will validate the task execution and release the reward. The result is usually a JSON file with bounding boxes and key value descriptions.

We've used Amazon blog about Textract and A2I services to compile this overview.
Cheers,
Ilia Zelenkin