If we receive a file with a name like "1.pdf" or "document1.pdf", we need to open it to understand what it is. Then usually we proceed to rename it to something like this "a big document from a customer that I will forget about in next hour.pdf".
It is an amazingly inefficient and cumbersome method to control the information flow. Because usually, a colleague uses a different way to name the files, then multiply on a number of employees you have, then contractors, customers, etc. and it becomes impossible to control the naming.
In this article, we wanted to show how you can improve a bit the productivity associated with files and documents in a project. We will create a workflow to detect the file type and assign a corresponding label or a custom field value.
In our example, we will be detecting shipping documentation - several documents are required to send something internationally. We will try to detect the following types: a bill of lading, insurance, certificate of origin, commercial invoice, and dangerous goods certificate.
Additionally, because there are many shipping companies, they all use different templates, then as step 2 we will detect the company name, and then knowing the company type, we will extract data from the document using the corresponding template workflow. Our logic is the following:
- Detect Document Type (Bitskout workflow - Shipping Documents Detector)
- Detect Vendor - we will use a bill of lading document to detect vendor in the example (Bitskout workflow - Bill of Lading Vendor)
- Extract Data from the bill of lading depending on the vendor.
First, lets prepare our model for Document Type Detection - we will use Data Extraction for that. Please type the model name and add some description. Then select the data extraction model type.
The next step is to choose the type of data extraction. We will use standard data extraction which will allow us to select the region of the file.
Then we need to load a template. And here we will do a trick - because all documents are of similar structure (A4 pages), they all have a title in the top part of the document. Hence, what we will do is capture all the text in the top part of the document and then find the keyword that tells us about the document type and use it as a detector for the document type.
Let's load a sample.
After the sample has been loaded, let's select not too big and not too small region and name it Document Type (the string type). After the sample has been loaded, let's select not too big and not too small region and name it Document Type (the string type). Then you will need to move the selection to reach the top of the page (the region name will disappear). We need to grab the whole top part of the page to be sure that we've captured all text.
You can leave the rest of the options as it is. Press Apply to save the model.
The next step is to create a Label Output. Click on Outputs in the left main menu and choose Label Mappings:
Label mappings allow you to map A.I. model output to labels or dropdown lists. Let's add a new mapping by press Add. As per instruction, first we select the A.I. model from the list and then a project management service and project/board where we want to map the output. Next step is the actual configuration.
As you can see we've added keywords to look in the scanned text that would allow us detect the document type. Press Apply and the output is saved. Now we need to create a workflow:
Once the workflow is saved, let's try and use it. If you've configured the output for monday.com, then you will need to use the recipe:
And once you run the recipe, the Bitskout workflow will set the labels automatically.
Now, as we know the document type, we can now extract information from it. But before we do it, we need to understand which vendor document is this. Continued in part 2.
Such functionality is very usefull if you want to get back control of your files or documents. Obviously, this functionality has its limits and we dont recommend to use that technique in trying to detect all possible documents you have. There already should be some filtering done before - in our case the client allowed only shipping documentation to be loaded via the form, hence, the workflow was quite efficient.
Feel free to contact us if you have a use case or have any questions.