AutoML NLP: Convert PDFs to JSONLs!

Learn to use the script provided by AutoML docs to convert your textual files to jsonl format.

Arpana Mehta
Google Cloud - Community

--

Screenshot from GCP AutoML docs [Photo by author]

If you started with anything related to AutoML NLP on GCP, you already know the first step. Convert your documents to .jsonl format before passing them as the training data. AutoML documentation already provides you a python script but if you found the script a little tough to understand or use, you are not alone!

When I was going through the docs I wished there was a step-by-step guide to use the python script, so here it is.

What is AutoML?

AutoML is one of the building blocks of AI on Google Cloud Platform. It is especially useful when you want to create and train custom high-quality models and have limited machine learning expertise to do the same. It provides an interface to help you transfer your data to the algorithm to build your model. It automatically selects the best neural network architecture, and tunes the hyper-parameters for you based on your goal, basically saving manual effort in re-training and fine-tuning.

The script provided in the docs is written in Python2. [ Photo by Hitesh Choudhary on Unsplash ]

So, step 1 — Get your data in!

I will not be writing about all steps in the process of creating and training a model on AutoML as you can find them very well documented here. Our focus is on the task causing a bit of friction.

Make sure your PDFs of size < 2 Mb. If they contain pictures or any graphical data, you might want to read this article explaining how you can extract textual data from images in PDFs using the cloud vision API and of course, python.

Converting PDFs to JSONL — in the gCloud shell

This script can be run in the gCloud command line shell or if you have installed the gCloud SDK then you can run the script directly from your machine and convert the pdfs stored in GCS buckets in jsonl files. It is written in Python2 and takes in three arguments — the script file name, path to pdf files and path to destination bucket.

python2 <script> gs://<path_to_src_pdf> gs://<dest_bucket>/
Upload the script and your source pdf files to a GCS bucket [Photo by author]
  1. Upload the script (input_helper_v2.py) and your source PDF(referred to as src.pdf later) file into a GCS bucket. Note the paths for these two documents.
  2. Install python2 in cloud shell. sudo apt install python2
  3. Create a GCS bucket in region (us-central1) and storage class set to standard. The region must be us-central1 for AutoML processing. (Data as of Sep’21)
  4. Copy the script stored in your GCS bucket in your cloud shell home directory. gsutil -m cp gs://<path_to_script>
  5. Step 4 was to load the script in our cloud shell VM’s home directory. Run ls to confirm if the script is copied successfully.
  6. In the cloud shell, run python2 input_helper_v2.py gs://<path_to_src_pdf> gs://<path_to_gcs_bucket>/

If you want to convert multiple files at once, you can use *.extension instead of one file name. Example — *.pdf

JSONL file created after running script [Photo by author — Screenshot from GCP console > cloud storage]

A CSV dataset with jsonl file URIs(which you can directly use for importing the training data) and a jsonl file has been created and stored in the destination GCS bucket specified by you. Head to the docs (importing your training data in the autoMl console) and continue building your custom model!

Hope this guide is helpful to you! Do comment if face any issues. [Photo by author, inspired by Priyanka Vergadia 🤗]

--

--

Arpana Mehta
Google Cloud - Community

Cloud engineer, Google cloud. Here to share my learning journey. I love Django, plants and feedback! 🪴