DVR TIG Week: Unlocking Data: Extracting Data Tables from PDFs with Power Query by Zoraya Cruz-Bonilla

Zoraya Cruz-Bonilla

Hello AEA365 community! I am Zoraya Cruz-Bonilla. I work as a Data Research Analyst in the Division of Student Affairs at Binghamton University.

If you have ever had the privilege of performing text analytics with open-source PDF files, then you understand the innate challenges – they are not standardized for easy extraction. In fact, a simple copy/paste can lead to more tedious manual steps in order to maintain structural integrity.

Perhaps extracting data from one or two tables is not a bad proposition. What about a PDF report with over 13 tables? I don’t know about you, but my wrists have already paid the price for over 12 years of experience as an analyst. I prefer a kinder approach. That’s exactly what transpired during this the 2023 SAS Hackathon.

Hot Tip: You got this! Dig into your Existing Toolbox.

For those of you who fancy programming languages such as Python – sure it’s possible to extract text from a PDF file using powerful script packages (e.g., PDFminer, PyMuPDF, PyPDF2, etc.) But when you’re in a time crunch and need a no- or low- code solution then you take a closer look at what’s in your analytics toolbox; you carefully consider what will get the job done relatively quickly and will help the team with allocating more time toward tasks that demand greater heavy lifting.

Rad Resource: Power Query

Fortunately, having watched one of my favorite YouTube videos by Leila Gharani, I recalled that Power Query is a liable tool for extracting data tables due to its data transformation and data preparation capabilities. Check out her channel for a complete walkthrough and additional use cases for Power Query.

Load Power BI, select get data, choose more, all, PDF, then connect, select table, and transform data.

Link to the report.

Screenshots of the process to query the data.

As soon as you load the data onto Power Query, you’ll notice that the header spills over to the next few lines – because there is no delimiter to define the boundaries. The solution to this issue requires transposing the text string via the ‘Transform’ tab {Ex 1}.

  • Select the columns containing header text {Ex 2}
  • Clicked on ‘Merge Columns’ from the tools menu bar {Ex 3}, and
  • Indicate the desired separator {Ex 4}. In my case, I chose ‘Space’ and did not input a new column name because it would get written over in the next step. 

As you perform these steps, you’ll notice that Power Query will keep track of all applied steps {Ex 5}. At any point, you can undo them by clicking on the “X” icon or reorder them.

Once the text string for the header is consolidated onto one column, the data can be transposed a second time {Ex 1}. Finally, back on the ‘Home’ tab, there is an option to assign the first row as the header.

Depending on how the data table was displayed, that may be the last step required. In other instances, however, it maybe be necessary to apply additional steps. Whichever the case might be, it’s noteworthy to know that the steps will repeat automatically whenever the data from the original PDF file changes. Thus, saving time and energy.

Split columns by delimiter, then replace null values, then promote headers, and finally delete duplicate columns.

Lessons Learned: Be Curious to Power Through!

There is lots more to unpack with Power Query and the capabilities are endless if you are versed in the M language that facilitates a range of data mash-up experiences (see the Microsoft documentation for more details).

Upon extraction, the Team End Forced Labor, was able to use the structured data and join it with other sources for natural language processing. Success!

The American Evaluation Association is hosting Data Visualization and Reporting (DVR) Week with our colleagues in the DVR Topical Interest Group. The contributions all this week to AEA365 come from DVR TIG members. Do you have questions, concerns, kudos, or content to extend this AEA365 contribution? Please add them in the comments section for this post on the AEA365 webpage so that we may enrich our community of practice. Would you like to submit an AEA365 Tip? Please send a note of interest to AEA365 is sponsored by the American Evaluation Association and provides a Tip-a-Day by and for evaluators. The views and opinions expressed on the AEA365 blog are solely those of the original authors and other contributors. These views and opinions do not necessarily represent those of the American Evaluation Association, and/or any/all contributors to this site.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.