There's tons of data out there, locked away in tables embedded in PDF essays and reports. PDFs are ubiquitous, yet it's harder to use the data soared inside them than it would be in a spreadsheet. Even if you can extract the data, it can be hard to get consistent results. One table could have lines in order to point out the cell boundaries, whereas others could have only white spaces to achieve a table view. They can also vary in terms of containing spanning rows and/or columns, and often the only constant is that they all have data you want to extract.
While you could manually extract data from a table in a PDF, it's far from simple. There's no reason to take that much time when there are tools to make our job simpler. In this tutorial, I’ll show you how to extract table from the PDF easily with Tabula, and then clean the messy data to a more usable form with OpenRefine.
Everyone working with data knows a common problem: you found some interesting data for your journalistic project or statistics for preparing a nice map, but the data comes messy and trapped inside a PDF file, not automatically readable by your program. So, either you have to write out or perform a lengthy process of copy-and-pasting rows of data out of PDF file into Excel. This sort of manual work create friction and subsequently you lose a lot of time.
Being a research graduate, I have gone through this pain of manually copying and pasting table for my thesis. I always dreamed of an application which can extract the table from PDF with single click—and that's exactly what Tabula is. It's a free, java-powered app that runs in a web interface on your computer that can extract data from almost any table in a PDF. In seconds it can extract the data that'd take you hours to re-type out by hand.
How to use Tabula?
First off, download the corresponding zip file for your computer from Tabula website and unzip them to a folder of your choice. Make sure you have Java v6 or 7 installed as well; if not, you can download it for free from the Oracle website. That link's to the ad/toolbar-free version, so you won't have to worry about anything else being installed.
When you launch Tabula, you
will initially see a Command window and within few seconds your browser
should automatically open to
http://127.0.0.1:8080. If not, open your web browser and type this URL in the address field. Submit
your PDF file and Tabula will process your file and show you a nice
list of page thumbnails. Look for the table you want to extract, or if
you have a small PDF you can check the option to “Auto-Detect
Click and drag to select the area of the table. Once you release the mouse, Tabula will show you the extracted table in a friendly format. If the data is fuzzy, try removing the headers or the footnotes and more. Finally, you can either download the extracted table as comma-separated values (CSV) or tab-separated values (TSV) or copy the data to the clipboard. Once the data is in spreadsheet, you may need to do a bit of editing such as correcting the headers or footnotes.
When you need to shut down Tabula in Windows, switch to that Command window and press Control-C. Wait until there’s a message that says the shutdown is complete. That window might close automatically, or close it yourself. In Mac, switch to the app (right-click the icon in the dock) and press CMD-Q. If you’re running Mac OS X 10.8 or later, be sure to adjust the Gatekeeper Settings to avoid error messages during installation.
Tabula won’t be perfect all the time, and there are still some issues to sort out. As of now it only works on text based PDF only, so you’re stuck with manual labour if you have scanned PDFs—though if you have a copy of Acrobat, you might be able to convert image PDFs to an editable PDF that'll work with Tabula. Also, if you have PDFs with multi-line rows, then Tabula will report frequent errors in tables without graphic row separators, so be sure to watch out for that.
Occasional misspelling, extra space, random punctuation, weird capitalization, duplicate record and more are par the course when you're extracting data from PDFs. No matter how nice the extract looks at first glance, your data will be a mess if there's any errors, creating problems in data retrieval, interoperability and indexing.
That's where OpenRefine (formerly Google Refine) comes in. It's another free tool that helps you easily identify these errors and data inaccuracies. OpenRefine not only allows you to quickly diagnose the accuracy of your data, but also act upon certain errors in an automated manner. It can transform data across many existing cells in bulk, for the purpose of cleaning up the data, can extend it with more data from other sources, and convert it to other formats so you can use it in the apps you want.
How to Use OpenRefine?
OpenRefine once again has a web interface, runs on your computer, and requires Java. The only difference is this time, it'll open in your browser at
When you open OpenRefine, you can notice three options on the left side: Create Project in which you can import data either from computer, a Web Address, Clipboard or Google Spreadsheet; Open Project which helps you go back to an existing project created during a former session; and Import project which allows us to directly import an existing OpenRefine project archive. OpenRefine understands a variety of data file formats including TSV, CSV, Excel documents, and even XML and JSON—a favorite for web and application developers.
Browse to your exported CSV file and click Next. On the next screen, you get an overview of your dataset as it will appear in the OpenRefine. By default, the first line will be correctly parsed as the name of a column. Another option is the “Quotation marks are used to enclose cells containing column separators” checkbox. If you leave it selected, be sure to verify that all the cell values are indeed enclosed in quotes in the original file. Otherwise, deselect this box to ensure that the quotation marks are not misinterpreted by OpenRefine.
The other options may come in handy in some cases; try to select and deselect them in order to see how they affect your data. Notice how the preview updates to reflect these changes. Also, be sure to select the right encoding to avoid special characters being mixed up. When everything seems right, give your project a name and click Create Project.
To demonstrate the various data manipulation techniques you can either use your own dataset or download a demo dataset from the Github repository to try yourself. Once your dataset has been loaded in OpenRefine you should learn to explore data by scanning the different zones. From top to bottom you can observe four zones.
In the first zone, you could see the total number of rows for a particular file. In the second zone, you can try to alternate between rows and records by clicking on either word. This zone will also let you choose whether to display 5, 10 or 50 rows/records on a page, and also provides the right way to navigate from page to page. In the third zone, you will find here the first row that was parsed as column header when the project was created. In my dataset, the columns reads university, endowment, numFaculty and so on. The leftmost column is always called All and is divided in three sub-columns containing stars, flags and IDs. Every column has a menu that can be accessed by clicking the small drop-down triangle. Finally the fourth zone shows the actual main area displaying the actual values of the cells. If you hover your mouse over a cell you can edit that particular datatype.
Columns are an essential part of OpenRefine and by default, all columns are expanded which can be cumbersome if there are many columns in the project. If you want to temporarily hide one more columns to facilitate the work on the others, click the small drop-down triangle in any column to show the menu and select View. You can see four options here: Collapse this column, Collapse all other columns, Collapse columns to left and Collapse columns to right.
Play with the individual column options until you are satisfied with the outcome. To expand a column again, just click on it. In some cases, it might be useful to change the order of the columns, for instance, to bring together columns that need to be compared. To achieve this, enter the menu of any column and click Edit column. Again, four options are available: Move column to beginning, Move column to end, Move column to left and Move column to right. If you want to re-order the columns completely, use the column called All.
The View option here offers you a quick way to expand or collapse all columns, while Edit column offers you a quick way to re-arrange columns by dragging them around or suppressing them by dropping them on the right, as shown in the following screenshot.
When your project is large, the first thing you should learn is how to sort data as a visual aid since sorting can make your dataset easier to explore and manipulate. In order to sort your data by their university names, choose Sort in the column menu to access the following screen. Looking at the screenshot you can sort cell values according to their type: text, numbers, dates and booleans and for every type they can be sorted in different order.
For example, text can be sorted either in alphabetical (a to z) or reversed alphabetical (z to a), booleans false then true or vice versa and so on. Moreover, we can specify where errors and blanks will be stored in relation to valid values by dragging them in the desired order. For instance, errors could be sorted first to spot them more easily, followed by valid values in middle and blank values at the end.
One of the tools in OpenRefine that you will use most often is faceting. Typically, you create a facet on a particular column. The facet summarizes the cells in that column to give you a big picture of your data in the sidebar and simultaneously allows you to filter data to some subset of rows that you want to change in bulk. There are various ways to facet data and depending on dataset values and your needs they are: text facets for strings, numeric facets for numbers and dates and customized facets for defining your own text and numeric facets.
Text facet is a very useful tool, similar to filter in a spreadsheet. Text facet groups unique text values into groups. This can help us to merge information and we can see values, which could be spelled in a lot of different ways. Now, we will create a text facet on the country column by clicking on that column’s drop down menu and select Facet → Text Facet.
The result of this facet appears in the Facet/Filter tab on the left of the screen. In the following screenshot we can see the column country grouped according to the names, but I found some strange entries such as “U.S,” “United States,” and “U.S.A”, along with the odd "United States )". Aren’t they all same?
You could manually edit these discrepancies, but you don’t have to. The option Cluster will help you to find this kind of dirty data easily and fix them. Click the country column dropdown and navigate to Edit cells → Cluster and Edit, OpenRefine will then present you a dialog box where you can choose different clustering Methods, each of which can use various Keying functions.
In Values in Cluster, you can see the different spellings and how many rows contain a particular spelling. The Merge? column contains a checkbox and if you check it, all values in that cluster will be changed to the value in the New Cell Value column. Here I can manually input the new cell value as “United States of America” and then you can click Merge Selected & Close button to merge all the selected clusters.
You may find some more discrepancies in the dataset; if so, select different Method from the dropdown menu and choose different Keying Functions. Play with different options until you fix all the inconsistencies. If you want to dig into how OpenRefine clustering works, be sure to check this article in its Github.
Where a text facet groups unique text values into groups, a numeric facet groups numbers into numeric range bins. We need to clean the data for the number of students in the numStudents column. Not all of the values are numeric, and many of them may contain bits of text in addition to the actual number of the students. To figure out which entries need to be fixed, I will use a numeric facet.
In the numStudents column menu, navigate to Facet → Numeric facet and look at what appears in the Facet/Filter tab on the left. This shows us a histogram of the values and, also lists the number of entries per type (numeric, non-numeric, blank, error), keep a watch on the non-numeric rows and uncheck the other types. We can see some problems already, as some cells have “text besides numerals” and various other discrepancies. So the numerical column in short is not purely numerical and contains some dirty data.
We can do two things here: use the text facet to list all the “text besides numerals” in the left sidebar, then hover your mouse over any messy data and you’ll see an option called edit, a dialog box will open and then remove the text with Backspace and click Apply. In my dataset I removed “Great Valley” text from 560 rows with a single click.
Facets are definitely powerful tools but sometimes you may see inconsistencies in the cell column and the only way to edit cells in a column is by invoking the Common Transform command on that column; from the column’s drop-down menu pick Edit cells → Common Transform command. You’ll see there are a range of useful functions here to clean up your data and make sure it is consistent. Here’s why: Sometimes in the process of entering data, people put a space before or after a name. You won’t be able to see it, but when it comes to combine two sets of data, the spreadsheet software will interpret it as two sets of data. For example, “ Rahul” is different from “Rahul”.
Go to the column where you want to remove whitespace, then navigate to Edit cells → Common Transform → Trim leading and trailing whitespace. Another common problem with data is inconsistent formatting—occasionally someone will leave the caps lock on or forget to capitalize a name. These three options: to titlecase, to uppercase, to lowercase converts all cells in that column to be consistently formatted, one way or another.
A very useful feature of OpenRefine is its handling of the history of all modifications that affected the data since the creation of the project. In practice, this means that you should never be afraid to try out things with the dataset. Feel free at all times to fiddle with your data and apply any number of facets or transformation because you can always undo it if you realize it was a mistake
To access the project history, click on the Undo/Redo tab in the top left of the screen, just next to the Filter/Facet, as shown in the screenshot. In order to turn back the clock, click on the last step that you want to be maintained and rest will get greyed out. To cancel all changes and recover the data as they were before any transformation was made, click on 0. Create Project. To redo, click on the step up to which you want to restore the history.
Once you have finished your project, you can export the data from an existing OpenRefine project. The Export menu in the top-right of the screen allows you to do just that. The supported formats are: TSV, CSV, Excel format, ODF spreadsheet, JSON and HTML table. The last option comes in handy if you want to publish your cleaned data online.
With this tutorial, you've got the basics of how to use Tabula and OpenRefine. At some point, you may get overwhelmed with the complexity behind OpenRefine but don’t worry. Take your time and remember that working with data is not just code or clicks, you need to follow your intuition to get data in great shape—and you can always undo anything if you make a mistake.
If you have any trouble getting started with Tabula or OpenRefine, or have any unique methods you want to share, be sure to let us know in the comments below.