This post was written more than four years ago. The world changes fast, and the information, conclusions, or attributions may or may not still be accurate. Check the sources and links, and email me if you have any questions.

automated-human-ocr-with-mechanical-turk

There was recently a campaign finance deadline in my city, and the jurisdiction that handles campaign finance disclosure reporting doesn’t require candidates file electronically. In fact, most candidates fill out the form and print it off with a bunch of printed contribution and expenditure schedules. The county scans them back in, doing the electronic-to-paper-to-electronic shuffle. By doing this, however, the data loses machine-readability. If you’re a journalist, researcher, or just mindful citizen, that’s no fun.

Optical Character Recognition (OCR) has advanced by leaps and bounds, but when you’re dealing with tabular data that may or may not have lines between rows, handwritten notes and corrections, or a low quality scan, OCR is going to frustrate you more than it will help you. Certainly, try it out first, but if it’s not meeting your expectations, move along. There might be some middle ground of combining OCR data with Mechanical Turk correction or validation, but I won’t address that here.

Here’s how I use Mechanical Turk to turn paper records into electronic machine-readable data. This is an overview, using some software that many technology generalists would likely have installed. So, if you’re pretty skilled on the command line you might be shrieking ‘but there’s an easier way!’ Indeed — and I’ll reference those technical options as I go.

Overview

The best way to use Mechanical Turk to convert paper data to electronic, machine-readable data is to slice the data on the page up into individual images for each row of data. That image is shown to a Mechanical Turk worker, who receives a micropayment to transcribe the data shown in the image into predetermined fields. The output from a Mechanical Turk batch is a CSV file that contains the data that the workers entered. There’s a data quality issue with anyone transcribing an image into text, and there are a number of ways to reduce errors.

Step 1 — Scan it in and deskew the pages

Scanning: If you’ve been given a file that was sourced from paper, it’s likely to already be in a PDF format.  If you are the holder of the paper, scan it in yourself to ensure that you’ve got a high enough resolution. I recommend 300dpi.

Deskewing: Any paper document that’s been run through a scanner is going to be skewed (rotated or tilted) a bit. With big corporate high-end sheet-fed scanners, my experience is that the page is rotated on average anywhere from –0.5% to +0.5%. You can deskew your pages using the built-in feature with Adobe Acrobat in the Tools » Document Processing » Optimize Scanned PDF » Filters (Edit) menu by selecting Deskew: On. If you don’t have Acrobat, there’s another option you have I’ll describe later on.

Technical users, ImageMagick has a built-in deskew option, and here’s two fabulous links to see how to clean up scanned images in general:

Step 2 — Get all of the pages into single 300dpi PNG files.

From a single PDF, separate and convert the file into one 300dpi PNG file per page. If you’re on a Mac, my favorite utility for doing this is PDF Toolkit+, $1.99 in the App Store. Technical users, you can use ImageMagick on the command line — simply run: convert -density 300 -depth 8 -quality 100 source.pdf destination.png

Step 3 — Deskew (if you haven’t already)

If you didn’t have Acrobat in Step 1, you can manually adjust the skewing and rotation of an image using desktop photo editing software like Adobe Fireworks (my preference), Adobe Photoshop, Adobe Photoshop Elements, or Preview.  I like to draw a rectangle on the page for a reference of where zero is, and then you can manually adjust the rotation.  I suggest using the numeric transform feature, and adjusting the skew (rotation) in 0.1 or -0.1 increments until you reach the target. Save the image. If all the images are equally skewed, you can batch deskew them — Adobe products have built-in batch processing or macro functionality.

Step 4 — Make a template and slice each page image

Now you’re going to make a template that slices each record into separate images. For this, you’ll need software that can handle ‘slicing.’  Both Adobe Fireworks and Adobe Photoshop support this.  Open up the first page and draw ‘slices’ around each row of data on the page, doing your best to ensure that all of the data is contained within the slice. If this is a one-off small job, don’t worry about being precise with ensuring each box is the exact same height and width. If it’s a big job or one you’ll be repeating often, you might want to be more precise. Duplicating slices rather than drawing new ones is the best way to ensure your outputs are the same widths and heights.

Here’s how this looks in Fireworks — the red circle shows where the ‘Create Slice’ tool is, and each red line represents the edges of the slice (colored in green) over the image.

adobe-fireworks-creating-slices

Once you have a slicing box around each row of data, this is now your template. Since all of the files are deskewed, as long as they match this format, you should have no problem applying this template to each page image.

Load each page into your image editor, and export the slices out.  In Fireworks, chose export option ‘Images Only’ and choose to ‘Export Slices.’ Do not check the box ‘Include Areas without Slices’ or you’ll get a bunch of extraneous exported files.

adobe-fireworks-export-slices

This gives you a single image file for each row of data. It looks something like this:

ma21_r1_c1

 

Technical users, I think the best way to handle slicing is to have a command-line utility repeatedly crop specific dimensions into separate files. I turn to ImageMagick once again, which has functionality to crop the image into the specified dimensions, starting from a given x,y coordinate. Of course, it’s a bit complicated on where to find those coordinates. I think the best way would be to figure out the dimensions of the data row, and where the first x,y coordinate is, and to mathematically calculate X rows going down the page image. If rows are varied heights or you don’t want to write a script, another way to do it would be to use a GUI image map generator that would spit out x,y coordinates for every box you draw. Simply perform some sed and awk magic to pass those numbers back to ImageMagick for cropping.

Step 5 — copy the images to a public web server

When the Mechanical Turk workers need to view the image, they’ll be loading it from a URL you specify. So, load up Transmit and S/FTP the images over to a public-facing web server. If you don’t have a web server, you can get one from companies like BlueHost — just make sure it includes SFTP or FTP access.

Step 6 — prepare your batch

Now you’ll need to create a CSV file that contains each filename, one per line, with a header row with ‘image_url’ on it.  You can do this by selecting all of the images in Finder (on a Mac), copying those images using Command+C, and pasting them using Command+V into an IDE text editor (like TextMate or Sublime Text). By pasting them into an editor that does not support images, only the filenames will appear, one per line. If you have spaces in the filenames, use Find/Replace to replace spaces with: %20

Now, add one row to the top that says: image_url — this serves as a column header.

It should look like this:

turk-csv-1

Now we need to turn these filenames into URLs, based on the URL structure to the web server that you put these images on. You can use find and replace:

find-replace

Your final product will look like this:

diane-hofstede

Technical users, you can accomplish all of this by using `ls`, adding in the header line, and echoing out to a .csv file.

Step 7 — Setup your HIT and determine your HIT pricing

Now we’re going to begin to create your Human Intelligence Task (“HIT”) in Mechanical Turk.  Each HIT will be one single data transcription process for one single image.

Once you’re logged into Mechanical Turk as a requester, visit the ‘Create’ section and choose ‘New Project.’ Select the type of project as ‘Transcription from an image’ and continue:

mechanical-turk-transcribe-image

Title, description, and keywords: On the next page, you’ll be prompted for a title, description, and keywords to describe your HIT to the workers. The workers search for words like type, transcribe, transcription, etc., so make sure to use those words in your generic description text and keywords.

Pricing: The most important thing here is deciding on what a proper price is. The largest Mechanical Turk requestors use APIs and automated systems to know the best time and the most appropriate pricing — a luxury we do not have. While you can get work done for a very low cost, I try to aim a little higher at an equivalent of about $8 per hour — both for benefits to the quality of the output, and because I don’t want to feel like I’m taking advantage of people. The tradeoffs are that lower-cost work doesn’t get as much attention, and in turn gets completed at much more of a leisurely pace, while higher-priced work gets snatched up right away, giving you less time to review the results coming in and take corrective measures if necessary. Also, workers are incentivized to do a good job if they’re getting paid reasonable rates.

The problem is that you don’t enter in an hourly target rate in Mechanical Turk, you enter in an amount for completed HIT — that is, how much will the worker earn for completing data entry of one single row?  I pay somewhere between $0.06 and $0.16 depending on the complexity, and I encourage you to explore and try out different rates in small batches at different times of the day.  Remember, if you have 2,000 rows, you’re going to end up paying $320 if you pay the workers $0.16 per HIT — but if a task takes 60 seconds, and you’ve got 2,000 of them, you just saved 33 hours of your time. Totally worth $320, but it adds up fast.

Mechanical Turk lets you go as low as a penny, which I only do for simple yes/no data validation tasks — the workers complete those so quickly that the hourly rate at $0.01/HIT ends up averaging above $14/hour.

Multiple workers: you can have Mechanical Turk assign multiple workers to each single HIT, which is done to ensure accuracy by having more sets of eyes. I don’t recommend doing this, for reasons I explain later on.

Time allotted: Give the workers enough time to complete the task — figure out how long it would take you to transcribe the text and then triple it. The reason you limit the time is that some workers might accept a HIT and not complete it, locking that HIT from being completed by another worker. This essentially sets a timeout that allows the HIT to be worked on by someone else after the expiration. Additionally, some workers look to the ‘time allotted’ information to clue them in on how long it might take them. So don’t put three hours for something super small.

HIT expiration: it depends on how large your task is. If you’re converting a few hundred to a few thousand records, just keep it at the default of 7 days. If you have a reasonable rate, you’ll get fast results. I’ve had about 500 records transcribed within 20 minutes since so many workers do the work simultaneously.

Automatic approval: completing a HIT doesn’t guarantee a worker money — you have recourse if a worker enters in totally fake information, in that you can reject a HIT. In reality, it’s unlikely that you’re going to have time to review every single entry in a meaningful way. However, you must setup automatic approval so that if you don’t actively reject a HIT within a certain time period, it’s automatically accepted. I cover approvals and rejections later on, but my suggestion here is to keep it around 8-24 hours. The topic of rejections is a big deal, so be sure to read later on.

Step 8 — Create the HIT data-entry form

This is the interface where the workers will view your instructions, see the image, and enter in the appropriate fields.

Provide a description: The goal is to describe in sufficient detail what the worker needs to do, provide them the image that needs to be transcribed, and the fields that the transcription goes into.  I always like to tell the workers what it is that they’re transcribing so that they have a better idea and can be more intuitive in the event that they have trouble reading a scanned item. So in the headline, put “Transcribe this political campaign contribution record.”

You must be very detailed with describing exactly what you want out of the workers, remembering that they come from all walks of life, from a variety of cultures and geographical areas. I describe each data element and what I want the workers to do, and I’ve learned to provide several examples from what I’ve seen in the data. When I had a field labeled ’employer’ on a campaign finance report, most everyone understood that the field that looked like it had company names in it was probably the employer. But, I got a message from someone confused because the field listed “unemployed,” and well, that’s the opposite of employer! Be very specific about each potential thing they might see, and how they should deal with it.

Here’s an example of one of my descriptions:

mechanical-turk-HIT-description

Step  9 — Create the HIT form

Below your description, you’ll need to to show the image that’s being transcribed, and give the workers fields to enter in the information. I put a red box around the image so it’s easy to see where the source data is when the worker is scanning up and down as they enter in the data.  The image gets displayed by pulling the ‘image_url’ column data from your CSV file you’ve already created, so the way to get the image to display on the page is with this markup (adjust the sizing as necessary):

<img width="1000" height="16" src="${image_url}" style="border: 5px solid #c00; padding: 2px;" />

 

Protip: that CSV file you created feeds this form with variables, which is what’s happening with ${image_url}. If you had another element of data, or perhaps more than one image, simply feed in more variables in the same format.

For data entry fields, this is as simple as using <input> fields and giving each field a unique `name` attribute — that is, <input name=”address” /> is all you’d need to do to give users a place to input an address. When you get back the CSV after the workers have completed all of your HITs, there will be an ‘address’ column.

Since you’re quite limited on how much customization you can do to how the form looks, it might be a good option (sadly) to use tables to organize and layout all of the fields on the page. I also like to include a checkbox so workers can alert me to situations where a record might be cut-off, jumbled, or otherwise unreadable to the point that accuracy is questionable. Notice above that I clearly outlined my expectations surrounding when I want that box to be checked.

Here’s what my form looks like, which is far from ideal, but it’s tough to justify making it much of a better experience for something that will only exist for a couple hours. Perhaps at some point I’ll make something resembling a boilerplate for this:

mechanical-turk-form

After that, you’re all set. Preview and save it.

Step 10 — Run your batch

From the ‘Create’ tab in Mechanical Turk, click the bright orange ‘New Batch’ button and upload that CSV file you generated with all of the image URLs. It’ll take a few seconds to process, and then it will show you an example of your HIT form loaded with an actual image in it. Browse through them and make sure everything’s looking okay. If not, go back to the ‘Create’ tab and make the adjustments you need by editing the project.

Then move along to the next page where you’ll pre-pay for the work, with a cut going to Amazon. As soon as you hit submit, the HITs get sent to the workers and you’ll see this screen:

mechanical-turk-status

Step 11 — Make sure the results are coming in as expected

Once the results start coming in, click the ‘Results’ button on the right to take a look at what’s coming in from the workers.  I suggest doing this early to make sure the right data is coming in so you can quickly cancel if not. You’ll see:

mechanical-turk-data-results

Perfect!

Step 12 — Download your data!

Once all the results are in, make sure all of the filters on the results page are disabled, and simply download your CSV file and start using your data in a more meaningful way.


About the approval and rejection process: Rejections are taken very seriously by workers, as the work that’s available to them is determined by their HIT approval rates to ensure that quality work is being performed. You can certainly reject poor-quality work, but you must provide a useful reason for rejecting the work. If you’re not fair in this process, there’s a Chrome and Firefox plugin that Mechanical Turk workers frequently use called Turkopticon, which serves as a reviewing system by workers, on requesters — you’ll be rated poorly, and some of the most dedicated workers won’t take up your HITs.

Validating your data: For campaign finance reports, I’m focused on accuracy. As I mentioned, there’s a way to get two workers to perform your HITs, but it doesn’t work well when you’re dealing with complex data like names and addresses. If a campaign finance report lists a $300 contribution made on January 1 by “Mr. John Smith” of “123 Main Street Northeast,” I’m okay if a worker enters in “John Smith” instead of “Mr. John Smith,” and I’m even more okay with workers entering in “123 Main St NE” or “123 Main Street NE” — you can see how an automated review process on this sort of data would prove impossible. However, the date and amount fields are incredibly important. Depending on how your data is formatted, I suggest resubmitting your data back as a new project just for verification purposes. In that event, you’d export a CSV with fields for ‘image_url’ and perhaps ‘amount’ and make a form that displays the image and asks the question “Is the amount shown in the image $X?” That’s a task you could certainly have multiple workers work on.