SiteMap
Page Bottom  References from Other Sources

PDF files

This is a menu of the topics on this page (click on any):
Help from Adrian Smith on extraction of text from a pdf file   
Steve Brook's question & answer from Adrian   
Siphoning Just the Text From a PDF  Document    February 13, 2003, J. D. Biersdorfer   .

Help from Adrian Smith on extraction of text from a pdf file

Once when I was trying to read a pdf file, this is the instruction Adrian gave me and it worked. I figured out how to repeat what he told me to do and created my own text extraction from a pdf file.

Hmmmmm - choosing the 'T' tool in the acroread toolbar and selecting a page of text (use right-mouse Copy on the page NOT Edit copy from the menubar!!) gives

(actual text I wanted as Adrian extracted it is omitted for brevity).

Which shows that they assembled the PDF in columns (a bit like NewLeaf does in fact!!).

This ?could? be dug out with APL!! You will need to do the pages one at a time!!

AS

Steve Brook's question & answer from Adrian

I am looking for a way to read text from .pdf files under program control. Carl showed me your trick to copy and paste text to achieve the effect manually and that works fine. But I would like a process to scan a large number of .pdfs under (APL) program control.

This is a tricky one to answer without knowing a little more about the problem.

I am guessing that all the PDFs you need to 'screen-scrape' come from the same source, so you can rely on the format not changing too much. In the general case, it is definitely not possible to 'read' a PDF except with the Adobe Reader, as the text stream may well be compressed (LZW or gzip) and I am not aware of any tools to unpick the LZW compression. Gzip compression (Deflate/Inflate) is reversible with a DLL call which could be done in APL.

If the PDF is in 'plain text' or 64-bit MIME only, then you have a chance. You can fairly easily walk the object structure to find the page content for each page, and then the exact arrangement of this will depend on the way the original software created its print file. Probably it will consist of pairs of (0 1 0 1 x y Tm) commands followed by (text) Tj commands to write the text. The ordering of these within each page is accidental and you might find (for example) all the bold words followed by all the plain words.

Have a look at a typical PDF in a text editor (PFE is my favourite, or TextPad) which tolerates line-feed delimited files. Or just []NREAD and )edit it for that matter! Can you identify bits that look like the text in what you see?? My notes from Naples last year should help you navigate the outer structure - attached in case you missed them!

If you can see something in the PDF that looks like your numbers, then we can probably go ahead and do some more digging!

Adrian Smith, Causeway Graphical Systems Ltd, Tel: +44 (0) 1653 696760, http://www.grapl.com

Siphoning Just the Text From a PDF Document

February 13, 2003, J. D. Biersdorfer

Q. Is there a way to copy the text out of a PDF file so I can reformat it and use it in a word-processing program?

Q. A PDF file (the initials stand for Portable Document Format) can be displayed and printed in full color with all of the fonts, pictures and graphics embedded within the document. Recipients do not need the original layout program or typefaces to open the file, just free Adobe Acrobat Reader software.

There are several ways to extract the text from a PDF file for use in another program. You may encounter problems trying to copy the text, however, if the file has been copy-protected by its creator or if the text you want is actually an image file instead of embedded text.

If you have the file open in the Acrobat Reader program and have clicked on the Text Select Tool from the menu bar, you can usually drag the mouse over text and copy it with the standard keyboard or menu commands, and then paste it into your word-processing program. Text arranged in columns within the PDF file may come in mixed together because the Text Select Tool reads horizontally across the page, but you can drag the mouse around each column to select it in order by holding down the Control key in Windows or the Option key on the Macintosh.

If you have a large PDF file to copy, text-extraction software can help automate the process. Web sites like PDF Zone (http://www.pdfzone.com) and the PDF Store (http://www.pdfstore.com) sell professional software to harvest the text from PDF files, and shareware sites like Tucows (www.tucows.com) have smaller, less expensive programs that can also handle the job. For those who own the full Adobe Acrobat program, Adobe Systems has a page on the topic in the technical support area of its Web site at http://www.adobe.com/support/techdocs/1c356.htm.

J. D. BIERSDORFER


Copyright 2003 The New York Times CompanyPrivacy Policy

horizontal line
to home page e-mail Page Top