Frequently Asked Questions
How can I extract text from a PDF file ?
I would like a simple example to show me how I can extract text from a PDF file whose name is supplied ?
The example below is maybe the simplest way to do this :
<?php include ( 'path/to/PdfToText.phpclass' ) ; $file = 'sample.pdf' ; // Replace this value with the path to your PDF file $pdf = new PdfToText ( $file ) ; // Instantiate a PdfToText object echo $pdf -> Text ; // Output the text contents
 
How should I report a problem ?
The PdfToText class does not work as I was expecting. How should I report a problem ?
There are two ways to do this :
  • Go to the Demo page, upload your file then click on the Report button once the text extraction process is finished.
    After filling up your email address, you will be able to enter a short description of your issue. Click then on the Ok button to directly send me an email, which will include as attachments your PDF file along with the generated output.
  • Contact me directly at : providing me with your sample PDF document and a description of your problem.

In any case, I will be happy for your contribution, which is the best way to help me make the PdfToText class even better !
 
What to do if the PdfToText class gives unexpected results ?
Sometimes, the PdfToText class can give unexpected results, such as empty contents, accuentated characters replaced by other characters, or even garbage output. What should I do in this case ?
This problem can have several origins :
  • Your PDF file contains only bitmaps. To be sure, open it with Adobe Acrobat Reader then try to select some text in your document, using the mouse. If you are unable to select text, or if you are unable to copy&paste the selection to a basic text editor such as Notepad, then your PDF file contains no text, only bitmaps.
  • Your PDF file is password-protected. The Adobe PDF standard makes the distinction between two kinds of password protection :
    • User password protection : this protection scheme will make tools such as Acrobat Reader prompt you for a password whenever you want to open the document.
    • Owner password protection : you will be prompted for a password whenever you want to modify the document. You will still be able to open the document for reading without entering a password.

    Unfortunately, in both cases, the PDF contents are encrypted. In it's current version, the PdfToText class is unable to handle encrypted PDF files, even if you have the right password. This feature will be implemented in a future release.
  • Occasionally, some accentuated characters may be replaced by something which looks like spurious data. This is the case for example of some documents containing German umlauts.
    This is a known issue, which will be fixed in a future release.
  • The output of the PdfToText class produces mostly garbage characters.
    Contact me ! (see below)

If you think that none of the above cases could describe your situation, please feel free to contact me - event in doubt !
See the How should I report a problem ? question for an explanation on how to do this.
 
My data table is not displayed correctly
I have a PDF file containing the following data, displayed as a table :
CustomerId Firstname Lastname Country CI01928 Jane Doe US CI17888 John Smith UK ... CI89012 Christian Vigh FR
However, when processed with the PdfToText class, the result is the following :
CustomerId Country Firstname Lastname CI01928JaneDoeUS CI17888 JohnSmithUK ... CI89012ChristianVighFR
The above example highlights several different problems :
  • The title line makes Country appear before Firstname
  • Cell contents looks catenated together ("CI01928JaneDoeUS")
  • Sometimes, an extra line break appears, such as the one after the customer id "CI17888"
First of all, you have to know that the PDF standard, unlike HTML, has absolutely NO notion of what a table is. A table is simply a set of rectangles and text strings drawn at various x/y positions, and there is no way to recognize that this set of drawing instructions represents a table.

The problems listed above have several origins :

  • Title line columns appearing in different order : this is due to the way the page layout is currently handled. See the Page layout not rendered correctly FAQ for more explanations on why this happens.
  • Cell contents catenated together : here again, this issue is related to how the page layout is handled.
  • Extra line breaks : this may be due to the fact that some relative/absolute page positions are used inside the Postscript-like PDF language. In its current version, the PdfToText class handles only a few combinations of relative/absolute coordinates.
Regarding cell contents being catenated together, there is a solution, however : you can set the BlockSeparator property to a string value that will be used to separate strings, as in the following example :
<?php require ( 'path/to/PdfToText.phpclass' ) ; $file = 'sample.pdf' ; $pdf = new PdfToText ( ) ; // Instantiate a PdfToText object, without specifying a file name $pdf -> BlockSeparator = ';' ; // Sets the BlockSeparator property to a semicolon $pdf -> Load ( $file ) ; // Explicitly load file contents echo $pdf -> Text ;
This will produce the following output, using our example file :
CustomerId; Country; Firstname; Lastname CI01928;Jane;Doe;US CI17888 John;Smith;UK ... CI89012;Christian;Vigh;FR
Note that this will not solve the issue of the unexpected line break, however.
 
Page layout not rendered correctly
When running the PdfToText class on my sample PDF file, the page layout is not rendered correctly. For example :
  • Some text appears on the page at a position that is different from the original one
  • Entire paragraphs can be moved inside the page : for example, a paragraph appearing at the bottom of the page appears on top of page after extraction
  • Columns in a table can appear in an order different from the original one
All those issues are due to the way the PdfToText class handles the page layout. Initially, the class was intended for pure text extraction only. As new PDf samples showing such issues were submitted to me, I put in place some kind of basic layout handling.

It is clear however that such page layout handling has reached its limits ; this is why a future version of the PdfToText class will provide an output that will be as close as possible from the original document, as far as it is possible to match text contents designed for graphical rendering with pure ascii output.

 
Changes to options and/or properties are ignored
I tried to change the BlockSeparator property, but it does not appear in the output. Example code below :
require ( 'path/to/PdfToText.phpclass' ) ; $file = 'sample.pdf' ; $pdf = new PdfToText ( $file ) ; $pdf -> BlockSeparator = ';' ; // Sets the BlockSeparator property to a semicolon echo $pdf -> Text ;
This is a perfectly normal behavior. When you supply a filename to the class constructor, it calls the Load() method to extract its contents, which will use the default property settings. Modifying any property after that is useless, unless you call the Load() method yourself !

To be able to customize PdfToText properties before extracting text, you need to call the constructor first without specifying any filename, then modify the properties you want to customize and, finally, call the Load() method, as in the following example :

require ( 'path/to/PdfToText.phpclass' ) ; $file = 'sample.pdf' ; $pdf = new PdfToText ( ) ; $pdf -> BlockSeparator = ';' ; $pdf -> Load ( $file ) ; echo $pdf -> Text ;
 
I'm receiving exceptions or warnings when running the class
When running the PdfToText class on one of my sample PDF files, I'm receiving exceptions and/or warnings. What should I do ?
Exceptions and warnings can denote abnormal conditions, such as bugs or unexpected PDF constructs that the class does not handle yet.

In any case, please feel free to contact me (see the How should I report a problem ? question on how to do that) and send me your sample PDF file together with a description of your problem.

 
PHP regular expression limit reached
I got the following error message when running the PdfToText class on one of my sample PDF files :
PHP regular expression limit reached (pcre.backtrack_limit)
or :
PHP regular expression limit reached (pcre.recursion_limit)
What should I do ?
The PdfToText class heavily relies on regular expression matching for analyzing PDF file contents. Sometimes, the file contains data that is enough complex to make the PHP pcre package reach its default limits.

To solve this problem, simply modify the following settings in your php.ini file :

  • pcre.backtrack_limit (the default value is 1000000)
  • pcre.recursion_limit (the default value is 100000)

Normally, you should only have to modify the setting that is given by the PHP error message. Try to double it first and, if it works, perform successive tryings by lowering this limit, until you find an acceptable value for your case.
 
Running out of memory when extracting text from a PDF file
When running the PdfToText class on one of my PDF files, I get the following error message :
Fatal error: Allowed memory size of xxx bytes exhausted (tried to allocate yyy)
The amount of memory consumed by the PdfToText class depends on the PDF file that has been supplied. It can sometimes require more than the maximum memory authorized by the memory_limit setting of your php.ini file.

You will need to assign this setting with a higher value. Normally, a value of 128M should handle all the cases.

 
Execution time limit reached
When running the PdfToText class on one of my PDF files, I get the following error message :
Fatal error: Maximum execution time of 30 seconds exceeded
The class needs more than 30 seconds to process the PDF file you supplied to it ; this can happen if your PDF file is very large, or if your server is very slow. In both cases, you will need to change the max_execution_time setting of your php.ini file to a higher value, in seconds.

Since I am regularly trying to optimize the PdfToText class, if you encounter such an issue, I would be happy to have a look at the PDF file that generated this error just to check why it took so long to process it. If you like, have a look at the How should I report a problem ? question for an explanation on how to send me a problem report.

 
Garbage characters appear in the output (I)
When displaying the output of the PdfToText class in a browser window, I see a lot of garbage characters.
The character set used by the PdfToText class is UTF-8. If the default character set of your Web server is different, you may need to add the following tag into the <head> part of your script :
<meta charset="UTF-8">
 
Garbage characters appear in the output (II)
My web server uses the UTF-8 character set, but the output of the PdfToText class contains garbage characters.
Sometimes, you may notice that one character is replaced by another, which resembles more an Ascii control character than a real one, in general. This is especially true for characters such as german umlauts.

This is a known issue : a future version will fix this problem.

However, if you get entire blocks of garbage characters, please report the problem to me ! (see the How should I report a problem ? question)

 
My Apache server fails on Windows
My Windows Apache server crashes when processing a PDF file and I get a "Connection reset" error.
The PdfToText class heavily relies on regular expressions (PCRE extension) to decode PDF contents. Sometimes such decoding may consume more stack space than expected.

On Unix platforms, the default stack size allocated by Apache to worker threads is 8Mb. On Windows platforms, this limit is set to 1Mb, which may be insufficient when processing certain PDF files.

To overcome this limitation, you have to enable the Apache mpm module in your httpd.conf file, and set a new stack size for threads, as in the following example :

Include conf/extra/httpd-mpm.conf ThreadStackSize 8388608