PDF to Text Conversion

PDF to Text Conversion

Post by Brian Farre » Thu, 17 Jul 2003 06:09:56



I am looking for a way to convert PDF documents into a Unicode text
file. The process needs to support having a separate text file for
each page. Also I need to write an application around the process so
PDF documents can be converted in bulk driven by from a database.

Methods I have considered for processing these documents include the
following.

1. Straight PDF to Text conversion/extraction. All of these tools
either don't support Unicode fonts or corrupt the text output on
certain fonts.
2. Using a text printer driver to simply print a PDF file and convert
into a text file. I am unable to find a print driver that supports
Unicode. I may have missed an obvious setting in some of these.
3. Converting a PDF to RTF and then to TXT. While this seems to work
it isn't as efficient as I would like. Also I have to break the PDF
into smaller PDF's for each page before I convert to RTF using the
save as function in Adobe. Although I am sure I could optimize this
method I would prefer to find something more efficient.

 
 
 

PDF to Text Conversion

Post by John V-Tracke » Thu, 17 Jul 2003 06:43:13


Hi Brian,

Try PDF-Tools - this offers PDF to Unicode or ASCII as well as many other
features - full font support - the only issue we have currently is complex
tables can cause some issues - but we are working on this now and hope to
have a new parser out in the near future.

Free eval version :
http://docu-track.com/index.php?page=28&dwld=PDF-X25TEval.EXE

Details:  http://docu-track.com/index.php?page=36

--
Best Regards

John Verbeeten
Tracker Software Products
PDF-XChange & SDK, Image-XChange SDK,
PDF-Tools & SDK, TIFF-XChange & SDK, DocuTrack.

www.docu-track.com
User Updates (serial # required)
http://www.docu-track.com/download.php
Trials:
http://www.docu-track.com/index.php?page=28
Pricing:
http://www.docu-track.com/index.php?page=30

Quote:> I am looking for a way to convert PDF documents into a Unicode text
> file. The process needs to support having a separate text file for
> each page. Also I need to write an application around the process so
> PDF documents can be converted in bulk driven by from a database.

> Methods I have considered for processing these documents include the
> following.

> 1. Straight PDF to Text conversion/extraction. All of these tools
> either don't support Unicode fonts or corrupt the text output on
> certain fonts.
> 2. Using a text printer driver to simply print a PDF file and convert
> into a text file. I am unable to find a print driver that supports
> Unicode. I may have missed an obvious setting in some of these.
> 3. Converting a PDF to RTF and then to TXT. While this seems to work
> it isn't as efficient as I would like. Also I have to break the PDF
> into smaller PDF's for each page before I convert to RTF using the
> save as function in Adobe. Although I am sure I could optimize this
> method I would prefer to find something more efficient.


 
 
 

PDF to Text Conversion

Post by George N. White II » Thu, 17 Jul 2003 07:18:19



> I am looking for a way to convert PDF documents into a Unicode text
> file. The process needs to support having a separate text file for
> each page. Also I need to write an application around the process so
> PDF documents can be converted in bulk driven by from a database.

> Methods I have considered for processing these documents include the
> following.

> 1. Straight PDF to Text conversion/extraction. All of these tools
> either don't support Unicode fonts or corrupt the text output on
> certain fonts.

$ man pdftotext
[...]
-enc encoding-name
     Sets  the  encoding  to  use for text output.  The encoding-name
     must be defined with the  unicodeMap  command  (see  xpdfrc(5)).
     This defaults to "Latin1" (which is a built-in encoding).  [config
     file: textEncoding]
[...]
BUGS
    Some  PDF  files contain fonts whose encodings have been mangled beyond
    recognition.  There is no way (short of OCR) to extract text from these
    files.
 ------------------------------------------

"mangled encodings" sounds like what you have

Quote:> 2. Using a text printer driver to simply print a PDF file and convert
> into a text file. I am unable to find a print driver that supports
> Unicode. I may have missed an obvious setting in some of these.

I don't see how this will overcome mangled encodings.

Quote:> 3. Converting a PDF to RTF and then to TXT. While this seems to work
> it isn't as efficient as I would like. Also I have to break the PDF
> into smaller PDF's for each page before I convert to RTF using the
> save as function in Adobe. Although I am sure I could optimize this
> method I would prefer to find something more efficient.

This suggests that RTF somehow deals with the mangled encodings.  Maybe
you can figure out the mangling algorithm and devise a map file for
use with pdftotext.

--

 
 
 

PDF to Text Conversion

Post by Nanc » Sat, 19 Jul 2003 13:18:48


Try download pdftools from http://www.paqtool.com, It matches all of you want.



> > I am looking for a way to convert PDF documents into a Unicode text
> > file. The process needs to support having a separate text file for
> > each page. Also I need to write an application around the process so
> > PDF documents can be converted in bulk driven by from a database.

> > Methods I have considered for processing these documents include the
> > following.

> > 1. Straight PDF to Text conversion/extraction. All of these tools
> > either don't support Unicode fonts or corrupt the text output on
> > certain fonts.

> $ man pdftotext
> [...]
> -enc encoding-name
>      Sets  the  encoding  to  use for text output.  The encoding-name
>      must be defined with the  unicodeMap  command  (see  xpdfrc(5)).
>      This defaults to "Latin1" (which is a built-in encoding).  [config
>      file: textEncoding]
> [...]
> BUGS
>     Some  PDF  files contain fonts whose encodings have been mangled beyond
>     recognition.  There is no way (short of OCR) to extract text from these
>     files.
>  ------------------------------------------

> "mangled encodings" sounds like what you have

> > 2. Using a text printer driver to simply print a PDF file and convert
> > into a text file. I am unable to find a print driver that supports
> > Unicode. I may have missed an obvious setting in some of these.

> I don't see how this will overcome mangled encodings.

> > 3. Converting a PDF to RTF and then to TXT. While this seems to work
> > it isn't as efficient as I would like. Also I have to break the PDF
> > into smaller PDF's for each page before I convert to RTF using the
> > save as function in Adobe. Although I am sure I could optimize this
> > method I would prefer to find something more efficient.

> This suggests that RTF somehow deals with the mangled encodings.  Maybe
> you can figure out the mangling algorithm and devise a map file for
> use with pdftotext.

 
 
 

1. pdf to text conversion

All,

I am having some problems with conversion of pdf to text and wonder if
someone could help

I have a program which successfully manages to get the text out of the
majority of pdfs that it encounters.

Sadly it doesnt work on a new document : the difference I can see is that
includes encrypted fonts.

The basic flow of my program is...
* PDDocCreateWordFinder
* for each page
   * PDDocAcquirePage
   * PDWordFinderAcquireWordList
   * for each word
        * PDWordGetString
        * print out the word

However for this document the words that I am getting back are garbled.

I can see how to get the font encoding (PDFontGetEncodingIndex [the answer
is 7]) and also the encoding array (PDFontAcquireEncodingArray [the answer
seems to be an array s.t. element 70 is the string "c70", element 50 is the
string "c50" etc])

However I cant see what to do with these values.  Of course I could be on
totally the wrong track and need to do something else : I am really
confused.

As alluded to in the below, the documentation is not the most forthcoming on
this matter.

The document in question was (I think) automatically created by Acrobat
Distiller 3.01 for Windows from a MSWord document.

Does anyone have any idea what my problem is and how I could possibly go
about solving it?

Many thanks in advance

Nik Cunniffe



[cut]

[cut]

2. Enter doesn't 'click' button on UserControl

3. PDF to text conversion ???

4. Lotus Notes R4 ID files

5. PDF to Text conversion

6. 68020 Assembly

7. pdf to text conversion?

8. MANAGEMENT DASHBOARD CHECKLIST

9. Pdf-to-text conversion: need help!

10. Gymnast - text to PDF conversion?

11. text to pdf conversion via COM/.Net Assembly

12. software for batch TIF to Image + text PDF conversion

13. problems with conversion of pdf to text