Text not found in some PDF documents, Found in others - Why?

Text not found in some PDF documents, Found in others - Why?

Post by james s shoenfe » Wed, 14 May 2003 07:56:11



I have used the text select and column select buttons before, but they
don't work anymore.  Below is my real problem:

I received data in a TIF document (three columns: first is text,
second is a number, the third is a date) that I want to put into Excel
without having to manually re-key everything.  So, I thought I could
convert it to PDF.  I tried this two ways:

1) printing the TIF file on paper, scan the printed copy using HP
scanning software which allows me to save it as Adobe PDF ("editable
text").
2) From Adobe open the TIF file using "File" "Open as Adobe PDF"

I figured I could then use the column select tool or the text select
tool to get the data into Excel.  Or I might be able to save it as RTF
from there.  But these tools don't work in this PDF document.  And it
won't save it as RTF, either.  In fact, even the CTRL-F, Edit Find,
and Binocular buttons don't work.  (By "don't work", I mean that I get
prompted for the text I'd like to search for, but Adobe fails to find
the letter "a", even though it's in the document many times.)

I've found some PDF documents on the web (e.g.
http://casact.org/library/studynotes/clark6.pdf), and saved it on my
hard drive.  I opened it up using the same sofware, and the search
buttons, etc. work on this document.

Any advice, tips, information is appreciated!

Jim

 
 
 

Text not found in some PDF documents, Found in others - Why?

Post by Wald » Wed, 14 May 2003 16:03:28


A TIF file is a bitmap which means it does not contain text that the
computer can "see" (although it is visible for humans). You need OCR
software (Abbyy Finereader, Caere Omnipage, Xerox TextBrigde, Adobe Capture,
...) to convert the image to text in order to make text searchable and
selectable.

If you print it and scan it afterwards, you may lose some quality. If your
scan application does do the OCR, you can give it a go.
Personally, I prefer option 2 combined with OCR software.

Waldo



Quote:> I have used the text select and column select buttons before, but they
> don't work anymore.  Below is my real problem:

> I received data in a TIF document (three columns: first is text,
> second is a number, the third is a date) that I want to put into Excel
> without having to manually re-key everything.  So, I thought I could
> convert it to PDF.  I tried this two ways:

> 1) printing the TIF file on paper, scan the printed copy using HP
> scanning software which allows me to save it as Adobe PDF ("editable
> text").
> 2) From Adobe open the TIF file using "File" "Open as Adobe PDF"

> I figured I could then use the column select tool or the text select
> tool to get the data into Excel.  Or I might be able to save it as RTF
> from there.  But these tools don't work in this PDF document.  And it
> won't save it as RTF, either.  In fact, even the CTRL-F, Edit Find,
> and Binocular buttons don't work.  (By "don't work", I mean that I get
> prompted for the text I'd like to search for, but Adobe fails to find
> the letter "a", even though it's in the document many times.)

> I've found some PDF documents on the web (e.g.
> http://casact.org/library/studynotes/clark6.pdf), and saved it on my
> hard drive.  I opened it up using the same sofware, and the search
> buttons, etc. work on this document.

> Any advice, tips, information is appreciated!

> Jim


 
 
 

1. Why both requested page found and not found in cache both zero?

Hi all,

  By the result of db_stat -m, I found that both the requested page found
and not found in cache is zero. I have already added 500 records and do
some searching. Why is that?

  What I intent to do is to try to make the program use less memory or less
memory at start up because running on low-memory device. How can I set a
limit on the memory usage or how do I lower the memory usage of my app (an
addressbook)?

  Many thanks.

Regards,
Geiger

/proc/*/status at start up:
VmSize:     4228 kB
VmLck:         0 kB
VmRSS:       996 kB
VmData:     1800 kB
VmStk:        20 kB
VmExe:        16 kB
VmLib:      1992 kB


82KB 176B       Total cache size.
1       Number of caches.
88KB    Pool individual cache size.
0       Requested pages mapped into the process' address space.
0       Requested pages found in the cache.
0       Requested pages not found in the cache.
0       Pages created in the cache.
0       Pages read into the cache.
0       Pages written from the cache to the backing file.
0       Clean pages forced from the cache.
0       Dirty pages forced from the cache.
0       Dirty pages written by trickle-sync thread.
0       Current total page count.
0       Current clean page count.
0       Current dirty page count.
37      Number of hash buckets used for page location.
0       Total number of times hash chains searched for a page.
0       The longest hash chain searched for a page.
0       Total number of hash buckets examined for page location.
0       The number of hash bucket locks granted without waiting.
0       The number of hash bucket locks granted after waiting.
0       The maximum number of times any hash bucket lock was waited for.
12      The number of region locks granted without waiting.
0       The number of region locks granted after waiting.
0       The number of page allocations.
0       The number of hash buckets examined during allocations
0       The max number of hash buckets examined for an allocation
0       The number of pages examined during allocations
0       The max number of pages examined for an allocation

----== Posted via Newsfeeds.Com - Unlimited-Uncensored-Secure Usenet News==----
http://www.newsfeeds.com The #1 Newsgroup Service in the World! >100,000 Newsgroups
---= East/West-Coast Server Farms - Total Privacy via Encryption =---

2. Finding mathing files, and getting the answer as a list.

3. Finding Excel 5.0 add-ins

4. Easy Memory Upgrade Question

5. Acrobat Plug-ins SDK, pdf->text problem

6. Web server editing problem

7. DDK can not find macwin32.h (not sure why it wants it though)

8. MediaMail- Notify Audio

9. Why doesn't FIND find some things?

10. Acrobat document security and plug-ins in general

11. "Find" does not find?

12. "Find" not finding some words

13. Program Not Found, Windows cannot find Netscape.exe