Personal computing discussed

Moderators: renee, Dposcorp

 
BIF
Minister of Gerbil Affairs
Topic Author
Posts: 2458
Joined: Tue May 25, 2004 7:41 pm

How can I convert old scanned documents to OCR?

Sun May 22, 2016 2:08 pm

I have some 20+year old documents (originally typed with a typewriter) that were copied with an old photocopier or low DPI scanner (mid to late 1990s?) and eventually output in PDF form. The PDF is basically "pictures" of the text, which has varying quality due to the quality of the scanner used back then. I would like to use an OCR tool to read these PDF files and produce a character-based output, such as MS Publisher or Word. I highly doubt that I'll have access to the original paper documents.

I've got about 250 pages to replicate, along with some graphic pages and signature pages which will just be screen-printed and produced as JPEGS.

I could retype the whole thing, and I'm actually prepared to do that. But if there is an OCR way that could ensure accurate verbatim reproduction, that would be my preference; even if it required a bit of post-conversion reformatting.

I've also thought about using one of the Dragon products to dictate the text verbally. That would help relieve the wear-and-tear on hands and fingers, but I don't know if ultimately that would result in more work rather than less.

I use Windows 10 on two computers. If it's a software solution, it needs to have licensing for at least two computers. My budget for this project is a couple hundred dollars, and my desired time-to-completion is leisurely; a couple months or more would be fine.

Suggestions welcome!
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 2:28 pm

You could give Tesseract a try. I've used the Linux version, but it appears that Windows is also supported.
Nostalgia isn't what it used to be.
 
ozymandias
Gerbil XP
Posts: 468
Joined: Mon Nov 22, 2004 9:50 am

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 2:57 pm

did you try it with a current gen OCR program? Mine reads image pdf's without much problems (omnipage), as my document scanner outputs everything as pdf.
Quality can be an issue for certain pages (as well as layout - which in my experience costs most of the time to get it right), but if you first ocr it, you can always dictate/type the missing pages.

Also note that, as a dragon naturallyspeaking user on a regular base, that the program needs around 24 hrs of training/dictating before it becomes usable enough to save time.
 
BIF
Minister of Gerbil Affairs
Topic Author
Posts: 2458
Joined: Tue May 25, 2004 7:41 pm

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 3:15 pm

Thanks JBI, but the no-info github site makes me think this is a little bit "too free" for me. :P It's not even obvious how to download the software. :-?

Ozy; no I haven't tried other software...I came here first. :)
 
Chrispy_
Maximum Gerbil
Posts: 4670
Joined: Fri Apr 09, 2004 3:49 pm
Location: Europe, most frequently London.

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 3:20 pm

If you have Acrobat (Standard or Pro) that'll do it for you.

Acrobat's pretty good at handling fax-quality text for OCR.
Congratulations, you've noticed that this year's signature is based on outdated internet memes; CLICK HERE NOW to experience this unforgettable phenomenon. This sentence is just filler and as irrelevant as my signature.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 3:34 pm

BIF wrote:
Thanks JBI, but the no-info github site makes me think this is a little bit "too free" for me. :P It's not even obvious how to download the software. :-?

Heh... yeah, I guess I'm just used to the Linux "read the man page, then Google it if you're still confused" drill. Not to mention spoiled by the fact that the majority of Open Source packages worth installing (and quite a few that aren't too... :lol:) are in the Ubuntu repositories already, so installation in this case is as simple as "sudo apt-get install tesseract-ocr"... no need to hunt down or manually download installation packages.

Oh well.
Nostalgia isn't what it used to be.
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 3:45 pm

I had to OCR around 8,000 documents once and a pipeline of ImageMagick to do a little pre-processing coupled with Tesseract and GNU-parallel (since Tesseract was single-threaded at the time, it might still be) did a pretty good job of OCR for documents that actually had legibly printed text. It was pretty useless at anything handwritten though, and large images would cause it to churn for a long time while nothing actually happened.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
BIF
Minister of Gerbil Affairs
Topic Author
Posts: 2458
Joined: Tue May 25, 2004 7:41 pm

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 4:09 pm

Chuckula, was your solution done on Windows or Linux? I'm on Windows 10.

To all: Omnipage 18 by Nuance (same people that make Dragon) looks like it might do what I need, and at $80 it's under my budget. If it works well, it'll save me many times over my hourly rate (how long will it take for me to type 250 pages?). Better for me since this is a voluntary effort.

Has anybody got experience with Omnipage?
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 5:40 pm

BIF wrote:
Chuckula, was your solution done on Windows or Linux? I'm on Windows 10.

To all: Omnipage 18 by Nuance (same people that make Dragon) looks like it might do what I need, and at $80 it's under my budget. If it works well, it'll save me many times over my hourly rate (how long will it take for me to type 250 pages?). Better for me since this is a voluntary effort.

Has anybody got experience with Omnipage?


It was in Linux. I can't say that I've used Omnipage or other commercial OCR products beyond some occasional OCR in Acrobat Pro.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
DancinJack
Maximum Gerbil
Posts: 4494
Joined: Sat Nov 25, 2006 3:21 pm
Location: Kansas

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 6:40 pm

I've used some serious OCR programs in the past (document processing for litigation), and to be honest, if it is not mission critical I'd go with Acrobat Pro. 25 bucks for one month, and you get all the other features of Acrobat Pro if you wanna take advantage of them while you've got the sub. Acrobat OCR "just works" and you don't have to fiddle around with other crap. I know Adobe reader and its iterations have been crap (IMO) for a long while now, but Acrobat Pro is and was pretty damn good.

https://acrobat.adobe.com/us/en/acrobat/pricing.html

edit: Oh, and the beefier CPU+RAM combo you can give the OCR process, the better.
i7 6700K - Z170 - 16GiB DDR4 - GTX 1080 - 512GB SSD - 256GB SSD - 500GB SSD - 3TB HDD- 27" IPS G-sync - Win10 Pro x64 - Ubuntu/Mint x64 :: 2015 13" rMBP Sierra :: Canon EOS 80D/Sony RX100
 
blitzy
Gerbil Jedi
Posts: 1844
Joined: Thu Jan 01, 2004 6:27 pm
Location: New Zealand

Re: How can I convert old scanned documents to OCR?

Sun May 22, 2016 7:31 pm

Nuance Omnipage (payware) is a pretty decent for recognition, at least previously when I've used Nuance it was. Haven't used it recently.

Also there might be some mobile apps for tablets / phones that would work well for this. It seems odd, but the reasoning for that is that OCR libraries are included as part of the platform tools for iOS / Android / Windows, and they are often quite good quality for OCR and ICR (handwriting). Since those platforms take a % of each sale via their store front, e.g. Apple store, Play etc, thats how they get royalties to include a good quality free OCR solution in the platform.

Whereas on desktop there are things like tesseract, which is IMO not really as accessible.

Acrobat Pro will also do it, although I suspect Nuance might be more accurate. I haven't compared in a long time so things could have changed.
 
BIF
Minister of Gerbil Affairs
Topic Author
Posts: 2458
Joined: Tue May 25, 2004 7:41 pm

Re: How can I convert old scanned documents to OCR?

Mon May 23, 2016 1:23 am

Trying out Omnipage, which I lucked out on for $59 for the entry? level one. I've an aversion to anything from Adobe.

In the first document the photocopy quality was very poor, so there were a lot of hits requiring correction and just as many hits not requiring correction but needing visual review and confirmation. Took several hours to go through it in my first pass using the proofreading function, which also constituted a training session. The software flagged a lot of stuff that would have otherwise been printed as a swearword or racial epithet. Needless to say, I'll have to spend LOTS of time proofreading and re-proofreading to keep the output from having embarrassing results.

Certain pages will have to just be carried forward without conversion, because the OCR engine wants to translate things that should remain image-based. Signatures, crests, logos, and notary imprints, for example. And some of the speckles on the pages have been translated into characters that really aren't there. Sort of like the software sees commas, semicolons, periods, and hyphens in poor photocopies just like we see puppy dogs and snowmen in a sky of puffy clouds. The software did quite well considering the poor quality of the copies.

I've already put nearly an entire workday into this project, and probably have another 20-30 hours to go, but I can see now that even with the imperfect source material, this software is going to save me a lot of labor hours; maybe three weeks or more's worth of time. I also see now that using Dragon to dictate this project would not have been a good use of my time. I have made the better choice, and now I just need to settle in and work it to conclusion.

Thanks!
 
alrey
Gerbil
Posts: 29
Joined: Fri Mar 11, 2011 3:45 am

Re: How can I convert old scanned documents to OCR?

Mon May 23, 2016 5:55 am

Try uploading some here
http://www.onlineocr.net/
so that you can preview the result. The site offers one of the best OCR based on my experience.
 
meerkt
Gerbil Jedi
Posts: 1754
Joined: Sun Aug 25, 2013 2:55 am

Re: How can I convert old scanned documents to OCR?

Mon May 23, 2016 8:32 am

What are the options if you want to add to a PDF an invisible OCRed text layer that retains the coordinates of the image-based text?

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On