OCR letter/edge detection...

Hi, sorry if this seems a little odd for this forum but here goes:

I would like to write a piece of OCR software for myself ;) but currently I know very little about the subject? My question is, what mathematical formulas, algorithms (or how would you do it) are used to detect the individual letters within the image as a whole? Lets say I pass in an image containing the word "help." What formula is used to detect the four individual letters 'h' 'e' 'l' 'p' so that it can be subdivided and each letter passed over for identification?

If anybody has any ideas, or pointers for articles, that would be great.

Thanks, Ron

[655 byte] By [cakea] at [2007-11-27 2:13:45]
# 1

http://en.wikipedia.org/wiki/Edge_detection

http://www.javaocr.com/

http://asprise.com/product/ocr/index.php?lang=java

Why not just use a solution that is already written, such as the 2 lower links above? (They both are written in Java).

The top link is an extremely brief overview on edge detection.

kmangolda at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 2

Well, and please correct me if i am wrong, the asprise (I know for sure) is a 30 day trial and I want to use it for perpetuity, and I do not have or wish to part with 1900 bucks for a full version. And from reading, neither piece of software will work for all the characters or scenario's I envisage encountering - and besides, I am a glutton for punishment and thought I would have a go at writing my own!

Many thanks Ron :D

cakea at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 3
What type of characters do you plan on detecting? CAPTCHA letters?
kmangolda at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 4
No, nothing like that.
cakea at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 5

No, this is simply an experiment and I was wondering what algorithms are used to identify letters in img files. I found an example OCR which uses a neural net, something else I am interested in writing, and it works fine for single characters, but what if you input an image which contains the words 'hello world' how do you determine that there is a total of 11 letters and one space in that string?

Thanks, Ron

cakea at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 6

A classical and simple OCR algorighm is to do the following:

Detect connected clumps of black ink. (this is done with a floodfill type algorithm. If a black pixel is adjacent to another black pixel then those two pixels are in a connected clump. Each connected clump infects any neighbors.

Once you have a connected clump you do a comparison with a dictionary of letter shapes. This comparison is often done with the Hamming distance, (you xor the source image with the one in the dictionary and then count the number of bits left - if the images are identical the count is zero, the greater the count the greater number of pixels did not match and the more likely that the characters did not match.

You do a nearest neighbor match to your dictionary to ID your characters.

So - that is the good news. The algorithms are simple, robust. Tastes great - less filling!!

Then when you find that this system does not work as well as you want, you start fixing the problems.

1) was there ink bleed in the document so that you got huge bunches of characters all connected together. Or alternatively, was the copy so light that things that should have been connected became disconnected?

These are knows as segmentation errors. You divided the entire image into segments, the connected regions, and you made mistakes in that process that make things impossible to unravel later on.

2) was there a font that you have not yet seen and loaded and coded into your dictionary.

These kinds of problems are know as classification errors.

3) were there lines on the drawing (perhaps a background image) that cut accross the image screwing up your segmentation, or was the image perhaps scanned from a book and due to the fold at the binding the end of each line bent up out of sight with the characters being slightly skewed, rotated and out of focus, or perhaps you had to read handwritten addresses done in crayon from the front of an envelope like the Post Office.

These kinds of problems are modeling problems, your images did not match the model you thought you were using.

And of course, there may be noise problems, speckles white or black, coffee stains on the image etc.

The good news is that if your problems are simple, you code can be simple. If your problems are hard the code can be hard.

Once you have solutions for hundreds of slightly different modeling problems, classification problems, and segmentation problems, and furthermore you have built code that will try to detect when you are having problem number 19 a lot so maybe you should up the independent parameter on your little problem 19 solver, you start thinking that you have put a whole lot of work into your OCR system and that you ought to charge people a LOT of money for it because it would be a whole lot of work for them to replicate it all.

It is easy to start on OCR code, it is easy to make progress, there is always another problem to solve, and it is easy to stop when you aren't having fun any more.

If you value your time at anything more than about nickle an hour, a thousand dollar OCR package is quite a bargain.

On the other hand, you pay to go to college, you pay to learn, and you pay to play games. The compiler is FREE, your spare time is your own so whip out yer compiler and start having some FUN!.

marlin314a at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 7

Marlin, I am not sure whether to say sorry, or laugh!

Your right, I have made some progress but I have now hit a brick wall, but I do feel better for the process.

My issue now is that the .png files that I am reading in only (usually) contain one word, but that word is so low res that the edging code i am using isn't really doing it! I need to find another way in order to subdivide the image into each individual letter. Also, the text is usually yellow on a black background or yellow on a green background, which is not helpying any.

Ron

cakea at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...
# 8
This one's fun to play with! http://homepages.inf.ed.ac.uk/rbf/HIPR2/hough.htm
ArneWeisea at 2007-7-12 2:09:23 > top of Java-index,Other Topics,Algorithms...