My Algorithm
Hello,
I wrote an algorithm to spilt an text image into line images using the image histogram of the text image, I need your comments and suggestions to make it works better.
Here is the steps:
1. Read an text image.
http://nomatterwhat.jeeran.com/image.jpg
2. Get the image horizontal histogram.
http://nomatterwhat.jeeran.com/hhist.jpg
3. Store the image pixels into an array.
4. Extract one column from the array and checks for blank areas between the lines by checking zeros values, get the centers of these blank areas and store them.
http://nomatterwhat.jeeran.com/extracted%20column.JPG
5. Use the stored values to split the image into sub-images.
6. Save sub-images.
In step (4):
From the extracted column, each sequensces of zreos are marked, then get the average of the marked area; the average will be where the image to be splitted, as shown here:
http://nomatterwhat.jeeran.com/idea.JPG
Choosing the correct column is the only problem left., and the most imporatant step to be done. Any ideas.?
[1106 byte] By [
Salamaa] at [2007-10-2 17:07:08]

You wanted comments
You have chosen a single fairly simple model for carving up an image into lines of text. By selecting a single column for your histogram you are essentially choosing a single threshold, below which a line of pixels is considered to be blank and above which it is considered to be a dark line of text.
Whether or not this particular model is robust enough to do what you want depends both upon the nature of your source material and upon what you want to do with the results. Your question about how to optimally set the one threshold value in your model similarly depends on what you want to do. The proper way to do this is to quantify your pain (how bad is it for you system to miss a split point vs. how bad is it to introduce one that should not be there) and then look at sample data and choose the threshold that minimizes your pain. But all of this presuposes that you do in fact have the proper model and that you only need to tune the free parameters.
I would question you as to whether you are really at that point. Suppose the text image was not scanned in straight and was slightly tilted. In that case, looking for a river of white to divide the black text lines will not work. You would need to first detect and correct for small rotations. (And of course this is why I emphasized that your model depends upon what you want to do. If you are looking at scanned in text then slant correction is essential. If however your images are all screen images of computer generated text it is already perfectly aligned.)
At one point in my life we were working with images scanned in from books and had to deal with (detect and correct) something worse than rotation. There was local bending cased by the way that the binding on the book caused the text near the spine to curve, even if the bulk of the page was properly oriented.
Your model makes no assumptions at all about the size of the splits (and hence the size of a line of text, or the size of the river that leads to a split). This is not necessarily a good decision. Suppose you see a bunch of essentially black lines (which you suppose to be text) followed by a few blank lines, followed by a single black line, followed by 2 blank lines followed by a bunch of black lines (again presumably text). What was the one single black line?
Was that a line of text exactly one pixel high? (I doubt it) Was it a graphic line seperator that just happened to be perfectly aligned on the scan? (I don't know, does you data have anything like that?) Was it a smudge of dirt that at its thickest point just happened to cross the threshold where you decided that it was a black line rather than blank? Or did you just happen to have a second line of text that just happened to be all lower case characters a few of which happend to be lower case i's and that pixel high line is the dots on the i's?
In any situation with real data all those possibilities and more than dreamed of in your philosophies will occur.
What you do about it is to hack out your model, try it on some data, see if it is good enough (setting your free parameters, your thresholds, by hand). When you see problems that you can't cure with your thresholds you modify your model and make it a little more complicated and usually you introduce new thresholds that must now be adjusted. Eventually you will get a model that you feel has enough free parameters that in theory you could adjust it to do anything right, but that in practice you couldn't possibly collect enough data to do the statistical analysis to determine the proper values of the parameters. Such is life.
In general, you will be better off if you do not build decision tree types of structures; where you look at some value, make a decision, and now that you have made your decision you branch to logic that depends on that particular case. The problem is that if you make a mistake in that decision you are off in the wrong branch and everything else you do is predicated on a false assumption.
Instead you would rather do something a little more fuzzy, where you award points for behavior that you believe to be good and penalize behaviors that you believe to be bad. So for example, a single pixel high black line is a bad model for a line of text. The thicker it gets the happier you are that you have a line of text up to some point, and then as it gets even thicker you get less happy that you are looking at a line of text. Similarly, a bunch of lines of black, split by rivers of white and all the splits being the same basic height sure fits the model of a nice paragraph of uniform text. Good! A bunch of line of very different height doesn't smell much like a nice paragraph of uniform text. Bad! Do you need to model paragraphs for your application? I certainly don't know.
What you are generally aiming for is something that does not just split up the image but something that both splits up the image and also tells you how happy it is to have split up the image that way.