After OCR - > Put the text into the correct reading order

I'm working in a programa that have to read PDFs to extract some data.

Till this moment i've extracted the text objects with the size, font, text and position (x1,y1,x2,y2).

After that the program recognizes text blocks and achieve again its position.

The document can have 1,2 or 3 columns and i need to put the text into the correct reading.

Let's try to explain with one example:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

&&&& &&&& && &&& &&&& &&&&& &&&& &&&& &&&&&&&

&&&&& &&&&& &&&& &&&&&& &&&& &&&&& &&&&&&&

&&&&&&& && &&&&&&& &&& &&&&& &&&&& &&&&&& &&&&

***********

#### ### ###### ######AAAA AAAAA AAAA AA

###### #### #### ###

### ### ### # ### ## #AA AAAA AAAA AAAA A

Zonex1y1x2y2

%101060020

@103060040

&105060080

*24090270100

#10260240140

A250260600140

How to make an algorithm to sort it in a correct reading order.

Does any one can guide me?

Thanks in advance,

McRunner

[1548 byte] By [mcrunnera] at [2007-9-29 10:57:37]
# 1

In english we read from the top down and left to right.

So You would first need to sort your textblocks by their vertical

position. Then for any blocks that had a similar vertical position

you would need to sort them by their horizontal position.

Then to obtain the correct reading order you would iterate through

the blocks from the top of the document to the bottom. Wherever there

are two or more blocks with the same vertical position you would

iterate through them from left to right. You would then continue down

the document.

matfud

matfuda at 2007-7-15 0:22:08 > top of Java-index,Other Topics,Algorithms...