monzilla.net / research / publications / full text


abstract
Combining Linguistic and Spatial Information for Document Analysis

Marco Aiello, Christof Monz, and Leon Todoran

In: J. Mariani and D. Harman (Eds.) Proceedings of RIAO'2000 Content-Based Multimedia Information Access, CID, 2000. pp. 266-275.

We present a framework to analyze color documents of complex layout. In addition, no assumption is made on the layout. Our framework combines in a content-driven bottom-up approach two different sources of information: textual and spatial. To analyze the text, shallow natural language processing tools, such as taggers and partial parsers, are used. To infer relations of the logical layout we resort to a qualitative spatial calculus closely related to Allen's calculus. We evaluate the system against documents from a color journal and present the results of extracting the reading order from the journal's pages. In this case, our analysis is successful as it extracts the intended reading order from the document.

[ PostScript | PostScript (gnu-zipped) | PDF ]


links
co-author(s): Marco Aiello, Leon Todoran
research project(s): Document Layout Analysis and Reading Order Detection
conference site:


BibTeX entry
@InProceedings{aiel:00comb,
  author = 	 {Aiello, M. and Monz, C. and Todoran, L.},
  title = 	 {Combining Linguistic and Spatial Information 
                  for Document Analysis},
  booktitle = 	 {Proceedings of {RIAO'2000} Content-Based 
                  Multimedia Information Access},
  pages =	 {266--275},
  year =	 2000,
  editor =	 {Mariani, J. and Harman, D.}
}