SegmOnto: A Controlled Vocabulary to Describe the Layout of Pages¶
SegmOnto offers a controlled vocabulary to describe the content of books or manuscripts pages, in order to homogenise the data required by layout analysers. This project follows a double objective:
- Mutualise data to train stronger models on various layouts.
- Design a standardised pipeline for text extraction, from page scans to structured documents
SegmOnto is thought as a generalist description scheme, covering written documents produced since the apparition of the codex, but it has been designed using mainly western and middle eastern documents.
How to cite¶
Simon Gabay, Ariane Pinche, Kelly Christensen, Jean-Baptiste Camps, Nicola Carboni, SegmOnto, A Controlled Vocabulary to Describe the Layout of Pages, version 0.9, Genève/Lyon/Paris, 2023, https://segmonto.github.io/.