Skip to content

Vocabulary

SegmOnto is not the first attempt to offer a description of a page. Because of the nature of the task, two different academic traditions have offered vocabularies that serve as a base:

  • Computer vision proposes a limited system adapted to the needs of computer scientists (cf. Page XML).
  • Codicology, whose purpose precisely is the description of the old documents, offers a much more precise vocabulary to answer the needs of philologists (cf. Codicologia).

On the one hand, codicological vocabularies do not take into account the requirements of computer-based solutions (such as a system of private zones or the articulation of types and subtypes) and usually does not address contemporary cases. On the other hand, the solution offered by computer scientists is too poor and is incapable to offer a satisfactory description of a page taken from an historical document. For all these reasons, SegmOnto offers a middle way, in between these two.

Page

PAGE XML (cf. infra) offers a simple, short and generalist scheme to describe the page, with only fourteen categories (lien):

  • TextRegion type. Pure text is represented as a text region. This includes drop capitals, but practically ornate text may be considered as a graphic.
  • ImageRegion type. An image is considered to be more intricate and complex than a graphic. These can be photos or drawings.
  • LineDrawingRegion type. A line drawing is a single colour illustration without solid areas.
  • GraphicRegion type. Regions containing simple graphics, such as a company logo, should be marked as graphic regions.
  • TableRegion type. Tabular data in any form is represented with a table region. Rows and columns may or may not have separator lines; these lines are not separator regions.
  • ChartRegion type. Regions containing charts or graphs of any type, should be marked as chart regions.
  • SeparatorRegion type. Separators are lines that lie between columns and paragraphs and can be used to logically separate different articles from each other.
  • MathsRegion type. Regions containing equations and mathematical symbols should be marked as maths regions.
  • ChemRegion type. Regions containing chemical formulas.
  • MusicRegion type. Regions containing musical notations.
  • AdvertRegion type. Regions containing advertisements.
  • NoiseRegion type. Noise regions are regions where no real data lies, only false data created by artifacts on the document or scanner noise.
  • UnknownRegion type. To be used if the region type cannot be ascertained.
  • CustomRegion type. Regions containing content that is not covered by the default types (text, graphic, image, line drawing, chart, table, separator, maths, map, music, chem, advert, noise, unknown).

Codicologia

Other vocabularies, designed by philologist specialised in codicology, offer an interesting alternative to those designed by computer scientists. Under the supervision of the IRHT in Paris, several glossaries have been gathered in an online application called Codicologia. It offers an extensive vocabulary in French, most of the time with English, German, Italian, Spanish and Arabic translations, with precise definitions adapted to the need of academics specialised in the humanities. Thanks to Georg Vogeler, a digital version is also available as a SKOS model, perfectly suited for digital purposes.

Towards a middle way

Because PAGE offers too general a description, it has been been decided not to expand it, but rather to reduce to the maximum the Codicologia glossary. The reduction process has followed a simple criterion: unlike other vocabularies (like the TEI), SegmOnto focuses on the form rather than the content/meaning. For instance, it uses GraphicZone, without making the difference between an illustration (which carries a semantic load) and an ornamentation (which is purely decorative).

Because of this reduction process, elements and definitions may differ:

  • GraphicZone does not exist in Codicologia: it aggregates Illustration and Ornamentation of Codicologia. On top of these two, it aggregates a third type: figures (for schemas found in scientific works, for instance).
  • `QuireMarksZone includes the Signature (quire numbers) and the Catchword, which are two different elements in Codicologia.