CCS logo information accessible Impressum             imprint             Rechtliche Hinweise             legal disclaimer      
"creating value through automated digital conversion"
 
 
 

METS / ALTOgeneral information

 
 

If you have any question about digitizing or converting serials like newspapers or journals or monographs like books, please select a language and enjoy our website

 

Introduction

Digitization requires standard XML schemas. The Library of Congress hosts the METS (Metadata Encoding and Transmission Standard) standard since quite a while now. METS is capable of describing the structure of a variety of objects, such as audio, video files and printed material as well. ALTO (Analyzed Layout and Text Object) is an extension schema to METS, describing the layout and content of e.g. single pages.

(directory)


History

METS offers great opportunities to reflect complex structure more than any other standard. Thus, the METAe project group chose METS for their challenging task to digitize historic books and journals (1850-1920).

While METS is great in describing the structure of objects, a schema related to the content and layout information of each piece of the object was missing. Thus, the METAe project group introduced the ALTO schema, that was not only able to hold all the text information of a page, but also to hold all the word and paragraph, text block or illustration coordinates within a page. ALTO became a great extension schema for METS during the METAe project, at least for printed materials.

(top - directory)


METS/ALTO XML Objects in Real Life

CCS developed its software docWORKS/METAe as a content conversion software. Scanned images are processed (Pre-processing, Layout Analysis, OCR, Structure Analysis) and exported as standard XML objects, based on METS/ALTO XML schemas. From the rich METS/ALTO XML object, you can build derivatives (PDF, METS/TEI, METS/TXT) using XSL style sheets easily. docWORKS is in use at:

Harvard University Library
Library of Congress
Royal Danish Library
University of Texas at Austin

(top - directory)


ALTO in NDNP


For the NDNP (National Digital Newspaper Project) the Library of Congress was looking for a METS extension schema describing the layout and content on printed pages. ALTO was a perfect fit, as it is proven in digitization of books and journals for the past 5 years. Due to NDNP related requests the ALTO schema was extended to cover all needs.

Recently ALTO 1.1 has been released and published by Library of Congress in the technical requirements of the NDNP project.

(top - directory)


ALTO Description

ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers. ALTO is a standardized XML format to store layout and content information. It is designed to be used as an extension schema to METS (Metadata Encoding and Transmission Standard), where METS provides metadata and structural information while ALTO contains content and physical information.

Each ALTO file contains a style section where different styles (for paragraphs and fonts) are listed. The layout section contains what’s on the page. A page is divided into several regions (Print space, left margin, right margin, top margin and bottom margin). For each region all objects are listed which have been detected inside.

Measurements in ALTO XML files are given in 1/10mm or in 1/1200inch. For presentation purposes one might want to create low resolution images. To use the coordinates within the ALTO file with any resolution they need to be transformed into pixels.

Transforming the inch1200 values to pixel depends on the image resolution. Convert the values into pixel as follows:

pixel = value * resolution / 1200

For 1/10mm convert the values into pixel as follows:

pixel = value * resolution / 2
5.4

(top - directory)