CCS logo information accessible Impressum             imprint             Rechtliche Hinweise             legal disclaimer      
"creating value through automated digital conversion"
 
 
 

METS / ALTOtechnical information

 
 

If you have any question about digitizing or converting serials like newspapers or journals or monographs like books, please select a language and enjoy our website

 

References from METS to ALTO Files

The reference starts from an entry of the type



within the structmap

The FILEID refers to the following structure within the file group



the BEGIN attribute then points into the alto file itself.

(directory)


Structure of ALTO Files


The ALTO file consists of three major sections:

Description
Styles
Layout

The Description section contains metadata about the alto file itself and processing information on how the file was created.
The Styles section contains the text and paragraph styles with their individual descriptions:

TextStyle has font descriptions
ParagraphStyle has paragraph descriptions, e.g. alignment information

The Layout section contains the content information. It is subdivided into Pages.
A page consists of margins and printspace, all of those are non-intersection rectangular areas within the page area. Each of these can contain any number of objects like lines, images or textblocks and more. A textblock is divided into textlines and those are divided furthermore in strings and spaces.

The global structure of the ALTO file is as follows:

alto

Description

 

MeasurementUnit
sourceImageInformation
Processing

Styles

TextStyle
TextStyle

ParagraphStyle
ParagraphStyle

Layout

Page

TopMargin
LeftMargin
RightMargin
BottomMargin
PrintSpace

(top - directory)


TextStyles


Textstyles have no content. The attributes are

  • FONTFAMILY
    FONTSIZE
    FONTCOLOR
    FONTWEIGHT
    FONTSTYLE
    FONTPITCH
    FONTCHARSET
    UNDERLINED

Only FONTFAMILY and FONTSIZE are required.

(top - directory)


ParagraphStyles


Paragraph styles have no content. The attributes are:

Name

with one of the values

ALIGN

Left

 

Right

 

Center

 

Block

LEFT

Numeric

RIGHT

Numeric

LINESPACE

Numeric

FIRSTLINE

Numeric

(top - directory)


Attributes of a Page Element

  • PAGECLASS
    STYLEREFS
    HEIGHT
    WIDTH
    PHYSICAL_IMG_NR
    PRINTED_IMG_NR
    QUALITY (OK, Damaged, Missing)
    POSITION (Left, Right, Foldout, Single)
  • PROCESSING (A link to processing information)

(top - directory)


Page Areas

Each page is divided into different areas (TopMargin, LeftMargin, RightMargin, BottomMargin and PrintSpace). The margins may contain text or other objects that are not part of the main body.

The positions are given as HPOS, VPOS, WIDTH and HEIGHT.

TopMargin

The area between the top line of print and the upper edge of the leaf. It may contain page number, running title or a complete page header.

LeftMargin

The left margin of a page. May contain margin notes.

RightMargin

The right margin of a page. May contain margin notes.

BottomMargin

The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word.

PrintSpace

Rectangle surrounding the printed area of a page. Page number and running title are not part of the print space.

 

The position of the margins on a page is illustrated in this picture.

(top - directory)


The Structure of Each of the Page Area (PageSpace) Elements

The page area elements have the attributes:

HPOS

Horizontal position upper/left corner (1/10 mm)

VPOS

Vertical position upper/left corner (1/10 mm)

WIDTH

Width (1/10 mm)

HEIGHT

Height (1/10 mm)

ROTATION  

In deg. as floating point number (optional)

 

Each page area may contain any number of elements. Those elements are one of the following:

TextBlock

A block of text

ComposedBlock

A block that consists of other blocks

Illustration

A picture or image

GraphicalElement  

A graphic used to seperate blocks. Mostly a line or a rectangle

 

Each of them may have the following attributes:

ID

Unique ID

STYLEREFS  

Reference for text or paragraph styles

HPOS

Horizontal position upper/left corner (1/10 mm)

VPOS

Vertical position upper/left corner (1/10 mm)

WIDTH  

Width (1/10 mm)

HEIGHT  

Height (1/10 mm)

ROTATION

In deg as floating point number (optional)

IDNEXT

Reference to the next element relating to the reading order

 

If the shape of the element is not rectangular an element SHAPE might be added:

Polygons are coded as X,y x,y … with different coordinate pairs separated by spaces.

Circles and ellipses are, although allowed in principle, not supported by docWORKS. Instead, such shapes are represented as polygons with sufficient accuracy.

A TextBlock is divided into lines and those are divided into strings, spaces and hyphens:

TextBlock

TextLine

 

String
SP
String
SP
...

TextLine

...

 

Meaning of those tags

Tag

Description

TextLine

Line of text

String

A single word

SP

White space

HYP

Hyphenation

(top - directory)


Additional Attributes of the Tags

TextBlock  

language

 

String

CONTENT

String content (word)

 

SUBS_TYPE

HypPart1

If content is the first part of a hyphenated word, applies only for the last word of a line if it is hyphenated

 

 

HypPart2

If content is the second part of a hyphenated word, applies only for the first word of a line if it is hyphenated

 

SUBS_CONTENT  

Complete content of a hyphenated word

 

WC

Word Confidence: Confidence level of the OCR results for this string. A float value between 0 (unsure) and 1 (confident)

 

CC

Confidence level of each character in that string. A list of numbers, one number between 0 (confident) and 9 (unsure) for each character

 

STYLEREFS

Text style used for this string, if it is different from the parent text block style

 

STYLE

Any combination of font style (italics, bold, …)

 

ALTERNATIVE

(element) Any number of alternative strings to be used instead

Illustration

TYPE

A user defined description of the type of the illustration

 

FILEID

A link to a seperate file that contains just the illustration.

ComposedBlock

TYPE

A user defined description of the type of the composed block

 

FILEID

A link to a separate file that contains just the composed block


(top - directory)