References from METS to ALTO Files
The reference starts from an entry of the type

within the structmap
The FILEID refers to the following structure within the file group
the BEGIN attribute then points into the alto file itself.
(directory)
Structure of ALTO Files
The ALTO file consists of three major sections:
Description
Styles
Layout
The Description section contains metadata about the alto file itself and processing information on how the file was created.
The Styles section contains the text and paragraph styles with their individual descriptions:
TextStyle has font descriptions
ParagraphStyle has paragraph descriptions, e.g. alignment information
The Layout section contains the content information. It is subdivided into Pages.
A page consists of margins and printspace, all of those are non-intersection rectangular areas within the page area. Each of these can contain any number of objects like lines, images or textblocks and more. A textblock is divided into textlines and those are divided furthermore in strings and spaces.
The global structure of the ALTO file is as follows:
alto |
Description
|
|
MeasurementUnit
sourceImageInformation
Processing |
Styles
|
 |
TextStyle
TextStyle
…
ParagraphStyle
ParagraphStyle
… |
 |
Layout
|
Page
|
 |
 |
TopMargin
LeftMargin
RightMargin
BottomMargin
PrintSpace |
(top - directory)
TextStyles
Textstyles have no content. The attributes are
- FONTFAMILY
FONTSIZE
FONTCOLOR
FONTWEIGHT
FONTSTYLE
FONTPITCH
FONTCHARSET
UNDERLINED
Only FONTFAMILY and FONTSIZE are required.
(top - directory)
ParagraphStyles
Paragraph styles have no content. The attributes are:
Name |
with one of the values |
ALIGN |
Left |
|
Right |
|
Center |
|
Block |
LEFT |
Numeric |
RIGHT |
Numeric |
LINESPACE |
Numeric |
FIRSTLINE |
Numeric |
(top - directory)
Attributes of a Page Element
- PAGECLASS
STYLEREFS
HEIGHT
WIDTH
PHYSICAL_IMG_NR
PRINTED_IMG_NR
QUALITY (OK, Damaged, Missing)
POSITION (Left, Right, Foldout, Single)
- PROCESSING (A link to processing information)
(top - directory)
Page Areas
Each page is divided into different areas (TopMargin, LeftMargin, RightMargin, BottomMargin and PrintSpace). The margins may contain text or other objects that are not part of the main body.
The positions are given as HPOS, VPOS, WIDTH and HEIGHT.
TopMargin |
The area between the top line of print and the upper edge of the leaf. It may contain page number, running title or a complete page header. |
LeftMargin |
The left margin of a page. May contain margin notes. |
RightMargin |
The right margin of a page. May contain margin notes. |
BottomMargin |
The area between the bottom line of letterpress or writing and the bottom edge of the leaf. It may contain a page number, a signature number or a catch word. |
PrintSpace |
Rectangle surrounding the printed area of a page. Page number and running title are not part of the print space. |
 |
The position of the margins on a page is illustrated in this picture. |
(top - directory)
The Structure of Each of the Page Area (PageSpace) Elements
The page area elements have the attributes:
HPOS |
Horizontal position upper/left corner (1/10 mm) |
VPOS |
Vertical position upper/left corner (1/10 mm) |
WIDTH |
Width (1/10 mm) |
HEIGHT |
Height (1/10 mm) |
ROTATION |
In deg. as floating point number (optional) |
Each page area may contain any number of elements. Those elements are one of the following:
TextBlock |
A block of text |
ComposedBlock |
A block that consists of other blocks |
Illustration |
A picture or image |
GraphicalElement |
A graphic used to seperate blocks. Mostly a line or a rectangle |
Each of them may have the following attributes:
ID |
Unique ID |
STYLEREFS |
Reference for text or paragraph styles |
HPOS |
Horizontal position upper/left corner (1/10 mm) |
VPOS |
Vertical position upper/left corner (1/10 mm) |
WIDTH |
Width (1/10 mm) |
HEIGHT |
Height (1/10 mm) |
ROTATION |
In deg as floating point number (optional) |
IDNEXT |
Reference to the next element relating to the reading order |
If the shape of the element is not rectangular an element SHAPE might be added:
Polygons are coded as X,y x,y … with different coordinate pairs separated by spaces.
Circles and ellipses are, although allowed in principle, not supported by docWORKS. Instead, such shapes are represented as polygons with sufficient accuracy.
A TextBlock is divided into lines and those are divided into strings, spaces and hyphens:
TextBlock |
TextLine
|
|
String
SP
String
SP
... |
TextLine
|
...
|
 |
Meaning of those tags
Tag |
Description |
TextLine |
Line of text |
String |
A single word |
SP |
White space |
HYP |
Hyphenation |
(top - directory)
Additional Attributes of the Tags
TextBlock |
language |
|
String |
CONTENT |
String content (word) |
|
SUBS_TYPE |
HypPart1 |
If content is the first part of a hyphenated word, applies only for the last word of a line if it is hyphenated |
|
|
HypPart2 |
If content is the second part of a hyphenated word, applies only for the first word of a line if it is hyphenated |
|
SUBS_CONTENT |
Complete content of a hyphenated word |
|
WC |
Word Confidence: Confidence level of the OCR results for this string. A float value between 0 (unsure) and 1 (confident) |
|
CC |
Confidence level of each character in that string. A list of numbers, one number between 0 (confident) and 9 (unsure) for each character |
| |
STYLEREFS |
Text style used for this string, if it is different from the parent text block style |
| |
STYLE |
Any combination of font style (italics, bold, …) |
|
ALTERNATIVE |
(element) Any number of alternative strings to be used instead |
Illustration |
TYPE |
A user defined description of the type of the illustration |
|
FILEID |
A link to a seperate file that contains just the illustration. |
ComposedBlock |
TYPE |
A user defined description of the type of the composed block |
|
FILEID |
A link to a separate file that contains just the composed block |
(top - directory) |