Multi-Dimensional Connectionist Classification: 
Segmentation-Free Handwriting Recognition 
 
Dissertation zur Erlangung des 
akademischen Grades eines 
Doktors der Naturwissenschaften (Dr. rer. nat.) 
 
vorgelegt von 
Schall, Martin 
 
an der  
 
 
 
Mathematisch-Naturwissenschaftliche Sektion 
Fachbereich Informatik und Informationswissenschaft 
 
 
 
Konstanz, 2022 
KonstanzerO nline-Publikations-System(K OPS) 
 URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-2-1jwkb729u4hib7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Tag der mündlichen Prüfung: 1. Juli 2022 
1. Referent: Prof. Dr. Daniel A. Keim 
2. Referent: Prof. Dr. Matthias O. Franz 
Vorsitzender: Prof. Dr. Bastian Goldlücke 
 
Abstract
Offline handwriting recognition is one area of research in document analysis. It is the
task of automatic transcription of natural handwritten text from images. As such it finds
applications in scientific systems, as well as industrial and consumer products.
This thesis deals with the segmentation-free offline handwriting recognition of hand-
written paragraphs, which is the transcription of paragraphs from images without prior
segmentation of the image into individual lines, words or characters. Removing the need
for prior segmentation also removes a potential error source from the overall pipeline
while transcribing text from images.
The beginning chapters of this thesis outline and discuss the state of the art in offline
handwriting recognition and general methods from machine learning and deep learning
which form the basis for this research. The discussed state of the art methods include
connectionist temporal classification, a method for line-wise offline handwriting recogni-
tion, and a paragraph-wise transcription method based on attention networks. Both are
used for empirical evaluation and comparison in the later chapters.
Following this overview of the state of the art is a discussion of the research ques-
tion and difficulty of segmentation-free paragraph-wise offline handwriting recognition.
The main method and theory of this thesis is discussed and detailed by proposing multi-
dimensional connectionist classification, which addresses segmentation-free paragraph
transcription. Multi-dimensional connectionist classification is a novel training method for
deep neural networks that builds an expectation-maximization loop in combination with a
conditional random field in order to predict glyph probabilities from an image of handwrit-
ten text. These glyph probabilities are transcribed to a computer-processable string by
applying a novel multi-line decoding algorithm. Multi-dimensional connectionist classifica-
tion and its decoding algorithm are empirically evaluated by applying them to handwritten
paragraphs.
A novel heatmap-based visual analytics technique and workflow are proposed for
human inspection of the multi-dimensional connectionist classification training loop and
transcription results. This technique is designed to preserve the contextual information
given by the handwritten text while enabling the model engineer to identify potential error
sources and improve hyper-parameters.
This thesis also discusses methods for combining line-wise and paragraph-wise tran-
scription in offline handwriting recognition. The goal of such an approach is to reduce the
overall error rate by combining the strengths of both methods. Several methods for com-
bining transcription methods in different steps of their respective pipelines are proposed
and empirically evaluated.
1
2
Zusammenfassung
Offline-Handschrifterkennung ist ein Forschungsbereich in der Dokumentenanalyse und
ist die automatisierte Transkription von natürlichen Texten aus Bildern. Handschrifterken-
nung findet Anwendung in wissenschaftlichen Systemen, als auch industriellen Lösungen
und Endbenutzergeräten.
Diese Dissertation beschäftigt sich mit der segmentierungsfreien Offline-Handschrift-
erkennung von Paragraphen. Dies betrifft die Transkription von Paragraphen aus Bildern
ohne vorhergehende Zerlegung in Zeilen, Worte oder Zeichen. Der Wegfall der Notwen-
digkeit zur vorhergehenden Segmentierung reduziert die möglichen Fehlerquellen des
gesamten Systems zur Transkription.
Die eröffnenden Kapitel dieser Arbeit zeigen den aktuellen Stand der Wissenschaft in
der Offline-Handschrifterkennung auf und diskutieren diesen. Auch werden generelle Me-
thoden des maschinellen Lernens und Deep Learning diskutiert, da diese die Grundlage
der vorliegenden Arbeit bilden. Als aktuelle Methoden der Offline-Handschrifterkennung
werden Connectionist Temporal Classification für zeilenweise Transkription und ein para-
graphenweises Transkriptionsverfahren basierend auf Attention Networks detailliert. Bei-
de Methoden werden zur empirischen Evaluation und Auswertung in späteren Kapiteln
herangezogen.
Die wissenschaftlichen Forschungsfrage und Problemstellung der segmentierungs-
freien, paragraphenweisen Offline-Handschrifterkennung werden auf diese Übersicht fol-
gend diskutiert. Die hauptsächliche Methode und Theorie dieser Arbeit bildet Multi-Di-
mensional Connectionist Classification, ein neuartiges Verfahren zur segmentierungs-
freien Transkription von Paragraphen. Multi-Dimensional Connectionist Classification ist
ein Lernverfahren für tiefe neuronale Netze, welches Expectation-Maximization in Verbin-
dung mit Conditional Random Fields anwendet um Zeichenwahrscheinlichkeiten aus Pa-
ragraphenbildern vorherzusagen. Diese Zeichenwahrscheinlichkeiten werden dann durch
einen neuartigen Dekodierungsalgorithmus in maschinenlesbare Strings umgesetzt. Mul-
ti-Dimensional Connectionist Classification und sein zugehöriger Dekodierungsalgorith-
mus werden empirisch ausgewertet indem diese auf handgeschriebene Paragraphen an-
gewendet werden.
Weiter wird ein neuartiges, Heatmap-basiertes Verfahren zur visuellen Analyse und
Inspektion von Trainingsläufen und -resultaten Multi-Dimensional Connectionist Classifi-
cation vorgeschlagen. Diese Methode integriert den durch das ursprüngliche Handschrift-
bild gegebenen Kontext und erlaubt es dadurch dem Modellingenieur, mögliche Fehler-
quellen zu identifizieren und Hyper-Parameter zu verbessern.
Im Anschluss daran zeigt diese Dissertation Verfahren auf um zeilen- und paragra-
phenweise Transkriptionen der Offline-Handschrifterkennung zu kombinieren. Das Ziel
dieser Methoden ist es, die Stärken beider Verfahren zu kombinieren. Es werden mehre-
re, an unterschiedlichen Stellen des Gesamtsystems ansetzende Methoden zur Kombi-
nation von zeilen- und paragraphenweiser Transkription diskutiert und empirisch ausge-
wertet.
3
4
Danksagung
Mein Dank gilt Prof. Dr. Matthias Franz, welcher diese Forschungsarbeit betreut hat. Ich
kenne Matthias Franz bereits aus dem Studium und als Betreuer meiner Masterarbeit und
verdanke ihm sowohl sehr viel Fachwissen als auch Erfahrung und Gespür zu wissen-
schaftlichen Fragestellungen. Die freundschaftliche und lockere, aber auch professionelle
und zielgerichtete Zusammenarbeit mit ihm schätze ich sehr.
Auch möchte ich meinen Dank an Prof. Dr. Daniel Keim aussprechen. Er hat diese
Forschungsarbeit betreut und mir gegenüber dabei einen Vertrauensvorschuss geleistet,
da er mich vorhergehend nicht kannte. Im Laufe unserer Zusammenarbeit habe ich ihn
dann als sehr kompetente, hilfsbereite und freundliche Person kennengelernt. Von ihm
habe ich dabei viel Fachwissen und wissenschaftliche Arbeitsweise erlernt.
Weiter möchte ich gerne Dr. Marc-Peter Schambach danken. Er hat mir diese For-
schungsarbeit initial bei der Siemens Parcel Logistics GmbH ermöglicht und dann im
weiteren Verlauf als Mentor informell betreut. Mit ihm habe ich einen guten Freund ge-
wonnen.
Danke an Dr. Pascal Laube, der durch viele interessante und aufschlussreiche Dis-
kussionen, aber auch als guter Freund zu dieser Arbeit beigetragen hat.
Gerne möchte ich allen am Institut für Optische Systeme und der Arbeitsgruppe Da-
tenanalyse und Visualisierung danken. Ihr habt mich bei dieser Arbeit als Freunde und
mit Diskussionen, Tipps und als Co-Autoren unterstützt. Insbesondere möchte ich hier
Prof. Dr. Georg Umlauf, Prof. Dr. Oliver Dürr, Dr. Dominik Sacha, Dr. Manuel Stein, Dr.
Michael Behrisch, Dr. Michael Blumenschein, Dr. Johannes Fuchs, Dr. Dominik Jäckle,
Dr. Dirk Streeb, Michael Grunwald, Matthias Hermann, Mennatallah El-Assady, Tobias
Birkle, Dennis Griesser, Daniel Dold, Rita Sevastjanova, Thilo Spinner, Fabian Sperrle,
Udo Schlegel, Robin Mattes, Felix Peter, Daniel Seebacher, Juri Buchmüller, Nico Brügel,
Henning Krause und Haiyan Bührig danken.
Mein Dank gilt ebenso meinen Arbeitskollegen bei der Siemens Parcel Logistics GmbH.
Im Kontext dieser Arbeit möchte ich besonders Dr. Jörg Rottland, Stephan v.d. Nüll, Mi-
chael Zettler und Insa Sigl danken.
Keine professionelle Arbeit ist ohne die Unterstützung durch Familie und Freunde
möglich. Vielen Dank an euch alle, die ich mit gutem Grund meine Freunde nenne. Sehr
viel Dank empfinde ich für meine Eltern Elfriede und Werner Schall, meinen Bruder Se-
bastian Schall und seine Lebensgefährtin Stefanie Eckardt. Vielen Dank an Simone und
Joachim Breyer, Stefan Lang und Andreas Bolz.
Diese Arbeit wurde durch die Siemens Parcel Logistics GmbH finanziert (meine Ar-
beitszeit, Computerhardware, sowie Teilnahme an Konferenzen) und damit erst ermög-
licht.
Diese Aufzählung ist mit Sicherheit nicht vollständig. Danke an alle, die ich als Familie,
Freunde, Kollegen und Wissenschaftler kenne!
5
6
Contents
1 Introduction 13
1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Scientific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Background 23
2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3 Related Work 61
3.1 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . . . . 61
3.2 Paragraph Transcription using Attention Networks . . . . . . . . . . . . . . 67
3.3 Paragraph Transcription by Reshaping CNNs . . . . . . . . . . . . . . . . . 72
4 The Problem with Multi-Line Handwriting Recognition 75
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Segmentation of Handwritten Paragraphs . . . . . . . . . . . . . . . . . . . 75
4.3 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 82
5 Decoding Algorithms for Multi-Line Text Recognition 89
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Structure of the Model Output . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Multi-Line Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Finding Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.5 Decoding Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6 Multi-Dimensional Connectionist Classification (MDCC) 113
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.2 Structure of Paragraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3 Basic of Multi-Line Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4 Maximum Likelihood Training . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.5 Expectation-Maximization Training . . . . . . . . . . . . . . . . . . . . . . . 127
6.6 Construction of and Inference in the CRF . . . . . . . . . . . . . . . . . . . 133
6.7 Emphasizing Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7 Text Recognition for Paragraphs 141
7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
7.3 Forced Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
7
7.4 Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8 Hyper-Parameter Search using Visual Analytics 165
8.1 Problem Description and Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.2 Error Sources in MDCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.3 Workflow for Identification of Error Sources . . . . . . . . . . . . . . . . . . 169
8.4 Heatmap-Based Visualization for MDCC . . . . . . . . . . . . . . . . . . . . 173
8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9 Combined Models for Text Recognition 181
9.1 Idea and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.2 Classifier on Paragraph Images . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.3 Classifier on Transcribed Texts . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.4 Classifier on Segmentation Information . . . . . . . . . . . . . . . . . . . . 194
9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
10 Dictionary-Based Decoding Algorithms 201
10.1 Overview and Relation to This Work . . . . . . . . . . . . . . . . . . . . . . 201
10.2 Decoding using a Large Lexicon and Fuzzy Search . . . . . . . . . . . . . . 201
10.3 Decoding using LSTM Networks and Metric Learning . . . . . . . . . . . . 205
11 Discussion and Conclusion 211
11.1 Achieved Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.2 Ideas for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
Bibliography 219
8
Nomenclature
BP Belief Propagation
BPTT Backpropagation Through Time
CNN Convolutional Neural Network
CRF Conditional Random Field
CTC Connectionist Temporal Classification
DGM Directed Graphical Model
DNN Deep Neural Network
EM Expectation-Maximization
GPGPU General Purpose Graphics Processing Unit
LBP Loopy Belief Propagation
LSTM Long Short-Term Memory
MAP Maximum A-Posterior
MDLSTM Multi-Dimensional Long Short-Term Memory
ML Machine Learning
MLP Multi-Layer Perceptron
MRF Markov Random Field
NLP Natural Language Processing
RNN Recurrent Neural Network
SVM Support-Vector Machine
VA Visual Analytics
XAI eXplainable AI
9
10
Mathematical Notation
Throughout this thesis we will discuss multiple mathematical concepts such as graphical
models, artificial neural networks and deep neural networks from mathematical fields
such as statistics, information theory, linear algebra and analysis. This thesis applies a
common mathematical notation to all of these concepts. This notation is as follows:
Scalar variables are typeset in italic font, e.g. i, j or n, m. Italic typeset symbols with
an index, e.g. yi refer to a specific scalar within a tensor, matrix, vector or set.
Functions are typeset in Roman font, e.g. exp(). Constants, e.g. k, are also in Roman
font.
Tensor, matrix, vector and set symbols are typeset in bold font, e.g. W, x or y. As
stated before, specific elements of these are typeset in italic font but with an index, e.g.
Wi.
11
12
Chapter 1
Introduction
1.1 Overview and Motivation
The topic of this thesis is mainly offline handwriting recognition, that is the automatic,
computerized transcription of handwritten text from an image containing this handwritten
text. The image is typically produced by scanning a sheet of paper or captured by an
optical camera system. The offline handwriting recognition task is then to run an algorithm
on a computer that will transcribe the text contained in this image in a form which is further
processable by software, e.g. as a UTF-8 encoded string.
Offline handwriting recognition as a research field is part of document analysis and
supports research in e.g. historical document analysis. The motivation, application and
context for the research and development of methods discussed in this thesis is provided
by the products and solutions of Siemens Parcel Logistics. Offline handwriting recogni-
tion is part of the pipeline for processing and sorting mail by automatic reading of the
sender and receiver addresses from mail and parcel items. The overall pipeline and goal
is to optically capture one or multiple images from mail items while they physically travel
through the sorting system, reading and encoding required information from these im-
ages and finally to decide how to proceed with the processing of this mail piece, yielding
according control commands to the physical sorting machine. In this context offline hand-
writing recognition is part of reading an coding of addresses on mail and parcels. Figure
1.1.1 shows a belt system that transports parcels through a tunnel for optical capture of
images of the parcel from six sides.
Figure 1.1.2 shows Siemens cross-belt sorter VarioSort EXB which is used for auto-
matic sorting of parcels. Offline handwriting recognition of the addresses on the parcels
is an intermediate step to issuing commands to the belts of the sorting system in order
to physically transport the parcels to their intended destinations. Figure 1.1.3 shows a
Siemens Integrated Reading and Video Coding Machine (IRV) for automatic sorting of
letter mail. Similarly, offline handwriting recognition is required to read addresses on
these mail items.
It is true that at the time of writing, most of the addresses on mail items and parcels
are not handwritten by humans, but printed by computerized machines. This means that
the layout of the labels on mail items and the font in use for addresses are much easier
to recognize, read and encode correctly. However, there are still mail and parcels in cir-
culation with handwritten addresses and as such the need for reliable offline handwriting
recognition is still given.
By nature of the mode of production of handwritten texts, that is by a human using a
pen on a sheet of paper, some glyphs in the final handwritten text may be overlapping.
Overlapping glyphs may occur within a text line between adjacent glyphs, but also in over-
lapping text lines. Modern methods in offline handwriting recognition typically address the
13
Figure 1.1.1: Tunnel for capturing images of parcels in preparation for the automatic reading and
coding step.
Image retrieved from https://www.siemens-logistics.com/en/
parcel-logistics/reading-and-coding on September 23rd, 2021.
Figure 1.1.2: Siemens VarioSort EXB for automatic sorting of parcels.
Image retrieved from https://www.siemens-logistics.com/en/
parcel-logistics/sorting on September 23rd, 2021.
14
Figure 1.1.3: Siemens Integrated Reading and Video Coding Machine (IRV) for mail sorting.
Image retrieved from https://www.siemens-logistics.com/en/mail-sorting/
letter-sorting-and-sequencing on September 23rd, 2021.
problem of overlapping glyphs by applying so called segmentation-free methods, that is
methods that are able to transcribe text without prior separation of its components. For
example, paragraphs may be segmented into lines or lines into words and characters.
Segmentation-free methods avoid this since every segmentation step introduces a po-
tential source for errors into the overall offline handwriting recognition system.
Connectionist temporal classification (CTC)[46], see also Section 3.1, is one such
segmentation-free method. CTC was developed for the automatic transcription of text
lines. It removes the need for segmentation of text lines into words or characters. Figure
1.1.4 shows an example postal address for which transcription with CTC is applicable.
However, overlaps may occur between text lines, as Figure 1.1.5 shows in an example
from the IAM offline handwriting database[88]. Overlapping lines may also occur in hand-
written addresses on mail items or parcels.
Figure 1.1.4: Postal address image from a Siemens Parcel Logistics project in New Zealand.
Multi-dimensional connectionist classification (MDCC), proposed in this thesis in Chap-
ters 5 and 6, is designed as a segmentation-free offline handwriting recognition method
for transcribing whole paragraphs without prior segmentation into individual lines, words
or characters. It is specifically designed to handle overlapping text lines. It thus removes
the error source of additional segmentation steps within the overall transcription system.
15
Figure 1.1.5: Paragraph from the IAM offline handwriting database that shows overlapping text
lines.
The technical motivation for the research into paragraph-wise segmentation-free of-
fline handwriting recognition is thus given by the facts that handwritten text lines are in-
deed sometimes overlapping and that there are mail items and parcels with handwritten
addresses in circulation. Modern offline handwriting recognition methods rely on deep
neural networks (DNNs), a specific type of machine learning model. DNNs proved suc-
cessful in solving complex recognition tasks based on large amounts of data, but also are
difficult to understand and optimize by human experts. As such, this research lies within
the intersection of machine learning, document analysis and visual analytics.
1.2 Scientific Contributions
This section details the novel scientific contributions contained within this doctoral the-
sis and it relates these contributions to their respective research fields. The research
contributions of this thesis are as follows:
• Multi-dimensional connectionist classification as a whole with its training method
and decoding algorithm is a novel method for paragraph-wise segmentation-free
offline handwriting recognition. It is capable of transcribing handwritten paragraphs
with overlapping text lines, writing of varying size and slanted or angled text (up
to 45 degree of angle). MDCC as detailed in this thesis is combined with a deep
neural network as the actual model for the recognition of handwritten text. However,
MDCC itself is a training method and decoding algorithm that is not restricted to a
specific deep neural network architecture or machine learning model. The training
method of MDCC is discussed in Chapter 6 and its decoding algorithm in Chapter
5.
• Multi-dimensional connectionist classification as a training method is a novel con-
tribution to machine learning and computer science. It interprets paragraph-wise
segmentation-free offline handwriting recognition as an inference task over a space
of two spatial dimensions while given only incomplete information. The information
provided in this task is the image of handwritten text and, only during training, the
label sequence of the correctly transcribed text. No geometric information, e.g. the
16
position or extent of characters, is provided. This missing information needs to be
inferred. To this end, MDCC sets up an expectation-maximization loop between a
conditional random field (CRF) and a deep neural network (DNN) in order to in-
fer the missing information and optimize the model parameters at the same time.
Chapter 6 discusses this approach.
• Training deep neural networks using MDCC highlighted the need for understand-
ing its workings in the context of offline handwriting recognition. Understanding the
deep learning model in use in combination with MDCC allows the expert user to im-
prove its hyper-parameters and to correct potential errors in the ground truth data
or software implementation. Chapter 8 of this thesis proposes a novel visual an-
alytics technique for inspecting the predictions of the DNN and CRF models while
preserving the contextual information provided by the handwritten text. It also pro-
poses techniques for identifying interesting cases in MDCC and a novel workflow
for identification and improvement of error sources in MDCC.
• Multi-dimensional connectionist classification is designed for the paragraph-wise
transcription of handwritten texts. There are methods, e.g. connectionist temporal
classification (CTC), that allow for line-wise transcription of handwritten texts. Since
paragraph-wise transcription only unfolds its full benefit in difficult to segment para-
graphs, the question arises if the decision to transcribe line- or paragraph-wise can
be made on a case-by-case basis. Chapter 9 proposes novel methods for combin-
ing line- and paragraph-wise transcription by classifying each example in order to
predict which transcription method yields a lower error rate.
• Section 10.2 discusses a novel method for decoding predictions of a DNN trained
with CTC by extracting character n-grams and fuzzy search within a large dictionary
of possible strings. This method can be used to speed up the decoding process in
combination with CTC.
• Section 10.3 proposes a novel training method for optimizing DNNs towards esti-
mating the Edit-distance between a query string and a reference dictionary, keep-
ing both the query string and dictionary exchangeable. The DNN in this case only
learns to approximate the algorithm for computing the Edit-distance.
1.3 Publications
This doctoral thesis is based on the following of my works. The order is based on the
timeline in which they have been published, starting with the newest publication. All pub-
lications in this list have gone through a peer-review process beforehand. The individual
authors have been asked for permission to use these publications in this thesis and their
individual contributions are outlined in the following listing. The attributions of the contri-
butions of each author is written from the perspective of the author of this thesis.
There are also, of course, the general contributions of Daniel A. Keim and Matthias O.
Franz as my doctoral advisers. Both my advisers contributed by teaching good research
practice, teaching machine learning and visual analytics, as well as proof-reading of pub-
lications. I would also like to point out that Marc-Peter Schambach, with his experience
in offline handwriting recognition and a colleague at Siemens Parcel Logistics, acted as
sort of an informal adviser throughout my doctoral research.
17
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-Line
Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR
International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019.
DOI: 10.1109/ICDAR.2019.00015
This was the second publication on multi-dimensional connectionist classification.
The largest part of this research and paper is my work. Marc-Peter Schambach en-
gaged with me in discussions on difficult cases in both multi-line alignment and decoding.
Matthias O. Franz continued with discussions on conditional random fields and how to
formulate MDCC in an expectation-maximization framework. My contribution was re-
search into multi-line text alignment using conditional random fields, multi-line decoding
algorithms and deep neural networks for offline handwriting recognition. I have formulated
the CRF topology and decoding algorithm as proposed in this work. Implementation of
the algorithms and experiments with a following evaluation was also done by me. The
paper was written by me, while incorporating the feedback of my co-authors. Both co-
authors proof-read the paper before publication.
Marc-Peter Schambach, Stephan von der Nüll, and Martin Schall. “Fast and Reliable
Acquisition of Truth Data for Document Analysis using Cyclic Suggest Algorithms.” In:
2019 International Conference on Document Analysis and Recognition Workshops
(ICDARW). vol. 2. Sept. 2019, pp. 7–12. DOI: 10.1109/ICDARW.2019.10030
The research into this topic is mainly the work of Marc-Peter Schambach and he
conducted the according implementation and experiments. He wrote this paper, incor-
porating feedback after proof-reading by both co-authors. Stephan v.d. Nüll was the
team lead at Siemens Logistics of both Marc-Peter Schambach and me during the time
of this research and he discussed the use cases for this work, as well as data format and
storage. My contributions were discussions on the requirements for capturing ground
truth data for multi-line text recognition. I also discussed resolving the cyclic dependen-
cies that occur during semi-automatic annotation of ground truth data with Marc-Peter
Schambach.
Martin Schall, Dominik Sacha, Manuel Stein, Matthias O. Franz, and Daniel A. Keim.
“Visualization-Assisted Development of Deep Learning Models in Offline Handwriting
Recognition.” In: Symposium on Visualization in Data Science (VDS) at IEEE VIS 2018.
Oct. 2018
This publication is a result of heatmap-based visualizations that I had created for multi-
dimensional connectionist classification for debugging and subsequently formalized as a
visual analytics method. As such the research into this visualization technique and match-
ing workflow was mainly my work. The body of the paper was written by me. Dominik
Sacha contributed by discussing the proposed workflow in context of his Vis4ML[114]
research. Manuel Stein provided feedback on the visualization and presentation of the
workflow. Daniel A. Keim engaged in discussions on the presentation of and argumenta-
tion for the heatmap-based visualization. Matthias O. Franz discussed the interpretation
of the heatmap technique in the context of deep neural networks. My contribution was
the development of the heatmap-based visualization technique and related workflow. I
wrote the main body of the paper. I also implemented this method for experimentation.
All co-authors provided feedback after proof-reading the paper, which I incorporated for
the final publication.
18
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional
Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR
International Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–410.
DOI: 10.1109/DAS.2018.36
This was the first publication on multi-dimensional connectionist classification. The
main part of the research was done by me. Implementation, experimentation, evaluation
and writing of the paper was my work. Marc-Peter Schambach engaged with me in
discussions on offline handwriting recognition and multi-line text alignment. Matthias O.
Franz contributed by discussing conditional random fields and their applications with me.
My contribution was the research of applying a conditional random field and loopy belief
propagation to the problem of multi-line text alignment, including the specification of the
CRF topology for MDCC. I also formulated the multi-line decoding algorithm proposed
in this work. The implementation of the algorithms for both training and decoding, the
according experimental setup and evaluation was also done by me. I wrote this paper
and both co-authors provided feedback before publication.
Martin Schall, Haiyan P. Buehrig, Marc-Peter Schambach, and Matthias O. Franz.
“LSTM Networks for Edit Distance Calculation with Exchangeable Dictionaries.” In: 2018
13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018
The idea for this work came to me based on my previous working experience and
the question if calculating the Edit-distance can easily be accelerated using GPU hard-
ware. Haiyan P. Buehrig worked on this topic as his bachelor thesis and he performed
the implementation of the method and he performed the experiments and evaluation.
Marc-Peter Schambach contributed by proposing to directly learn the Edit-distance as
a metric. Matthias O. Franz was the supervising professor of this bachelor thesis. He
engaged in discussions on the deep neural network architecture, the experimental setup
and the interpretation of these result. My contribution was the research idea for this
work and I co-supervised this work together with Matthias O. Franz. I contributed ideas
on how to encode the dictionary and query strings for them to be suitable for deep neural
networks and I discussed the application of long short-term memory layers towards this
research goal. I, together with Matthias O. Franz, guided the experiments and how to
build on their results. The content of this paper was written by me, while incorporating
feedback provided by Marc-Peter Schambach and Matthias O. Franz.
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Improving
gradient-based LSTM training for offline handwriting recognition by careful selection of
the optimization method.” In: BW-CAR| SINCOM (2016), p. 11
This paper is based on general observations while evaluating modern optimization
methods for offline handwriting recognition. I conducted the main part of the research and
experiments described in this paper and also wrote the paper itself. Marc-Peter Scham-
bach engaged with me in discussions on which properties of an optimization method are
useful for handwriting recognition. Matthias O. Franz guided the experimental setup for
this work and discussed with me the general properties of these modern optimization
methods. My contribution was the application of the modern optimization methods to
offline handwriting recognition based on their mathematical properties. I conducted the
implementation of methods and experiments for this paper. Both co-authors provided me
with feedback on the paper.
19
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Increasing Robustness
of Handwriting Recognition Using Character N-Gram Decoding on Large Lexica.” In:
2016 12th IAPR Workshop on Document Analysis Systems (DAS). Apr. 2016,
pp. 156–161. DOI: 10.1109/DAS.2016.43
The largest part of the research into this topic and writing of the paper was conducted
by me. Marc-Peter Schambach provided the idea for this work as an introduction for
me into both offline handwriting recognition and the OCR software at Siemens Logistics.
He engaged with me in discussions on offline handwriting recognition and decoding al-
gorithms for it. Matthias O. Franz gave his feedback on the experimental results and
their discussion. My contribution was research into how to use and structure an n-gram
index for decoding in offline handwriting recognition. I also conducted the implementation
and evaluation of this method. The paper was written by me with feedback given by my
co-authors.
1.4 Organization of this Thesis
The section at hand outlines the structure and organization of this thesis. We will also
discuss decisions that lead to this specific organization, in the hope of improving the
reading flow.
Chapter 2 discusses works and methods by other researchers and authors that serve
as a basis for the research in this doctoral thesis. This does not mean that the work
of this thesis is derived from the methods detailed in Chapter 2, but that it builds upon
them. The purpose of this chapter is to introduce the reader to the concepts necessary
for understanding the methods proposed in this thesis. Chapter 2 discusses machine
learning, conditional random fields, deep neural networks and expectation-maximization.
I expect that many readers are already familiar with these methods and may want to
skip, partial or in full, this chapter. Section 2.2 discusses conditional random fields from
a perspective of defining the model and its parameters, using it for inference based on
given knowledge and assumptions. This is in contrast to using a CRF as a machine
learning model while automatically learning its parameters, which is not how CRFs are
used in this thesis.
Chapter 3 details methods that can be directly related to the methods proposed in this
thesis. These works are either compared to those in this thesis or their influence on this
thesis is shown. Section 3.1 discusses connectionist temporal classification, which is a
method for segmentation-free line-wise offline handwriting recognition. This is in contrast
to this work, which addresses paragraph-wise offline handwriting recognition. Sections
3.2 and 3.3 outline methods for paragraph-wise offline handwriting recognition that can
be directly compared to multi-dimensional connectionist classification.
Chapter 4 discusses the problem of multi-line offline handwriting recognition from
both a perspective of document analysis and computational complexity. It answers the
questions ‘Why is it hard to go from line-wise transcription to paragraph-wise transcrip-
tion?’ and ‘Why is it hard to go from one-dimensional sequence labeling to two- or multi-
dimensional sequence labeling?’.
Chapters 5, 6, 7, 8 and 9 represent the main body of work conducted in this doctoral
thesis.
Chapter 5 discusses the decoding algorithms for transcribing multi-line texts as pro-
posed in multi-dimensional connectionist classification. Chapter 6 details the training
method for optimization of a DNN in multi-dimensional connectionist classification. The
order of these two chapters is flipped in comparison to how MDCC is applied: one first
needs to train the artificial neural network and perform inference using this optimized
20
model before its prediction can be decoded. The reason for switching the order of these
two chapters is that MDCC relies on a latent variable, called the ‘soft-assignment’ or
‘alignment’ in this thesis, which needs explanation. Explaining the MDCC training al-
gorithm requires an explanation of how to predict this soft-assignment given an image
of handwritten text. On the other hand relies the discussion of the decoding algorithm
on how to retrieve a computer-processable string from this soft-assignment. The dif-
ference here is that an image of handwritten text consist of a large amount of tokens
(its pixels) with each token carrying only a low amount of information and that it can be
counter-intuitive to discuss the properties of handwritten text from a visual perspective.
A computer string of natural language on the other hand consists of only a low number
of tokens (its characters) with each token carrying a high amount of information and a
intuitive understanding of the structure of this string is given by our everyday reading
and writing on computers. It seems easier to the reader to first discuss the decoding
algorithms, followed by the training algorithms. This way the discussion of the decoding
algorithms provides a basis for detailing the soft-assignment.
Chapter 7 details the implementation of multi-dimensional connectionist classification,
its application to the IAM offline handwriting database[88] and the empirical evaluation
based on these experiments. This chapter contains an evaluation of MDCC itself, as well
as a direct comparison to the method of applying attention-networks for paragraph-wise
offline handwriting recognition. This chapter also discusses MDCC in relation to existing
works in publications of other authors.
Chapter 8 proposes a visualization technique for multi-dimensional connectionist clas-
sification. This visualization technique is embedded into a workflow for guiding an expert
user to optimize the hyper-parameters of a DNN for MDCC, as well as allow the identi-
fication of further error sources in MDCC. This chapter is based on the observation that
automatic identification of error sources and automatic hyper-parameter optimization in
MDCC is difficult, but much easier if the human expert user is taken into account. This
chapter proposes a workflow for improving the DNN model trained with MDCC by putting
the human expert user into the loop.
Chapter 9 discusses methods for combining line- and paragraph-wise offline hand-
writing recognition. The proposed methods do this by classifying each handwritten text in
order to infer which of the two methods should be applied. The goal of these methods is
to improve the combined transcription error rate in line- and paragraph-wise transcription.
Chapter 10 presents novel research conducted in the context of this doctoral thesis,
but which is not part of the ‘main story line’ of this thesis. Section 10.2 details an ap-
proach to single-line decoding in the context of connectionist temporal classification by
applying a fuzzy search within a large database of valid strings. Section 10.3 discusses
a method for the application of long short-term memory networks in order to estimate
the Edit-distance between a query string and the entries of a dictionary. Both methods
are ideas for improving the line-decoding in connectionist temporal classification or multi-
dimensional connectionist classification.
21
22
Chapter 2
Background
2.1 Machine Learning
In this chapter we will discuss some general machine learning (ML) concepts, practical
approaches as well as terminology used in this thesis. I would like to start this by quoting
Kevin P. Murphy[93]:
With the ever increasing amounts of data in electronic form, the need for au-
tomated methods for data analysis continues to grow. The goal of machine
learning is to develop methods that can automatically detect patterns in data,
and then to use the uncovered patterns to predict future data or other out-
comes of interest.
This raises the question about what type of data is processed in ML. In general, a
wide variety of data can and is processed by ML systems, most common in scientific and
industrial applications are probably images, videos, sound recordings, financial data and
abstract measurements, e.g. temperature, noise level, electrical current or GPS positions
for a singular point in time or for a whole time series. Data in ML can either be labeled or
unlabeled, which means that semantic information is attached to the data. This might be
information about the content of the data, e.g. if the image shows a dog or a cat.
As quoted above, the goal of ML is to find patterns within the data and exploit these
pattern to predict useful but unobserved values. This involves a training process in which
a ML model is tailored towards the specific task and data at hand. Depending on how the
data is collected, possibly labeled, or generated, different training paradigms like super-
vised learning, unsupervised learning or reinforcement learning are applied. Supervised
learning uses the observed data, but also has information about the true outcome of the
prediction step. This allows to control the predictions done by the ML model and correct
them towards the true predictions that are known beforehand. As both the observed data
and truth prediction outcomes are known beforehand, this is the most controlled train-
ing environment in machine learning. Unsupervised learning applies only the observed
data to the training process, but has now information about the prediction outcome be-
forehand. The ML model in unsupervised learning is to uncover patterns in the data
and make useful predictions on its own. Clustering of data is one case of unsupervised
learning. Reinforcement learning[142] is not based on data known beforehand at all,
but instead generates data during training by executing actions withing a simulated en-
vironment such as for example a simulation of a driving car. The training paradigm in
reinforcement learning is to observe the state of environment generated by the simu-
lation, execute an action as suggested by the ML model and then to observe a scalar
reward that is to be maximized during learning. We see that these are three very different
training schemes with different semantic information about the data and task at hand,
23
but still fall into the general machine learning category about uncovering and exploiting
patterns in the observed data.
Next we will discuss different types of tasks that models in machine learning may
solve. First, a model in ML refers to an abstract representation of the patterns that have
been detected or knowledge that has been learned during training. This representation
is in modern ML methods often in the form of statistical correlations in the data or e.g. in
semantic representations like decision trees. A model in ML allows inference from known
to unseen data by applying the patterns uncovered during training. Typical models in
ML are e.g. artificial neural networks (ANNs)[126][7, ch. 5] or support-vector machines
(SVMs)[10, 30, 127]. A task in ML describes the goal that inference on the model should
solve. The two atypical tasks are regression and classification.
Figure 2.1.1: Regression task of predicting a scalar value based on observed features.
Figure 2.1.1 shows a set of example data and example model for a regression task.
Regression is to intra- or extrapolate scalar values from seen features and is a general
method in statistics. The black crosses in Figure 2.1.1 mark the data points, whereas the
blue dotted line is a partial plot of a linear regression model.
Figure 2.1.2: Classification task of predicting a class assignment based on observed features.
Figure 2.1.2 exemplifies a classification task in which data points are assigned to one
of two or more classes. Classes in this case relate to different semantic concepts within
the data, e.g. the cats or dogs in images for object classification. The figure shows data
points as crosses, with the two classes in red and green. The model is again visualized
24
as a blue dotted line and is in this case a linear classifier that describes the separating
plane between the two classes.
The task that we will discuss in this thesis is called sequence labeling[43], which is
the assignment of a specific sequence of labels, e.g. a character string, over one or more
discrete spatial or temporal dimensions. The length of the label sequence is smaller
or equal to the size of these temporal or spatial dimensions. It follows that not every
point in the temporal or spatial dimensions has a unique label and labels may span over
several points with their exact location and extent being unknown. This is a variant of
a classification task, but applied to temporal or spatial problems. Prominent examples
for this type of task is transcription of text from audio, e.g. voice recognition in smart
assistant devices or transcription of text from images.
We have briefly discussed different training paradigms, as well as what a model is
in general and what types of tasks there are in machine learning. How is learning in
machine learning facilitated? Learning is done by parameterization of models in the form
of automatically adapted parameters and user-defined hyper-parameters. Parameters
in the context of this thesis refer to all variables within a model that are automatically
optimized using a training method, such as e.g. gradient descent. Gradient descent sets
up the learning as an optimization problem over the parameters of the model whereas a
specific optimization criteria needs to be solved, e.g. the error of a loss function that is to
be minimized. We will discuss gradient descent for deep neural networks in Section 2.3.
Parameters in deep neural networks, also discussed later, are the weights and biases
that define the linear combination of artificial neurons. Hyper-parameters are variables in
the model or optimization algorithm that are user-defined and not automatically chosen.
Hyper-parameters are for example the number of layers in a multi-layer perceptron, the
learning rate in gradient descent or even the choice of the modeling method itself.
The distinction between parameters and hyper-parameters in the context of an opti-
mization problem requires us to think about how to evaluate specific solutions in the form
of parameter sets for models. In a supervised setting, a set of input data and true pre-
dictions are provided and can be applied to both optimization and evaluation. In order
to conduct this in a statistically fair and comparable fashion, this data set should be split
into several disjoint parts. Splitting should be done in such a way that general patterns of
the problem at hand will occur in all parts, but patterns not correlated to the task will be
restricted to individual splits. For example in the transcription of handwritten text, all char-
acters should occur in all splits and all splits should contain many different texts that are
valid in the language at hand. On the other hand, individual writers should be restricted
to their specific split. This is done to prevent the machine learning model from becoming
sensitive to patterns that are unrelated to the task at hand but still help to improve the op-
timization criteria. This way we will be able to detect if the model is sensitive to patterns
unrelated to the task by evaluation of the other data set splits. Into how many disjoint
parts do we need to split the available data for a fair set-up of supervised training? Su-
pervised training requires data for automatic optimization of the parameters of the model.
After that the model is potentially overly sensitive to the specific patterns that occur in this
training set. This is known as overfitting, where the error on the training set for automatic
optimization is smaller than the error on previously unseen data. To combat this, we need
to split the available data into at least two parts: One for automatic optimization and one
for evaluation in order to detect overfitting. This covers the automatic optimization, but
models that include hyper-parameters are also manually tuned by the model engineer.
Of course this also introduces its own overfitting since the model engineer will choose
hyper-parameters that lead to a low error rate on the available data sets. This means the
overall optimization process (automatic and manual) is now prone to overfitting on the
available data. We can combat this with the same approach as with automatic optimiza-
25
tion, by splitting the unseen data into yet another part. We have now three disjoint parts of
the available data in supervised learning: The training set, which is used for automatic op-
timization of the parameters. The validation set, which is used to detect overfitting in the
automatic optimization and also for manual optimization of hyper-parameters. The third
one is the evaluation set or test set, which facilitates detection of overfitting caused by
ill-chosen hyper-parameters. Automatic optimization on models is state-less in a way that
models and optimization algorithms do not carry over information from different models
or optimization runs. This is not the case for humans and in extension not the case for the
model engineer. Therefore in a perfect world the evaluation set should be hidden from the
model engineer and only used for evaluation after the seemingly best hyper-parameters
have been identified.
2.2 Conditional Random Fields
A conditional random field (CRF) is a graphical model of a multivariate probability dis-
tribution. It defines the joint distribution of multiple random variables in dependency of
observed variables. CRFs are a specialization of a Markov random field (MRF) in the
way that CRFs allow to condition its random variables on observed variables. MRFs are
in turn a variant of graphical models to include undirected dependencies between random
variables and to allow cycles in the graph structure. We will start by introducing graphical
models in general and building up to CRFs from there.
A A A 
B B B 
C C C 
... ... ... 
T H E 
... ... ... 
X X X 
Y Y Y 
Z Z Z
y0 y1 y2
Figure 2.2.1: Example of a graphical model.
Figure 2.2.1 shows a small example graphical model that we will use to point out the
basic terminology. It shows three random variables or nodes Y0 to Y2 that are dependent
on each other in an ascending sequential order. Each random variable has the set of
characters A through Z as its discrete states. We could now use this graphical model to
define the joint distribution of words of three characters in length. In this case we would
expect ‘THE’ to have a high probability, whereas for example ‘AXZ’ should have a low
probability. We will further discuss the definition of joint probabilities for graphical models
in this section.
Equations and algorithms in this section are based on and similar to, but not neces-
sarily identical to, the ones stated by Kevin P. Murphy[93].
26
Bayesian Networks
A directed graphical model (DGM)[93, ch. 10] uses a graph structure to define the joint
distribution of multiple random variables by encoding variables as nodes and their depen-
dencies as edges. As the name suggests, their dependencies are unidirectional with the
conditioning of two random variables yi and yj in the form of P (yi|yj) instead of P (yi, yj).
If the DGM does not contain any cyclic dependencies - meaning a random variable can-
not be indirectly conditioned on itself - it is also called a Bayesian network, which will be
the first type of DGM that we will discuss in detail.
y0
y1 y2
y3 y4
Figure 2.2.2: Bayesian networks are one type of a directed graphical model.
A Bayesian network is directed and acyclic, which means its nodes can be topolog-
ically ordered. The joint distribution that it defines is then the probability distribution of
each random variable, conditioned on its predecessors. The joint distribution
P (y) = P (y0)P (y1|y0)P (y2|y0, y1)P (y3|y0, y1, y2)P (y4|y0, y1, y2, y3) (2.2.1)
is general applicable to Bayesian networks with five nodes y0 through y4.
The Markov assumption states that the distribution of one random variable is condi-
tionally independent of the remaining graphical model given its direct neighbors. This is
also called the Markov blanket in graphical models. Applying the Markov assumption to
Equation 2.2.1 allows us to simplify the joint distribution to
P (y) = P (y0)P (y1|y0)P (y2|y0)P (y3)P (y4|y1, y2, y3) (2.2.2)
for the model in Figure 2.2.2.
Markov Random Fields
We will now move on from Bayesian networks to Markov random fields (MRFs)[66, 82][93,
ch. 19], which generalize the concept of graphical models to undirected and cyclic graph
structures.
Figure 2.2.3 shows one example MRF. Please note that the graphical models of Fig-
ures 2.2.2 and 2.2.3 do not define the same probability distribution since converting from
unidirectional to bidirectional dependencies did introduce additional conditionals for the
random variables.
Topological ordering of the nodes in the graph is not possible in a MRF since the
graph includes cyclic structures. We will introduce the concept of potential functions to
define the joint distribution.
At this point it is prudent to differentiate between continuous and discrete MRFs. In
a continuous MRF each random variable has a continuous scalar as its state. Discrete
27
y0
y1 y2
y3 y4
Figure 2.2.3: Markov random fields allow undirected cyclic graph topologies.
MRFs assign discrete states to their random variables. This difference gives rise to differ-
ent definitions of the joint distribution and inference. Continuous values require integrals
for the joint distribution and inference, whereas discrete values simplify to the sum over
the finite number of states. Since this thesis deals with discrete states, we will only dis-
cuss the case for discrete MRFs.
Assigning one specific discrete state to each of the random variables of a MRF is
called a configuration. As such, each configuration is one overall discrete state in which
the MRF can be. Since the number of states per random variable is finite, the number of
configuration for the MRF is also finite. Using the MRF from Figure 2.2.3 as an example
and assuming two discrete states for each of the five random variables, the MRF in total
has 25 = 32 different configurations. The concept of a configuration comes into play when
defining the joint distribution of a MRF.
The Hammersley-Clifford theorem[50, 70, 77] states the conditions that are necessary
for the joint distribution of a probabilistic graphical model to be defined by the product of
its maximal clique potentials. Luckily, any non-negative function that is dependent on the
random variables within the clique can be used as a potential function. A clique in graph
theory is defined as a subset of nodes of a graph in such a way that every node is a
neighbor of every other node in the clique. A maximal clique is a clique where no more
nodes from the graph can be added while still retaining the clique property.
A MRF is parameterized by assigning a potential function to each maximal clique
of the graph. Please note that each node of the MRF can be part of multiple maximal
cliques. The potential function defines the ‘compatibility’ of the nodes within the clique.
The value of the potential function should be non-negative and higher if the states of
the random variables within the clique are more ‘compatible’ to each other. The joint
distribution of the MRF is then proportional to the product of its maximal clique potentials.
This gives rise to the definition of the joint d∏istribution
P (y) ∝ ψC(yC) (2.2.3)
C
of a MRF with ψC(yC) being the potential function of the maximal clique C. Normalization
leads to the equation for the joint distribution
1 ∏
P (y) = ψC(yC) (2.2.4)
Z
C
of a MRF with the Zustandssumme or p∑arti∏tion function
Z = [ ψC(πC)] (2.2.5)
π C
28
being the sum over all possible configurations π of the random field.
Assigning potentials to the maximal cliques of a MRF is only one way to parameterize
the MRF. Another way is to assign potentials to each edge of the MRF and the two
nodes it connects[93, ch. 19.3.1], which then is called a pairwise MRF. This type of
parameterization is used for the remainder of this thesis.
A pairwise MRF is parameterized by two potential functions. Node potential function
ψs(ys) defines the potential for the state ys of node s. Similarly, the edge potential function
ψs,t(ys, yt) defines the potential for states ys, yt of two neighboring nodes s and t. This
then leads to the joint distribution
1 ∏ ∏
P (y) = ψs(ys) ψs,t(ys, yt) (2.2.6)
Z
s s∼t
of a discrete MRF, which is applicable to a wide variety of problems. Relation ∼ defines
edges within the graphical mode∑l and∏ ∏
Z = [ ψs(πs) ψs,t(πs, πt)] (2.2.7)
π s s∼t
again is the partition function.
The Ising model [19, 61] is a basic example for a Markov random field. It is named after
Ernst Ising and describes the spin of atoms in ferromagnets and anti-ferromagnets. Each
node of the MRF in an Ising model represents one atom with two states yi ∈ {−1,+1} for
a positive or negative spin respectively. The Ising model is a pairwise MRF with its edge
potential function (
ewst e−
)
wst
ψs,t(ys, yt) = e−w
(2.2.8)
st ewst
encoding the ‘compatibility’ of equal spin of neighboring atoms on the diagonal and of
non-equal spin on the off-diagonal. wst is zero for non-neighboring nodes (atoms that
do not influence each other’s spin). A simplification of this model is to define wst = J
for all nodes s and t. The node potential function ψs(ys) in an Ising model is defined as
ψs(ys) = e
0 for all node-state combinations since no prior for an atoms spin is assumed.
There are now three different possibilities for the behavior of an Ising model. First,
we can define J > 0, modeling a material in which the spin of adjacent atoms tend
to be identical. This is the case for ferromagnets. Second, a value of J < 0 favors
configurations of the material in which the spin of neighboring atoms are different, which
is the case for anti-ferromagnets. Such a MRF is also called a ‘frustrated system’ since
nodes in an Ising model are ordered in a grid structure and choosing J < 0 means
that there exists no configuration of the MRF in which there are no edges with a low
compatibility. At least one neighboring relation encoded in the edge potential function will
always be contradicted by such a frustrated system. The third possibility in an Ising model
is to define J = 0, modeling a material in which atoms do not influence each other’s spin.
Conditional Random Fields
Transitioning from Markov random fields to conditional random fields (CRFs)[77][93, ch.
19.6] is a straightforward step, since CRFs basically are only MRFs that are conditioned
on observed variables. Figure 2.2.4 shows our familiar graphical model, this time as a
CRF with conditioning on five observed variables x0, ..., x4.
The join distribution of a CRF is defined in a similar fashion to the join distribution of a
MRF, that is by its node potential function ψs(ys|xs) and edge potential function ψs,t(ys, yt)
over maximal cliques or, as in the equation given in this section, as a pairwise CRF. The
29
x0 y0
x1 y1 y2 x2
x3 y3 y4 x4
Figure 2.2.4: Conditional random fields are MRFs with conditioning on observed variables.
difference to a MRF is that the node potential function is conditioned on the observed
variables x. Each variable y of a conditional random field is conditioned on one observed
variable x.
The parameterization of a CRF is dependent on observed variables, with the node
potential function being ψs(ys|xs) and edge potential function ψs,t(ys, yt). This leads to
the joint probability ∏ ∏
P (y| 1x) = ψs(ys|xs) ψs,t(ys, yt) (2.2.9)
Z
s s∼t
with the partition function Z bei∑ng t∏he normaliza∏tion over all configurations π:
Z = [ ψs(πs|xs) ψs,t(πs, πt)] (2.2.10)
π s s∼t
An example application of CRFs very much worth mentioning in the context of this the-
sis is that of handwriting recognition. In such a CRF the label space yi ∈ A is the alphabet
A of the language at hand. The edge potential function ψs,t(ys, yt) encodes a probabilistic
language model, that is the likelihood of observing specific character 2-grams in this lan-
guage. For example the 2-gram ‘th’ has a higher likelihood and thus higher compatibility
than ‘xq’ in the English language. The node potential function ψs(ys|x) encodes a dis-
criminate classifier that produces a high node-state compatibility if the node of the CRF
(a pixel in offline handwriting recognition or a time step in online handwriting recognition)
corresponds to the specific glyph from the alphabet A.
The model parameters of both edge and node potential functions in such a CRF for
handwriting recognition are learned from a data set of examples of the language at hand.
This model optimization for CRFs is not discussed here since it is not part of this thesis,
which deals with the optimization of deep neural networks and applies conditional random
fields for inference. Training the model parameters of a CRF is discussed in literature[93,
ch. 19.6.3].
There are three differences between the CRFs described above and the method pre-
sented in this thesis: first, the node potential function of the CRFs in this thesis does not
itself encode a discriminate classifier, but instead relies on a deep neural network as the
best available classifier, which is then iteratively improved. Second, the edge potential
function in multi-dimensional connectionist classification is not probabilistic and describ-
ing a whole language model, but instead is a discrete description of one specific example
from a data set. Third, the model parameters of the CRFs in this thesis are defined by a
fixed set of rules as discussed in Section 6.2 and not learned from a data set. Overall is
30
the CRF of this example much more similar to the deep neural network in this thesis in
that both serve as discriminate classifiers for glyphs.
A concrete example for the application of CRFs is stereo vision[139] in which the
depth dimension in physical space is estimated from two corresponding images that show
a disparity in the horizontal axis between each other. Human vision can be seen as
an example of this, were the vision from both eyes is combined for depth estimation.
Observation x in stereo vision is a pair of images xL from the left camera and xR from
the right camera. The labels yi encode the horizontal disparity between the two images,
which in this case is discrete since the disparity is measured in whole pixels. Let is be
the horizontal and js the vertical position of the pixel s in question.
The node potential function
1
ψs(ys|x) = exp[− (xL(is, js)− xR(i 2s + ys, js)) ] (2.2.11)
2σ2
encodes a Gaussian prior with the assumption that corresponding pixels from the two
camera images will show a similar pixel intensity or color.
The edge potential function
1
ψs,t(ys, yt) = exp[− (ys − y 2t) ] (2.2.12)
2γ2
encodes a Gaussian prior of neighboring pixels having similar disparities between the
two camera images.
Inferring the node-state combinations ys of such a CRF for stereo vision means to
infer the disparity between the two camera images for each pixel. This disparity directly
correlates to the distance between the camera system and the physical object that it
captured. A high disparity indicates a small distance to the object and a low disparity a
large distance.
Belief Propagation
The last sections discussed the concepts of graphical models and how to define their
joint distribution. Some use-cases require to do inference in such graphical models, for
example estimating the local posterior marginals of individual random variables. This is
the case in this thesis.
In general there are two different types of inference on graphical models: computing
the local posterior marginals or computing the local maximum a posteriori (MAP).
The local posterior marginals∑in a MR∏F are defin∏ed as follows:1
P (yi) = [ ψs(πs) ψs,t(πs, πt)] (2.2.13)
Z
π:πi=yi s s∼t
The local MAP is defined as follows:
1 ∏ ∏
y⋆ = argmax[ ψs(πs) ψs,t(πs, πt)] (2.2.14)
π Z s s∼t
Evaluation of the local posterior marginals or local MAP using the above definitions
of Equations 2.2.13 and 2.2.14 is obviously computationally prohibitive since they require
enumeration of all configurations π of the graphical model. The number of possible con-
figurations π for a graphical model of n random variables with m discrete states is mn
and too large even for simple models. The complexity of inference in general graphical
models is the topic of works[2, 15, 23, 80] by other authors.
31
y0 y1
y2
y3 y4
Figure 2.2.5: Graphical models in the form of a polytree allow efficient exact inference.
Efficient exact inference in a graphical model is possible if the graphical model is a
polytree[24]. A polytree is a directed acyclic graph of which the underlying structure is
an undirected tree. Figure 2.2.5 shows an example polytree. One algorithm for exact
inference in graphical models is called belief propagation (BP)[100][93, ch. 20] and is a
message passing algorithm.
When discussing BP, the terms belief and message occur frequently. A belief is an
estimation of the probability distribution of one random variable based on the currently
available information. A message is the assumption about the probability distribution of
a neighboring random variable based on the own belief. Beliefs and messages influence
each other since beliefs are formed by collecting evidence from neighboring random vari-
ables via messages. Messages towards neighboring random variables are based on the
belief about the source random variable.
BP can be broken down into three individual steps: first, choose one of the nodes of
the graphical model as the root node. Second, collect evidence for the probability distribu-
tion of the root node by message passing from the leaf nodes to the root node. Equations
2.2.15 and 2.2.16 below model this step. Third, update the beliefs of the remaining nodes
by message passing from the root node towards the leaf nodes. Equations 2.2.17 and
2.2.18 define this second step. The beliefs of the root node will be correct after the sec-
ond step, collecting all the evidence for the probability distribution of the root node. From
there the third step will correct the beliefs of the remaining nodes.
In the second step, incomplete beliefs ∏
bel−t (yt) ∝ ψt(y ) m
−
t c→t(yt) (2.2.15)
c∈child(t)
are build by collecting evidence from ch∑ild nodes by propagating messages
m− (y ) = ψ (y , y )bel−s→t t s,t s t s (ys) (2.2.16)
ys
towards the root node.
Passing evidence from the root node towards the leaf nodes yields the correct, but
unnormalized, beliefs ∏
bels(ys) ∝ bel−s (y +s) mt→s(ys) (2.2.17)
t∈parent(s)
of the nodes of the graphical model wit∑h the messages
m+
belt(yt)
t→s(ys) = ψt,s(yt, ys) − (2.2.18)m
yt s→t(yt)
32
being propagated from parent to child nodes in this step. The upward messages m+t→s(ys)
collect evidence about node t and infer an incomplete state distribution about node s. It
is important that the evidence about the state distribution of node t does not incorporate
the node s, to which the upward message is propagated. Multiplication of the incomplete
state distributions collected by upward messages m+ yields the true state distribution of
ys.
Equation 2.2.18 can be further generalized by recognizing that ignoring the evidence
from node s about node t is not only possible by dividing through the corresponding
message m−s→t(yt), but also by simply ignoring node s while collecting evidence about s.
This produces ano∑ther variant of Equation∏2.2.18: ∏
m+ − +t→s(ys) = ψt,s(yt, ys)ψt(yt) mc→t(yt) mp→t(yt) (2.2.19)
yt c∈child(t)\s p∈parent(t)
The above Equations 2.2.15, 2.2.16, 2.2.17, 2.2.18 and its variant 2.2.19 describe the
sum-product algorithm based on published works[93, ch. 20.2.1]. Normalizing the beliefs
bels(ys) of Equatio∑n 2.2.17 such that the beliefs of one random variable sum up to onewill yield the local posterior marginals as defined in Equation 2.2.13.
Replacing the operator with the max operator in Equations 2.2.16 and 2.2.19 will
yield the max-product algorithm which computes the local MAP of Equation 2.2.14. See
[93, ch. 20.2] for an in-depth discussion.
Computing the exact marginals is possible in a polytree since it is possible to pass
messages in such an order that collects all evidence for one node, correctly fixating the
beliefs for this node. Evidence is then distributed from this node and the correct local
marginals are computed for all remaining nodes based on their fully available evidence.
Applying the sum-product algorithm to a chain structured graphical model is the same
principle as the forward-backward algorithm[105], whereas applying the max-product al-
gorithm is the principle of the Viterbi Algorithm[33, 149] with backtracking.
Loopy Belief Propagation
We have discussed belief propagation and how it yields exact marginals in case of a
polytree structure of the graphical model. This thesis deals with CRFs in which the nodes
are structured in a 8-neighborhood grid. Grid-structured undirected graphical models
contain cycles and cannot be represented as a polytree. It follows that BP is not directly
applicable to it.
We will now discuss loopy belief propagation (LBP)[34, 94][93, ch. 22], which is
an inference algorithm to approximate the local marginals. LBP is based on a simple
approach: repeatedly apply BP to the cyclic graphical model until convergence of the
beliefs or other stopping criteria are met. This approach is not guaranteed to converge
to a stable point within polynomial time, as even approximating the posterior marginals
is generally NP-hard[23], or that this stable point is near the exact solution for the local
marginals[100]. However, LBP has been successfully applied to a variety of practical
problems[94]. In computer vision, CRFs and LBP have been successfully applied to
image segmentation, object tracking and stereo depth estimation.
The BP algorithm as described before implements the so called serial protocol for
message scheduling. In it, messages are sent one after another from node to neighboring
node. The order in which messages are processed is sequential. LBP employs the
parallel protocol, see known works[93, ch. 20.2.2], in which all messages are being sent
simultaneously. This means for LBP we will initialize all messages, then iteratively update
all messages simultaneously. This process of parallel message updates is repeated until
33
the stopping criteria are satisfie∑d. Message updates ∏
ms→t(yt) = [ψs(ys)ψs,t(ys, yt) mu→s(ys)] (2.2.20)
ys u∈nbr(s)\t
as stated by Kevin Murphy[93, ch. 20.2.2] are dependent only on the potential functions
ψ and the current message values m. Message updates are computed by collecting
evidence for the source node s by receiving messages from all neighboring nodes u,
except the target node t. This evidence is used to form a belief about the probability
distribution of the target node t and send the corresponding message update to the target
node t. Approximated beliefs ∏
bels(ys) ∝ ψs(ys) mt→s(ys) (2.2.21)
t∈nbr(s)
are formed by collecting the evidence via message updates from all neighboring nodes.
The beliefs bels(ys) are proportional to the approximation of the local marginals and thus
can be normalized to retrieve these approximated local marginals.
Algorithm 2.2.1 Loopy Belief Propagation in Sum-Product Mode
Initialize messages ms→t(yt) = mt→s(ys) = 1 for all edges s ∼ t.
Initialize beliefs bels(ys) = 1 for all nodes s.
Choose a random but fixed order for message updates.
repeat
Send messa∑ges along each edge∏according to the chosen ordering:
ms→t(yt) = y [ψ∏s(ys)ψs,t(ys, y )s t u∈nbr(s)\tmu→s(ys)]Update beliefs for each node:
bels(ys) ∝ ψs(ys) t∈nbr(s)mt→s(ys)
until stopping criteria are satisfied.
Return marginal beliefs bels(ys).
Algorithm 2.2.1 outlines LBP in sum-product mode as pseudo-code based on the
version stated by Kevin Murphy[93, ch. 22.2.2] with the stopping criteria generalized.
LBP is typically stopped and the computed beliefs returned after the beliefs do not change
significantly anymore[9∑3, ch. 22.2.2]. While this is surely the most prominent stoppingcriteria, other more problem-specific ones are possible. We will explore this later in this
thesis. Replacing the operator with the max operator will again yield the max-product
mode. Please note that the max-product algorithm in LBP can yield inconsistent results,
e.g. nodes being in contradicting states.
In theory, the parallel protocol updates all messages at the same time. However, im-
plementing LBP on a physical computer will most likely lead to some serialization of the
message order with the hardware parallelization of modern computing hardware being
used for a speed-up in terms of wall clock time. It is worth noting that there are different
strategies for computing the message updates in the parallel protocol. One way is the
synchronous update in which the new messages are computed based on the message
values from the last iteration. Asynchronous update holds and uses only the most recent
value per message. In practice this leads to inconsistencies in the ‘versioning’ of mes-
sages since currently held message values are originating from two different iterations at
times n and n− 1. It has been observed[93, ch. 22.2.4.3] that this does not pose a prob-
lem, but can be used to increase the convergence rate of the messages within LBP. This
is achieved by choosing a random order for message updates at the beginning and then
using this fixed order in each LBP iteration. See figure[93, Fig. 22.5] for a comparison of
the convergence rate in synchronous and asynchronous updates.
34
2.3 Deep Neural Networks
A deep neural network (DNN)[41, 126] is a (rather loosely) defined type of an artificial
neural network [126][7, ch. 5]. An artificial neural network is a machine learning (ML)[7,
93] method that employs a large number of neurons-like structures and their connections
to approximate decision functions. Each individual neuron in an artificial neural network
is modeled after a highly abstracted model of a biological neuron[31, 52, 68, 89, 110]. An
artificial neural network is typically organized in layers with each layer consisting of a set
of neurons that receive signals from the previous layer. Each layer adds more parameters
and non-linear mappings to the learning machine. The input signals of the first layer is
the observed data as defined by the ML task. The output signals of the last layer are
the values of the decision function and are dependent on the ML task, e.g. if the task
is classification or regression. This type of artificial neural network is also called a multi-
layer perceptron (MLP)[7, ch. 5.1]. A DNN is often defined as a MLP or MLP-like artificial
neural network with more than one hidden layers[5] that are not directly visible by the
user (neither in- nor output). We will discuss the function of individual artificial neurons
and layers in this section.
A Single Neuron
x0 X w0
x X w11 Σ l a
b
xn X wn
Figure 2.3.1: A single neuron of an artificial neural network is a linear combination followed by a
non-linear function.
A single neuron within an artificial neural network is a weighted linear combination of
its input signals x, followed by a non-linear activa∑tion function. This is described by thefunction
f(x,W, b) = σ(b+ xiWi) (2.3.1)
i
with input signals x and learnable parameters W and b. To learn parameters here means
to optimize them regarding a task-specific target function. Optimization of an artificial
neural network is most often done using gradient descent [11, 65, 107] and backpropa-
gation[112, 113], both of which will be discussed later.
Function σ is a non-linear function and often called the ‘activation function’ of the
neuron. In the case of a MLP or DNN, most likely the same activation function is used for
all neurons in the layer, an approach which offers computational benefits. The activation
function allows the neuron to transition from a ‘non-activated state’ to a ‘activated state’,
which following neurons use to compute their own activations. Figure 2.3.2 shows the
step function, which transitions the neuron between two discrete states. Please note that
35
Figure 2.3.2: Step function as an activation function σ.
this activation function is not used anymore, for various reasons. We will discuss this
when detailing backpropagation.
Figure 2.3.3: Standard logistic sigmoid as an activation function σ.
Figure 2.3.3 shows the standard logistic sigmoid σ(x) = 11+exp(−x) , which is in com-
mon use and a much better choice than the step function. The standard logistic sigmoid
function is differentiable at any point and monotonically increasing, two properties which
allow for gradient descent.
The formulation of a single neuron of a artificial neural network is also the formulation
of the decision function of the perceptron algorithm[109, 110]. In the case of a perceptron
the activation function is also called the ‘threshold function’. As with a perceptron, a single
neuron is not able to learn non-linear separable classification problems[92].
Please note that we can easily combine the weights W and biases b of an artificial
neural network into one parameter set Ŵ by expanding the input vector x by a dimension
36
that has a constant coefficient of 1. This constant feature allows us to move the bias b
into the weights W as one additional weight coefficient. Using x̂ as the expanded feature
vector, Equation 2.3.1 reduced to ∑
f(x̂,Ŵ) = σ( x̂iŴi) (2.3.2)
i
with x̂ and Ŵ having one dimension more than x and its according W. From now on we
will use the formulation with the expanded feature vector without explicitly denoting this
fact.
Multi-Layer Perceptron
We will now discuss how to construct a MLP from the above definition of a single neuron.
A MLP is an artificial neural network that comprises a set of neurons organized in ‘layers’.
Each layer is a set of neurons that, in the case on non-recurrent layers, use the activations
of the previous layer as inputs and then compute their own activations which are either
the output of the MLP or fed into the consecutive layer. The neurons within one layer are
organized in parallel. The first layer takes the observed features as input, provided by the
user. The activations of the last layer are the value of the decision function.
x0 h0 y0
x1 h1 y1
xn hn yn
Figure 2.3.4: A multi-layer perceptron featuring a single hidden layer.
Figure 2.3.4 shows a simple MLP with one hidden layer. Nodes x0 through xn rep-
resent the input features as provided by the user. Nodes h0 through hn and y0 through
yn are artificial neurons. Each one is of the type as detailed in Figure 2.3.1. Important
in a MLP is to choose a non-linear activation function σ. Without this non-linearity, the
decision function of the MLP reduces to a single linear layer. This effect can be produced
by substituting the equations for a single neuron and applying the distributing property to
reduce the weight and bias coefficients of the linear combination.
Organizing the artificial neural network as a MLP in layers and each neuron within a
layer modeling the same function, apart from its weights and biases, allows us to simplify
Equation 2.3.1 using matrix linear algebra. Resulting in the function
f(x,W) = σ(Wx) (2.3.3)
for a layer with multiple neurons. The above function f : Rn → Rm describes the activa-
tions of a layer of m neurons given n input signals, either observed features or activations
from the previous layer. In this case the vector x is of dimension n and matrix W of di-
mension m×n. Activation function σ is a coefficient-wise non-linear function as described
above for a single neuron. This formulation using matrix linear algebra has the benefit
37
True
False
False True
Figure 2.3.5: Logical XOR is a classic example for a non-linear separable classification as it re-
quires two separating planes in linear space.
of allowing to utilize computationally efficient matrix multiplication algorithms and general
purpose graphics processing unit (GPGPU)[16, 37] computation.
MLPs allow to solve non-linear separable classification problems. This can be ex-
emplified with the logical XOR problem. The logical XOR is a boolean function that is
true if its inputs are different. Figure 2.3.5 shows this for two boolean variables. It also
shows that two decision boundaries (dashed lines) would be necessary to directly solve
the problem, which is not possible with a linear classifier. We will now introduce one
hidden layer with two neurons into the MLP.
True 1
0
False
False True
Figure 2.3.6: Separating one ‘true’ case of the logical XOR in its own hidden neuron.
Figure 2.3.6 shows the decision boundary of the first neuron of the hidden layer.
Figure 2.3.7 shows the decision boundary of the second neuron of the hidden layer. Now
we have separated the two cases were the logical XOR is true from the remaining cases.
Figure 2.3.8 shows the decision boundary of the output neuron. Whenever at least
one of the two hidden neurons is activated (1), the logical XOR will be true. Please note
that both neurons with decision boundaries from Figures 2.3.6 and 2.3.7 cannot be active
at the same time. We have now seen that introducing one hidden layer allows a MLP to
solve non-linear separable classification problems.
Softmax and Cross-Entropy for Classification Tasks
This work uses DNNs to transcribe multi-line texts from images. For this we will need to
produce probabilities for individual glyphs at specific spatial positions within the image.
38
True
0
False 1
False True
Figure 2.3.7: Separating the other ‘true’ case of the logical XOR in a second hidden neuron.
1
0
0 1
Figure 2.3.8: Logical XOR is true if at least one of the two hidden neurons is activated. Both
hidden neurons cannot be active at the same time.
39
We will now discuss a suitable non-linear activation function and loss function for this
task.
The equation ∑exp(xi)σ(x)i = (2.3.4)
j exp(xj)
describes the softmax[13][7, p. 198][41, ch. 6.2.2.3] activation function, which produces
activated values between zero and one that sum up to exactly one. This allows the ac-
tivated values to be interpreted as probabilities. Since the vector x can be of arbitrary
dimensionality, softmax allows to model multi-class problems were classes are mutually
exclusive to each other. This is not the case for e.g. the standard logistic sigmoid, which
also is in range zero to one, but does not normalize the sum of activations. This makes
softmax suited for multi-class problems were classes are mutually exclusive and the stan-
dard logistic sigmoid suitable for non-exclusive classes.
The softmax function produces class probabilities for a multi-class problem. Now a
suitable loss function is necessary for optimization of the parameters of a MLP for solving
a classification problem. Let us use y for the class probability distribution estimated by
the MLP and z for the true class distribution. Please note that we are discussing the case
of discrete classes, since this work deals with discrete classes in the form of glyphs from
an alphabet.
The cross-entropy[7, ch. 4.3.2][41, ch. 3.13] los∑s function is described by
L(y, z) = −Ez[log(y)] = − [zi log(yi)] (2.3.5)
i
for discrete probability distributions y and z over the same event set. Minimizing the
cross-entropy loss minimizes the uncertainty regarding the true distribution z given the
estimated distribution y. It can also be interpreted as minimizing the coding length when
using a coding scheme based on the estimated distribution y for coding the true data
from distribution z. This makes the cross-entropy loss a suitable target function for a
multi-class problem since in this case we can only estimate the probability distribution for
unknown samples, but we can minimize the uncertainty beforehand on a known training
data set.
Optimization by Gradient Descent
Up until now we have discussed how a MLP is structured and working in general. We also
have described cross-entropy as a target or loss function, that is a function that describes
the error value that the MLP did produce in its estimation. It follows now that we can use
this target or loss function to derive how to optimize the parameters of the MLP, that is
find a set of weights W⋆ ∈ Rn with n being the number of parameters in the MLP that
minimizes the value of the loss function as
W⋆ = argminL(S,W) (2.3.6)
W
with S being the training data set and thus minimizes the value of the error on this data.
We will do this using gradient descent [11, 65, 107].
Applying gradient descent means trying to minimize the value of the loss function
over a fixed, finite set of examples. As discussed in Section 2.1, this data set is called
the training set and is the largest part, e.g. commonly 80-90 percent, of the overall avail-
able examples regarding the problem at hand. The remaining data examples will later be
used to validate and evaluate a possible solution. Having a data set of examples for train-
ing with both the input (observed features) and output (true target values) is also called
40
supervised learning. This means that the ML model is fully under supervision by the
optimization algorithm with errors being detected immediately and corrected if possible.
Popular other training paradigms for DNNs are unsupervised learning and reinforcement
learning[142]. This thesis deals with supervised learning problems and we will concen-
trate of this training paradigm.
gradient descent itself is an iterative algorithm for supervised training of ML models.
It requires the loss function and the ML model to be differentiable. It assumes the set of
training data to be a representative sample of the problem at hand and the loss function
to be a function over weight space, that is over the DNN parameters of weights. gradient
descent uses the gradient
∂L
(2.3.7)
∂W
of the loss function L in the weight space defined by W. Since the gradient in weight
space gives the direction of steepest increase of the function, the negative gradient gives
the steepest descent of it. This is why gradient descent follows the negative gradient of
the loss function for minimizing it. The new parameter set Wt+1
∂L
Wt+1 = Wt − µ (2.3.8)
∂Wt
is the old parameter set with the negative gradient of the loss function added. Hyper-
parameter µ is called the learning rate and is a coefficient added to control the speed of
gradient descent. This is necessary since we may otherwise not reach a stationary point.
This process of calculating the first order derivative of the loss function in weight space
and following the negative gradient is repeated until convergence. If gradient descent
converges to a stationary point, it is guaranteed (assuming the loss function is from a
left-bounded interval) to be a local minima or saddle point of the loss function.
Algorithm 2.3.1 Gradient Descent
Initialize weight set W.
Choose a learning rate µ.
repeat
Evaluate the loss function L with current weights W.
Calculate gradient ∂L∂W of the loss function.
Update weights W = W − µ ∂L∂W .
until convergence to a stable point.
Return final weight set W as W⋆.
Algorithm 2.3.1 describes gradient descent in its most basic case.
Figure 2.3.9 shows an example of gradient descent for finding the minimum of the
function f(w) = w2 starting at w = −10 and a learning rate µ = 0.1. Please note that this
is a good-behaved case since the function f(w) = w2 is convex and thus only has one
minimum. In such a case gradient descent will reliable converge to a stable point near or
at the global minimum, given that the learning rate µ is not set too high.
Figure 2.3.10 exemplifies the case for a learning rate of µ = 0.9 which is too high and
the parameter w will oscillate and only slowly converge to the minimum of the function. If
we choose the the learning rate even higher, then the parameter w in gradient descent will
diverge from the minimum. In the case of optimizing the weights W of a DNN regarding
a given loss function, the loss value would actually increase over time.
The above examples show the simple case of gradient descent in a one-dimensional
space over parameter w while minimizing the convex function f(w) = w2. We cannot
expect the loss function L to be convex in the case of training a DNN[42]. As such
41
Figure 2.3.9: Gradient descent in function f(w) = w2 (blue) with µ = 0.1.
Figure 2.3.10: Gradient descent in function f(w) = w2 (blue) with µ = 0.9.
42
gradient descent may converge to a stable point that is not a global minimum, which
would be a sub-optimal solution. We need to keep this in mind when optimizing a DNN
and validate each possible solution. However, it seems that from a practical point of view,
most local minima are at a good-enough solution in the case of DNNs[18, 25, 116].
So far we have discussed the theory of gradient descent with some examples, but for
practical implementation we need to decide on a schedule for gradient descent. ‘Sched-
ule’ in the context of gradient descent describes the order in which the training set is
processed and at which intervals an update of the parameters is executed. We need to
specify the terms epoch and iteration for beforehand. An epoch in supervised training of
DNNs is the update of the network parameters with each example from the training set ex-
actly once. An iteration is one individual update step of the network parameters, weather
this is with one example from the training set or multiple examples at once. The number
of examples per iteration is actually the difference between online, mini-batch and batch
training of DNNs. Let us use S as the training set in supervised training x, z ∈ S with
x being the network input (extracted features or e.g. an image) and z being the correct
output. y will be the prediction by the DNN. Let us also use the Mean Squared Error as
our loss function in the following examples. Online training
∑N1
Lo(S, i) = (yi,n − z 2i,n) (2.3.9)
N
n
with yi = f(xi,W) and xi, zi ∈ S is then the loss for exactly one training set element
xi, zi per iteration i. Batch training
1 ∑|S| N1 ∑
L 2b(S) = (yi,n − zi,n) (2.3.10)|S| N
i n
is the other extreme to online training. In batch training, the full training set will be used
per iteration whereas online training uses exactly one example per iteration. A middle way
between online and batch training is mini-batch training where only a part of the training
set, but more than one example, is used per iteration in gradient descent. This mini-batch
training is formulated as the loss
(i+∑1)×b N1 1 ∑
Lm(S, b, i) = (yj,n − zj,n)2 (2.3.11)
b N
j=i×b n
where 1 < b < |S| is the mini-batch size and i indicates the current iteration. When ap-
plying training using gradient descent to supervised learning problems, a choice between
these three schedules has to be made. This is mainly a trade-off between computa-
tion time and convergence rate of the optimization process[6, 155]. Using online training
or mini-batch training with a low mini-batch size allows gradient descent to better follow
the curvature of the loss function during optimization, but also yields more iterations per
epoch. This in combination leads to an overall faster convergence rate of the loss value
during optimization. On the other hand does batch training or mini-batch training with
a larger mini-batch size allow the efficient utilization of GPGPU hardware by enabling
multiple independent computation threads. In general, mini-batch training is currently the
schedule of choice for practical applications. Mini-batch sizes range typically somewhere
in between 8 and 64, depending on how much training data and computer memory is
available.
43
Backpropagation
So far we have discussed how MLPs are structured and how their parameters can be
optimized towards a given task using gradient descent. However, gradient descent re-
quires the partial derivatives ∂L∂W of the loss function L towards the MLP parameters W.
Backpropagation[112, 113] is an algorithm for calculating the partial derivatives ∂L∂W for
MLPs and DNNs.
MLPs are organized in layers of neurons, each layer being a matrix multiplication of
the input features and the weights of the layer. This linear combination of each layer is
followed by a non-linear activation function. We can see that the linear combination is a
function f(x,W) of its input x and the weight set W of the layer. The non-linear activation
function σ(x) is a function that has the result of the linear combination as its input x. In
the same way, the output of the non-linear activation function of one layer in a MLP is the
input of the next layer in the MLP. This means that a MLP is a series of composite (or
nested) functions. The equation
m(x,W) = σ(l2(σ(l1(x,W1)),W2)) (2.3.12)
exemplifies this for a MLP m with two layers consisting of linear combinations l1, l2 and
their non-linear activation functions σ.
Backpropagation takes advantage of the fact that the inner derivative of a composite
function is the product of the derivatives of the outer and inner functions:
∂f(g(x)) ∂f ∂g
= (2.3.13)
∂x ∂g ∂x
This is known as the ‘chain-rule’ and means we can now specify the partial derivatives
for the above MLP:
∂m ∂m ∂l2
= (2.3.14)
∂W2 ∂l2 ∂W2
∂m ∂m ∂l2 ∂l1
= (2.3.15)
∂W1 ∂l2 ∂l1 ∂W1
The chain-rule and in extension backpropagation also applies to the loss function for
optimization itself. It allows calculation of any necessary derivative for gradient descent,
e.g. for W1 of the above MLP m(x,W):
∂L ∂L ∂m
= (2.3.16)
∂W1 ∂m ∂W1
Backpropagation is the application of the chain rule for calculating partial derivatives
of the MLP, beginning from the loss function and ‘working towards’ the input of the MLP.
First step is to calculate the derivative of the loss function L, then applying the chain
rule to obtain the derivative of the output layer, then the hidden layer(s) and so on. The
derivatives of the individual layers of the MLP∑are often simple, as for the example above:
l(x,W) = xiWi (2.3.17)
i
means
∂l
= Wi (2.3.18)
∂xi
and
∂l
= xi (2.3.19)
∂Wi
44
both of which are known if Wi and xi are stored, which is the case for Wi anyway since it
is a MLP parameter that is optimized by gradient descent.
The derivative of the standard logistic sigmoid
1
σ(x) = (2.3.20)
1 + exp(−x)
is
∂σ(x)
= σ(x)(1− σ(x)) (2.3.21)
∂x
which means it is also easy to compute when storing σ(x) from the forward pass through
the MLP.
The backpropagation algorithm can be summarized as follows. It is used to calculate
the partial derivatives ∂L∂W , which are then used to optimize W towards minimizing the
loss function L using gradient descent. Please note that loss function L most likely has
more parameters than just xN , depending on the ML task at hand.
Algorithm 2.3.2 Backpropagation
Define current weight set W.
Define current input x0.
repeat ▷ Forward pass
Compute linear combination li = Wixi−1.
Apply non-linear activation function xi = σ(li).
Store both li and xi.
until output layer i = N .
Compute loss function L(xN ).
Compute derivative ∂L∂x .N
repeat ▷ Backward pass
Compute derivative ∂L∂l =
∂L ∂xi
i ∂xi ∂l
.
i
Compute derivative ∂L ∂L ∂li∂W = ∂l ∂W .i i i
Compute derivative ∂L = ∂L ∂li∂xi−1 ∂l .i ∂xi−1
Store ∂L∂W .i
until input layer i = 0.
Return set of derivatives ∂L∂W .
Algorithm 2.3.2 describes backpropagation for a standard MLP. Backpropagation can
be adapted for other DNN topologies as well, as long as the individual modules of the
network are differentiable. Modules refer to building blocks of the network, e.g. a non-
linear layer in a MLP, that feed forward and process the data. These modules need to be
differentiable regarding their parameters that should be optimized, if any, and their input
in order to back propagate the loss derivative to the previous module(s). The forward
and backward passes in backpropagation need to be changed to match the modules in
use accordingly. We will later discuss convolutional neural networks and recurrent neural
networks, which both use different module types than fully connected non-linear layers.
Convolutional Neural Networks
Convolutional neural networks (CNNs)[36, 78, 79] are a variant of MLPs that are espe-
cially suited to process spatial data. As with MLPs are CNNs organized in layers with the
first layer processing the user-defined input and then each layer receiving the forwarded
activations from the previous layer. In contrast to MLPs do CNNs often consist of several
45
types of layers, convolutional layers which give them their names and often pooling layers
to reduce the spatial resolution.
Convolutional layers in a CNN, as with artificial neuronal layers, consist of a linear
combination with learnable weights followed by a non-linear activation function. In con-
trast to a MLP are convolutional layers in a CNN not ‘fully connected’, meaning no artificial
neuron receives the full data from the previous layer as input. Input data is organized in
‘feature maps’ with one or multiple spatial dimensions (e.g. two in the case of image data)
and one or more channels or feature maps. Each feature map contains either the one in-
put feature or the activations from one artificial neuron from the previous layer, organized
in a spatial map. Convolutional layers now consist of neurons that receive only a part of
the feature map, from a sliding rectangular window or kernel, as input. Each window may
be processed by multiple neurons with different weight sets, resulting in multiple output
feature maps, but neurons in different windows share their weights. Referring to Equation
2.3.2, weights W are shared from window to window, but inputs x are individual per win-
dow. This behavior is similar to kernels in Computer Vision methods, but the coefficients
of the kernel are learned by gradient descent.
The equation
f(x,W, k)i = σ(Wx[i−⌊ k ⌋,i+⌊ k ⌋]) (2.3.22)
2 2
describes a convolutional layer with input x, weight set W and a window/kernel size k.
It is a one-dimensional convolution in this example and as such the interval for slicing
x is only along one dimension. Convolutional layers can easily be extended to multiple
dimensions by extending the window to those dimensions and slicing x accordingly. In
the equation above, weight matrix W would be m × (kn) in size, with m neurons in the
convolutional layer and n input channels.
same W, 
different x
σ(Wx)
σ(Wx)
σ(Wx)
σ(Wx)
channels
channels
Figure 2.3.11: Convolutional layers apply learnable spatial kernels to the feature maps and used
to extract general features from the data. Colors indicate different feature maps or
channels.
Figure 2.3.11 shows a basic example of a convolutional layer along one spatial di-
mension. The input feature map is in this case 6 in size along the spatial dimension with
2 channels. The convolutional layer has a kernel size of 3 with 3 artificial neurons, re-
sulting in an output feature map with a spatial dimension of 4 with 3 channels. We have
reduced the spatial size in this case by 2 because only spatial positions were processed
were the window lies fully within the input feature map. This behavior can be prohibited
adding padding, e.g. constant zeros or a mirror of the feature map, of size ⌊k2⌋ to the input
feature map.
46
spatial dimension
spatial dimension
CNNs have a similar architecture as MLPs, but with shared weights in the convolu-
tional layers. This greatly reduces the number of learnable parameters, as compared to
fully connected layers. This not only makes a CNN computationally more favorable by
reducing both memory usage and CPU load but also reduces overfitting of the learned
solution. This is because on average there are more data points per learnable parameter
in a CNN than in a MLP, resulting in fewer minima of the loss function in weight space.
Also the same convolutional kernels are applied to different spatial positions within the
input feature map, which prevents neurons from becoming receptive to specific features
that only occur in a single or very few spatial positions. Instead, neuron weight sets
which are receptive to general features in many different spatial positions are favored
during gradient descent.
CNNs typically consist of convolutional layers, see our discussion before, and pool-
ing operations. Pooling operations in a CNN are ‘layers’ or functions without learnable
parameters that are used to reduce the spatial resolution of the feature map. Pooling in
a CNN is done by moving a non-overlapping window over the feature map and reducing
the contained feature map to only one position, e.g. ‘pixel’ in image data. For example
in pooling with a windows size of two, the spatial resolution will be halved. The operation
to reduce the spatial resolution is applied channel-wise, preserving the number of chan-
nels, and is often a simple combination of the input data, most likely the maximal value
or average value. It is necessary to keep this operation differentiable in order to apply
backpropagation to the CNN. Backpropagation for maximum pooling is done by propa-
gating the loss derivative only towards the coefficient that was forwarded during pooling,
ignoring other data.
Equation
f(x, w)i,c = max(x[iw,(i+1)w],c) (2.3.23)
describes a maximum pooling operation with window size w.
2 1
1 2 max
5 4 2 2
6 3 max 6 4
2 7 max 8 7
8 0
channels
Figure 2.3.12: Pooling operations, here in the example of maximum pooling, are a way to reduce
the spatial resolution and introduce translation invariance to the deep neural net-
work. Colors indicate different feature maps or channels.
Figure 2.3.12 shows an example of an one-dimensional feature map with two chan-
nels to which a maximum pooling operation with a windows size of two is applied.
Pooling layers are used in CNNs to both reduce the memory consumption and compu-
tation time, but also to introduce translation invariance to the CNN. Convolutional layers
are receptive to certain features within the feature map, e.g. edges or Gabor filters[161].
Reducing the spatial resolution of the output feature map from convolutions by pooling
allows that the CNN is sensitive to the feature occurring somewhere within the pooling
window, but removed the requirement that it is detected in all spatial positions in order to
be propagated as a strong activation to the next layer.
47
spatial dimension
High resolution
Input
Spatial 
resolution 
Number 
of 
channels
Output
Many channels
Figure 2.3.13: Schematic showing the relation between the spatial resolution and number of fea-
tures of the feature maps in common CNN architectures.
Now the question arises on how to choose suitable hyper-parameters, namely the
number of neurons per convolutional layer and the window sizes for pooling, in a CNN. A
common approach is to start out with a high spatial resolution and low number of features,
where individual features add few information but only the accumulation over the whole
spatial extent will carry meaningful information about the input data. Take an RGB image
as an example, where each individual pixel carries only a tiny amount of information
and the three channels for themselves are of no high abstraction, but viewing the whole
image (as a human) easily reveals the content in terms of objects, their classes, and so
on. This input feature map of high spatial resolution of low-abstraction features will then
gradually be converted to a feature map with low spatial resolution but high abstraction
per feature by reducing the spatial resolution via pooling, but at the same time increasing
the number of neurons per convolutional layer. In the best case, this relation between
spatial resolution and number of channels in the feature map will be chosen in such a
way that the relevant information can always be contained in it, but random noise or
irrelevant information will be dropped. Figure 2.3.13 visualizes this relation between the
spatial resolution and number of channels in the feature maps of CNNs.
A term commonly used in the context of CNNs is the ‘receptive field’. This describes
the fact that in a CNN, data is processed by sliding fixed size windows or kernel over
the feature map and applying convolutions to it which means that a specific instance of
a convolutional neuron only receives a finite, fixed-size part of the feature map as input.
Convolutions and pooling operations in a CNN are stacked in multiple layers, accumulat-
ing the total window size. For example two convolutional layers with a kernel size of three
will result in a accumulated kernel size of five for the second convolutional layer. This
accumulated window in a stack of convolutions and pooling layers is called the receptive
field. This is a useful concept when deciding on window or kernel sizes for a CNN since
it is necessary to feed the relevant information into the convolutions for them to correctly
solve the task at hand, e.g. a CNN for object classification from image input should have
a receptive field large enough such that the output neurons actually receive the full object
is input.
Convolutional neural networks include many state of the art methods[73, 108, 143,
144, 153] for computer vision problems, such as e.g. the ImageNet[27] object classi-
fication problem. Convolutional neural network were even applied, in combination with
reinforcement learning and Monte Carlo tree search, to learn to play the game of Go at a
human-level performance[132, 133].
48
Propagation
through CNN
Recurrent Neural Networks
Recurrent neural networks (RNNs)[55, 112] are another common topology of artificial
neural networks, besides CNNs. A RNN is organized in layers as in a MLP, but recurrent
layers not only receive the activations from the previous layer as inputs, but also their
own activations from the previous time step or spatial position. This means that a RNN is
used to process a time series or feature map, position by position with the activations in
each step being dependent on the previous ones. On the topic of terminology, ‘neurons’
in recurrent topologies are commonly also referred to as ‘cells’.
The basic RNN layer is, as usual, a weighted linear combination followed by a non-
linear activation function. It possesses two weight matrices, one for the feed forward
activation from the previous layer and one for its own activations from the previous time
step. Equation
f(xi, ri,Wx,Wr) = σ(Wxxi +Wrri) (2.3.24)
with {
f(xi−1, ri−1,Wx,Wr), if i > 0
ri = (2.3.25)
0, if i = 0
describes a simple RNN with Wx being of size m×n and Wr of size m×m for a recurrent
layer with m cells and n inputs from the previous layer. i being the index variable to
indicate positions within a one-dimensional time series or feature map. Please note that
weights Wx and Wr are, similar to convolutional layers, weight shared between different
time steps. This is important when optimizing the parameters.
same W, 
different x and r
σ(Wxx)
σ(Wxx+Wrr)
σ(Wxx+Wrr)
σ(Wxx+Wrr)
σ(Wxx+Wrr)
σ(Wxx+Wrr)
channels channels
Figure 2.3.14: Recurrent neural network processing a feature map with one spatial dimension.
The RNN activations of the previous step are fed into the same RNN in the next
step.
In contrast to CNNs do RNNs not have receptive fields of fixed size. Instead they can
be applied to variable size sequences with the receptive field for the last RNN step being
the whole sequence. The dependency on the previous steps and the ability for variable
size receptive fields do make RNNs suitable for ML tasks in natural language processing
(NLP)[63, 141, 157], as is the case in this thesis.
49
spatial dimension
spatial dimension
Wr
x Wx RNN a
Figure 2.3.15: Schematic representation of a recurrent neural network. Backpropagation is not
possible since the recurrence is infinite.
x Wx0 RNN a0
Wr
x Wx1 RNN a1
Wr
x Wx2 RNN a2
Figure 2.3.16: Recurrent neural network from Figure 2.3.15 unrolled for a finite sequence of three
steps. Recurrence is eliminated and backpropagation is applicable.
RNNs cannot directly be optimized using gradient descent and backpropagation, since
the recurrence in the formulation disallows backpropagation by repeated application of
the chain rule to the composite function at hand. However, RNNs are for obvious rea-
sons trained on a data set with sequences of finite length. This leads to the observation
that the actual depth of the composite formulation is also finite. backpropagation through
time (BPTT)[152] is based on this fact that even though the RNN formulation is infinite
in theory (Figure 2.3.15), when applied to finite sequences it can be rewritten as a feed
forward network without recurrence (Figure 2.3.16). This is called ‘unrolling a RNN’ and
standard backpropagation and gradient descent are applied to an unrolled RNN in order
to optimize its parameters.
A commonly observed behavior when training RNNs are the so-called ‘vanishing gra-
dient’ and ‘exploding gradient’ problems[53, 54, 55, 99]. This is based on the fact that
backpropagation is the repeated multiplication of partial derivatives, based on the chain
rule for composite functions. If the partial derivative of one module is smaller one, the
overall gradient will be reduced in magnitude. The same is true for a partial derivative
greater than one, which will increase the magnitude of the gradient. This can pose prob-
lems during training since a decreasing magnitude of the gradient (vanishing gradient)
may end up with a gradient close to zero and thus only very small changes to the weights
during gradient descent, inhibiting effective training. On the other hand an exploding gra-
dient may exaggerate individual training examples and lead to oscillations during gradient
descent, resulting in seemingly random results.
In MLPs and CNNs, the number of multiplications of the derivatives is linear in the
number of layers of the network. This means that the factor by which the magnitude
of the gradient is increased or decreased is relatively constant and can thus easily be
countered by e.g. choosing different learning rates for the individual layers. This is not
50
applicable to RNNs since the sequence length may be variable and rather lengthy. As-
signing different learning rates would mean modification of the learning rate for different
steps within the sequence, resulting in a RNN that is artificially (in)sensitive to parts of
the sequence. Mitigation of the vanishing or exploding gradient in RNNs is commonly
done by introducing parameterized gates to the recurrence, keeping the gradient along
the recurrent connections in a predictable magnitude.
Long short-term memory (LSTM)[39, 55] increases the complexity of the cells of
RNNs by giving them a parameterized internal structure. This internal structure con-
sists of gated connections that control information flow within the cell, as well as their in-
and outputs. Gated connections feed forward information, but are modulated by gates. A
modulated connection
f(s,x,Wg) = s⊙ σ(Wgx) (2.3.26)
forwards signal s with σ(Wgx) being the gate in form of a standard logistic sigmoid of
the linear combination of gate weights Wg and cell input vector x. Operator ⊙ is the
Hadamard product [59, ch. 5] which is a coefficient-wise multiplication of two matrices.
Gate σ(Wgx) near one will forward the signal s (nearly) unchanged, while a value near
zero will drop the signal. Matrix Wg is of size n ×m for n neuron activations in signal s
that are to be modulated and m inputs into the cell. Please note that m in the case of
RNNS are both inputs from the previous layer as well as recurrent connections from the
own layer, but from a previous time or space step.
Figure 2.3.17 shows the inner structure of an original LSTM cell with constant error
carousel. As usual in a RNN the cell input is a weighted linear combination over the
activations from the previous layer, as well as the own layer in the previous time step. In
contrast to a plain RNN, the cell state in LSTM is gated by an input and an output gate.
Equations for the LSTM topology in Figure 2.3.17 are
ĉt = σs(Wcxt +Rcat−1) (2.3.27)
it = σg(Wixt +Riat−1) (2.3.28)
ct = ĉt ⊙ it + ct−1 (2.3.29)
ot = σg(Woxt +Roat−1) (2.3.30)
at = σs(ct)⊙ ot (2.3.31)
with σs and σg being the activation functions for the signal and gates respectively. σg is
the standard logistic sigmoid in most cases. xt the activations from the previous layer, at
the activations of the LSTM layer and ct the internal cell state of the LSTM layer. W and
R denote learnable weights to the previous layer and for the recurrent connection.
The above LSTM formulation uses the constant error carousel, which means that the
cell state from the previous time step is added to the current cell state as-is, that is without
multiplication with weights. Through this constant error carousel mechanism, the gradient
of the cell state will always be exactly one along the time dimension. This constant prop-
agation of the cell state is one part for preventing the vanishing and exploding gradient
effects during training. The second part to this end are the gated connections along the
input and output gates. These are learnable gates that adaptively prevent modification of
the cell state (input gate) or modification of the LSTM activation (output gate), effectively
reducing the LSTM block to only the constant error carousel if both gates are closed.
Since both gates are optimized using backpropagation and gradient descent, they learn
to only propagate signals in or out of the LSTM cell when necessary for solving the task at
hand. This cuts off the gradient in and out of the LSTM cell when closed, preventing the
vanishing and exploding gradient effects. It is thus that LSTM blocks are only expected
51
Output(t-1) Output(t)
σg(Wox) ⋅
σs
Cell 
State t-1
σg(Wix) ⋅
σs(Wcx)
Input(t)
Figure 2.3.17: Original long short-term memory formulation using two gated connections and a
constant error carousel.
52
to show the vanishing or exploding gradient effect if one of the the input or output gates
is open for a long period of time.
Variants[48] of the original LSTM formulation include the forget gate[39], which con-
trols the constant error carousel by preventing information flow along the time dimension.
This effectively allows the LSTM cell to reset its memory. Variations also include ‘peep-
hole connections’[40], which add weighted connections from the LSTM cell state c (before
activation σ) to the gates.
So far we have discussed RNNs and LSTMs that are unidirectional in one dimension
and are in theory processing infinite sequences. The one exception so far is BPTT which
requires finite sequences in order to ‘unroll’ the RNN structure and apply standard back-
propagation to it. Many ML problems however deal with sequences that are finite in their
nature and where the whole sequence is observed from the beginning. This allows to ap-
ply bi-directional RNNs[128] to the data. Bi-directional RNNs consist of two RNN layers
that process the same input sequence independently of each other. One of the two RNN
layers processes the sequence from start to end, the other one reversely from end to
start. Typical implementations of bi-directional RNNs feed the same input sequence into
both RNN layers, then process the sequence in both directions and finally concatenate
the two output sequences along the feature-/neuron-dimension. This results in a ‘black
box’ bi-directional RNN layer that can be plugged into a DNN.
This work deals with image data and two-dimensional feature maps in DNNs. As
such we need to discuss a method on how to apply LSTMs to two-dimensional feature
maps. Multi-dimensional long short-term memory (MDLSTM)[44] is a type of layer that
extends the bi-directional RNN to arbitrary numbers of dimensions in LSTMs. MDLSTM
processes the input in 2n different orderings with n being the number of temporal and spa-
tial dimensions in the feature map. Bi-directional LSTM does process a one-dimensional
sequence in two orderings, whereas MDLSTM for two-dimensional feature maps, e.g.
images, processes the input in four different orderings. For this, the LSTM formulation is
extended to multiple recurrent dimensions:
• The cell input, the input gate and the output gate receive recurrent input along all
dimensions.
• The forget gate, if used in the topology, only has recurrent connections along the
dimension it modulates. This means that there needs to be one forget gate for each
dimension.
Figure 2.3.18 shows the four orders of processing for MDLSTM over a two-dimensional
feature map. The LSTM activations from the four passes will be concatenated at the end
to acquire the overall result feature map. Please note that at each position in the feature
map, MDLSTM actually has two predecessor states. This makes MDLSTM a powerful
model since it allows to utilize context information from the preceding rectangle (in 2D)
within the feature map. All processing orders of MDLSTM in combination allow to use the
full feature map as input at any point of the following layer. This characteristic of MDL-
STM enabled several state of the art models in document analysis and natural language
processing (NLP), but also adds the drawback of being hard to implement on GPGPU
systems. This is because each MDLSTM state in the feature map is dependent on one
neighboring state along each of the n dimensions of the feature map. This leads to the
fact that no two or more parts of one feature map can be computed independently of each
other, which is the property that would allow to fully utilize GPGPU hardware.
Separable multi-dimensional long short-term memory [156] is a simplification of the
MDLSTM concept that allows the application of LSTM networks to multi-dimensional
problems but reduces the recurrent connection in the RNN to only one dimension. Sep-
arable MDLSTM processes a feature map in 2n different orderings with n again being
53
2D-LSTM
2D-LSTM
Input feature map Concat Output feature map
2D-LSTM features
2D-LSTM
Figure 2.3.18: The four orderings of processing in MDLSTM over a two-dimensional feature map.
the number of temporal and spatial dimensions in the feature map. It processes each
dimension in a bi-directional fashion independently from other dimensions. For example
in a 2D image, each row and each column is treated as a independent sequence and
processed on its own with a bi-directional LSTM.
Figure 2.3.19 shows the four orderings of processing in a separable MDLSTM for a
two-dimensional feature map. As in MDLSTM, the activations from the individual LSTM
runs are concatenated along the feature dimension to acquire the final output feature
map. Compared to MDLSTM do separable MDLSTM have less context information ‘to
work with’ for predictions and thus are overall less powerful. This is because of the
fact that each prediction is based only on a one-dimensional slice of the feature map
and not based on a rectangle (in two-dimensional feature maps). On the other hand do
separable MDLSTM reduce the number of processing orderings for the LSTM cells, thus
reducing the overall computational runtime. Also the fact that each one-dimensional slice
of the feature map is processed independently does again allow for parallelization using
GPGPU hardware.
An overview over many LSTM variants can be found in published literature[39, 44, 48,
55, 64, 156].
Overfitting and Regularization
As we have discussed before, deep neural networks are trained by optimizing their pa-
rameters towards minimizing a specified loss function using backpropagation and gradi-
ent descent. This optimization is done over a finite data set of examples from the ML
problem at hand, which means that there will be only a finite amount of observations
from the feature space. This poses two problems: One, there will always be unknown
examples that the neural network (or ML model in general) has not seen before and we
want it to generalize correctly from the finite training set to unseen examples. Two, the
learning capacity of the ML model might be large enough to allow recognizing specific
54
1D-LSTM
1D-LSTM
Input feature map Concatfeatures Output feature map1D-LSTM
1D-LSTM
Figure 2.3.19: The four orderings of processing in separable MDLSTM over a two-dimensional
feature map.
examples from the training set without actually learning the discriminating features that
lead to a correct classification, regression or prediction in general. These two problems
will lead to cases were the error rate of the model predictions over the training set is lower
than over an independent validation or evaluation data set. This is called overfitting or
generalization error. Reducing overfitting and thus the generalization error of a model is
called regularization and is commonly done by introducing additional constraints to the
optimization procedure. The goal while training a ML model is to reach a satisfying error
rate on a data set with examples unseen during training in order to gain the confidence
that the model will make correct predictions on further unseen examples. We will now
discuss three regularization techniques used while training deep neural networks that
facilitate the reduction of overfitting: early stopping, dropout and L2 regularization.
Early stopping[14, 103] is a rather simple regularization technique that can be imple-
mented with a variety of ML models. For early stopping we need to split the available data
set into three disjoint parts, the training set, validation set and evaluation set. The training
set is commonly the largest one and is used for automatic optimization of the ML model
using, in the case of neural networks, backpropagation and gradient descent. Calculation
of the error rate or loss value is done with the current ML model on both the training set
and validation set in identical intervals. What is now observed in the case of overfitting in
deep learning is that in the beginning of the training, both the training and validation sets
see a reduction of the error rate. However, it is common that at some point the error rate
on the training set will continue to decrease but start to increase on the validation set.
This is the point were overfitting sets in. Early stopping is a simple mechanism were the
training process is stopped after the error rate on the validation set has not decreased
for a certain amount of time. If this is the case, training is stopped and the ML model
with the minimal achieved error rate on the validation set is used. This carries the risk
that the ML model is now overfitting on the validation set because we have chosen the
parameters that produce the minimal error rate on the validation set. This is why there is
need for a third data set, the evaluation set, which is evaluated once after early stopping
55
and which gives us the error rate that we can expect on further unseen data. In practice,
the error rates on the validation and evaluation sets will often be similar to each other,
but it is still necessary to keep this new source of overfitting in mind and evaluate the ML
model accordingly.
Dropout [38, 102, 137] is a regularization technique specific to MLPs and derived
models, such as CNNs and RNNs. The trick in dropout learning is to reduce the model
capacity by randomly removing (‘dropping’) parts of the model during training, thus re-
ducing the sensitivity to specific examples by enforcing redundancy of general features
within the data. Dropout is implemented as additional layer(s) in a MLP and is said to be
applied to the layer previous to the dropout layer. Dropout is applied during training by
randomly setting activations of the previous layer to zero according to
fT (x, p) = x⊙ B(n = |x|, 1− p) (2.3.32)
were x are the activations of the previous neural layer, B is a Binomial distribution with
the probability p of dropping an individual neuron and number of events n. n is equal to
the number of neurons in the previous layer. p is also called the ‘dropout rate’ since it
defines how frequently individual neurons are removed from the model. A dropout rate
of p = 0.5 means that in every forward pass, only half of the neurons are actually used
in the model. A lower dropout rate removes fewer neurons from the forward passes and
p = 0 deactivates dropout completely. Outside of training, no neuron activations will be
dropped and all learned redundancies will be used for effect. Not dropping any neurons
during inference generally leads to a much higher activation in inference than in training
since no inputs of the next layers’ linear combination will be set to zero. To counter this,
dropout during inference scales the activations accordingly:
1
fI(x, p) = x (2.3.33)
1− p
Dropout thus is implemented differently during training, see function fT , and inference,
see function fI . fT randomly removes parts of the neural network, whereas fT scales the
activations to make sure that the linear combination of the following neuron layer stays in
the same value range.
The last techniques for regularization in deep neural networks that we will discuss
here are L1 regularization[115, 145] and L2 regularization[56]. Both are based on the
observation that overfitting in artificial neural networks often means that parts of the net-
work get more and more sensitive to specific patterns that occur within the training data.
In practice this means that the absolute value of the weights in the neural network in-
crease more and more. Intuitively said, as the weights of the neural network increase in
magnitude, the activations of the linear combination will be in the saturation on either side
of the activation function (if the activation function in use has saturation). At some point in
training, activations will always be either in saturation on one side or the other of the ac-
tivation function or always tend to positive or negative infinity. This will lower the training
error significantly if the network capacity is large enough since parts of the network will
be sensitive to those patterns and activations will tend towards one saturation if a specific
pattern is shown to the network. One could say that the DNN in this case actually has a
‘grandmother neuron’. To counter this effect, L1 and L2 regularization impose a penalty
on network weights with large magnitude. Implementations either add a penalty term to
the loss function or in case of L2 regularization in ANNs, do weight decay [74]. Weight
decay is a functionally equal formulation of L2 regularization. The loss function
L = Ltask + λ||W||2 (2.3.34)
with penalty term λ||W||2 adds L2 regularization to the loss function as defined by the
ML task at hand. Factor λ controls how strong the regularization effect is during training.
56
Setting λ = 0 will disable regularization, whereas a large λ will force all network weights
in weight set W to be at or near zero, thus inhibiting any learning at all. In a similar way,
L1 regularization is modeled as loss
L = Ltask + λ||W||1 (2.3.35)
with penalty term λ||W||1. The practical difference between L1 and L2 regularization is
that L1 in DNNs tends to eliminate connections from the network completely by bringing
their weights to zero. It acts thus as a sort of automatic feature selection. L2 tends to-
wards keeping the network weights at a low magnitude without elimination, thus reducing
the sensitivity to specific patterns without eliminating features.
2.4 Expectation-Maximization
E-step: Estimate latent variables. Constant model parameters.
Model parameters Latent variables 
W z
M-step: Update model parameters. Constant latent variables.
Figure 2.4.1: Overview over expectation-maximization.
We will now discuss the expectation-maximization (EM) algorithm, first proposed by
Dempster et al.[26] as an iterative algorithm for maximum likelihood optimization under
incomplete data. Maximum likelihood optimization in machine learning scenarios refers
to the optimization of a parameter set W in regards to maximizing the probability of ob-
serving specific data x. Since the logarithmic function is monotonic but many optimization
problems are convex only after logarithmic transformation, it is prudent in these cases to
maximize the log-likelihood. This correspond∑s to
W⋆ = argmax logP (xi|W) (2.4.1)
W i
as the solution for W. Some tasks include latent variables z that may be marginalized in
order to implement maximum likelihood t∑raining∑, leading to
W⋆ = argmax log[ P (xi, zi|W)] (2.4.2)
W i zi
for the modified solution with latent variables zi. This marginalization is suitable in cases
were variables zi can be observed. In some cases the latent variables zi are either
unobserved or are themselves restricted by constraints that need to be observed.
Expectation-maximization optimizes toward the maximum likelihood solution by iter-
atively finding the expectation value for the latent variables zi and then optimizing the
parameters W given the current expectation values for zi. This approach is iteratively
repeated until the latent variables do not change anymore or some other suitable conver-
gence criteria are met. This basic loop is visualized in Figure 2.4.1.
57
Setting up an expectation-maximization optimization requires an objective function or
auxiliary function. Ideally we would like ∑to maximize the complete data log-likelihood
P (W) = logP (xi, zi|W) (2.4.3)
i
of observing the data examples xi and corresponding latent variables zi, given our model
parameters W. This cannot be done for tasks were expectation-maximization is utilized
since those tasks are exactly those that do not allow to observe the latent variables zi.
We define the objective function as the expected complete data-log likelihood
J(W,Wold) = E[P (W)|X,Wold] (2.4.4)
which will be maximized instead. In the E-step we will now compute the latent variables
Z based on model parameters Wold and the observed data X. The M-step will maximize
the objective function by choosing new model parameters W. This iterative optimization
is repeated until convergence.
Please note that in this section we will discuss both maximizing or minimizing the
objective function J . This is task dependent and the expectation-maximization approach
in general is unchanged by this.
The equations and examples in this section on expectation-maximization, including
the above ones, are based on, but not necessarily identical to, the ones given in the
corresponding chapters by Christopher Bishop[7, ch. 9] and Kevin Murphy[93, ch. 11.4].
Gaussian Mixture Model
We will discuss expectation-maximization by using Gaussian mixture models (GMMs)
or mixture of Gaussians[90] as a basic example on how to apply EM to probabilistic
optimization problems with latent variables. Mixture models in general describe multi-
variate distributions by linear combination of base distributions. In case of GMMs, the
base distribution is a multi-variate Gaussian distribution and a total of K distributions are
mixed to model the K data clusters of the mixture model. Expectation-maximization is
applied to fit the mixture model to a finite set x of observed data.
The likelihood of observing a specific data point xi in a Gaussian mixture model is
defined by the linear combination
∑K
P (xi|W) = πkN (xi|µk,Σk) (2.4.5)
k
where π are the coefficients for the linear combination of the individual Gaussian distri-
butions and W := {(πk, µk,Σk)|k ∈ [1,K]} are the model parameters for the K clusters
of the GMM.
The expected complete data log-likelihood of a GMM is
∑N
J(W,Wold) := E[ logP (xi, zi|W)]
∑ iN ∑K ∑N ∑ (2.4.6)K
= [ ri,k log πk] + [ ri,k logP (xi|µk,Σk)]
i k i k
with zi := {ri,k|k ∈ [1,K]} being the latent variables in form of the responsibility of cluster
k for generating data point xi.
58
The E-step in expectation-maximization for Gaussian mixture models is the calcula-
tion of these cluster responsibilities:
∑πkP (x |Woldi k )ri,k = (2.4.7)
k′ πk′P (x
old
i|Wk′ )
The M-step is to maximize the log-likelihood of Equation 2.4.6 by choosing new model
parameters W while keeping the latent variables z constant. The mixture coefficient πk
of cluster k is simply its mean responsibility for generating the observed data:
1 ∑N
πk = ri,k (2.4.8)
N
i
∑ ∑Maximizing J(W,Wold) regarding µk and Σk, which is to maximize the expression
i k ri,k logP (xi|µk,Σk) of Equation 2.4∑.6, completes the M-step in GMMs. This max-imization yields ∑i ri,kxiµk = (2.4.9)
i ri,k
and ∑
T
i∑ri,kxixiΣk = − µ µ Tk k (2.4.10)
i ri,k
for the cluster center and variance.
This process of iteratively minimizing J by applying the E-step and M-step is repeated
until cluster responsibilities ri,k do not change significantly anymore, in which case the M-
step will also stabilize at the final cluster model parameters πk, µk and Σk. Convergence
is guaranteed since both the E-step and M-step do in fact maximize the complete data log-
likelihood J of Equation2.4.6 in each iteration of the expectation-maximization process.
Generalized Expectation-Maximization
The above case of applying expectation-maximization to the k-means clustering task
could be seen as true expectation-maximization since there exist closed-form solutions
for both the E-step and M-step, which in turn allows to minimize J at each iteration given
the current assignment of latent variables. This may not always be the case for the E-step
and/or the M-step and thus minimizing J at each iteration will not be possible. generalized
expectation-maximization aims to still provide a framework for finding maximum likelihood
solutions with incomplete data by still applying the EM algorithm to these cases, but in-
stead of minimizing J at each step we try to decrease the value of J at each step.
In this work the M-step will optimize the parameters of a deep neural network with the
goal of minimizing J . To this end, backpropagation and gradient descent will be applied
to the deep neural network while choosing a suitable surrogate loss function for the DNN.
Since gradient descent has only local information about the loss function, is only applied
to a small batch of data examples at a time and performs only a small step in weight
space, minimizing J in each M-step is intractable. Instead gradient descent applied to
the deep neural network will likely reduce the value of objective function J by reducing
the value of a suitable surrogate loss function. Although this is not guaranteed since
gradient descent for artificial neural networks may diverge or oscillate depending on the
‘loss landscape’ and step size in use.
Convergence to a local optimum still occurs on average in general expectation-max-
imization even if no closed-form solutions to the E-step and M-step are applied[93, ch.
11.4.5.2].
59
k-Means Clustering
The following paragraphs discuss k-means clustering as an example of generalized ex-
pectation-maximization. The generalization in k-means clustering is that it is not a prob-
abilistic model, but instead uses discrete cluster assignments between observed data
points and cluster centers. In this respect can k-means clustering be seen as a dis-
cretized variant of Gaussian mixture models. Let us define x as a set of points xi of our
data set. This data should be separated into K different clusters in such a way that dis-
tances between the points in the same cluster are smaller than between points in different
clusters. For this we will need to assign points to clusters by defining cluster assignments
ri,k, which is of value 1 for exactly one cluster k per point xi and 0 for all other clusters.
Clusters are defined by their centers µk.
In this case we want to optimize cluster centers µk in order to reflect the clusters
present in the observed data points xi by minimizing the intra-cluster distance between
points xi of the same cluster k. The unobserved variables are the cluster assignments
rn,k.
Expectation-maximization requires an objective function that will be minimized itera-
tively. In k-means clustering we may choose to minimize the Euclidean distance between
data points within the same cluster. ∑This∑leads to
J = ri,k||xi − µk||2 (2.4.11)
i k
as our objective function.
We start out by choosing initial cluster centers µk, for example a random selection
of k data points from x. Afterwards we will iteratively minimize objective function J by
first choosing new cluster assignments r that minimize J while keeping cluster centers µ
constant (E-step) and then minimizing J by choosing new cluster centers while keeping
assignments constant (M-step).
The E-step is easily perform{ed by simply assigning each data point xi to the nearestcluster center µk:
1, if argminn ||x 2i − µn|| = kri,k = (2.4.12)
0, else
In the M-step we want to minimize J given our current assignments ri,k of data points
xi to clusters k. We keep in mind that function J is the sum of the Euclidean distances
between data points and their assigned cluster centers, which is a convex function. Find-
ing any extreme point will thus be a minimal point of function J . To this end we will derive
J with respect to cluster centers µk
∂J ∑
= 2 ri,k(xi − µk) (2.4.13)
∂µk i
and find centers µk where it is zero
∂J
= 0 (2.4.14)
∂µk
and thus a minimal point of the convex func∑tion J . This leads to
µk = ∑i ri,kxi (2.4.15)
i ri,k
which simply is the average over the coordinates of data points xi assigned to cluster k.
Again this iterative expectation-maximization approach is repeated until the cluster
assignments ri,k do not change anymore.
60
Chapter 3
Related Work
3.1 Connectionist Temporal Classification
We will now discuss connectionist temporal classification (CTC)[46][43, ch. 7], a method
consisting of a loss function and a decoding function that solves the sequence labeling
problem for one-dimensional sequences. We will discuss the CTC loss function in detail
since it is crucial to understanding this solution to the one-dimensional sequence labeling
problem, but also since it provides the context for this thesis. One decoding function for
CTC will also be discussed in this section, with another decoding function being detailed
in the original publication[43, ch. 7.5.2].
The mathematical symbols and equations in this section closely follow the ones in the
original publications[46][43, ch. 7] by Alex Graves.
Collapse Layer
Before we begin to discuss the CTC loss function, we need to take a short look at the
deep neural networks typically employed together with CTC. The task for which the CTC
loss is designed is to predict a sequence of labels given an image (offline handwriting
recognition) or audio (speech recognition) as input. Both cases typically use deep neu-
ral networks based on LSTM[55] or MDLSTM[44] layers, although other topologies or
combinations with convolutional neural networks[78] are possible. Images of handwriting
or speech recordings are often multi-dimensional in their nature. Images consist of two
spatial dimensions. Audio of one temporal dimension and the frequency domain. This
type of data can be processed by a DNN that consists of multiple layers of LSTM, MDL-
STM, convolutions or pooling and does not intrinsically pose a problem to modern DNN
architectures. However, the output of a forward pass of such a neural network will also
be multi-dimensional. The need to eliminate all but one dimension within the data arises
out of the fact that CTC processes one-dimensional sequences.
One possibility for eliminating the additional dimensions is to apply a so called col-
lapse layer. A collapse layer sums up the predicted values along all but one dimension
in order to marginalize those dimensions, and as such effectively eliminates them while
maintaining differentiability of the overall DN∑N. The collapse function
cx = ix,y (3.1.1)
y
with i being an image input and c being the collapse output for example marginalizes the
y-dimension. Marginalization by summation is a common approach for array dimensions
that are of a constant size. Variable array dimensions may be reduced by averaging or by
61
finding the maximum value. A collapse layer can be seen as a sort of ‘dynamic pooling’
with the pooling window always being the total extent of the dimension in question.
A collapse layer is then followed by a softmax layer, see Section 2.3, in order to
compute label probabilities. The overall network prediction y consists of one temporal
or spatial dimension, along which label probabilities are estimated for characters from an
alphabet.
Fundamental Probabilities
Let us start by defining the sequence labeling task. Let A be the set of glyphs of the
script that will be transcribed using CTC. In Latin or roman script this would be the Latin
alphabet, plus language- or region-specific symbols. For CTC to work we need to add
one more label to this set, which is the blank or glyph separator. This is a special artificial
label the meaning of which we will discuss soon. From now on this section, A will contain
both the visible glyphs of the script as well as the artificial blank. Connectionist temporal
classification employs a deep neural network (DNN) in order to transcribe texts. The DNN
used in the original publication was a LSTM and MDLSTM, but other variants have been
proposed[12, 104, 156] for the use with CTC since then. Let x be the input into the DNN
and y = f(x,W) the prediction of the DNN using input x and parameter set W. The DNN
f has one output neuron per element from label set A and produces an output sequence
of length T . The output neurons of the DNN estimate the probabilities of each sequence
position in T belonging to one of the labels from A. As such, y is from a real-valued
probability distribution y ∈ RT×|A|. Sequence labeling is to assign a sequence l ∈ A|l|
to the time steps of network prediction y in such a way that the probability P (l|x,W) is
maximized.
The first step towards this goal is to define the probability for observing a specific path
or configuration π given the network output y. A configuration π is a label sequence of
length T over alphabet A with π ∈ AT . We can easily define the probability
∏T
P (π|y) = ytπ (3.1.2)t
t
for observing a specific path π. yts refers to the estimated probability of symbol s occurring
at time step t.
Next we will define a function F that maps configurations π to a label sequence l. This
mapping should allow for repetitions of the same glyph in adjacent characters while also
maintaining that the same character may stretch of multiple time steps. This mapping
is done by first collapsing multiple adjacent occurrences of the same symbol to exactly
one occurrence of the same symbol, e.g. F (aaabbaa) = aba. Next, artificial blank sym-
bols ϵ will be removed, e.g. F (aaaϵϵaa) = aa. This already shows the usefulness of
the blank symbol ϵ since it allows to distinguish actual repetitions of the same symbol
from repetitions out of necessity to ‘fill up’ time steps T . Overall an example would be
F (aaϵabbaa) = aaba and of course many different configurations π will map to the same
label sequence l = F (π).
Since there are many different paths or configuration π that map to the same label
sequence l, but are conditionally independ∑ent of each other, we can now also defineprobability
P (l|y) = P (π|y) (3.1.3)
π:F (π)=l
for observing a specific label sequence l given the network prediction y.
62
Prototypical Decoding
This leads us to the prototypical formulation of what a decoding algorithm in the context
in CTC does. Decoding is an algorithm that, given network prediction y finds the most
likely label sequence l⋆ or at least a label sequence with reasonable high probability. This
decoded label sequence is the overall result of the sequence labeling task, that is to find
e.g. the most likely transcription given a recording of spoken language as is the case in
voice assistant devices. The label sequence
l⋆ = argmaxP (l|y) (3.1.4)
l
is thus the transcription of the network prediction y = f(x,W). This transcription method
is not computationally feasible since it would require to enumerate all paths π ∈ AT which
can easily be a prohibitively large set. We will later in this section discuss computationally
feasible decoding algorithms for CTC.
Prototypical Loss
Training of a deep neural network (DNN) using CTC is done by gradient-based parameter
optimization for maximum likelihood of the true label sequence. Let us use S as a training
data set with (x, z) ∈ S being the input x and true label sequence z. y = f(x,W) is again
the DNN prediction using the parameters W that will be optimized in the process. The
loss function ∏ ∑ ∑
L = − ln P (z|y) = − lnP (z|y) = − lnP (z|x,W) (3.1.5)
(x,z)∈S (x,z)∈S (x,z)∈S
is then minimal when the likelihood for predicting z is maximized. The question remains
how to evaluate the likelihood P (z|y) according to Equations 3.1.2 and 3.1.3 if doing
so would require the enumeration of all paths or configurations π that relate to label
sequence z. We can reduce the computational requirements for evaluation of Equation
3.1.3 by employing a dynamic programming approach similar to the forward-backward
algorithm for hidden Markov models[105].
Forward-Backward Algorithm
The forward-backward algorithm is based on the idea that both the temporal or spatial
dimension of the (in our case) deep neural network prediction can be split into two disjoint
parts, as well as the target label sequence can be split into two disjoint parts. We can
then evaluate the probabilities of observing a specific label prefix at the beginning of the
sequence and of observing the corresponding label suffix at the end of it. Multiplication
of the probability for the prefix with the probability for the suffix yields the probability of
observing the overall label sequence l, restricted to configurations π that indicate the label
lu at time step t. Let U = |l| be the len∑gth of label sequence l and clarify this observation
P (l : lu = π
′
t|y) = P (π |y)
π′:F (π∑′)=l∧π′t=∏lut ∑ ∏T
i i (3.1.6)= ( yπ′)( y ′)i πi
π′:F (π′)=l i=1 π′1:u :F (π′)=lu+1:U i=t+1
= α(t, u)β(t, u)
with π′ being paths related to prefixes or suffixes of l and t, u being the points in the DNN
prediction and label sequence where the prefix or suffix starts or ends accordingly. In
63
1 ε ε ε ε ε ε ε ε ε ε ε ε
H H H H H H H H H H H H
ε ε ε ε ε ε ε ε ε ε ε ε
E E E E E E E E E E E E
ε ε ε ε ε ε ε ε ε ε ε ε
L L L L L L L L L L L L
ε ε ε ε ε ε ε ε ε ε ε ε
L L L L L L L L L L L L
ε ε ε ε ε ε ε ε ε ε ε ε
O O O O O O O O O O O O
U ε ε ε ε ε ε ε ε ε ε ε ε
1 Time steps of DNN prediction T
Figure 3.1.1: Each connected path from top left to bottom right represents one path π were
F (π) is the label sequence ‘HELLO’. Calculating the probability for each path and
summing up all these yields the total probability of observing the label sequence
‘HELLO’.
reference to Figure 3.1.1 this is the probability of picking one node at position (t, u) and
calculating the probability of passing through that node. The forward variable
∑ ∏t
α(t, u) = yiπ′ (3.1.7)i
π′:F (π′)=l1:u i=1
and the backward variable
∑ ∏T
β(t, u) = yiπ′ (3.1.8)i
π′:F (π′)=lu+1:U i=t+1
will be evaluated in the following paragraphs using a dynamic programming approach.
Equation 3.1.3 can then be rewritten as
∑U ∑U
P (l|y) = P (l : lu = πt|y) = α(t, u)β(t, u),∀t ∈ [1, T ] (3.1.9)
u u
by observing that the total label probability for l is the sum of the paths passing through
any position u in the label sequence at one time step. This is picking one vertical slice
at point t out of the graph in Figure 3.1.1 and summing up the probabilities of all paths
passing through this vertical slice.
Let l′ be the label sequence l with the glyph separator ϵ added in between every label
and also at the front and rear. This ϵ label is used to separate adjacent occurrences of
the same glyph, but also in order to fill up the time dimension up to T in case the label
sequence is shorter. As such the ϵ label is mandatory only for those adjacent occurrences
of the same glyph and will be optional otherwise. Otherwise said we could omit the ϵ glyph
in l′ whenever collate function F would still produce the correct true label sequence.
64
Label sequence
Applying a dynamic programming approach to calculating α first requires us to set
the initial probability for a prefix of one time step in size, then incrementally increase
from there until we reach the end of the label sequence l′ and the last time step T .
The initial probabilities for one time step can only be the first ϵ or the first visible glyph,
since otherwise we would have skipped the first glyph in the sequence. This means that
α(1, 1) = y1 1 1l′ = yϵ , α(1, 2) = yl′ = y
1
l and α(1, u) = 0, ∀u > 2. The recursive formulation
1 2 1
for α is then ∑u
α(t, u) = ytl′ α(t− 1, i) (3.1.10)u
i=head(u)
with head(u) being the first valid pr{edecessor of position u in l′. This function is
u− 1, if l′u = ϵ or l′ ′head(u) = u−2
= lu (3.1.11)
u− 2, else
and allows jumps over ϵ labels or no jumps if the ϵ is mandatory because of repetitions.
We can apply the same dynamic programming approach to the backwards variable β
by initializing β(T,U ′) = 1, β(T,U ′ − 1) = 1 and β(T, u) = 0,∀u < U ′ − 1 with U ′ being
the length of augmented label sequence l′. Similar to α, but in reverse traversing order is
the recursive formulation of
ta∑il(u)
β(t, u) = β(t+ 1, i)yt+1′ (3.1.12)li
i=u
with tail(u) being the last valid suc{cessor of position u according to
u+ 1, if l′u = ϵ or l′ ′tail(u) = u+2
= lu (3.1.13)
u+ 2, else
which again models the mandatory ϵ labels.
Loss based on the Forward-Backward Algorithm
Equation 3.1.5 represents the loss function for CTC for full batch training on a training
data set S. The sample loss is thus
L(x, z) = − lnP (z|y) (3.1.14)
with x being the example input, e.g. an image, and z being the true label sequence. By
assuming l = z during training and substitution of Equation 3.1.9 we obtain
∑U
L(x, z) = − ln α(t, u)β(t, u) (3.1.15)
u
and can begin computing the partial derivative
∂L(x, z)
(3.1.16)
∂ytg
for glyph g at time step t in order to apply backpropagation and gradient descent for
optimizing the DNN parameters. We observe that ∂ lnx 1∂x = x and thus
∂L(x, z) −∂ lnP (z|y) 1 ∂P (z|y)= = − (3.1.17)
∂ytg ∂y
t
g P (z|y) ∂ytg
65
leads to the question of how to compute the partial derivative ∂P (z|y)∂yt . We further observeg
from Equation 3.1.6 that {
α(t,u)β(t,u)
∂α(t, u)β(t, u) yt , if g ∈ z= g (3.1.18)
∂ytg 0, else
since glyphs g that are not contained in z do not influence the derivative and thus are now
ready to complete the partial derivative ∂L(x,z)∂yt . We will sum the derivatives for individualg
label positions u which are the same g in order to adhere to the above observation. This
gives us
∂P (z|y) 1 ∑
= α(t, u)β(t, u) (3.1.19)
∂yt ytg g u:zu=g
which can be substituted to obtain
∂L(x, z) − 1 ∂P (z|y) 1
∑
= = − α(t, u)β(t, u) (3.1.20)
∂ytg P (z|y) ∂ytg P (z|y)ytg u:zu=g
and thus we have a partial derivative of the loss function at hand for applying to parameter
optimization.
Decoding Algorithms
So far we have discussed how to train a deep neural network for one-dimensional tran-
scription using CTC and how to derive the loss function for its training. The question re-
mains how to decode the network prediction in order to obtain the most likely (or a good
choice) label sequence after the training. As discussed above the prototypical decoding
algorithm should solve
l⋆ = argmaxP (l|y) (3.1.21)
l
which is computationally unfeasible since it would require enumerating all possible paths
π through y. We will now discuss two approximations of this. First, best path decoding[43,
ch. 7.5.1] which finds the path π⋆ with the highest probability and assumes that this
correlates to the most likely label sequence. As such
l⋆ = F (π⋆) (3.1.22)
with
π⋆t = argmax y
t
g (3.1.23)
g
which means that π⋆ is simply the sequence of labels g with the highest probability ytg at
their respective time step t. While this is simple and fast, it is prone to errors in case that
some correct glyphs are only weakly predicted.
We will now shortly discuss the beam search decoding or prefix search decoding[43,
ch. 7.5.2] algorithm, which can prevent this shortcoming of weak predictions and find
the most probably label sequence given enough time. Beam search decoding builds a
prefix trie of known label sequences and updates the actual probabilities for them during
decoding. Decoding starts out at t = 1 with an empty tree (only the root node repre-
senting the empty label sequence) and then incrementally processes paths through the
deep neural network prediction up until t = T while also incrementally updating the label
sequences in the prefix trie. At each time step t, all glyphs in alphabet g ∈ A are iterated,
their estimated probabilities ytg retrieved from the DNN and then appended to all the label
sequences in the prefix trie. If at this point two or more prefixes in the trie collapse to one
66
when applying function F to them, they are actually collapsed to one label sequence and
their probabilities summed. This corresponds to multiple paths π with F (π) resulting in
the same label sequence. After each iteration the prefix trie is pruned to the top-n (usu-
ally with 10 ≤ n ≤ 100) most probably prefixes in order to keep the runtime requirements
low. At t = T , the most probable label sequence in the prefix trie is the solution l⋆ of
the decoding process. Disabling the pruning after each iteration and always keeping all
label sequences and incrementally appending to them would yield the true most prob-
able label sequence as defined by Equation 3.1.4, but would also require enumeration
of all possible paths π, which are numbering |A|T in total and thus can quickly grow to
computationally intractable numbers.
Relation to this work
Connectionist temporal classification is based on the idea that given a label sequence l
and a deep neural network prediction of length T , one can compute the probabilities of
each possible path π with F (π) = l of length T and thus compute the alignment of the
label sequence l. CTC solves this task by implementing a forward-backward algorithm to
efficiently compute this alignment. In turn this allows to set up a loss function for gradient-
based optimization of the deep neural network or any other machine learning model that
is optimized by gradient descent. The transcription algorithm of CTC is completed by
following up the DNN prediction with a decoding algorithm that produces the most likely
label sequence from the DNN prediction.
CTC during training takes into account all possible paths π with F (π) = l and in this
matter solves the alignment task in an exact and optimal way. Probabilities for the individ-
ual paths π, as well as for the overall label sequence l and on the other side specific glyph
probabilities at specific time steps P (l : lu = πt|y), see Equation 3.1.6, are exact under
the assumptions laid out before. The drawback is that the forward-backward algorithm
only can be applied to one-dimensional sequences since at each time step t it requires
the exact probabilities for all prefixes from [1, t − 1] and all suffixes from [t + 1, T ]. This
is only possible if the prefix and suffix are conditionally independent given time step t.
Looking at Equation 3.1.6 this is obviously the case in one-dimensional sequences. It
does however not apply to multi-dimensional sequences.
Sections 4.3 and 6.4 further detail this problem. Multi-dimensional connectionist clas-
sification improves on this point by providing an approximate solution to the sequence
labeling task in multi-dimensional spaces.
3.2 Paragraph Transcription using Attention Networks
So far we have discussed connectionist temporal classification, a loss function and de-
coding algorithms which together solve the sequence labeling task for one-dimensional
sequences. Since the forward-backward algorithm for computing the alignment between
the target label sequence and the deep neural network prediction is based on the fact
that both need to be one-dimensional, it cannot be easily transferred to multi-dimensional
problems, e.g. labeling paragraphs of multiple text lines. We will now discuss one method
for the application of CTC to multi-dimensional problems by implicitly converting it to a
one-dimensional sequence.
As we have discussed before, see Section 3.1, deep neural networks for CTC typ-
ically use a collapse layer followed by a softmax layer, see Section 2.3, to marginalize
all but one dimensions of the input data and produce a one-dimensional sequence of
label probabilities as prediction. We recall that the collapse layer marginalizes dimen-
sions by summing up along them, effectively removing them from the prediction. The
67
works[8][9] of T. Bluche et al. replace this non-parameterized collapse layer by a collapse
function based on attention networks in order to transform a multi-dimensional input to a
one-dimensional prediction while allowing for complex relationships between the output
sequence order and the spatial locations within the input. This modified collapse function
based on attention networks is then applied to multi-line paragraph transcription based
on the CTC loss.
Attention networks[35] are a class of recurrent neural networks that try to mimic cog-
nitive focus or attention by depending only on a small subset of the data available at each
time step, but moving this focus to another subset in the input data at each time step. Se-
lecting the subset is done based on the attention of the previous time step, as well as the
input data itself and possibly the previous prediction. Applied to image data, an attention
network does select a set of pixels (spatial positions) at each time step and computes its
prediction based on these spatial positions. The selection of pixels is then moved to other
positions and the whole process is repeated.
Attention networks are regularly applied to e.g. images[158] or language transla-
tion[3].
Attention Networks on Images
We will now discuss this type of attention networks applied to image data. Figure 3.2.1
serves as an overview of this type of attention in deep neural networks. Let us begin with
the input x, which is an image with two spatial dimensions. This input x may be encoded
by using a convolutional neural network or a recurrent neural network in order to obtain
encoded features that are meaningful to the task at hand. The encoder artificial neural
network
x′ = Encoder(x,We) (3.2.1)
produces the encoded feature maps x′ based on the image input x and the encoder
network parameters We. If no encoder network is employed, we can assume x′ = x.
The next step in an attention network is the modeling of attention at on a subset of the
encoded data at each time step t. This again is modeled as an artificial neural network. It
is important to note that the spatial dimension and the size of the attention must be equal
to those of the encoded data. The attention network
at = Attention(x′,at−1,Wa) (3.2.2)
models the attention based on the encoded data x′, its own attention at−1 of the previous
time step and parameter set Wa. Attention at is of the same spatial resolution as the
encoded data x′, but has only one feature map. This feature is bound to the value range
[0, 1] to model a two-class classification problem. As such, the standard logistic sigmoid
function would be a suitable choice for the final activation function in the attention net-
works. Attention near or at 1 models focused attention to this point, whereas attention
near or at 0 ignores this position.
We now perform a coefficient-wise multiplication of the attention at and the encoded
features x′ ∑I ∑J
st = ati,jx
′
i,j (3.2.3)
i j
while collapsing the two spatial dimension I and J . Feature vector st is now a selection
of the features from the encoded data x′, but with dependency on the current attention
focus.
68
x: Image of 
handwritten 
text
Encoder t-1
network a : Attention of last step
Store at
x': Encoded Attention 
image network 
at: Attention 
of current step
⋅
st: Collapsed st+1 st+2 st+...
features
Weighted
features ∑
Decoder
network
yt: Label yt+1 yt+2 yt+...
probabilities
CTC loss
Figure 3.2.1: Attention network applied to offline handwriting recognition. In this network each
attention step processes one character. Processing a paragraph line-by-line is also
possible.
69
Final step in the attention network is to decode the selected feature vector st in order
to produce the prediction related to the task at hand. The decoder network
y = Decoder(st,Wd) (3.2.4)
with its parameter set Wd is typically a multi-layer perceptron, see Section 2.3, since
those networks are well suited to predictions on feature vectors. Another possibility would
be to use a recurrent neural network or LSTM network as the decoder and treat each time
step of the attention network as one time step of the RNN.
Attention networks consist of the three - or two, without encoder - networks Encoder,
Attention and Decoder, each differentiable and combined in a way that allows the full
attention network to be differentiable. This allows parameter optimization of We, Wa and
Wd using backpropagation, see Section 2.3, and gradient descent, see Section 2.3.
Paragraph Transcription
We have discussed attention networks on image data and seen that at each time step
t, the attention mechanism selects a subset of pixels from the image and collapses the
two spatial dimensions I and J . The attention network effectively transform the two-
dimensional image into a one-dimensional sequence of characters. We will now briefly
discuss the details of a variant of this attention network used for line-wise paragraph
transcription[8]. This network reads one text line, instead of one character, per attention
step and is trainable by CTC. The attention mechanism is applied for a constant number
of attention steps, but could also be modified to predict the end of the paragraph and to
stop if this is the case. In the work[8], the attention mechanism is applied for a constant
number of steps, each transcribing one text line and all these text lines were concatenated
to one sequence in order to apply CTC to the full text at once.
The first step is to apply a hybrid MDLSTM+CNN encoder
x′ = Encoder(x,We) (3.2.5)
to the input image. This is done once in order to extract meaningful features from the
image.
The attention network is a MDLSTM network and
at = Attention(x′, lt−1,Wa) (3.2.6)
estimates the position and extent of each text line. The last layer of the attention network
is a linear layer and as such at is a feature map with one unbound scalar per pixel.
Feature map lt−1 is at activated with a softmax activation function, applied to each pixel
column thus giving the probability for each pixel that it belongs to the current text line:
exp at
lt ∑ i,ji,j = (3.2.7)J
k exp a
t
i,k
Indices i and j denote the column and row within the two-dimensional feature map. This
type of softmax function allows to read one text line per attention step, but on the other
hand limits the curvature of the text lines to a maximum of 45 degree. This is the same
limitation as in the work of this thesis. The softmax activated attention map is then fed
back into the attention network for the next step.
The collapsing layer ∑J
sti = l
t ′
i,jxi,j (3.2.8)
j
70
reduces the two-dimensional feature map x′ to a one-dimensional sequence st which
denotes one text line. This process is repeated for a constant number of times and the
resulting collapsed text lines s1 to sn are concatenated to one sequence s.
The decoder network
y = Decoder(s,Wd) (3.2.9)
is a bidirectional LSTM network that predicts the one-dimensional character sequence.
CTC is applied to this sequence y.
Image to Sequence Techniques
Another method[135] for paragraph-wise offline handwriting recognition using attention
networks was presented at the International Conference on Document Analysis and
Recognition (ICDAR) 2021. This method [135] applies a ResNet[51] encoder followed
by a self-attention decoder network to transform an image input to a one-dimensional se-
quence of labels. This inferred sequence may be of variable length and its end is signaled
by a special token reserved for this task. This method was specifically designed for the
transcription of tables and mathematical formulas.
This method shows good results when applied to paragraph-wise offline handwriting
recognition. On the other hand, the authors report high runtime requirements for tran-
scription. From the publication[135, p. 12]:
Inferencing (sic) takes an average of 4.6 seconds on a single CPU thread for a
set of images averaging 2500x2200 pixels, 456 chars and 11.65 lines without
model compression i.e., model pruning, distillation or quantization.
Relation to this work
The work using attention-based paragraph transcription[8, 9] directly address the same
problem as this work, that is multi-line offline handwriting recognition without explicit line
segmentation. We will directly compare the resulting transcriptions in Chapter 7. Newer
works[135] will also be included, although in a shorter form, in this comparison of Chapter
7.
The main difference between the approaches using attention networks and the work of
this thesis is the modeling of line transitions. In this work, the alignment for labeling multi-
line text is interpreted as a inference problem over a two-dimensional pixel space. This
requires to model line transitions as one label class and to do a probabilistic assignment
of the line transition class to a connected path of pixels from left to right (in English
handwriting) in order to separate two lines in pixel space. More on this in Chapters 5 and
6. This translates to an extension of the CTC approach to multi-dimensional sequences
and spaces. It also makes a robust inference of the line separators necessary in order to
correctly separate and transcribe individual text lines.
The attention-based approach to multi-line translation does not require modeling the
line separators. Instead, individual characters or lines are iterated one-by-one by moving
the attention of the network at each time step. The attention only needs to be at its spatial
position to translate the text covered by the current attention. A hard separation between
text lines does not seem to be strictly required and overlaps between the attention focus
of adjacent text lines seem to be possible.
Another difference lies in how the task is modeled. As we will see later (Chapters 5
and 6) the method proposed in this thesis is based on the idea to set up an expectation-
maximization loop in order to solve the sequence labeling problem for multi-dimensional
spaces and label sequences. This allows to model multi-line transcription in the loss
71
function and decoding of a deep neural network. On the other hand, the attention-based
multi-line transcription explicitly employs a specific DNN topology to solve this task. This
means that the attention-based approach is confined to a very specific topology of deep
neural networks. The work of this thesis on the other hand only has some general re-
quirements on the DNN topology and even on the machine learning model employed. It
can be used in combination with e.g. recurrent or convolutional DNNs, a combination
thereof or possibly a ML model that is not a neural network at all.
Concluding from comparing the approaches to the same task, the attention-based
solution should be more robust when translating difficult lines since it does not rely on an
explicit encoding of line separators. On the other hand is the work of this thesis applicable
to a variety of machine learning models as it is implemented in the loss/target function
and not in the model itself.
3.3 Paragraph Transcription by Reshaping CNNs
Overview and Method
In this section we will briefly discuss a work[21] on paragraph-level offline handwriting
recognition presented at the International Conference on Document Analysis and Recog-
nition (ICDAR) 2021. The use case for this method is again to transcribe multi-line text
from an image of a paragraph of handwritten text without prior segmentation into lines,
words or characters.
This method achieves this by applying a convolutional neural network (CNN), see
Section 2.3, to the presented image. Both the input image and estimated output of the
CNN consist of two spatial dimensions, their height and width. This method proposes
to reshape the CNN output, ordering ‘pixel’ rows in a single one-dimensional sequence,
starting with the topmost row. As this reshaped output is now one-dimensional, connec-
tionist temporal classification (CTC) can be applied to it for both training and decoding.
Figure 3.3.1 illustrates this approach.
Image of
Handwritten Text
CNN
Tensor of Shape 
(B x N x H x W)
Reshape
CTC Training and Decoding
Figure 3.3.1: Convolutional Neural Network applied to paragraph-wise transcription by reshaping
the CNN prediction. The CNN prediction of shape (batch-size × num. features ×
height × width) is reshaped to concatenate all pixel rows to form one sequence in
a left-to-right and top-to-bottom fashion.
72
The last layer of the CNN (before reshaping) estimates a soft-assignment that does a
probabilistic assignment between ‘pixels’ of the CNN output and the alphabet in use, e.g.
Latin glyphs for English handwritten texts. Glyphs are exclusive to each other per pixel,
which is achieved by applying a pixel-wise softmax function to this soft-assignment. The
alphabet in use contains an additional glyph for distinguishing repetitions of the same
glyph in adjacent characters. As such the output of this CNN is, in meaning, identical to
the output of deep neural networks for CTC, see Section 3.1, and attention networks for
paragraph-wise transcription, see Section 3.2, except that it contains two spatial dimen-
sions instead of one.
This soft-assignment is reduced from two spatial dimensions to one spatial dimension
by reordering the ‘pixel’ rows in a one-dimensional sequence. Connectionist temporal
classification is then applied to this sequence.
The original publication[21] contains details on the deep neural network topology and
training method applied.
Relation to this work
The advantage of this method of reshaping the CNN output for multi-line offline handwrit-
ing recognition is that is is very easy to implement and employs connectionist temporal
classification as both the loss function for training and decoding algorithm during infer-
ence. Applying a CNN, not a RNN or LSTM network, makes inference very fast. The
published paper[21] does not report the time required for inference, but the paper’s author
reported a low amount of milliseconds in discussions on site at the ICDAR conference.
In comparison to multi-dimensional connectionist classification, proposed in this the-
sis, the CNN reshaping approach suffers from a disadvantage: the convolutional neural
network transforms the presented input image, which is a pixel space with two spatial di-
mensions, to a lower-resolution soft-assignment again with two spatial dimensions. Attri-
bution between spatial positions in the predicted soft-assignment and input image is fixed
according to the receptive field of the CNN. This yields a 1:1 assignment from rectangular
areas in the presented image to characters in the transcription. The next step reshapes
the predicted soft-assignment from two to one spatial dimension by concatenation of the
pixel rows in a top-down fashion. Altogether this means that this method encodes the
assumption that text lines in the paragraph presented to the CNN are oriented roughly
horizontal and are of roughly the same height.
The paper[21, p. 11] addresses this problem by oversampling the input image. Mean-
ing that the two-dimensional soft-assignment predicted by the CNN contains more pixel
rows than the presented image contains text lines. This introduces flexibility for transcrib-
ing paragraphs with text lines of different heights and a variable number of text lines. The
examples given in the publication also show transcription of slanted text lines. However,
text lines can only be successfully transcribed using this method if they do not overlap in
the estimated soft-assignment, e.g. each pixel row of the soft-assignment must be part of
exactly zero or one text lines, but not multiple. This limitation is introduced by the design
of this method in reshaping the CNN output.
Multi-dimensional connectionist classification does not have such strict limitations on
the size and orientation of text lines since it introduces a special token that separates
text lines within the two-dimensional soft-assignment. This allows one pixel row of the
soft-assignment to be part of multiple text lines.
73
74
Chapter 4
The Problem with Multi-Line
Handwriting Recognition
4.1 Overview
This chapter is devoted to a discussion of the problems which are arising with auto-
matically transcribing multi-line paragraphs of handwritten texts. Connectionist temporal
classification (CTC)[46], see also Section 3.1, addresses the transcription of natural texts
from one-dimensional inputs. CTC is a method for training and decoding a deep neural
network in a way that estimates a one-dimensional sequence of labels, e.g. glyphs of an
alphabet, from an image of a single text line (offline handwriting recognition), a sequence
of pen strokes (online handwriting recognition) or audio (voice recognition). All three in-
put types can be represented as a one-dimensional sequence (left-to-right or beginning-
to-end) and accordingly their respective transcribed output is always a one-dimensional
sequence of labels. In this thesis, only offline handwriting recognition is of interest to us.
However, the question of why CTC cannot be directly applied to multi-line paragraphs of
text and why solving the same problem for multi-line texts and multi-dimensional input is
harder arises out of this transition from one- to two-dimensional inputs.
The following Section 4.2 discusses these question from a practical perspective based
on examples of actual handwritten paragraphs. Section 4.3 touches on the computational
difficulties in transcribing multi-line texts.
4.2 Segmentation of Handwritten Paragraphs
Overview
As mentioned in the previous Section 4.1 this section is discussing the multi-line offline
handwriting recognition problem from a practical perspective. To this end it employs ex-
amples from the IAM offline handwriting database[88] which are problematic for a ‘clas-
sical’ transcription pipeline. Figure 4.2.1 shows one example paragraph from the IAM
database. It consists of several lines of handwritten text in English language. The lines
are aligned in a neat horizontal fashion with similar spacing between and heights of lines.
The cursive writing is uniform and was done using a high-contrast pen in comparison to
the background sheet of paper. No overlaps exist between adjacent words or adjacent
text lines.
As such, Figure 4.2.1 is a prototypical example of a very well written paragraph of
cursive writing that is, assumedly, easy to transcribe. Transcription of this paragraph
using CTC would entail a line-level segmentation, cutting the overall paragraph image
75
Figure 4.2.1: Example paragraph from the IAM offline handwriting database with non-
overlapping, nearly straight horizontal text lines without character or word correc-
tions and in high contrast.
into multiple smaller images, each containing exactly one complete text line. A deep
neural network trained with connectionist temporal classification can then be applied to
each individual text line image.
Unfortunately not all handwritten paragraphs are in such an enabling layout. The fol-
lowing paragraphs will discuss problematic cases that favor multi-line transcription without
explicit segmentation.
Problematic Segmentation
The main case for the multi-line transcription method proposed in this thesis is the tran-
scription of overlapping text lines without explicit segmentation. Such a paragraph is
shown in Figure 4.2.2. Overlaps between adjacent text lines and near-overlaps are
marked in red and orange respectively. Near-overlaps are also marked since they may,
depending on the line-segmentation algorithm in use, also be problematic.
True overlaps between adjacent text lines render the line-level segmentation espe-
cially hard since there is no clear corridor of background pixels between the two text lines.
When e.g. applying a connected components algorithm, the overlapping glyphs of the two
text lines would appear as one continuous glyph. Correctly separating these overlapping
glyphs into multiple individual glyphs would require knowledge about the contained text
in order to infer the prototypical shape of these glyphs. This is known as Sayre’s para-
dox [117] or Sayre’s knot which in summary states that:
Transcription of cursive text requires segmentation of it.
Segmentation of cursive text requires transcription of it.
Figure 4.2.3 illustrates Sayre’s knot with three overlapping characters. Each square
of the figure represents a pixel of an image and the coloring indicates the assignments of
pixels to characters. As we can see most pixels are part of the background and not part
of any character. Many other pixels are part of exactly one character. However, some
pixels are part of two characters at the same time. This example poses two segmentation
76
Figure 4.2.2: Example paragraph from the IAM offline handwriting database that shows multiple
overlaps, marked in red, or near-overlaps, marked in orange, between adjacent text
lines.
problems at the same time: first, assigning the pixels to characters and second also
deciding if pixels are part of multiple characters.
Figure 4.2.3: Sayre’s knot in an example of three overlapping characters ‘A’, ‘B’ and ‘C’. Red,
blue and yellow pixels are part of exactly one of the three characters. Purple and
orange pixels are part of two characters at the same time. Correctly assigning pixels
to characters requires knowledge about the characters and their prototypical glyph
shape. On the other hand is the pixel assignment necessary for correctly identifying
the glyphs.
Solving this segmentation problem shown in Figure 4.2.3 would require knowledge
about the content of the pixel image, that is knowledge which glyphs are contained and in
which order. Knowledge of the contained glyphs could be incorporated in a segmentation
method in the way of prototypical shapes of glyphs. Unfortunately the contained glyphs
are not known before transcription and segmentation happens before transcription, which
is creating Sayre’s paradox.
This effect may occur at any level of segmentation, that is while separating lines or
word or characters. Figure 4.2.4 shows possible variants for a ‘classic’ transcription
pipeline for handwritten paragraphs. Applying connectionist temporal classification en-
tails line- or word-level segmentation, followed by transcription using CTC. At any stage
of segmentation, overlaps may occur and thus reduce the quality of the segmentation re-
sult. Degraded segmentation will negatively influence the following transcription and thus
increase the overall transcription error. Since segmentation is done prior to transcription
without feedback, this increase in error will prevail.
77
Line-level 
segmentation Transcription "A MOVE ..."
Word-level 
segmentation Transcription "A MOVE ..."
Character-level 
segmentation Transcription "A MOVE ..."
Figure 4.2.4: Transcription pipeline for handwritten paragraphs based on prior segmentation on
a line-, word- or character-level. Segmentation does not have to be sequential from
one level to the next one. However, each segmentation step introduces a chance for
errors, influencing the final transcription result. Possible sources of segmentation
errors are indicated by the lightning symbols. The shown paragraph image is from
the IAM offline handwriting database.
78
This thesis proposes a transcription method for handwritten multi-line paragraphs that
does not depend on prior line-, word- or character-level segmentation. Figure 4.2.5 illus-
trates, contrasting to Figure 4.2.4, this idea. The idea of multi-dimensional connectionist
classification (MDCC) is that prior segmentation before transcription is not necessary and
instead, segmentation and transcription are two products of the same process. MDCC
emphasizes transcription, but Section 6.7 will briefly discuss how MDCC could be modi-
fied to emphasize segmentation. Assuming that segmentation and transcription are two
independent products of the same process and not processes that are dependent on
each other effectively solves Sayre’s paradox on a paragraph-level.
We state that MDCC solves Sayre’s paradox on a paragraph-level since MDCC ap-
plies only to transcription within a paragraph. A scanned document may still, and often
does, consist of multiple paragraphs with occasional figures and tables. Analyzing such
document structure and extracting individual paragraphs poses its own problems, which
are not in the scope of this thesis. Figure 11.2.2 of Section 11.2 does however show one
such example of a complex document layout.
Transcription
"A MOVE ..."
Figure 4.2.5: Transcription pipeline proposed in this thesis. No explicit segmentation of the para-
graph image is performed. The paragraph image is again from the IAM offline
handwriting database.
Other Considerations
The previous section discussed problems with overlapping text lines while applying line-,
word- or character-level segmentation followed by a transcription method. Addressing
problems of overlapping text lines is the main reason for MDCC. However, the following
paragraphs will briefly touch on other, more general, interesting cases that occur while
transcribing handwritten texts. The following examples are based on knowledge of seg-
mentation and transcription methods, as well as inspection of the IAM offline handwriting
database. They do serve to detail some of the problems that typically occur during offline
handwriting recognition.
79
Figure 4.2.6 shows an example text where segmentation on a word-level is ambigu-
ous. The bracketed numbers can be seen as stand-alone words or be assigned to the
leading or trailing word. The choice between these three possibilities may well influence
the following transcription since transcription methods in the form of recurrent neural net-
works contain implicit language models and some transcription methods even explicitly
apply a language model. Word-level segmentation thus should prefer the segmentation
which closely matches the language model at hand. This effect, at least in the example
of Figure 4.2.6, does not apply to paragraph- or line-level transcription since all words are
contained in one single text line anyway and thus presented to the transcription method.
It also does not apply to character-level transcription since it does not contain a language
model, implicit or explicit, anyway.
Figure 4.2.6: Example paragraph from the IAM offline handwriting database where correct word-
level segmentation is ambiguous without knowledge of the transcribed text.
The example of Figure 4.2.7 contains one to three characters, marked in red, which
are not clearly separable. Paragraph-, line- or word-level transcription should be applied
since in these methods no character-level segmentation will be necessary and an (im-
plicit) language model may be capable of distinguishing between the characters.
Figure 4.2.8 shows an example were the writer made a mistake and corrected it by
striking through the wrong word and writing down the correct text. A character- or word-
level transcription applied to this example may be erroneous, especially if the corrected
word is treated as an individual word segment. Paragraph- or line-level transcription
seems to be more applicable here, especially if such cases occur within the training data,
since the segmentation or transcription method may ignore the corrected part of the line.
Presenting only the stricken word to the transcription method may on the other hand
generated erroneous results since a transcription method is designed to produce natural
language text, even if the input image does not contain text.
An abstraction of offline handwriting recognition is so called identification of writer in-
tention, which is transcription plus the assumption that a writer will occasionally make
mistakes and write down a different text in comparison to what was the intended informa-
tion. Figure 4.2.9 shows an example of this. The writer of this paragraph made a spelling
mistake. The correct transcription of the marked word in terms of offline handwriting
recognition is ‘effektive’, but in terms of identification of writer intention it is ‘effective’.
80
Figure 4.2.7: Example paragraph from the IAM offline handwriting database with an ambiguous
or corrected character marked in red.
Figure 4.2.8: Example paragraph from the IAM offline handwriting database with a word, marked
in red, corrected by the writer.
81
Figure 4.2.9: Example paragraph from the IAM offline handwriting database where the writer
made a spelling mistake. This is an example of ‘identification of writer intention’.
Conclusion
This section detailed some exception cases that can occur in handwritten paragraphs of
natural language. Of interest to this thesis are mainly potential errors that occur when
applying line-level segmentation to paragraphs that contain overlapping text lines. Multi-
dimensional connectionist classification is designed to transcribe whole paragraphs with-
out prior segmentation in order to mitigate these problems. We show that MDCC is
capable of solving Sayre’s knot on a paragraph-level by treating segmentation and tran-
scription as products of the same process. MDCC as proposed in this thesis emphasizes
transcription. Section 6.7 does however briefly discuss ways to put emphasis on seg-
mentation.
Other examples shown in this section concern more general difficulties in offline hand-
writing recognition. These are not directly addressed by MDCC. On the other hand is a
paragraph- or line-level transcription suitable for these examples.
4.3 Computational Considerations
Forward-Backward in Connectionist Temporal Classification
So far this chapter has detailed potential problems in line-, word- or character-level seg-
mentation when applied to handwritten paragraphs. This section will discuss why the
methodology of connectionist temporal classification (CTC)[46], namely forward-back-
ward alignment, cannot be directly transferred to two-dimensional, and in extension to
82
multi-dimensional, tasks. We will again use an example from the IAM offline handwriting
database[88] to illustrate the considerations of the following paragraphs.
Section 3.1 discussed the application of the forward-backward algorithm in CTC. CTC
solves the sequence labeling task for one-dimensional sequences. This task is to tran-
scribe a sequence of discrete labels from a one-dimensional input, or at least input that
can be treated as one-dimensional. For example in offline handwriting recognition, an im-
age of one text line is the one-dimensional input (processed left to right with the height col-
lapsed) and the transcribed sequence is the sequence of characters contained in this line
image. It is important to note that in sequence labeling, the transcribed label sequence
is shorter than the input sequence. As such the assignment between transcribed labels
and input positions is not known prior. CTC solves this by applying forward-backward to
infer the alignment between the transcribed sequence and input sequence during training
of the deep neural network.
Connectionist Temporal Classification as a Conditional Random Field
Connectionist temporal classification successfully applies the forward-backward algo-
rithm to infer the character alignment. This is possible since the underlying graph struc-
ture, when interpreting CTC as a graphical model, is an undirected chain. Figure 3.1.1
shows all paths while aligning the label sequence ‘HELLO’ over a observation of 12 time
steps. In terms of a graphical model this translates to a chain of 12 nodes with 11 labels
each. The transitions in Figure 3.1.1 indicate compatible node-label combinations within
one node and in neighboring nodes. Each path thus represents one configuration of the
graphical model which correctly decodes to the label sequence in question. Please see
Section 2.2 for a discussion of graphical models in this context.
1 ε ε ε ε ε ε ε ε ε ε ε ε
H H H H H H H H H H H H
ε ε ε ε ε ε ε ε ε ε ε ε
E E E E E E E E E E E E
ε ε ε ε ε ε ε ε ε ε ε ε
L L L L L L L L L L L L
ε ε ε ε ε ε ε ε ε ε ε ε
L L L L L L L L L L L L
ε ε ε ε ε ε ε ε ε ε ε ε
O O O O O O O O O O O O
U ε ε ε ε ε ε ε ε ε ε ε ε
t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10 t=11 t=12
1 Time steps of DNN prediction T
Figure 4.3.1: Interpretation of CTC as a chain-structured graphical model. The example is identi-
cal to that of Figure 3.1.1 but with the time steps of the DNN prediction as 12 nodes
of the graphical model and the label sequence ‘HELLO’ as 11 discrete states of
each node.
83
Label sequence
Figure 4.3.1 shows the interpretation of CTC as a chain-structured graphical model
in the same example as of Figure 3.1.1. We can show that this interpretation as a chain-
structured graphical model is equivalent to the CTC formulation by recovering Equation
3.1.3, which defines the probability of observing a specific label sequence, from Equation
2.2.9, which defines the joint probability of a conditional random field (CRF). Equation
3.1.3, with substitution of Equation 3.1.2, is as follows, with l being the label sequence in
question, y the observed DNN prediction and π being one configuration:
∑ ∏T
P (l|y) = ytπ (4.3.1)t
π:F (π)=l t
Function F (π) collapses a configuration π to a label string by first converting repeti-
tions of the same glyph to a single instance of the glyph, followed by removing all glyph
separators. This is discussed in Section 3.1.
Equation 2.2.9 defines the joint probability of a CRF as follows, using π as one con-
figuration and y as the observed DNN∏:1 ∏
P (π|y) = ψ ss(πs|y ) ψs,t(πs, πt) (4.3.2)
Z
s s∼t
Since the goal is to construct a chain-structured CRF, the neighborhood relation s ∼ t
is defined as nodes s and t being two consecutive nodes within this chain. Each node of
the chain is part of exactly two such neighborhood relations, one with its leading and one
with its trailing neighbor. The exception are the very first and very last nodes of the chain,
both only being part of one neighborhood.
Z is a normalization factor, also called the Zustandssumme, defining the accumulated
joint probability over all possible configurations π. This normalization factor is responsible
for the joint probabilities of all c∑onfig∏uration actu∏ally summing up to 1:
Z = [ ψs(πs|ys) ψs,t(πs, πt)] (4.3.3)
π s s∼t
Marginalization over all CRF configurations π that represent the label string l yields
the probability of obs∑erving this label s∑tring: 1 ∏ ∏
P (l|y) = P (π|y) = [ ψs(π ss|y ) ψs,t(πs, πt)] (4.3.4)
Z
π:F (π)=l π:F (π)=l s s∼t
We define the node potential function ψs(π ss|y ) of the CRF as the probability of ob-
serving character πs in time step s according to the deep neural network prediction y:
ψs(πs|ys) = ysπ (4.3.5)s
The edge potential function ψs,t(πs, πt) is defined as a constant, giving equal compat-
ibility to all node-label combinations:
ψs,t(πs, πt) = 1 (4.3.6)
At this point the question arises if this is a valid CRF representation of the CTC model
since the edge potential of the CRF does not restrict to the specific label string l. Con-
nectionist temporal classification is a loss function for optimizing a deep neural network
towards predicting label string l matching the true transcription of the DNN input. As
such there is the possibility that the DNN predicts many different label strings for different
inputs. The goal of CTC is to maximize the probability of predicting the true label string
84
given its corresponding input from the training data set. The original CTC formulation
accounts for this fact by recognizing in Equations 3.1.2 and 3.1.3 that there are potential
paths π that do not correspond to the correct label string l. The CRF at hand is built on
the same assumption.
Substituting ψs(πs|ys) = ysπ and ψs,t(πs∑, πt) = 1∏yields the marginalizations
P (l| 1y) = [ ysπ ] (4.3.7)Z s
π:F (π)=l s
for the probability of the DNN prediction y∑enc∏oding the truth label string l with
Z = ysπ (4.3.8)s
π s
being the normalization factor.
Section 3.1 briefly discusses deep neural network topologies as used for CTC. These
topologies end with a collapse layer, followed by a softmax function that normalizes the
glyph probabilities within each time step. This observation leads to the conclusion that
the normalization factor Z over all configurations π always sums up to exactly one given
that y is predicted by such a DNN: ∑∏
Z = ysπ = 1 (4.3.9)s
π s
Substituting Z = 1 recovers the CTC formulation for P (l|y) from the conditional ran-
dom field joint probability of Equation 4.3.2. This formulation is identical to Equation 4.3.1
with the exception that the time dimension is∑index∏ed by symbol s instead of t:
P (l|y) = ysπ (4.3.10)s
π:F (π)=l s
Computational Complexity of Inference in Graphical Models
So far this section discussed how connectionist temporal classification applies the for-
ward-backward algorithm for alignment of the truth label string and thus solves the se-
quence labeling task for one-dimensional sequences. This section also detailed the in-
terpretation of CTC as a chain-structured conditional random field. This explains why
forward-backward can be applied for exact inference while keeping computational com-
plexity polynomial.
Belief propagation has been discussed in Section 2.2 and the forward-backward al-
gorithm is a special case of BP. Forward-backward applies belief propagation in sum-
product mode to chain-structured graphical models. Section 2.2 also refers to published
literature[2, 15, 23, 80] on the computational complexity of inference in graphical mod-
els. Kevin Murphy[93, ch. 20.5] gives an overview over inference algorithms for graphical
models and their respective restrictions. He states that forward-backward is applicable
for exact inference in chain-structured models and belief propagation in trees. Please
note that belief propagation can in general be applied for inference in polytrees since the
factor-graph[34, 75] of a polytree is again a tree[7, ch. 8.4.3]. A polytree is a graph of
which its underlying topology is an acyclic graph. Colloquially speaking, a polytree is
a graph in which directed edges are replaced by undirected edges, duplicate edges re-
moved and the resulting graph then will not contain any cycles. As such, a polytree is the
least restricting topology out of a chain, tree or polytree. Inference in general graphical
models, that is directed or undirected and with cycles, is NP-hard[23] with exact inference
being even #P-hard[111].
85
Loopy belief propagation (LBP)[34, 94][93, ch. 22] allows for approximate inference
in general graphical models within polynomial time, given that a suitable convergence
criteria is applied. The remainder of this section discusses why the sequence labeling
task with multi-dimensional inputs and label sequences falls within these general graph-
ical models and thus renders application of the forward-backward or belief propagation
algorithms unfeasible. Chapter 6 discusses how to apply a grid-structured CRF and LBP
to this multi-dimensional sequence labeling task by proposing multi-dimensional connec-
tionist classification (MDCC).
On Chain- and Grid-Structured Models
Figure 4.3.2: Chain-structured graphical model with 5 nodes.
The previous paragraphs of this section detailed the interpretation of connectionist
temporal classification as a chain-structured conditional random field, the application of
the forward-backward algorithm to chain-structured graphical models and the general
computational limitations of inference in graphical models. Figure 4.3.2 shows a chain-
structured graphical model with 5 nodes. A discussion on the modeling of multi-line text
in a graphical model follows in the next paragraphs.
1 2 3
4 5 6
7 8 9
Figure 4.3.3: Example of multi-line handwritten text. The matrix serves as a partial pixel grid.
Cell 5 could be the beginning of e.g. an ‘a’, ‘g’ or ‘o’ glyph. Cell 6 e.g. an ‘o’, ‘c’, ‘g’.
Cell 9 e.g. an ‘o’, ‘b’, ‘s’ with or without a new line on the bottom. Cell 8 e.g. an ‘o’,
‘c’, ‘u’ with or without a new line.
Figure 4.3.3 contains an extract from an IAMDB example with two text lines of three
words each. A partial pixel grid was added as an overlay to show how the contents of
neighboring cells influence each other in a cyclical fashion. In this example, the cells
5, 6, 8 and 9 influence each other and inferring the content of each cell is not reliably
possible without looking at the other cells as well. Please note the caption of Figure 4.3.3
for example possibilities of the cell contents.
This observation leads to the following two reasons on why multi-line text necessitates
a grid-structured model with 4- or 8-neighborhoods around each node instead of a chain-
structured model as is the case for one-dimensional transcription using CTC:
1. Even without difficult to recognize examples, as in Figure 4.2.1, the modeling of
multi-line text is inherently multi-dimensional. The horizontal spatial dimension
roughly translates to the reading direction within one text line. The vertical dimen-
sion to the ordering of multiple text lines. This is the case even for simple cases, but
also holds true for more interesting cases such as slanted, curved or rotated text
lines.
86
2. The example of Figure 4.3.3 shows that cyclical dependencies in the neighborhood
around nodes of the graphical models do exist. We argue that correct inference
of the alignment of multi-line text is not possible with a chain-structured graphical
model, but instead necessitates a grid-structured model that allows for cyclical de-
pendencies.
This reasoning leads us to multi-dimensional connectionist classification as discussed
in Chapter 6. MDCC applies a grid-structured conditional random field to infer the align-
ment of multi-line text. An example of such a grid-structured graphical model is shown in
Figure 4.3.4.
Figure 4.3.4: Grid-structured graphical model with 25 nodes in a 5 by 5 grid.
We can see from the example in Figure 4.3.4 that such a grid-structured graphical
model is not a chain or tree. It also is not a polytree since the underlying graph structure
is cyclic. As such neither the forward-backward algorithm nor belief propagation can be
applied for inference in such a model. We propose to apply loopy belief propagation as
an approximate inference method.
It would still be possible to apply variable elimination[162][93, ch. 20.3] or the junc-
tion tree algorithm[86, 131][93, ch. 20.4] as an inference method to a grid-structured
model. Both methods are also discussed in other published literature[70]. Unfortunately
the computational complexity of both algorithms grows exponentially with the tree-width
of the graphical model. The tree-width of a grid with 4-neighborhoods and N ×N nodes
is N . The worst case for the tree-width in graphical models is that it is identical to the
number of nodes in the model. Thus the choice of the variable elimination or junction tree
algorithm for inference in grid-structured graphical models of multi-line text would quickly
become unfeasible since the size of the grid-structure depends on the number of pixels
in the input image of the text. Loopy belief propagation on the other hand is an iterative
algorithm that can be stopped as soon as sufficient convergence criteria are met.
87
88
Chapter 5
Decoding Algorithms for Multi-Line
Text Recognition
Figure 5.0.1: Part of the pipeline discussed in this chapter. Left is the input, middle the estimated
probabilities and right the decoded text.
The content of this chapter is based on the two publications on multi-dimensional
connectionist classification:
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-Line
Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR
International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019.
DOI: 10.1109/ICDAR.2019.00015
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional
Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR
International Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–410.
DOI: 10.1109/DAS.2018.36
Please see Section 1.3 for detailed information on the authors contribution.
5.1 Overview
The overall method and system for offline handwriting recognition proposed in this work
can be split into two basic parts: A first stage, consisting of a deep neural network, that
takes the image of the handwritten text as input and estimates a probability distribution
that soft-assigns each pixel1 to one of the glyphs from the given alphabet. The second
stage is a decoding algorithm that produces the most likely or at least a likely sequence
of glyphs given the probability distribution from the first stage. See Figure 5.0.1 for an
overview. Algorithmically these two stages are executed in this order. However, dis-
cussing the decoder stage first is more easily approachable since its output (a sequence
1We will use the term ‘pixel’ in this context colloquially for a specific spatial position in the probability
distribution and not strictly for a ‘picture element’ of an image.
89
of glyphs) is not abstract and intuitively understood. This is why we will discuss the de-
coding algorithm of this work in the current chapter with the discussion of the deep neural
network and training algorithm in the following Chapter 6.
Decoding is a problem well known in information and signal theory with applications
in e.g. telecommunication, speech recognition and of course handwriting recognition.
We have previously, see Section 3.1, discussed decoding algorithms for connectionist
temporal classification. Decoding describes the problem of observing a time series of
continuous signals, e.g. voltages on copper telecommunication lines or probability esti-
mates from a deep neural network, and uncovering the sequence of discrete events the
most likely led to this observation. A well known decoding algorithm for one-dimensional
sequences is the Viterbi algorithm[33, 149]. In the case of offline handwriting recognition,
the sequence of events is the sequence of glyphs actually written on the sheet of paper
and is captured by the camera.
Offline handwriting recognition systems that employ line-wise transcription of multi-
line text require a line segmentation algorithm run beforehand in order to extract individual
text lines and facilitate correct text transcription. These text line segmentation algorithms
are based on features extracted from the image of handwritten text before applying the
transcription method to the extracted lines. This chapter proposes a multi-line decoding
algorithm for multi-dimensional connectionist classification. It employs a similar overall
approach by first identifying and extracting text lines from the deep neural network pre-
diction, converting these extracted lines to one-dimensional sequences of probability es-
timates and then decoding these using established decoding algorithms. The difference
is that the proposed system does not extract lines from the original image of handwritten
text but from the two-dimensional probability distribution estimated by the deep neural
network. This probability distribution gives probabilities for both visible glyphs and the
artificial line separator glyph. This allows the proposed decoding algorithm to use infor-
mation about the transcribed text and extract lines in such a way as to facilitate text line
transcription with fewer errors.
At this point it is prudent to specify our terminology as there may be conflicting defini-
tions in literature. In this chapter and Chapter 6 we will use the term glyph for one element
from the alphabet in use, e.g. ‘l’ is a glyph. A character denotes a specific instance of a
glyph within a text, e.g. ‘hello’ contains the glyph ‘l’ twice in two characters. The definition
of a label in this context is identical to that of a glyph but also is a general term used in
the sequence labeling task in machine learning.
In the following sections of this chapter we will first discuss the structure and proper-
ties of the two-dimensional probability distribution estimated by the deep neural network
and how it encodes information about the included text and its spatial structure. We will
then continue by outlining and discussing the proposed decoding algorithm, starting with
the overall algorithm and then detailing the algorithmic parts for finding and extracting text
lines, as well as decoding text lines to label sequences.
We will use images from the IAM offline handwriting database[88] as examples in this
chapter. Please note that we will use handwritten text as examples in this Section, but
this work is applicable to multi-line text in general.
5.2 Structure of the Model Output
In the following section we will discuss the structure, properties and meaning of the
two-dimensional probability distribution estimated by the deep neural network in multi-
dimensional connectionist classification. As illustrated in Figure 5.0.1, this probability
distribution is generated by the DNN using an image of handwritten text as input. Opti-
mization of this DNN will be discussed in Chapter 6.
90
Mathematical Properties
Let A be the alphabet in use. It is a set consisting of the glyphs of the writing system,
as well as an artificial glyph separator ϵg and an artificial line separator ϵl. The glyph
separator ϵg is used to differentiate between multiple adjacent occurrences of the same
glyph in one text line and a single occurrence that stretches over multiple pixels. For
example a sequence a ϵg a encodes ‘aa’ whereas aaa encodes ‘a’ stretched over three
pixels. The line separator ϵl encodes information about line breaks, specifically it indicates
that the two pixels directly above and below belong to two different text lines.
Let us use x ∈ [0, 255]dx as the gray scale input image of dx = Width(x)× Height(x)
number of pixels. Prediction y ∈ [0, 1]dy×|A| with dy number of pixels is then the soft
assignment of each pixel to one of the glyphs or separators from A. This soft-assignment
is estimated by the deep neural network as y = DNN(x,W) in the case of transcribing
text from an image, with W being the parameters of the DNN. The number of pixels dy is
likely smaller than dx because of subsampling, pooling or padding-effects in the DNN or
model in general. However from a theoretical viewpoint it is sufficient to assume that each
pixel in x corresponds to exactly one pixel in y. When training the DNN, see Chapter 6,
this soft-assignment will be estimated by a conditional random field (CRF) and will be
assumed to be the true alignment and as such be used for supervised optimization of the
DNN parameters.
The prediction y is a probability distribution with two spatial dimensions (width and
height) as well as one dimension for the glyph space of the alphabet. It gives the proba-
bility ysg of a specific pixel s being part of a specific glyph g for every pixel and glyph. Since
the glyphs are mutually exclusive (assuming the writer intended to only write exactly one
glyph per spatial position), the pro∑babilities per pixel s sum up to one. This constraint
ysg = 1, ∀s ∈ [1, dy] (5.2.1)
g∈A
is enforced by applying a pixel-wise softmax function to the last layer of the DNN.
We will later also use the annotation (i,j)yg in addition to ysg in order to indicate a pixel at
a specific position (i, j). Here i ∈ [1, I] indicates the horizontal position with I = Width(y)
and i ∈ [1, J ] the vertical position with J = Height(y). ysg is the short-hand notation for
one spatial position s.
Semantic Meaning
In the section beforehand we have discussed the mathematical properties of the soft-
assignment y. We will now discuss the meaning of the probabilities contained in it and
how to interpret them in a way that facilitates transcription of text by decoding these
probabilities. In the following discussion, and in fact for the remainder of this thesis, we
will assume that the text is written in a typing system that places characters from left to
right within a line and lines from top to bottom on the page or paragraph. Adjusting the
decoding algorithm described in this chapter and the training algorithm in Chapter 6 to
other writing systems is matter of adjusting these neighborhood relations.
This top-to-bottom and left-to-right writing system leads us to the first interpretation
of the probabilities in y since this defines also the ‘reading direction’ of the probability
distribution. The first text line starts in the pixel in the top-left corner, text lines generally
start on the left pixel and the last text line ends in the bottom-right corner. Text lines are
presented in the pixel space from top to bottom and characters within a text line from left
to right. This directly reflects the writing system.
We will now discuss the meaning of the different glyphs from alphabet A, starting with
the artificial glyphs ϵl and ϵg followed by the glyphs purposefully intended by the writer.
91
The line separator ϵl encodes transitions between text lines since this work aims at
transcribing multi-line text without prior segmentation. The line separator ϵl is an artificial
glyph of the alphabet, in the sense that the writer did not intend to write it on paper as part
of the text, but it is necessary in order to correctly encode multi-line text in the probability
distribution. Let (i,j)yϵl be the probability of pixel (i, j) being part of a line separator ϵl. A line
separator ϵl encodes the semantic meaning that the pixel column (i, j′), j′ ∈ [1, j − 1] to
the top belongs to a different text line or multiple text lines than the pixel column (i, j′), j′ ∈
[j + 1, J ] to the bottom. Thus (i,j)yϵl contains the probability that this line separation is in
effect for pixel (i, j) and the pixels above and below belong to different text lines. Two
horizontal adjacent text lines are separated by a continuous chain of line separators ϵl
ranging from the left border at i = 1 to the right border at i = I. Each line separator
is also ranging from the left to the right border and does not ‘merge’ with another line
separator above and below, even if the text lines are not using up the full width. Instead
the line separators range the full width, separated by at least one pixel in vertical direction
and too short text lines are filled up with spaces on the left or right side, according to their
alignment.
Figure 5.2.1: DNN prediction of the line separator for one IAMDB example. Yellow color encodes
high probabilities. This example shows perfectly horizontal lines, which may not
always be the case as MDCC allows for line curvature or rotation up to 45 degree.
Figure 5.2.1 illustrates this concept with an example DNN prediction for the line sep-
arator. High probability is encoded in yellow color and thus the image parts above and
below each yellow line have a high probability of belonging to two different text lines.
Similar to the artificial line separator ϵl there exists an artificial glyph separator ϵg.
This glyph separator becomes necessary whenever two adjacent characters in the same
text line are the same glyph from the alphabet. In this case the decoding algorithm needs
a semantic pointer to differentiate between the two characters since it is allowed that
one character spans over several adjacent pixels. Differentiating between two adjacent
characters with different glyphs is intrinsically modeled since there is an actual transition
between different glyphs, but this is not the case for the same glyph in adjacent charac-
ters. Hence the need for a separator ϵg between these characters.
The glyph separator ϵg is placed whenever two adjacent characters are the same
glyph. This allows differentiation between e.g. decoding a sequence aaa to the string
‘a’ or a sequence aϵgaa to the string ‘aa’. The probability
(i,j)
yϵg in pixel (i, j) gives the
probability that the pixels (i′, j), i′ ∈ [1, i − 1] left of the glyph separator belong to one or
more characters different from the pixels (i′, j), i′ ∈ [i+1, I] to the right. Two characters of
the same glyph are thus separated by a continuous chain of glyph separators ϵg ranging
from the upper text line boundary, either the top of the pixel space or the above line
92
Figure 5.2.2: DNN prediction of the glyph separator for one IAMDB example. Yellow color en-
codes high probabilities.
separator ϵl, to the lower text line boundary. Figure 5.2.2 illustrates this by encoding high
probabilities for the glyph separator in yellow.
Glyph separators are only placed when necessary, in contrast to how they are han-
dled in connectionist temporal classification (CTC). This is computationally useful when
using a conditional random field (CRF) for estimating the alignment during training since
it directly reduces the runtime required for inference in a CRF.
We will now continue the discussion with the glyphs intentionally placed by the writer.
Most of them are visible, but the space is not. Similarly to the glyph separator ϵg these
are ‘writer-intended glyphs’ placed from left to right within a text line, specifically between
the upper and lower text line borders. These borders may be the start/end of the vertical
dimension in pixel space or the next occurrence of the line separator ϵl. The probability
for pixel (i, j) being part of glyph g is given by (i,j)yg with glyph g ∈ A \ {ϵl, ϵg}.
Figure 5.2.3: DNN prediction of the space for one IAMDB example. Yellow color encodes high
probabilities.
The first glyph in this part of the discussion is the space glyph that in Latin writing
systems is used to separate two adjacent words. In this work it is also required to fill up
the alignment whenever a text line in pixel space is not using up the full horizontal extent
of the pixel space. This can occur since the pixel space is a Euclidean space, which is
rectangular when visualized, but not every text line necessarily has exactly this horizontal
size in pixel space. This can also be used to restrict the alignment during the training
93
process since we can define the text to be left- or right-aligned and thus only add the
space to one of the two sides of the text lines. In both cases is the space an optional
character in the text line and only used to fill remaining pixel space if necessary. During
decoding it is suitable to remove leading or trailing spaces in each text line. Figure 5.2.3
shows the probabilities for space glyphs in an example were the text is left-aligned.
Figure 5.2.4: DNN prediction of the glyph ‘a’ for one IAMDB example. Yellow color encodes high
probabilities.
Figure 5.2.5: DNN prediction of the glyph ‘e’ for one IAMDB example. Yellow color encodes high
probabilities.
Visible glyphs are encoded in a similar fashion as the space glyph. The difference is
that these are strictly only present in the alignment or prediction when intentionally written
and not used to fill up remaining space. Again these characters are boxed in by the upper
and lower boundary of their respective text line in pixel space and to the left and right by
their neighboring characters. Figure 5.2.4 shows an example for the glyph ‘a’ and Figure
5.2.5 for the glyph ‘e’.
5.3 Multi-Line Decoding
In this section we will begin discussing the actual multi-line decoding algorithm pro-
posed and used in this work. The function of this decoding algorithm is to take the soft-
assignment y as estimated by the deep neural network (DNN), as well as the alphabet in
94
use A as input and produce the most likely, or at least a high probability, readable string
from it.
Likelihood of a Specific String
Let us start out by defining the likelihood of a specific string l given the soft-assignment
y. This in turn requires us to define the likelihood for one specific configuration C first.
We use the word configuration in the same way as in graphical models of probability
distributions, e.g. a Markov random field or conditional random field. It describes a
hard-assignment of pixels to glyphs from the alphabet, meaning each pixel is assigned
to exactly one glyph in a one-hot coding. Given a soft-assignment y of spatial size dy =
Width(y)×Height(y), configurations C ∈ Ady are accordingly hard-assignments.
Assuming that spatial positions p ∈ [1, dy] in predicting the soft-assignment y are
conditionally independent events, we can define the likelihood
∏dy
P (C|y) = ysCs (5.3.1)
s
of a specific configuration C being the hard-assignment generated out of the observed
soft-assignment y. The assumption that the spatial positions s in the soft-assignment y
are conditionally independent holds true if y is predicted by a deep neural Network were
the last layer is not fed back to itself or the layers before[46, p.2]. If using a different
machine learning model from a deep neural network for predicting y, we need to make
sure that this assumption still holds true.
From here we can define the likelihood of observing a specific label sequence l given
the soft-assignment y. There are many different ways of writing the same label sequence
in a pixel space, e.g. coloring the paper using a pen, and these different ways of writ-
ing are different events that lead to the same outcome. Thus we can marginalize over
different configurations C in order to find the likelihood for a specific label string. The
likelihood ∑∏dy ∏
P (l|y) = ys α (Cs, CtCs s,t , l) (5.3.2)
C s t∈nbr(s)
is dependent on the indicator function α, which defines valid glyph neighborhood relations
in pixel space. Function nbr(s) defines the 8 neighbors in pixel space around spatial
position s. Function α{is defined as:
s t
s t 1 iff C ,C are valid neighbors in s, t according to lαst(C ,C , l) = (5.3.3)
0 else
Configuration positions Cs and Ct are valid neighbors if their indicated glyphs match the
relations in the label sequence l, e.g. left-right, top-down relations are preserved. We will
discuss these neighborhood relations in detail in Section 6.2, but for now it is sufficient to
keep the discussed properties of Section 5.2 in mind.
Approach to Decoding
The likelihood P (l|y) gives us the opportunity to define the prototypical decoding method
for finding the most likely label sequence l⋆ given a soft-assignment y. This decoder
l⋆ = Decoder(y) = argmaxP (l|y) (5.3.4)
l
95
simply selects the most likely label sequence. Sadly, Equation 5.3.2 prohibits such an
approach since enumerating all configurations C is computationally far too expensive.
Assuming alphabet A being a one-byte character set of 256 characters and a spatial
resolution of 320 by 240, we arrive at |A|320×240 = 28×320×240 = 2614400 ≈ 6.7 × 10184952
different configurations C.
This number of configurations C tells us that a simple maximum likelihood approach to
decoding will not be computationally feasible, a pattern that we will see repeat in Section
6.4. Thus the need arises for a decoding algorithm that uses the semantic structure
of multi-line text in order to quickly decode this soft-assignment y to a likely string. As
discussed above, multi-line text as approached in this work is organized in multiple distinct
text lines that span roughly horizontally from left to right and each text line can be a
variable number of pixels in heights. These text lines are separated by line separators ϵl
in pixel space. Glyphs within a text line are organized from left to right, typically span the
height of their text line in pixel space and likely span over several pixels horizontally.
Armed with this knowledge we will derive a two-stage decoding algorithm that first
identifies distinct text lines in pixel space and then proceeds to decode each text line. First
identifying and extracting text lines makes this computationally feasible since this allows to
dynamically collapse each text line to a one-dimensional sequence and then decode them
with tried and tested algorithms such as e.g. Viterbi decoding. This two-stage approach
of finding and extracting text lines and then decoding each individually does resemble the
‘classic pipeline’ for offline handwriting recognition. The benefit of the approach of this
work is that this two-stage process is done on the soft-assignment y which is semantically
closely related and relevant to the problem at hand, offline handwriting recognition, in
contrast to the classical approach which starts from semantically unrelated grayscale,
often black-and-white, or color images.
Main Algorithm
The following paragraphs will discuss the algorithmic entry point of the multi-line decoding
algorithm, as stated in Algorithm 5.3.1, proposed in this work. The basic idea of this
decoding algorithm is that text lines are organized from top to bottom, each spanning
from left to right and each one should be dynamically collapsed to a one-dimensional
sequence in order to decode each text line.
To this end, the algorithm employs a scan-line approach with the scan-line starting
at the very top of the pixel space and spanning from the left to the right border. The
scan-line will then be moved from top to bottom, alternating between processing line
separators ϵl and readable text lines. Text lines may span multiple pixels in height and
thus each text line will be dynamically collapsed by summing its glyph probabilities per
pixel column. This works since, as we will see later, for decoding each text line only the
relative difference between the probabilities of any two glyphs is of importance, not their
absolute probability. The scan-line is allowed to move downwards with varying speeds
per pixel column in order to account for slanted or curved text lines. Figure 5.3.1 shows
an example of this scan-line approach.
The limitation of this proposed multi-line decoding algorithm is that it is not able to cor-
rectly decode text lines that are curved by 45 degrees or more at one or more positions.
Otherwise said, each text line has to be exactly one interval per pixel column. A single
text line may not occur in two or more distinct intervals in any given pixel column.
Algorithm 5.3.1 outlines the main function of the proposed multi-line decoding algo-
rithm. Input are both the soft-assignment y and the used alphabet A. Desired output
is a (highly) likely sequence of glyphs that lead to the observed soft-assignment, or in
extension to the observed image of multi-line text.
96
Start Line 1
1 2 4 6 7
εl εl εl End Line 1
8 3 5 11 12
εl εl Start Line 2
13 9 10 18 21
14 15 16 19 22
εl εl
24 25 17 20 23
εl εl εl End Line 2
29 31 26 27 28
Start Line 3
30 32 33 34 35
End Line 3
Figure 5.3.1: Scan-line moving through a pixel space of width 5 and height 7 which contains 3
text lines. The ‘horizontal’ lines indicate the scan-line in different states. Dashed
lines signify states at the beginning of a text line and solid lines after a text line,
either hitting a line separator or reaching the end of the pixel space. The downward
arrows indicate the advancement of the scan-line with the numbers counting the
individual movement operations in order.
97
Algorithm 5.3.1 Proposed Multi-Line Decoding Algorithm
Input soft-assignment y and alphabet A.
I = Width(y)
J = Height(y)
Find line separators l = FindLineSeps(y).
Initialize the scan line s:
s ∈ NI and set elements to 1.
Initialize the resulting glyph sequence r to empty:
r = {}
while min(s) ≤ J do
Skip over the preceding line separator:
for i ∈ [1, I] do
while si ≤ J ∧ l(i,si) do
si = si + 1
end while
end for
Initialize accumulated glyph probabilities a:
a ∈ RI×|A| and set elements to 0.
Accumulate the glyph probabilities of the current text line:
for i ∈ [1, I] do
while si ≤ J ∧ ¬l(i,si) do
(i,si
ai = ai
)
g g + yg , ∀g ∈ A
si = si + 1
end while
end for
Set line separator probabilities to zero: aiϵ = 0,∀i ∈ [1, I].l
Decode and add the current text line to the sequence:
r = r+ {ϵl}+DecodeLine(a)
end while
Transform the glyph sequence to a readable string:
Return GlyphsToString(r,A).
98
The decoding algorithm starts by applying function FindLineSeps to the soft-assign-
ment y in order to identify line separators in pixel space. The result is a two-dimensional
matrix of the same spatial size as the soft-assignment, where each pixel that is part of
a line separator is marked as a logical true. Whenever a position in l is logical true, it
means that the pixels in the above pixel column and the below pixel column belong to two
distinct text lines. We will discuss two variants of the FindLineSeps function in Section
5.4.
The scan-line s is a vector of integer numbers, giving the vertical position in each pixel
column, with its number of components equal to the width of the soft-assignment y. The
scan-line is initialized to start at the very top of the pixel space. During decoding it will
move incrementally downwards until reaches the lower end of the pixel space, in which
case the decoding algorithm stops. Accumulator a is reset for each new text line and is
used to dynamically collapse the text line. It is a matrix of the size of the width of the
soft-assignment by the number of glyphs in alphabet A and is used to sum up the glyph
probabilities while moving the scan-line through the text line in pixel space. Summing
up the glyph probabilities while moving the scan-line through a text line may seem to
give different weights to different pixel columns as the text line may be of varying height.
Strictly speaking is this the case, but does not influence the result since the line decoding
algorithms of Section 5.5 compare the glyph probabilities relative to each other and all
glyphs within one pixel column have the same weight as they are all the sum of the same
pixels.
Individual text lines are decoded using the function DecodeLine, which we will discuss
in two variants in Section 5.5. Their function is to take accumulated probabilities a and
produce the likely glyph sequence that matches this observation. Both variants of this
function are indeed identical to the ones proposed by connectionist temporal classifica-
tion[43, 46] since they too decode a one-dimensional sequence of glyphs.
Parallelization using multi-threading can be implemented in this multi-line decoding
algorithm in two different ways: first, by decoding multiple paragraphs in parallel. This
decoding algorithm does not depend on shared resources and thus can easily be applied
to a batch of examples at the same time. Second, even within one example the decoding
of individual lines can be done in parallel. This approach entails first applying the Find-
LineSeps function, followed by running the horizontal scan-line to accumulate the glyph
probabilities per text line. In contrast to a non-parallel implementation the individual text
lines are not directly decoded, but their accumulated glyph probabilities stored for later
use. All text lines can then be decoded in parallel by applying the DecodeLine function in
multiple threads. The decoded individual text lines are then combined to the final result
of the multi-line decoding algorithm.
We will discuss the two variants of the function FindLineSeps in the following Section
5.4 and the two variants of the function DecodeLine in Section 5.5. The variants of both
functions are usable in all combinations, yielding four different variants of this proposed
decoding algorithm.
Converting Glyph Sequences to Strings
So far we have discusses the overall algorithm for decoding multi-line texts. Next we will
discuss the function GlyphsToString which fulfills three utilities:
• Allow individual glyphs to stretch over multiple adjacent pixels.
• Allow the same glyph to occur in two or more adjacent characters within the same
text line.
99
• Map the decoded glyph sequence to a computer-processable string, e.g. encoded
in UTF-8.
Each position in the decoded glyph sequence r roughly correlates to one pixel col-
umn of one text line - a one-pixel wide column of a small vertical interval - which has
been dynamically collapsed. This means that if a glyph stretches multiple pixels in width,
it may result in repetitions of its glyph in the resulting sequence. It may because one
of the two variants of the line decoding function, see Section 5.5, already deduplicates
these repetitions. To make sure that this case is correctly handled, function Glyph-
sToString will first deduplicate adjacent repetitions of the same glyph, e.g. the sequence
‘aaaaϵg ϵgaabbbbbbbb’ is mapped to the new sequence ‘aϵgab’.
Next step is to allow for the same glyph in adjacent characters within the same text
line. As stated before, see Section 5.2, this is encoded by including the glyph separator
ϵg in the sequence to distinguish between the same character in multiple adjacent pix-
els and the same glyph in multiple adjacent characters. The above example sequence
‘aϵgab’ encodes the same glyph ‘a’ in two adjacent characters. This is signaled by the
intermediate ϵg glyph. Keeping in mind that repetitions of the same character within ad-
jacent pixels are already handled, we can now safely omit the glyph separator ϵg. The
example ‘aϵgab’ reduces thus to ‘aab’, which is the final glyph sequence.
The last step is to map the glyph sequence to a computer-processable string encoded
in the character encoding chosen by the user. This is a mapping from glyphs in alphabet
A to the computer’s character encoding, which is applied to the decoded glyph sequence.
On the Runtime
At this point it is also prudent to discuss the computational runtime of the proposed Al-
gorithm 5.3.1. We will use Bachmann-Landau notation[69] to this end. Specifically will
we employ the O notation (‘Big-O notation’) that defines the upper limit in growth of a
function.
Algorithm 5.3.1 employs a scan-line that spans horizontally and moves downwards,
visiting each pixel exactly once and summing up the glyph probabilities of pixel columns
within the same text line. This leads to a upper limit for the runtime of O(I × J × |A|). As
before, I is the width and J the height of the pixel space. |A| is the number of glyphs in
the alphabet.
The referenced algorithm also calls function FindLineSeps once and function Decode-
Line once per text line, which is at most half the number of pixel rows. Function Glyph-
sToString is called once, reducing and mapping the decoding glyph sequence. Since this
sequence has an upper limit for its length of once glyph per pixel, it also has a computa-
tional limit of O(I × J × |A|).
In total this leads us to an upper limit for the runtime of Algorithm 5.3.1 in O(I × J ×
|A|) +O(FindLineSeps) + J ×O(DecodeLine).
5.4 Finding Lines
Maximum Individual Probability Variant
In Section 5.3 we have discussed the overall multi-line text decoding algorithm proposed
in this work. It employs a two-stage approach in which first line separators are identified
and text lines extracted, followed by decoding each text line. We will now discuss the
first of two variants of the FindLineSeps function for finding line separators, and thus
extracting lines.
100
This algorithm for finding line separators is based on the assumption that the predictor
that generated the soft-assignment y does not make mistakes or at least it will still predict
the line separator glyph ϵl with the highest probability if it is required in the pixel in question
in order to satisfy the neighborhood relations for the correct glyph sequence. Whenever
the pixel column above the current pixel and the pixel column below belong to two different
text lines, the predictor is expected to predict the line separator in this position.
Algorithm 5.4.1 FindLineSeps: Max. Individual Probability Variant
Input soft-assignment y.
I = Width(y)
J = Height(y)
Initialize the resulting matrix with pixels marked as line separators:
l ∈ [true, false]I×J and set elements to false.
Mark pixels where the line separator has the highest probability:
for i ∈ [1, I] do
for j ∈ [1, J ] do
if (i,j)argmaxg yg ≡ ϵl then
l(i,j) = true
end if
end for
end for
Finished, return the marked line separators:
Return l.
Algorithm 5.4.1 outlines the algorithm for finding text lines based on this assumption.
It initializes a matrix with one position per pixel in soft-assignment y, which is the marker
if this position is a line separator, and initializes all to a logical false value. After that the
algorithm simply iterates all pixels, identifies the highest-probability glyph for this pixel and
marks it as a line separator if the glyph is the line separator ϵl. In terms of computational
runtime, this leads to an upper limit of O(I × J × |A|) for this algorithm.
The benefit of this algorithm is that it is very fast. On the other hand is it not robust
in case of noise in the soft-assignment y or whenever the correct line separators are not
predicted with a high probability, at least not as a continuous line in the full width from the
left to right borders.
The main decoding algorithm as outlined in Section 5.3 is based on a scan-line that
is spanning from the left to the right borders and moving from top to bottom. It alternates
between skipping over the line separator preceding a text line and then decoding the text
line while moving the scan-line through it. Gaps in the predicted line separators, as can
happen with this line-finding algorithm, will then result in the merge of two or more text
lines.
Figure 5.4.1 illustrates this problem. The example contains a paragraph of three text
lines, but the line separator between the first two lines has a gap in it. Step A shows
how the scan-line moves through the first decoded text line and merges parts of the first
two true text lines. Step B continues this merging of adjacent text lines since now the
scan-line is offset by one in comparison to the true text lines. Step C then misses part of
the last text line in the area where the original gap exists.
Noise in the soft-assignment will result in similar effects. A single pixel where the
line separator is randomly predicted with the highest probability will lead to the scan-line
switching between lines and thus introducing an offset in comparison to the true text lines.
We will discuss a second line-finding algorithm that solves these problems in the next
few pages.
101
A B C
Figure 5.4.1: Merging of text lines while moving the scan-line downwards in case of gaps in the
predicted line separators.
Continuous Separators Variant
As we have seen before, Algorithm 5.4.1 poses problems when identifying lines in the face
of random noise or gaps in the prediction of the true line separator. This phenomenon
occurs because only the individual pixel and no context information is used for deciding
if the pixel is a line separator or not. We have discussed the semantic structure of the
soft-assignment in Section 5.2. Line separators are expected to form a continuous line
from the left to right border. There also should be at least a one pixel vertical space
between two line separators in order to actually fit a text line in between. The first property
disallows gaps in the line separator and prevents larger influence due to noise. The
second property again is important in face of the scan-line behavior.
A B C
Figure 5.4.2: Actually merged line separators introduce an offset into the scan-line.
Figure 5.4.2 illustrates the particular problem of merged line separators. The middle
text line is shorter than the full width and thus the line separator between the first and
second text lines merges with the one between the second and third text lines. We can
see that this correctly decodes the first text line in step A, but then partially merges the
second and third text lines in step B because there is an offset in the scan-line. Step C
then again misses parts of the last text line.
We can identify three properties of a line-finding algorithm that works well with the
overall decoding algorithm as outlined in Algorithm 5.3.1:
1. Random noise should be ignored as best as possible.
2. Line separators should be continuous line from the left to right borders.
3. Two line separators should be at least one vertical pixel apart to allow for a text line
in between.
The second and third properties can be expressed in a consistent manner: Each pixel
column of the soft-alignment should contain exactly the same number of line separators
as the other pixel columns in the same soft-alignment. This prevents the introduction of
offsets into the scan-line.
102
With this in mind we can start to derive an algorithm for finding continuous line sepa-
rators from the left to right border that do not have gaps, have at least one vertical pixel
in space between each other and are robust against noise. The basic idea is to find the
probability for drawing a continuous line separator from a specific vertical position on the
left border to a specific vertical position on the right border, as well as marking the path in
between that leads to this probability. For this we define a continuous line as one in which
each included pixel is offset from its neighbors exactly by one pixel in horizontal and by
at most one pixel in vertical direction. This allows for curved line separator strokes with a
curvature between minus 45 and plus 45 degrees. The algorithm should identify the best
line separator stroke for all left-bound starting points to right-bound ends and their paths
in between. It then picks the highest-probability ones and adds them to the result as long
as the conditions for line separators are not violated.
We observe that the line separator candidates starting in the same vertical position
on the left can be calculated using a dynamic programming approach in a tableau. The
line candidate probabilities following the recursive formulation
P (c(i,j)|y, c(i−1,j±1)) = y(i,j)ϵl (5.4.1)
×max(P (c(i−1,j−1)|...), P (c(i−1,j)|...), P (c(i−1,j+1)|...))
where the candidate probability P (c(i,j)|...) is dependent on its predecessors to the left,
namely c(i−1,j−1), c(i−1,j) and c(i−1,j+1). Index i − 1 refers to the pixel column to the left,
j ± 1 to the pixel rows above and below.
1
2
3
4
5
6
7
Figure 5.4.3: Tableau of line separator probabilities starting from the third position on the left.
Higher saturation encodes higher probability. Blank areas are not reachable.
Figure 5.4.3 shows one such a tableau that gives the line separator candidates that
start at the third vertical position on the left and end on the right border. The line sep-
arators with a higher saturation have a higher probability. We can later follow the line
separator candidate in reverse from right to left in order to enter it into the result.
The overall algorithm for finding continuous line separators is shown in Algorithm
5.4.2. It starts again by initializing the result structure, followed by computing the tableaus
of line separator probabilities and their line separator candidates. These line separator
candidates are then processed in descending probability and entered into the result struc-
ture if viable.
Algorithm 5.4.2 heavily relies on two functions, one for computing the tableaus of line
separator probabilities in the first place and the second for tracing each line separator
candidate in a backwards fashion and inserting them into the result structure. Algorithm
5.4.3 outlines the first one. It generates one tableau for every starting position on the
left border by using the line separator probability (1,s)yϵl from the soft-assignment as the
line separator probability in that single pixel. It then increments the tableau to the right
by applying a dynamic programming approach based on the recursive formulation from
Equation 5.4.1.
103
Algorithm 5.4.2 FindLineSeps: Continuous Separators Variant
Input soft-assignment y.
I = Width(y)
J = Height(y)
Initialize the resulting matrix with pixels marked as line separators:
l ∈ [true, false]I×J and set elements to false.
Produce tableaus of line separator candidates:
C = CandTableaus(y)
Sort by descending probability:
C = SortByProb(C)
Add as many candidates as viable:
while C ̸= ∅ do
Retrieve the highest-probability separator candidate:
(p, t, s, e) = Pop(C)
Trace it backwards and test if its viable:
c = TraceSeparator(l, p, t, s, e)
if c ̸= ∅ then
Accept the candidate into the result:
for i ∈ [1, I] do
j = ci
l(i,j) = true
end for
end if
end while
The second of the two missing functions from Algorithm 5.4.2 is shown in Algorithm
5.4.4, which takes a separator candidate and traces it backwards from right to left, enter-
ing it into the result if viable. For this, each line separator candidate consists of the tableau
with the separator probabilities, as exemplified in Figure 5.4.3, as well as its starting and
ending positions on the left and right border, respectively. It will then follow the maximum
probability within the tableau from the right border to the left one while testing each pixel
if it is still valid as a line separator according to Algorithm 5.4.5. If the line separator can-
didate is not viable anymore, it is skipped and the next highest probability line separator
candidate is processed by Algorithm 5.4.2.
This process as outlined by Algorithms 5.4.2, 5.4.4 and 5.4.5 is greedy, entering as
many line separators into the result as viable. In some cases these may be more or
less than the actual true number of line separators in the example to be decoded. It is
to be expected that a well-trained deep neural network, or model in general, generating
the soft-assignment y will produce the correct line separators, although possibly with
low-probability gaps in them. This is the case for which this algorithm was designed.
There are limitations to the line separators that the algorithm discussed in this section
can detect, most important that they cannot touch and must be separated by at least one
pixel in vertical direction. These are strong limitations to false line separators generated
by noise, which will in turn likely not lead them to be accepted since the high-probability
true line separators are already entered in the result. Optionally a lower limit on the line
separator probability could be enforced in order to discard line separator candidates with
large gaps or high noise. Still, there is a chance that this algorithm will find line separators
that are false as results of flukes in the model or random noise. The same is true for the
line detection outlined in Algorithm 5.4.1.
104
Algorithm 5.4.3 CandTableaus: Produce tableaus of line separator candidates
Input soft-assignment y.
I = Width(y)
J = Height(y)
Initialize the set of line separator candidates:
C = {}
Iterate starts on the left border:
for s ∈ [1, J ] do
Initialize a new empty tableau:
t ∈ RI×J and set elements to 0.
Set beginning line separator probability:
t(1,s)
(1,s)
= yϵl
Increment to form continuous separators to the opposing border:
for i ∈ [2, I] do
for j ∈ [1, J ] do
Preceding line separator probability:
p = max(t(i−1,j−1), t(i−1,j), t(i−1,j+1))
Probability for a line separator at the current position:
t(i,j)
(i,j)
= p× yϵl
end for
end for
Iterate ends on the right border and store candidates:
for e ∈ [1, J ] do
p = t(I,e)
if p > 0 then
C = C ∪ (p, t, s, e)
end if
end for
end for
Finished, return the line separator candidates:
Return C.
105
Algorithm 5.4.4 TraceSeparator : Trace a separator candidate backwards
Input separator matrix l.
Input separator candidate (p, t, s, e).
I = Width(l)
J = Height(l)
Initialize list of resulting vertical coordinates:
c = {}
Is the start and end still viable?
if IsSeparatorOkay(l, 1, s) ∧ IsSeparatorOkay(l, I, e) then
Add the right end to the separator trace:
c = c+ e
Last seen vertical position in the separator:
j− = e
Iterate backwards from right to left border:
for a ∈ [1, I − 1] do
i = I − a
Find next vertical position j+ with the highest probability:
+ (i,j−−1) (i,j− −j = argmax )j(t , t , t
(i,j +1))
Test if this pixel is viable as a separator:
if IsSeparatorOkay(l, i, j+) then
Use this position as the next step in the line separator:
c = c+ j+
j− = j+
else
Separator candidate is not viable anymore:
Return ∅.
end if
end for
Reverse the order of the trace and finish:
Return ReverseOrder(c).
end if
This candidate is not viable anymore:
Return ∅.
Algorithm 5.4.5 IsSeparatorOkay : Can this pixel be a separator?
Input separator matrix l.
Input coordinates i, j.
if l(i,j) ≡ true∨l(i,j−1) ≡ true∨l(i,j+1) ≡ true then
Return logical false.
end if
Return logical true.
106
On the Runtime
In case of Algorithm 5.4.2 it is prudent to start the runtime analysis backwards. Function
IsSeparatorOkay simply has a runtime of O(1). Function TraceSeparator in Algorithm
5.4.4 follows a line separator candidate backwards, visiting each pixel column once and
identifying the highest-probability predecessor in each column. Since at most three pre-
decessors are tested in each column, this reduces to a total runtime of O(I) for width I.
Function CandTableaus generates one tableau of line separator candidates for each of
the J pixels on the left border and each tableau requires to visit every pixel one. This
results in a total runtime of O(I × J2) for this function with I being the width and J the
height of the pixel space.
The overall function FindLineSeps as described in Algorithm 5.4.2 calls function Can-
dTableaus once an then sorts the resulting line separator candidates, which are at most
J2 many. In-place sorting has a computational complexity of O(n × log(n)), which in
this case is O(J2 × log(J2)) and reduces to O(J2 × 2 × log(J)) for J > 0 and thus
to O(J2 × log(J)). The upper limit for the number of line separators actually entered
in the result is half the height J , respecting the condition that two line separators must
be separated by at least one pixel in vertical direction. This means that the function
TraceSeparator is called at most J2 times. In total the runtime of Algorithm 5.4.2 is thus
O(I × J2 + J2 × log(J) + J × I), which reduces to O(I × J2 + J2 × log(J)).
5.5 Decoding Lines
Preface
The two Sections 5.3 and 5.4 discuss and outline the algorithm for decoding multi-line
texts and for identifying and extracting individual lines within a paragraph as one step of
the decoding. The missing piece to complete this multi-line decoding algorithm is to de-
code the individual lines by means of reading a high likelihood glyph sequence from the
probabilistic model output. Algorithm 5.3.1 employs a scan-line spanning from the left to
right borders in order to sequentially process each line and dynamically collapse each line
to a one-dimensional sequence of glyph probabilities. This in turn means that decoding
each text line is the same decoding problem as in connectionist temporal classification
(CTC)[43, 46]. The work discussed in this thesis thus employs the one-dimensional de-
coding algorithms from CTC in order to decode individual text lines.
The Algorithms 5.5.1, 5.5.2, 5.5.3 and 5.5.4 as discussed in this section are the au-
thor’s specific implementation of the algorithms proposed in the CTC[43, 46] publications.
Further ideas on how to improve decoding algorithms for one-dimensional sequences are
discussed in Sections 10.2 and 10.3.
Best Path Variant
The first of the two algorithms for decoding one-dimensional text lines is named best path
decoding and outlined in Algorithm 5.5.1. Recalling Figure 3.1.1, we observe that there
are multiple different ways to align a glyph sequence over a time series in case the time
series is longer than the glyph sequence. This allows for variability for e.g. translation
of glyphs in pixel space, or time steps in one-dimensional problems2, or to span glyphs
over multiple pixels. A ‘path’ in this context, or a ‘configuration’ in terms of this work and
2In this context, the terms ‘time step’ in one-dimensional decoding and a ‘pixel’ in the dynamically col-
lapsed, accumulated glyph probabilities are synonyms.
107
graphical models in general, refers to one single chain of glyphs with one glyph per time
step or visually one path from left to right in Figure 3.1.1.
Best path decoding as shown in Algorithm 5.5.1 is based on the assumption that
the highest-probability configuration C also represents the true glyph sequence. This
holds true for perfect estimators of the soft-assignment y, which would produce a one-
hot encoding at each time step with the true glyph having a probability of one and all
others zero. The overall probability for one con∏figuration is given by
P (C|a) = aiC (5.5.1)i
i
with Ci being the glyph in configuration C at time step i and a being the one-dimensional
sequence of glyph probabilities as accumulated by Algorithm 5.3.1.
This assumption does not hold true anymore in case the accumulated glyph proba-
bilities a contain time steps where the true glyphs are predicted with a low probability or
where there is a general ambiguity of glyphs.
Algorithm 5.5.1 DecodeLine: Best Path Variant
Input accumulated glyph probabilities a.
I = Width(a)
Initialize the result sequence:
s = {}
Collect the maximum probability glyph per pixel:
for i ∈ [1, I] do
Find the glyph in this pixel:
g i⋆ = argmaxg ag
Append the glyph to the sequence:
s = s+ g⋆
end for
Finished, return the glyph sequence:
Return s.
Algorithm 5.5.1 then outlines the first variant of the DecodeLine function in Algorithm
5.3.1. It decodes a single text line by identifying the highest probability glyph per time
step and appending it to the decoded glyph sequence.
Beam Search Variant
Again recalling Figure 3.1.1 and the discussion in Section 5.3, we observe that there are
multiple configurations that represent the same glyph sequence when accounting for e.g.
repetitions of the same glyph. We can use Beam Search[106] as a heuristic to uncover
the most likely glyph sequence, accounting for its different configurations.
Since different configurations that fold to the the same glyph sequence are indepen-
dent events, we can write the likelihood for a∑specific gly∏ph sequence t
P (t|a) = ais (5.5.2)i
C:GlyphsToString(C)≡t i
as the sum of the likelihood for observing configurations C that map to it using function
GlyphsToString. We can use this fact to heuristically decode the high likelihood glyph
sequence by building a trie of already known glyph sequences, the prefixes of the full
decoded sequence, and incrementally appending further glyphs to those prefixes with the
108
highest likelihood. Low likelihood prefixes will be discarded underway in order to reduce
the computational effort, introducing a heuristic property to this decoding algorithm.
The decoding algorithm outlined by Algorithm 5.5.2 is based on the idea to build a
prefix trie, initialized with the empty sequence, containing the top-n most likely sequences
and then incrementally appending to those until all time steps of the input soft-assignment
are processed. Prefixes that do not fall in the top-n most likely ones per time step are
discarded. Finding the best prefixes within the current trie is implemented using a heap,
organized by the likelihood of each prefix. In the context of the beam search, the top-n
most likely prefixes are that processed in each time step are also called the ‘beam width’.
Processing the top-n prefixes at each time step requires that the likelihood for each
is kept up to date and consistent in order to compare the individual likelihood of multiple
prefixes. This requires that we respect the rules for folding glyph sequences, implemented
by function GlyphsToString, at every time step. This in turn requires the application of
Equation 5.5.2 and the rules for folding given by function GlyphsToString to the prefixes
in the trie. This leads to two rules for incrementally adding glyphs to the prefixes in the
trie:
1. Increment a prefix with a glyph by multiplication of the glyph probability with the
prefix likelihood. Append the glyph to the sequence if it is different from the last
glyph in the sequence.
2. Fold two identical prefixes by summation of their likelihood and unification of their
trie nodes.
Algorithm 5.5.2 implements this beam search decoding algorithm for one-dimensional
sequences. It builds a prefix trie of known glyph sequences, incrementally appending
glyphs to the top-n most likely ones according to a heap structure that organizes trie
nodes. It keeps track of two likelihood per prefix sequence, that is per trie node: the
likelihood in the last time step and the interim likelihood in the current time step. The
current likelihood is used for accumulation of the likelihoods if multiple prefixes fold to the
same sequence and is initialized to zero at each time step. The last likelihood is used for
calculation of the new likelihood when appending a new glyph to this prefix.
The algorithm further depends on two helper functions, Increment for appending to a
prefix and FlipProbs for saving the likelihoods from the last time step and zeroing for the
next time step.
Algorithm 5.5.3 describes the function for incrementing an existing prefix sequence
by one additional glyph and at the same time folding identical glyph sequences. This is
done by following the prefix trie structure, modifying the likelihood if the current sequence
is already known or creating a new leaf node if it is a previously unknown sequence.
Folding is implemented by summation of the likelihoods of two identical glyph sequences.
Algorithm 5.5.4 is a helper function that enumerates the prefix sequences within the
trie and flips the current for the last likelihoods, re-initializing the current likelihoods to
zero again.
This Beam Search decoding algorithm is capable of decoding the correct glyph se-
quence even if it is partially weakly predicted by the model that generated the soft-
assignment y. This is because it takes into account the different configurations, that
is ways to align the glyph sequence over the pixel space, of each individual glyph se-
quence. If there are parts were the correct glyphs are weakly predicted, there are still
multiple configurations that use this weak prediction and sum up to a high likelihood for
this sequence. This is likely not the case for random noise in the prediction. Still, as both
decoding algorithms discussed here, it is not robust in the face of incorrectly predicted
glyphs.
109
Algorithm 5.5.2 DecodeLine: Beam Search Variant
Input accumulated glyph probabilities a, beam-width w and alphabet A.
I = Width(a)
Initialize the trie to only the empty sequence. Elements are 4-tuples of the own glyph,
set of suffixes, last probability and current probability:
T = {(ϵg, ∅, 1, 0)}
Initialize the heap of trie nodes, sorted by descending last probability:
H = Heapify(T)
Iterate pixels and append to the trie:
for i ∈ [1, I] do
Follow the top w current prefixes:
for n ∈ [1, w] do
Retrieve the best sequence/trie node from the heap:
s = Pop(H)
Trie node s is a 4-tuple as described above:
s ≡ (g′,C, plast, pcur)
Are there anymore prefixes?
if s ̸= ϵ then
Increment the sequence by one glyph:
for g ∈ A do
Probability for the sequence with the glyph added:
p = p ilast × ag
if p > 0 then
Increment with this glyph:
Increment(s, g, p)
end if
end for
end if
end for
Flip the current and last probabilities:
T = FlipProbs(T)
Ensure the heap is sorted and all nodes are in it:
H = Heapify(T)
end for
Finished, return the best sequence:
s = Pop(H)
Return ToSequence(s).
110
Algorithm 5.5.3 Increment : Append glyph to a trie node
Input reference to trie node s, glyph g and sequence probability p.
Trie node s is a 4-tuple of its own glyph g′, child nodes C, last probability plast and
current probability pcur:
s ≡ (g′,C, plast, pcur)
Is the new glyph the own glyph?
if g′ ≡ g then
Deduplicate identical adjacent glyphs in this case:
pcur = pcur + p
else
Actually increment by one glyph:
c = InsertChild(C, g)
Increment(c, g, p)
end if
Algorithm 5.5.4 FlipProbs: Flip last and current probabilities in the trie
Input trie T.
Iterate all nodes per reference s:
for s ∈ T do
Trie node s is a 4-tuple of its own glyph g′, child nodes C, last probability plast and
current probability pcur:
s ≡ (g′,C, plast, pcur)
plast = pcur
pcur = 0
end for
111
112
Chapter 6
Multi-Dimensional Connectionist
Classification (MDCC)
Figure 6.0.1: Part of the pipeline discussed in this chapter. Left is the input, middle the estimated
probabilities and right the decoded text.
As with Chapter 5 is this chapter based on the following two publications on multi-
dimensional connectionist classification:
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-Line
Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR
International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019.
DOI: 10.1109/ICDAR.2019.00015
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional
Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR
International Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–410.
DOI: 10.1109/DAS.2018.36
Section 1.3 discusses the individual contributions to these publications.
6.1 Overview
As we have discussed in Section 5.1, can the full offline handwriting system proposed
in this work be split into two parts. One is a multi-line decoding algorithm as detailed in
Section 5.1 that uncovers a high-likelihood string given a probabilistic soft-assignment of
pixels to glyphs from an alphabet. This covers the latter part of the pipeline shown in Fig-
ure 6.0.1. The other part of this pipeline is predicting the probabilistic soft-assignment,
given an image of handwritten multi-line text, in the first place. This prediction is gen-
erated by a deep neural network in this work and training this DNN is the topic of this
chapter. The left-side part of Figure 6.0.1 visualizes this part of the pipeline.
The training system that we will discuss in this chapter is, as is the decoding algorithm
presented before, suitable for multi-line text in general and not just handwritten text.
113
A large part of this chapter will be a discussion of the ideas and function of the training
algorithm proposed in this work. Still, a few words about the deep neural networks used
for multi-line offline handwriting recognition are in order. Figure 6.0.1 shows the overall
pipeline for transcribing multi-line text as proposed in this work. Its function is to transcribe
multi-line text by predicting a computer-processable string from an image with the string
containing the text as contained in the image.
Processing image data with deep neural networks is a well-studied problem and can
be tackled by both convolutional neural networks[78], e.g. for ImageNet classification[73],
and recurrent neural networks[55], e.g. for offline handwriting recognition[45]. Both these
topologies of DNNs are detailed and discussed in Section 2.3. We can thus build on a
large corpus of knowledge regarding the processing of image data in deep neural net-
works.
Deep neural networks are typically optimized using the combination of the backprop-
agation algorithm[112, 113] and gradient descent [11, 65, 107], see Sections 2.3 and 2.3.
Gradient descent requires that the model, in our case a deep neural network, is differ-
entiable regarding its parameters at (theoretically) all points. In practice, a model that is
differentiable at most of the points and provides heuristics for non-differentiable points is
still suitable when applying gradient descent. However, the output of the overall pipeline
in Figure 6.0.1 is a string, which is a sequence of discrete symbols. Target functions for
optimizing models that directly predict discrete symbols tend to be piecewise constant,
providing no gradient at all, in large parts and non-continuous in other parts. One exam-
ple for such a optimization target would be to count the number of wrong discrete symbols
in the sequence, minimizing this number to zero. These properties, being piecewise con-
stant at most points and non-continuous at the remaining points, disqualify such problem
statements for optimization using gradient descent.
The way this work, and many other works in fact, address this problem is by for-
mulating it in a probabilistic framework by not directly predicting discrete symbols, but
instead predicting probabilities for the occurrences of those. Basically the problem is re-
formulated as a probabilistic multi-class classification problem. Assuming prediction y for
observation x is generated by a deep neural network with parameters W according to
y = argmax(DNN(x,W)) (6.1.1)
is this a change to
y = Softmax(DNN(x,W)) (6.1.2)
with y now being a vector of class probabilities instead of a specific class. The softmax
function[13] is discussed in Section 2.3.
Application of a multi-line decoder function as proposed in Chapter 5 to the soft-
assignment estimated by the deep neural network allows the formulation of the multi-line
offline handwriting recognition problem in exactly such a probabilistic framework. The
soft-assignment as predicted by the deep neural network, see the middle part of Figure
6.0.1, now gives probabilities for pixels being part of discrete glyphs. This is in contrast
to a hard assignment of a pixels to one specific glyph. In the turn the proposed multi-line
decoding then produces a high-likelihood glyph sequence from this soft-assignment.
This formulation with the deep neural network estimating a soft-assignment from pix-
els to glyphs from the alphabet effectively transforms this into a semantic segmentation
or image segmentation task. Image segmentation is a task that can be solved using deep
neural networks, see e.g. U-Nets[108]. In case of this thesis, the image segmentation
task is a supervised learning task. This means we have a data set for training the deep
neural network and this data set contains input images of multi-line text together with the
matching transcribed text. The problem is that the data set contains the correctly tran-
scribed text only, but no spatial information about the text. It does not contain information
114
about the position, size, orientation or shape of the characters in the text. Since the deep
neural network should estimate this information, we need to infer this missing spatial in-
formation during training. This procedure of inferring the missing information is called,
similar to connectionist temporal classification, an alignment of the truth text. This topic
will be the main task during training of the deep neural network and of this chapter.
In this chapter we will at first discuss the structure of multi-line text in general, which is
a necessary precondition for further discussions on inferring the missing spatial informa-
tion within the training data. We will again use the IAM offline handwriting database[88]
for examples in this chapter. This is followed by discussions of the alignment problem of
multi-line text in general. The latter part of this chapter is detailing the solution proposed
in this work: using conditional random fields[77][93, ch. 19.6], see Section 2.2, for infer-
ence of the alignment. This will allow us to implement and train a deep neural network
that fulfills the role as needed for the proposed pipeline, see Figure 6.0.1 and thus allows
to set up the overall multi-line text recognition system as proposed in this thesis.
A practical implementation, application and experimental evaluation of both this chap-
ter and Chapter 5 is detailed in Chapter 7.
6.2 Structure of Paragraphs
Patterns of Multi-Line Text
A problem that is stated in the opening of this chapter is that for supervised training of
the deep neural network in this work, the training data only contains the input image and
corresponding true text, but no spatial information about the text. This missing spatial
information needs to be inferred in order to treat this as an image segmentation task
and to optimize the deep neural network accordingly. In addition, Sections 5.2 and 5.3
discussed the structure of multi-line text in the context of multi-line decoding, applying
an indicator function α but not detailing it. Both of these problems will be addressed in
this section. We need to keep both applications, alignment during training and decoding
during transcription, in mind when identifying and deriving the rules for the structure of
paragraphs since both alignment and decoding need to be symmetrical in this sense.
This requires us to keep the discussions of Section 5.2 in mind.
Similar to Chapter 5 will we use the term glyph to refer to one element of the alphabet
A at hand. A character denotes a specific instance of a glyph within a text or sequence
and multiple characters within a text can be of the same glyph. The line separator glyph
ϵl indicates that the pixels columns above and below belong to two different text lines.
The glyph separator ϵg indicates that the pixels rows left and right belong to two different
characters, even if they are of the same glyph.
We will start by discussing the shapes that multi-line text contains when writing it to a
piece of paper. In computer vision terms, we will discuss the patterns of multi-line text in
pixel space. This is a necessary step to inferring the missing spatial information since we
know from the annotated data and the writing system at hand what the general geometric
relations between characters are. In Latin writing systems, characters in the same line
are ordered from left to right in pixel space and text lines are ordered from top to bottom.
Missing is the translation of this writing system to the pixel space. Identifying the patterns
that occur in pixel space will allow us to deduct rules for describing and inferring these
patterns.
Let us have a look at the patterns that the transitions between two text lines produce.
Text lines in this work are assumed to be roughly horizontal or slanted and curved up to
45 degrees. Further curvature will lead to decoding errors since the decoding algorithm
proposed in Chapter 5 requires each text line to be exactly one interval per column in pixel
115
A B C
L1 L1 L1 L1 L1 L1 L1 L1 L1 εl εl εl L1 L1 εl εl L1 L1
εl εl εl εl εl εl L1 L1 εl L2 L2 L2 L1 εl L2 L2 εl L1
L2 L2 L2 L2 L2 L2 εl εl L2 L2 L2 L2 εl L2 L2 L2 L2 εl
εl εl L1 L1 L1 L1 εl L1 L1 L1 L1 εl
L2 L2 εl L1 L1 L1 L2 εl L1 L1 εl L2
L2 L2 L2 εl εl εl L2 L2 εl εl L2 L2
Figure 6.2.1: Patterns in pixel space for transitioning between two text lines. ϵl denotes the line
separator.
space. The valid patterns for line transitions resulting from these restrictions are shown
in Figure 6.2.1 with each block being one possible pattern. These example patterns
are within a limited pixel space and can be increased in size, e.g. for creating longer
diagonals. Extending the size of and combining these patterns will then produce the
shapes of text lines that this work covers. Pattern A describes the default line transition,
which is perfectly horizontal. Pattern B is a line transition over the diagonal. This may
occur either because the text line is actually slanted or because a glyph above or below
the line separator is extending down- or upwards. Pattern C is the shape of a curved line
transition. Again this may be because the line itself is actually curved or because of the
shape of the glyphs directly above and below.
A B C
C1 C1 C1 C2 C2 C2 C1 C1 C2 C2 C2 C2 C1 C1 C2 C2 C2 C2
C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2
C1 C1 C1 C2 C2 C2 C1 C1 C2 C2 C2 C2 C1 C1 C1 C1 C2 C2
C1 C1 C1 C1 C2 C2 C1 C1 C1 C1 C2 C2
C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2
C1 C1 C1 C1 C2 C2 C1 C1 C2 C2 C2 C2
Figure 6.2.2: Patterns in pixel space for transitioning between two characters. One of the two
characters may be of the glyph separator ϵg.
Similar to Figure 6.2.1 for lines does Figure 6.2.2 show the patterns for transitions
between characters within the same text line. Pattern A shows the default transition of
a straight vertical border between the two adjacent characters. This is also the optimal
case for the decoding algorithm of Chapter 5, since this algorithm dynamically collapses
each text line to a one-dimensional sequence by summation of the probabilities in each
pixel column. A perfectly vertical transition between two characters thus introduces the
least ambiguity after collapsing. Pattern B shows transitions between two characters
with ragged borders. Pattern C are transitions over the diagonal. These patterns of
transitions between characters occur because their glyphs have specific shapes and thus
116
intersections between each other. As with line separators the patterns can be combined
and repeated to produce the overall patterns of placing characters within a text line.
Indicator Function α
Armed with this knowledge can we now derive abstract rules that describe these patterns.
These rules in turn then will allow us to derive the indicator function α that we have used
before or even generate these patterns, which will be used for computing the alignment
during training of the deep neural network.
This is necessary since ‘computing the alignment’ only means, as we will discuss
in Section 6.3, to marginalize over all possible configurations for placing a text in an
pixel space. The term configuration here carries the same meaning as it does in the
context of graphical models, see Section 2.2, namely a configuration is exactly one hard
assignment of labels to nodes of the graphical model. In our case, a configuration is one
way of assigning a character to every pixel. The character may be a different one per
pixel, but each pixel must be assigned exactly one character. We say a configuration is
valid if it always follows all the rules for the patterns of text in pixel space discussed in this
section. Marginalization over all valid configurations for placing the text at hand in pixel
space then yields the alignment of this text. We will discuss this in-depth in Section 6.3.
These rules for describing the valid configurations establish the connection between
the text and pixels and thus we need to be clear on how to operate in both. The term
label space refers to the topological space that contains the text at hand, e.g. the true
label text during supervised training. Movements to the ‘left’ or ‘right’ in this space refer
to the character before or after respectively, within the same text line. Direction ‘up’ and
‘down’ indicate the text line before or after the current one. Please note that, similar to the
structure in Chapter 5, is each visible text line separated from its neighbors above and
below by a line separator ϵl. Figure 6.2.3 shows the label space for the two-line sequence
‘CAT DOG’.
C A T
εl
D O G
Figure 6.2.3: Label space for the two-line sequence ‘CAT DOG’.
We will use the term pixel space as done before in this thesis, which is to refer to
the grid structure of pixels in an image. In this work we will use a pixel space of 8-
neighborhoods, that is each pixel is seen as connected to 8 neighbors, including 4 over
the geometric diagonal. Figure 6.2.4 shows part of the pixel space around a specific
pixel.
We have now the tools necessary to derive the rules that we will use in this thesis to
describe the patterns of multi-line text in images. That is these rules create the relation
117
· ···· ·
i-1, i-1, i-1, 
j-1 j j+1
··· i, j-1 i, j i, j+1 ···
· i+1, i+1, i+1, ·· j-1 j j+1 ···
Figure 6.2.4: Pixel space around a pixel (i, j). Solid lines indicate direct neighbors, dashed lines
are neighborhoods of the other pixels. Outer dots indicate the extension of the pixel
space in all directions.
between the label space as given by a text and the pixel space as given by e.g. the output
of a deep neural network. In the following paragraphs we will first discuss the rules
governing the line separator ϵl, followed by a discussion of the rules for the remaining
glyphs.
Constructing the label space and pixel space for alignment in this work assumes that
the text on the input image, and thus encoded in the pixel space by the deep neural
network, the the truth label string do match. That is, e.g. spelling mistakes or missing
characters in a handwritten paragraph are reflected in the truth label string and thus the
constructed label space. This also assumes that the full paragraph is visible in the input
image and if it is not that the truth label string also only contains the visible part.
The following figures and paragraphs will focus on a specific character within the label
space and then show its direct neighbors in both label and pixel space. In order to derive
the overall rule for mapping a label space to pixel space and testing if a configuration
is valid or not, the following rules will have to be repeated for all characters in the label
space and all pixels in the pixel space. There is one exception to this: combinations of
labels and pixels are valid only whenever there is enough space left before and after to
map the remaining characters from the label space. For example, the character ‘B’ of the
label ‘ABC’ cannot be mapped to any the left- or right-most pixels since there would no
pixel be left in which to place the character ‘C’. Violating this automatically leads to invalid
patterns and thus configurations.
Figure 6.2.5 visualizes the rules for connecting the label space around a line sepa-
rator ϵl to the pixel space. These are directly derived from the patterns on lines that we
have observed in Figure 6.2.1. Each node in Figure 6.2.5 denotes one character in the
label space and the edges give the relationship in label space, e.g. rightwards for the next
character within the same line. The markers ‘R’ (Right), ‘DR’ (Down-Right), ‘D’ (Down)
and ‘DL’ (Down-Left) on the edges refer to the according relationship in pixel space. We
only show four directions of movement in the pixel space in Figures 6.2.5 and 6.2.6 since
the other four are defined from the viewpoint of the neighbors to the left to top-right direc-
118
··· ···
R, DR, DL
Cur. εl
R, DR,
D, DL
... Char. of Char. of Char. ofnext Line next Line next Line ...
Figure 6.2.5: Rules for transitioning from one line separator ϵl to its neighbors. Arrows indicate
relations in label space. ‘R’, ‘DR’, ‘D’ and ‘DL’ indicate ‘Right’, ‘Down-Right’, ‘Down’
and ‘Down-Left’ directions in pixel space. The dotted transitions follow the same
rules as the middle transition. The same rules need to be repeated for all line
separators ϵl in the text.
tions in pixel space. The remaining four directions (left, up-left, up and up-right) will later
be constructed within the indicator function by transforming the edges from directed to
undirected ones. For now the directed transitions should be seen as a simplification that
will be dropped by the indicator function in favor of symmetric relations. Self-cycles are
allowed in these rules to accommodate for the fact that there are at most as many char-
acters in the label space than pixels in pixel space and each pixel needs to be assigned
one character. Undefined pixel assignments are not allowed.
It is notable here that Figure 6.2.1 contains a relationship between neighbors that is
not modeled in the rule set of Figure 6.2.5: in actual examples, two adjacent text lines
maybe neighbors over the diagonal, effectively skipping the line separator ϵl in between.
However, the line separator is not really skipped since the decoding algorithm presented
in Chapter 5 identifies individual text lines by testing for the line separator ϵl over the
vertical in pixel space, not the horizontal or diagonal. The correct order neighborhood
relations of Figure 6.2.1 are still preserved in Figure 6.2.5. Omitting this neighborhood
between two text lines over the diagonal does serve a computational purpose in the way
that all other described rules are either 1:1 (neighbors within the same text line) or 1:n
(transition from ϵl to characters of adjacent text lines) relations, but the missing rule would
be a n:m relation. This would increase the computational runtime of the algorithms built
on this rule set and thus omitting it gains practical benefits without drawback.
Figure 6.2.6 illustrates the rules for the label space around non-line-separator glyphs
to the pixel space. The notation is identical to the one used in Figure 6.2.5. It is directly
derived from the character patterns of Figure 6.2.2 for transitions within a text line and
Figure 6.2.1 for those between text lines.
The rules shown in both Figures 6.2.5 and 6.2.6 illustrate only four of the eight direc-
tions in pixel space as seen from a character. This is because in order to apply these
rules to a specific label and pixel space, they need to be repeated for each and every
character anyway. This leads to a full rule set for the specific instances of label and pixel
space, thus including all neighbors in all eight directions. The missing relations in pixel
space are then observed by reversing the direction, e.g. ‘Down-Right’ becomes ‘Up-Left’.
The missing indicator function α of Chapter 5 can now be implemented according to
the following overall rule: a configuration is valid if and only if it places every character
119
Prev. εl
R R, DR, D, DL
Prev. Char. Cur. Char. ≠ εD, DL l R, DR, D Next Char.
R, DR, D, DL
Next εl
Figure 6.2.6: Rules for transitioning from one character to its neighbors. Arrows indicate relations
in label space. ‘R’, ‘DR’, ‘D’ and ‘DL’ indicate ‘Right’, ‘Down-Right’, ‘Down’ and
‘Down-Left’ directions in pixel space. The same rules need to be repeated for all
characters, excluding the line separator ϵl, in the text.
from label space at least once, if every pixel is assigned exactly one character and if
the previous rules from Figures 6.2.5 and 6.2.6 for neighborhoods are respected without
contradiction. We will apply the same indicator function to the remainder of this chapter.
Formalizing the Indicator Function α
We will now formalize the definition of the indicator function α as given above. We refer
to this indicator function as αs,t(u, v, l) with s and t being pixels in the pixel space as
defined in Figure 6.2.4. In the same way, u and v refer to positions within the label
space of Figure 6.2.3. Variable l is the truth label string itself, necessary to construct
the label space correctly. The indicator function αs,t(u, v, l) assumes a value of 1 if the
neighborhood of u and v in pixels s and t is valid according to label l, else it is 0. The
following paragraphs give the formal definition of α.
The graph tensor product [154] provides the mechanism for formalizing allowed neigh-
borhoods. The nodes of the product graph are the Cartesian product of the nodes of the
pixel space, see Figure 6.2.4, and the nodes of the label space of Figure 6.2.3. A node
(s, u) of the product graph exists iff all the following statements are true:
• s is a node of the pixel space.
• u is a node of the label space.
• There are equal or more pixels to the left of s than characters before u in the same
text line.
• The above statement is true for pixels to the right of s and characters after u.
• There are equal or more pixels to the top of s than the sum of the number of text
lines and line separators ϵl before the text line of which u is a part.
• This is also true for the pixels below s and the text lines and line separators after u.
120
The edges of the product graph are defined according to the graph tensor product. An
undirected edge (s, u) ∼ (t, v) exists in the product graph if both s and t are neighbors in
the pixel space in any direction D - one of ‘Right’, ‘Down-Right’, ‘Down’ and ‘Down-Left’ -
and u and v are neighbors in the label space according to either the rules of Figure 6.2.5
or Figure 6.2.6 in the same direction D. This formulation necessitates flipping the pixel-
label combinations (s, u) and (t, v) such that pixel s is always to the top, left or top-left of
pixel t.
The indicator function αs,t(u, v, l) has a value of 1 iff both (s, u) and (t, v) are nodes of
the product graph and there exists an edge (s, u) ∼ (t, v) or (t, v) ∼ (s, u) in the product
graph. αs,t(u, v, l) has a value of 0 in all other cases.
Please notice that the product graph is an undirected graph in which an edge (s, u) ∼
(t, v) defines a neighborhood relation, but no direction. This is in contrast to the directed
edges of the neighborhood rules specified above and exampled by Figures 6.2.5 and
6.2.6. This approach simplifies the construction of the pixel and label spaces since only
four instead of eight neighborhood relations need to be encoded. The final product graph
still encodes the required spatial information since the indices s and t still address specific
pixels, while u and v address label positions accordingly. This means it is not possible to
e.g. flip a text line in pixel space and still have a configuration that is valid according to
this indicator function.
An Example Configuration
C C C C A A T T T T T T T
C C C C A A T T T T T T T
C C C A A A A T T T T T T
C C C A A A A A T T T εl εl
C C C C εl εl A A T T εl G G
εl εl εl εl O O εl εl εl εl G G G
D D O O O O O O O G G G G
D D D O O O O O G G G G G
D D D D O O O O G G G G G
D D D D O O O O G G G G G
D D D D O O O O O G G G G
D D D O O O O O G G G G G
D D O O O O O G G G G G G
Figure 6.2.7: One example configuration for placing the two-line sequence ‘CAT DOG’ in a pixel
grid. There are many more possible configurations for the same sequence and pixel
grid.
Figure 6.2.7 shows one possible configuration of placing the text from the label space
of Figure 6.2.3 in a pixel space like the one in Figure 6.2.4, but of 13 by 13 pixels in
121
size. We can see that this is only one of many valid configurations. Other valid configura-
tions can be generated by either iteratively modifying one root configuration pixel-by-pixel
while respecting the above rule set at each change, or enumerating all configurations and
testing for their validity afterwards. The question of how to work with the many different
configurations for one text and pixel space will be part of the topic of the remainder of this
chapter.
On 8- versus 4-Neighborhoods
The paragraphs before discussed the patterns that occur when placing multi-line text in
a pixel space, e.g. by writing it on paper, and the abstract rules for these patterns. The
rules represent the connection between label and pixel space. The pixel space in use is
designed with 8- instead of 4-neighborhoods between pixels. That is it includes the four
diagonal neighbors and not only the four over the vertical and horizontal lines.
A εl
εl B εl
εl A εl
εl B
Figure 6.2.8: Invalid configuration of the sequence ‘AB’ over the diagonal when using 4- instead
of 8-neighborhood relations. Rule sets based on 4-neighborhoods (without diag-
onal) cannot model the correct order of characters in a ‘staircase’ pattern without
increase in model complexity.
This was a deliberate choice in order to address a specific problem during alignment.
If a text line is place at a 45 degree angle, then its borders to the neighboring line sep-
arators ϵl form a ‘staircase pattern’. If in addition the text line is placed with exactly one
pixel in height, then the pixels assigned to this text line never touch over the vertical or
horizontal neighborhoods. In a pixel space with 4-neighborhoods, the characters of the
text line thus never touch in pixel space. This is illustrated in Figure 6.2.8. Line separators
ϵl only carry the meaning that the pixels above and below belong to different text lines,
but are not indicating to which characters specifically they are neighboring. This means
that if the correct order of characters is not enforced within the text line, which it cannot
in this case, then these invalid configurations can incorrectly be recognized as valid. This
is shown in Figure 6.2.8 where the sequence ‘AB’ is placed in a diagonal as ‘ABAB’.
One solution to this problem would be to encode the closest characters to each line
separator ϵl. This approach modifies the label space, e.g. of Figure 6.2.3, in such a way
that there are multiple line separators in place of one, each one encoding one possible
placement in pixel space. For example there would be a line separator ‘ϵl nearest to
C of CAT’, ‘ϵl nearest to A of CAT’, ‘ϵl nearest to D of DOG’ and so on. While this
is a remedy to the problem of incorrect alignment overs the diagonal, it also increases
the model complexity. In the model with 8-neighborhoods each character of a text line
has a constant number of possible neighbors and as such is a 1:c relation with c being
constant. There is also exactly one line separator ϵl between any two text lines. These
line separators have 1:n relations in the described rule set with n being the summed
number of characters in the text lines directly above and below. In the model with 4-
neighborhoods with extended info added to the line separators, there is now one line
separator character per character in the text line above and one per character in the text
line below, all standing in relations to each other. This constitutes a n:m relation in the
122
rule set with n being the number of characters in the text line above and m in the text line
below. In total does this lead to an increase in the model complexity by a factor of the
number of characters in the label string, which should be avoided if possible.
The method proposed in this thesis addresses this problem by using 8- instead of
4-neighborhoods in pixel space. In this case, characters within the same text line can
be neighbors in pixel space even if they are placed in a diagonal. This in turn allows to
correctly identify such configurations as shown in Figure 6.2.8 as invalid since they violate
the rules for the indicator function as described above. The rule set is still increased in
complexity since the characters now have more neighbors, but this is only an increase
by a constant factor of two and not an increase in complexity linear to the number of
characters in the label string.
6.3 Basic of Multi-Line Training
Idea and Definitions
The core goal of this chapter about multi-dimensional connectionist classification is to
derive a training algorithm that can be applied to a given deep neural network and training
data set in order to maximize the likelihood that the DNN predictions decode to the correct
label strings if the decoding algorithm of Chapter 5 is applied. We will discuss multiple
approaches to such a training function in the following sections, but first we need to define
the common basics for all these approaches.
Symbol W is the parameter set of a deep neural network that is suitable for the
functionality as discussed in Section 6.1 and specifically for the pipeline of Figure 6.0.1.
That is the DNN should take an image of multi-line text as input and estimate a soft-
assignment, that is a probabilistic assignment, between pixels and characters from the
alphabet in such a way that the decoding algorithm of Chapter 5 will produce the correct
label string. The correct label string is defined by the training data set S, which consists of
2-tuples (x, l) ∈ S with x being the input image of multi-line text and l being the true label
sequence for the text as seen in the input image. The symbol y = DNN(x,W) defines
the soft-assignment between pixels in x and glyphs from alphabet A in the same way as
used beforehand in this thesis. The goal of the training of the DNN and optimization of its
parameter set W is to maximize the likelihood of Decoder(y) ≡ l.
At this point we need to discuss the difference in the configuration C between the
decoding and training algorithms. In decoding, each configuration C is a hard-assignment
between pixels and glyphs from the alphabet A in the form of C ∈ Ady with dy being the
spatial resolution of the DNN prediction y. A big part of the training method proposed in
this thesis is to infer the missing information about the alignment of the truth label string l
over the DNN prediction y, which is necessary since the DNN predicts a soft-assignment
between pixels and glyphs and not directly a label string. As such the network prediction
y ∈ [0, 1]dy×|A| is a probabilistic assignment where glyphs g ∈ A are exclusive per pixel
s: ∑
ysg = 1, ∀s ∈ [1, dy] (6.3.1)
g∈A
In contrast to this, the alignment z is a soft-assignment between characters - specific
instances of a glyph from alphabet A - of the truth label string l and pixels of the DNN
prediction y. Thus z ∈ [0, 1]dy×|l| and it follows that configurations C in training are hard-
assignments between characters of the label string and pixels. This is necessary since
multiple characters of the label string may refer to the same glyph in the alphabet. In the
same way as with the soft-assignment y, character assignments within the same pixel
123
are mutually exclusive to each other while pixels are independent of each other. This
differentiation between the soft-assignment estimated by the DNN and the alignment of
the label string leads to formulations such as ysl where the lower index refers to a glyphCs
from the alphabet, which itself is defined by a position Cs within the truth label string l.
The equivalent of this statement for the alignment would be zsCs .
This described mechanism is a fundamental difference between the DNN prediction
y and the alignment z of the truth label string. To resolve this we define a marginalization
zΣ over characters that refer to the same glyph:
∑|l|
z s = β(l sΣg i, g)× zi ,∀s ∈ [1, dy] (6.3.2)
i
with {
1 iff li = g
β(li, g) = (6.3.3)
0 else
as an indicator function that ensures marginalization over characters of the same glyph.
As stated before, the spatial resolution of the input image x and soft-assignments
y, z and zΣ may differ because of resizing operations beforehand or subsampling and
padding effects in the deep neural network. However, from a theoretical viewpoint it is
sufficient to assume that the spatial resolution of the DNN input and output are identical.
Naive Algorithm
We can now derive the first, although impractical, training algorithm for optimizing the
parameter set W in such a way that the DNN, after decoding, likely predicts the correct
string as seen in the input image x. For naive approach we directly employ the decoding
algorithm from Chapter 5 in the tra∏ining. The likelihood
P (S|W) = P (Decoder(DNN(x,W)) ≡ l) (6.3.4)
(x,l)∈S
defines the likelihood of observing the training data S given the DNN model with param-
eter set W when directly applying the decoding algorithm. The optimal parameter set
W⋆ = argmaxP (S|W) (6.3.5)
W
is in this case the one that maximizes this likelihood of observing the training data. Since
we are dealing with one-dimensional sequences in l, we can use the Edit-distance[81,
151] as a surrogate for the probabil∑ity of decoding to the true label sequence l:
W⋆ = argmin Edit(Decoder(DNN(x,W)), l) (6.3.6)
W
(x,l)∈S
Unfortunately this direct approach is not practical. Gradient descent[11, 65, 107] and
backpropagation[112, 113], see Section 2.3, cannot be applied since the decoding func-
tion is not differentiable. The weight space from which the parameter set W ∈ R⋆ is
drawn is high-dimensional and each dimension has infinite elements. This makes ex-
haustive search, grid search or random search unfeasible. Heuristic optimization meth-
ods would be a possibility, but we will discuss a more direct approach in the remainder of
this chapter.
124
6.4 Maximum Likelihood Training
Loss Function
We will now look at the maximum likelihood approach to training the deep neural network.
This leaves out the ‘middle man’ of the decoding function and directly maximizes the
likelihood P (l|y) as defined by Equation 5.3.2 of observing the true label string l. Again
we will use (x, l) ∈ S as our training data set consisting of tuples of the image input and
true label string. The parameters of the deep neural network to be training is the set W.
We can then define the likelihood of ob∏serving the training data set S given our modelparameters W:
P (S|W) = P (l|DNN(x,W)) (6.4.1)
(x,l)∈S
This assumes that the examples in the training data set S are identical and independent
distributed (i.i.d.), which allows sampling of data for the training set without considering
dependencies between examples. This reduces the likelihood for the whole data set to
the product of the likelihoods of its individual examples.
The optimal parameter set
W⋆ = argmaxP (S|W) (6.4.2)
W
is the one that maximizes the likelihood of observing the training data. Deep neural
networks are typically optimized using gradient descent for finding a minimum of the loss
function and thus we rewrite this in a log-likelihood formulation:
W⋆ = argmin[− logP (S|W)] (6.4.3)
W
Using a log-likelihood formulation is a practical choice for optimizing deep neural networks
using gradient descent towards a maximum likelihood solution since it both improves
numerical stability by replacing products by summations and this in turn facilitates efficient
batch training by simply accumulating the gradient of the examples within each batch.
This directly leads us to the loss function
L = − logP∏(S|W)
= − log P (l|DNN(x,W))
∑(x,l)∈S (6.4.4)
= − logP (l|DNN(x,W))
(x,l)∈S
which is suitable for gradient-based batch optimization of the parameter set W of the
DNN.
Before substituting P (l|DNN(x,W)) we need to define P (l|y) in a way that retrieves
the likelihood of observing the truth label string l given a soft-assignment y. We can
achieve this by treating valid configurations of the truth label string in the pixel space as in-
dependent events and thus define the likelihood for observing the truth label string as the
marginalization over all its configurations. Defining configuration C as a hard-assignment
between pixels and characters of the truth label string l then gives the likelihood of ob-
serving the correct string as the marginalization
∑∏dy ∏
P (l|y) = ysl s α (Css,t , Ct, l) (6.4.5)C
C s t∈nbr(s)
125
over all possible config{urations C. Applying the indicator function
s t 1 iff C
s, Ct are valid neighbors in s, t according to l
αst(C ,C , l) = (6.4.6)
0 else
ensures that the marginalized probabilities are those of configurations C leading to the
observation of the correct label string l. See Section 6.3 for a discussion of the hard-
assigned configurations C and Section 6.2 for the formal definition of this indicator func-
tion.
Substituting Equation 6.4.5 in Equation 6.4.4 yields the following loss:
∑ ∑∏dy ∏
L = − log DNN(x,W)sl s αs,t(Cs, Ct, l) (6.4.7)C
(x,l)∈S C s t∈nbr(s)
The calculation of the likelihood P (l|y) as given by Equation 6.4.5 is fully differen-
tiable and thus allows us to employ backpropagation and gradient descent for parameter
optimization. The derivative ∂L∂W of Equation 6.4.7 gives the gradient necessary for this.
The drawback in this approach lies in the computational complexity in doing so. Equation
6.4.5 and thus the loss of Equation 6.4.7 requires the enumeration of all possible config-
urations C for placing a label string l in the pixel space defined by x. As we will see next,
is this an intractable large amount.
Number of Enumerated Configurations
Estimating the number of these configurations as valid ways of writing text is the topic of
the next few paragraphs. Let us assume a very simple indicator function α which only
accepts neighborhood relations as valid if text lines are always aligned horizontal with
a constant height per text line. In this case each text line is a perfect rectangle in pixel
space. Similarity it accepts characters in pixel space only if they themselves are perfect
rectangles.
Let W be the width and H the height of the pixel space. Let L be the number of text
lines. Each pair of text lines is separated by a horizontal line of line separators ϵl of one
pixel in height, leaving H − L + 1 pixel rows for alignment. There are then Tl = L − 1
transitions between text lines and Tv = H − L vertical transitions between pixel rows for
free use in configurations C.
Interpreting this as a combinatorial problem[138] of choosing Tl elements out of a
base set of Tv elements without replacing elements and without counting permutations
twice yields the following factorial function for the number of possible configurations Nl
for placing text lines:
Tl
Tv Tv × (Tv − 1)× · · · × (Tv − Tl + 1)
Nl = = (6.4.8)
Tl! Tl × (Tl − 1)× (Tl − 2)× · · · × 1
The same approach can be applied for placing G glyphs in a horizontal pixel row.
Counting occurrences of the glyph separator ϵg as part of the sequence of G glyphs,
there are then Tg = G− 1 transitions between glyphs and Th = W − 1 vertical transitions
between pixel columns. This gives the number of possible configurations for placing the
text line in a pixel row:
Tg
Th Th × (Th − 1)× · · · × (Th − Tg + 1)Ng = = (6.4.9)
Tg! Tg × (Tg − 1)× (Tg − 2)× · · · × 1
126
Number Nl gives the number of ways to place the text lines in the vertical dimension
and Ng the ways for placing the glyphs of a text line in horizontal direction. In total the
number of configurations C is then
Nc = Nl ×Ng (6.4.10)
assuming that each text line has an equal number of glyphs and the alignment of their
characters is uniform for all lines or
Nc = N
L
l ×Ng (6.4.11)
assuming that the alignment of characters is independent for each text line.
In an example with a pixel space of W = 320 by H = 240, aligning a paragraph
of L = 5 text lines of G = 50 characters each, the number of possible configurations
Nc is approximately 1066 with uniform text lines as in Equation 6.4.10 and approximately
10299 with independent character alignments in each text line as assumed in Equation
6.4.11. These numbers of configurations are vastly smaller than just enumerating all
configurations independent of their encoded text, but it still is computationally prohibitively
large even for this simple indicator function α, which only allows perfectly rectangular and
axis-parallel lines and characters.
Using the indicator function α as discussed in Chapter 6.2 will only introduce more
degrees of freedom, e.g. that text lines and glyphs do not have to be rectangular anymore,
and thus increase the number of different configurations Nc. This further increases the
computational complexity of applying the maximum likelihood approach to this problem.
Please also note the discussion in Section 4.3 on the computational limitations of
inference in graphical models.
6.5 Expectation-Maximization Training
Idea and Training Algorithm
In the previous sections we have so far discussed methods for training deep neural net-
works for multi-line text transcription based on directly minimizing the Edit-distance[81,
151] between the truth label string and the decoded predicted string and another method
for maximizing the likelihood of observing the truth label string in the prediction by enu-
merating all configurations on how to place this truth label in pixel space. Both seem
computational inefficient and intractable in their own way. We will now discuss an ap-
proach based on expectation-maximization[26][7, ch. 9] which we have in discussed in
Section 2.4. This approach will employ a conditional random field and loopy belief propa-
gation in order to infer an approximation of the soft-assignment between pixels and glyphs
that will maximize the likelihood of decoding to the correct label string. After that all that
is missing is to optimize the deep neural network for reproducing this soft-assignment
during prediction. In this section we will discuss this approach in detail, which is also the
training algorithm used in MDCC.
Expectation-maximization is an iterative optimization algorithm used in machine learn-
ing for selecting model parameters in the face of latent variables. Each iteration is a
two-step process as follows:
1. Expectation step: Keep the model parameters constant while inferring the latent
variables.
2. Maximization step: Keep the latent variables constant while updating the model
parameters.
127
In our case, the latent variable is the soft-assignment zΣ between pixels and glyphs
in such a way that it decodes to the truth label string. This is in contrast to the soft-
assignment y, which is estimated by the deep neural network without knowledge of the
correct label string. We need to keep this distinction between zΣ for a soft-assignment
matching a specific truth label string and y for the network prediction in mind. The E-step
in our case will be to use a conditional random field for inferring z, a soft-assignment
between characters and pixels based on the truth label string, and thus zΣ by accumu-
lating characters to glyphs, see Equation 6.3.2 for this distinction, while keeping the DNN
parameters constant. The M-step then is to keep the soft-assignment zΣ constant while
updating the DNN parameters towards reproducing it without knowledge of the truth la-
bel string. Figure 6.5.1 gives an overview over this loop, which will be applied iteratively
during the training of the deep neural network.
Start here
Training Only
Image of Truth 
Multi-Line Label 
Text Update String 
x Weights Corrected Alignment l
W Soft-Assignment 
z and zΣ
DNN CRF Topology
Prediction Estimated 
Soft-Assignment Prior
y
Decoding
Transcribed 
String
Transcription Only
Figure 6.5.1: Loop between the deep neural network and conditional random field to optimize the
network parameters using expectation-maximization. The EM loop only exists dur-
ing training, as does the CRF. Only the DNN and decoding algorithm are required
for transcription.
The next few paragraphs will discuss the optimization target for the expectation-max-
imization training in this thesis. As before, (x, l) ∈ S will denote the training data set S
consisting of examples of an image x of multi-line text and the matching truth label string
l. Expectation-maximization requires the definition of a optimization function, in the con-
text of EM also called the distortion function, that will be minimized during training. In the
128
case of MDCC, this function consists of two distinct parts. The first regards the E-step, in
which we will minimize the Edit-distance between the decoded soft-assignment zΣ and
the truth label string in order to infer this latent variable zΣ. The second part accordingly
aims at the M-step, in which the cross-entropy loss will be employed to optimize the DNN
parameters towards reproducing the soft-assignment zΣ in its own prediction y. The
distortion function J for EM trainin∑g in MDCC is thus
J(W,Wold) = [Edit(Decoder(zΣ), l) + CE(zΣ,y)] (6.5.1)
(x,l)∈S
with W being the parameters of the deep neural network. The soft-assignment
y = DNN(x,W) (6.5.2)
is estimated by the deep neural network based on the current network parameters. Esti-
mation of the latent variable in form of the soft-assignment
z = Alignment(DNN(x,Wold), l) (6.5.3)
and its marginalization zΣ according to Equation 6.3.2 is a function of aligning the truth
label string l on the DNN prediction based on the network parameters Wold of the last
EM iteration, which are now being kept constant. The function Alignment() for finding
the soft-assignment z will be the topic of further discussion in this chapter and especially
Section 6.6.
Equation 6.5.1 details the optimization function for training in MDCC. One part of it is
to find a soft-assignment zΣ that minimizes the Edit-distance with the truth label string l
when decoded with the decoding algorithm of Chapter 5. Since it was possible to place
the truth label string l in the input image x in the first place, assuming the truth label and
image match and are correctly annotated, then it is also possible to find a soft-assignment
zΣ that correctly encodes this truth label string. This means that the lower bound for the
term Edit(Decoder(zΣ), l) of the proposed EM distortion function is actually zero, which
is the Edit-distance if both strings match without difference. However, the second term
CE(zΣ,y) minimizes the cross-entropy between the aligned soft-assignment zΣ, which
serves at the truth in this case, and the DNN predicted soft-assignment y. The goal is that
y decodes to the truth label string l without prior knowledge of l. This cross-entropy term
means that the EM distortion function will increase in value if the aligned soft-assignment
zΣ varies between multiple iterations of the EM algorithm or deviates too much from the
DNN prediction y. This effect is negated by including the estimated soft-assignment y,
based on the last network parameters Wold as prior into the aligned soft-assignment z. In
conclusion, the overall target of the alignment function Alignment() should be to find the
alignment zΣ that minimizes the Edit-distance, that is a distance of zero, of the decoded
alignment Decoder(zΣ) to the truth label string l, but as a side condition also is as similar
as possible to the estimated soft-alignment y known beforehand for this example. This
also makes practical sense since the deep neural network with parameters Wold is also
the best estimator for the true soft-assignment without prior knowledge of the label string.
We can now also see why different formulations of P (l|y) of Equation 5.3.2 in de-
coding and of Equation 6.4.5 in training are necessary. On a basic level, the decoding
variant is identical to the training variant but with the marginalization of Equation 6.3.2
from label positions to glyphs already built-in. This simpler formulation during decoding
is sufficient since in decoding the only goal is to see how well the assumed label string
matches the observation of the glyphs. During training we need to infer the spatial posi-
tion for each character of the label string and thus need to avoid confusions between the
identical glyph in different characters and thus different spatial positions.
129
The actual loss function for optimizing the DNN parameters W is only dependent on
the second term in the distortion function of Equation 6.5.1. Only the term CE(zΣ,y)
is dependent on W, while the remaining distortion function is dependent on Wold. This
reduction leads us to deriving the f∑ollowing loss function
L = CE(zΣ,DNN(x,W)) (6.5.4)
(x,l)∈S
for training the DNN. This loss function is differentiable regarding W and thus can be
used for gradient-based optimization of this parameter set.
Armed with this knowledge we can derive the algorithm for training the deep neural
network using an expectation-maximization loop in combination with backpropagation
and gradient descent as detailed in Algorithm 6.5.1.
Algorithm 6.5.1 MDCC training based on Expectation-Maximization
Input training data set S.
Choose initial parameter set W.
Outer loop for epoch-based training:
while convergence criteria are not met do
Inner loop for processing training examples:
for (x, l) ∈ S do
Store the current parameter set:
Wold = W
Calculate the aligned soft-assignment:
z = Alignment(DNN(x,Wold), l)
Margin∑alization zΣ according to Equation 6.3.2:
s |l|z sΣg = i β(li, g)× zi , ∀s ∈ [1, dy]
Predict the soft-assignment using the DNN:
y = DNN(x,W)
Calculate the DNN loss:
L = CE(zΣ,y)
Update the parameter set:
W = W − µ× ∂L∂W
end for
end while
Return final parameter set W.
In this formulation of Algorithm 6.5.1, the parameter set Wold is explicitly copied and
stored from W. The reasoning is to make clear that the parameter set is kept constant
during inference of the alignment zΣ and only modified when updating the parameter set
using backpropagation and gradient descent. In a practical implementation, the parame-
ter set would not be copied in each loop and the DNN estimation would only by computed
once per loop.
Updating the parameter set in this is case performed by simply applying stochastic
gradient descent in the form of W = W − µ × ∂L∂W with µ being the learning rate. Other
optimization algorithms such as Adam[67, 84] can be applied as well. Mini-batch or batch
training can be applied instead of stochastic optimization to the DNN by modification of
the inner loop of Algorithm 6.5.1 in order to process multiple training examples at once.
The parameter set is in this case updated by the accumulation of the gradients of the
individual examples ∑ ∂L
W = W − µ× (6.5.5)
∂W
(x,l)∈B
130
with B ⊆ S being a mini-batch or batch of examples.
Finding the Soft-Assignment z
So far in this section we have discussed the MDCC training algorithm for deep neural net-
work parameter optimization based on expectation-maximization and gradient descent.
The paragraphs above give the methodology, rationale and equations for this EM training.
What is missing is a formulation on how to find the soft-assignment z based on the DNN
estimate y = DNN(x,Wold) and the true label string l. That is to define the function
z = Alignment(y, l). We will discuss this in the following paragraphs.
Let C again be a configuration, that is a hard-assignment of label space positions
i ∈ [1, |l|] of the truth label string l to pixel space positions s. That is each pixel hard-
assignment Cs ∈ [1, |l|] gives the label position to pixel s. Each pixel is assigned exactly
one label position, but the same label position may be assigned to different pixels. This
is consistent with the use of the term configuration in the context of graphical models.
We need to refer back to the definition of the likelihood P (l|y) of Equation 6.4.5, which
gives the likelihood of observing label l given the soft-assignment y in order to define the
aligned soft-assignment z.
The alignment z as required for the expectation-maximization training described in the
paragraphs beforehand is a soft-assignment between label space positions and pixels.
That is each pixel is assigned a vector of probabilities, each given the likelihood that one
specific label position, that is character of the truth label string, occurs in this pixel. Since
the characters are mutually exclusive but one character has to be assigned to each pixel,
the probability vector per pixel does sum up to exactly one hundred percent. To achieve
this we can choose the same approach as for P (l|y) of Equation 6.4.5, but to define zsi
only, we marginalize pixel-wise over configurations that assign the label position Cs = i
per pixel s. We thus derive a basic formulation of the alignment z by first computing the
unnormalized alignment z′ with
∑ ∏dy ∏
z′s
′ ′
i = γ(C
s, i) ys α s tl ′ s′,t(C ,C , l) (6.5.6)
Cs
C s′ t∈nbr(s′)
for all spatial positions s ∈ [1, dy] and all label positions i ∈ [1, |l|]. In this formulation, γ is
a indicator function {
s
γ(Cs
1 iff C = i
, i) = (6.5.7)
0 else
that ensures we only marginalize over configurations that assign the correct label space
position i to pixel space position s. The alignment z is the normalization
′s
zs
z
i = ∑ i ′s (6.5.8)
j∈[1,|l|] zj
of the unnormalized soft-assignment z′. As in Equation 6.4.5, the indicator function α
enforces correct neighborhood relations according to label string l, dy gives the spatial di-
mensionality of the soft-assignment y and function nbr gives the 8-neighborhood around
position s. In this equation, the variable s gives the spatial position in question and s′ is
an iterator variable over all pixel positions in order to derive the likelihood for the current
configuration C in question.
In this case we treat each configuration C as an independent event that favors the
assignment of label position i to pixel s or not. Accumulating the likelihoods of configura-
tions that favor a specific assignment derives the likelihood of observing this label position
131
in this pixel. Normalization of each pixel in such a way that its likelihoods of assigned la-
bels sums up to 100 percent is required since each pixel must be assigned a label. There
cannot be any configuration C that correctly encodes l but has one or more pixels with as-
signed labels of probability zero. This normalization also ensures that P (l|z) = 1, which
is necessary since the alignment z purposefully encodes l.
Computing the alignment z according to Equation 6.5.8 again requires the enumera-
tion of all configurations C that encode the correct label string l. This means the same
reasoning on the runtime as in Section 6.4 with maximum likelihood training holds true
here. Applying this equation for the alignment is prohibitively large in computational run-
time. In Chapter 6.6 we will discuss how to interpret this two-dimensional alignment
problem as an inference problem in general graphical models, namely a conditional ran-
dom field. This will allow us to approximate the alignment of Equation 6.5.8 in reasonable
runtime using loopy belief propagation.
Considerations on EM-training
As have have discussed before in Section 2.4 do we need to distinguish between expecta-
tion-maximization and generalized expectation-maximization. Expectation-maximization
refers to the specific use case where a maximum likelihood solution is heuristically ap-
proximated. In this case the E-step finds the expectation value for the latent variables
given the current model parameters and according to the distortion function. The M-step
maximizes the likelihood of observing the training data given these latent variables. Iter-
atively repeating this process will lead to convergence towards the maximum likelihood
solution. The M-step can also be changed towards a maximum a-posterior (MAP) solu-
tion, which is one generalization of the EM algorithm.
Another generalization of EM is to approximate either or both the E-step and the M-
step. In the original formulation of EM, the E-step finds the expectation value for the latent
variables given the joint distribution of the observed data set and the latent variables. This
is reflected in the distortion function. In a symmetrical fashion, the M-step minimizes the
distortion function by choosing the optimal model parameters. Finding the expectation
value of the latent variables and optimizing the model parameters is done in every iter-
ation of the EM algorithm. One example for this ‘exact’ EM is k-means clustering which
we have discussed in Section 2.4. The EM training proposed in this thesis falls under the
term generalized expectation-maximization since both the E-step and M-step are only
approximations or iterations towards the true expectation value of the soft-assignment zΣ
or the optimal model parameters W. The DNN training discussed beforehand utilizes
gradient descent for optimizing the model parameters. Gradient descent is an iterative al-
gorithm, which only modifies the model parameters in small steps each iteration. It does
not in one step find the optimal model parameters each step. This alone leads us to the
conclusion that we deal with generalized expectation-maximization in this thesis. The E-
step in MDCC approximates the soft-assignment z using a conditional random field and
loopy belief propagation. This is the topic of Section 6.6. Again, this E-step is only an
approximation of the true expectation value for the soft-assignment zΣ and thus supports
the idea that this is generalized expectation-maximization.
We need to keep in mind that this is generalized expectation-maximization with step-
wise updates of the model parameters instead of finding one-step optimal solutions for
the model. However, there are theoretical discussions[93, ch. 11.4.8] and examples
that show that an incremental EM algorithm will still converge to a (local) optimum of the
maximum likelihood solution.
132
6.6 Construction of and Inference in the CRF
Overview and Structure
Section 6.5 discussed the expectation-maximization approach to training the deep neural
network in the method proposed in this thesis. We have so far discussed the optimiza-
tion function, or distortion function, of the EM training in multi-dimensional connectionist
classification and how to derive the loss for gradient-based optimization of the DNN pa-
rameters. We also have shown the prototypical alignment z = Alignment(y, l) between
the truth label string l and the DNN prediction y. This aligned soft-assignment z be-
tween label positions and pixels represents the latent variable in the EM training at hand.
This prototypical alignment made clear that exact inference of the alignment z opens up
the same computational considerations as with maximum likelihood optimization of the
DNN parameters and is computational intractable. Formulating the deep neural network
training as an expectation-maximization approach did however change the nature of the
alignment problem from a optimization task on the DNN parameters to a inference task
of the latent variable. This allows the application of well known approximate inference
algorithms. In this section we will discuss the formulation of the alignment in MDCC as
approximate inference using loopy belief propagation[34, 94][93, ch. 22] on a conditional
random field [77][93, ch. 19.6]. Both concepts have been discussed in Section 2.2.
It is necessary to mention that there are two ways of using conditional random fields,
or most graphical models in general: One being as a machine learning model in which
the model parameters are automatically learned in a supervised schema using a training
data set. The other being as a model of a multi-variate probability distribution in which
the model parameters are predefined according to expert knowledge of the problem at
hand. In this work we will apply the CRF in the latter way, by choosing the graph topology
and parameters according to the alignment problem that we have discussed beforehand.
We then use the graphical model for inference of the aligned soft-assignment.
We define the conditional random field in MDCC as a pairwise undirected model with
discrete states in which each pixel of the alignment problem relates to a node in the CRF
and each label position to a state of the CRF. Otherwise said, each pixel of the pixel
space becomes one random variable of the graphical model and each label position of
the truth label becomes one possible discrete state of these random variables. The joint
distribution of such a discrete pairwise CRF for the alignment problem at hand is defined
as follows: ∏dy1 ∏
P (C|y, l) = ψs(Cs, l,y) ψ s ts,t(C ,C , l) (6.6.1)
Z
s t∈nbr(s)
Z being the partition function or Zustandssumme, that is the normalizer
∑ ∏dy ∏
Z = [ ψs(C
s, l,y) ψ (Css,t , C
t, l)] (6.6.2)
C s t∈nbr(s)
that ensures that the accumulated likelihood over all possible configurations is 100 per-
cent.
Functions ψs(Cs, l,y) and ψs,t(Cs, Ct, l) are the potential functions of the CRF. As
discussed before, the Hammersley-Clifford Theorem[50, 70, 77] defines the properties
that these potential functions need to fulfill. They need to be non-negative functions
in dependency of the nodes in their clique, which in our case as a pairwise model are
cliques of two neighbors. The potential functions define the ‘compatibility’ between the
observations, in our case the DNN prediction y, and the states of the random variables as
well as between the states of neighboring random variables. Higher values of the potential
133
functions mean higher compatibility. In line with these properties and the discussions of
Section 6.5 we define the node potential function
s
ψ (Cs
y
, l,y) = e lCss (6.6.3)
as proportional to the estimated soft-assignment y, which serves as a prior. Please note
that the soft-assignment y is indexed by a spatial position and a glyph from alphabet A,
not a specific instance of it. Thus the necessity to index it by lCs in the case as applied
here. The edge potential function
ψ s ts,t(C ,C , l) = αs,t(C
s, Ct, l) = α ss,t(C ,C
t, l)× e0 (6.6.4)
models the topology of the conditional random field according to the indicator function α
which we have discussed in Section 6.2. Indicator function α has a value of 1 whenever
the hard assignments Cs and Ct in pixels s and t are valid according to the truth label
string l and 0 otherwise. The node potential function also assumes a value of one or zero
depending on if the neighborhood relation is valid or not. Rewriting this as α × e0 allows
to better fit this into the framework of graphical models and practical use, as we will see
now.
The edge potential function ψs,t of a graphical model is essentially a n × n matrix
with n being the number of states in each random variable, in our case the size of the
label space, and the coefficients of this matrix define the edge between the states in
these two nodes of the graphical model. This matrix may contain structural zeros, that is
zero coefficients which denote states that are not valid neighbors. As we can see from
Equation 6.6.1, configurations C which contain such neighbors of such structural zeros
have, correctly, a likelihood of zero according to the joint distribution. In our case the
structural zeros are dependent on the structure of the label space and pixel space as
defines by the indicator function α.
Undirected graphical models with exponential potential functions, specifically where
low-energy configuration have a high probability, are called energy based models, see
e.g. Kevin Murphy[ch. 19.3.1][93] for further information, which are common in model-
ing physical systems. In practice, choosing exponential potential functions offer bene-
fits for the implementation of loopy belief propagation on computers. Applying LBP in
sum-product mode, as we will do in this section, will require the repeated application of
summation and multiplication operations to the values of the potential functions. Numer-
ical stability is increased in such cases if these operations are done in logarithmic scale.
The work on connectionist temporal classification contains[43, ch. 7.3.1] the required
equation for addition in logarithmic scale
ln(a+ b) = ln a+ ln(1 + eln b−ln a) (6.6.5)
with multiplication
ln(a× b) = ln a+ ln b (6.6.6)
and the identity
ln(ea) = a (6.6.7)
being in the general corpus of knowledge. These equations allow a numerical stable
implementation of the sum-product algorithm in loopy belief propagation and thus favor
exponential potential functions.
Restating the edge potential function ψs,t as a product of two terms, the indicator
function α defining the structural zeros and the constant e0, allows for an efficient imple-
mentation of loopy belief propagation.
The above discussions and equations define the conditional random field as used for
the alignment z to complete the EM training of Section 6.5. Approximate inference of the
node marginals to retrieve the aligned soft-assignment z will be the next topic.
134
Inferring the Aligned Soft-Assignment
We will now discuss how to apply loopy belief propagation in sum-product mode to the
conditional random field described above. As discussed before in Section 2.2, belief
propagation[100] is a message passing algorithm that computes the marginalized beliefs,
in sum-product mode, or maximum posterior states, in max-product mode, of graphical
models. If the graphical model at hand is a polytree, belief propagation will yield the
exact marginals. Cyclic graphs, as in our case, are not polytrees, but still belief propaga-
tion can be applied iteratively in order to retrieve approximated marginals[100, p. 195].
This iterative variant of belief propagation applied to cyclic graphs is called loopy belief
propagation.
The term beliefs in this context refer to the inferred likelihoods of the unobserved
variables, in our case the alignment z, on the basis of the observed variables, here the
DNN prediction y, and the given graphical model topology. In case of the alignment prob-
lem in this chapter, the beliefs belsi are proportional to the marginals of Equation 6.5.8:
belsi ∝ zsi . Referring back to the discussions of Section 2.2 and especially Algorithm 2.2.1,
we will now define the message passing for loopy belief propagation for approximating
the aligned soft-assignment z.
In the following equations we will use the variables xs, xt as random variables ex-
pressing the state of the pixels, that is nodes of the CRF, at spatial position s and t. This
is in contrast to the usage of x as input into the deep neural network. The message
update ∑ ∏
ms→t(xt) = [ψs(xs, l,y)ψs,t(xs, xt, l) mu→s(xs)] (6.6.8)
xs u∈nbr(s)\t
contains the local belief, based only on pixel s, about the likelihoods of discrete states
in pixel t. Each message is built in a multi step process. First the evidence for node
s is collected from its neighbors, except node t. This evidence is adjusted by its prior,
namely the node potential ψs and finally transformed to a belief about node t via the
edge potential ψs,t. Thus each message is the belief about the probability distribution in
a specific node, given one of its neighbors and the prior information available.
In loopy belief propagation, the message values ms→t(xt) are updated and stored at
each iteration in order to use them for updating of the other messages in the message
passing process. Updating the messages is iteratively repeated until the predefined con-
vergence criteria are met. See Algorithm 2.2.1 or literature[93, ch. 22] for for the full
algorithm.
Given all the messages within the CRF, we ∏define the beliefs
belsi ∝ ψs(i, l,y) mt→s(i) (6.6.9)
t∈nbr(s)
where s is a spatial position in pixel space and i and position within the label space. As
stated before, in sum-product mode these beliefs are proportional to the marginals of
Equation 6.5.8 and as such we can use the same normalization to approximate
∑ belszs ≈ ii s (6.6.10)
j∈[1,|l|] belj
with i and j both being positions in the label space defined by the truth label string l. Ap-
proximating the beliefs for all spatial positions s and all label positions i will approximate
the aligned soft-assignment z and thus complete the expectation-maximization training
described in this chapter.
135
Convergence Criteria
So far we have discussed how the aligned soft-assignment z can be inferred using loopy
belief propagation. As stated, LBP is an iterative algorithm and requires convergence
criteria in order to stop the iteration and use the best available beliefs. There are some
theoretical considerations: At the time of writing it is not clear[94] under which conditions
convergence of the beliefs in loopy belief propagation does occur or if the beliefs are
near the exact solution if convergence does occur. However, practical application of loopy
belief propagation on Markov random fields and conditional random fields does show that
LBP can be successfully employed. It seems that if the beliefs do converge to a stable
point, they are a reasonable approximation of the true posteriors. In some cases, LBP
does oscillate without convergence to a stable point.
We will now discuss that we can defuse these problematic behaviors of LBP for the
proposed method of this thesis. For this we will choose an appropriate convergence crite-
ria and we will test the approximated marginals if they are near the exact marginals. The
first convergence criteria is derived from the expectation-maximization distortion function
as presented in Equation 6.5.1. One term of the distortion function is to minimize the
Edit-distance between the truth label string l and the aligned soft-assignment zΣ. When
approximating the soft-assignment z in LBP, we can at the same time approximate zΣ by
applying Equation 6.3.2. This in turn means that at every iteration of loopy belief propa-
gation, we can approximate the soft-assignment zΣ, decode it with the decoder algorithm
presented in Chapter 5 and compute the Edit-distance between the decoded alignment
and the truth label string l. This is the term Edit(Decoder(zΣ), l) of the EM distortion func-
tion of Section 6.5. If this Edit-distance is at its lower limit of zero, meaning no difference
at all between the decoded and truth string, then we can stop the iterations of LBP and
use the current marginals as the approximated alignment. While this does not give any
indication of how large the difference between the approximated soft-assignment z and
its true counterpart is, it at least means that there is no other string o ̸= l different from
the truth label string l for which P (o|zΣ) > P (l|zΣ) based on Equation 5.3.2 holds true.
This means that z is sufficiently close to the exact solution in order to apply the discussed
expectation-maximization training for optimizing the deep neural network parameters to-
wards predicting the soft-assignment y which decodes to the correct label string during
transcription. This convergence criteria thus defuses the problem that LBP sometimes
converges to a stable point that is not close to the exact solution.
The second consideration is to prevent loopy belief propagation from running infinitely
when not converging towards a stable point. Armed with the knowledge that we can
test the current beliefs for sufficient closeness to the true marginals, we can simply stop
LBP after a fixed amount of iterations. If the beliefs converge towards a point where
they decode to a string with an Edit-distance of zero to the true label string, LBP will
be stopped before and these sufficient beliefs be used as the aligned soft-assignment
z. If no such beliefs were discovered after a fixed amount of iterations, LBP can be
stopped and the example be ignored for the current expectation-maximization iteration.
This effectively removes the example from the training data set for the current epoch of
deep neural network training. It may be that a sufficient solution will be found for the same
training example in later iterations of EM after the DNN parameters have been optimized
towards correct transcription of multi-line text.
These two ideas in combination result in the following formulation of loopy belief prop-
agation, based on Algorithm 2.2.1, in the context of expectation-maximization as dis-
cussed in Section 6.5:
Loopy belief propagation as outlined in Algorithm 6.6.1 prevents infinitely running LBP
iterations or training the deep neural network towards invalid predictions. It does so by
introducing a limit on the number of iterations and by using only training examples for
136
Algorithm 6.6.1 Loopy Belief Propagation in MDCC
Input predicted soft-assignment y.
Input truth label string l.
Initialize messages ms→t(xt) = mt→s(xs) = 1 for all edges s ∼ t.
Initialize beliefs belsi = 1 for all nodes s.
Choose a random but fixed order for message updates.
Propagate beliefs until convergence criteria are met:
repeat
Send messa∑ges along each edge: ∏
ms→t(xt) = x∏[ψs(xs, l,y)ψs,t(xs, xt, l)s u∈nbr(s)\tmu→s(xs)]Update beliefs for each node:
belsi ∝ ψs(i,y) t∈nbr(s)mt→s(i)
Compute approximate z and zΣ based on these beliefs.
Test if these beliefs do decode correctly:
if Edit(Decoder(zΣ), l) = 0 then
Return soft-assignment z and use it in EM training.
end if
until limit on the number of iterations is reached.
Return and discard the training example for the current EM iteration.
which the CRF approximation decodes to the truth label string. On the other hand does
it potentially remove some training examples from the training data set and adds them
back later again. This is not a problem from an algorithmic viewpoint since the expec-
tation-maximization training presented in Section 6.5 is already a batch version and the
optimization of the DNN parameters already has a non-stationary loss in the expectation-
maximization loop of MDCC. Non-stationary objectives in the optimization of deep neural
networks can be addressed by the Adam[67, 84] method.
This concludes the training of the deep neural network towards transcription of multi-
line text, which was the goal of this chapter. The following chapters will focus on exper-
imentation and application of both the discussed decoding algorithm and training algo-
rithm.
6.7 Emphasizing Segmentation
Section 4.2 stated that one possible solution to Sayre’s knot is to treat segmentation and
transcription of handwritten text as two products of the same process, not two different
processes. Treating these as two different processes will introduce a circular dependency
between both, which constitutes Sayre’s knot. So far this chapter and Chapter 5 have dis-
cussed transcription of multi-line paragraphs using multi-dimensional connectionist clas-
sification. The transcription task is encoded in the conditional random field, defined by
its structure as discussed in Section 6.2, followed by a suitable decoding algorithm as
proposed in this thesis.
Segmentation, that is a correct assignment between pixels in the soft-assignment
predicted by the DNN and the presented input image, was so far not discussed in the
context of MDCC. Emphasizing segmentation instead or in addition to transcription is
the topic of this section. It is worth noting that this is an idea on how to emphasize
segmentation in multi-dimensional connectionist classification, but implementation and
evaluation of this approach is not in the scope of this thesis.
Part of the training algorithm of MDCC is to estimate the true soft-assignment of the
ground truth label sequence over the two-dimensional soft-assignment estimated by the
137
deep neural network. This alignment process is implemented by constructing a suitable
conditional random field, followed by approximate inference using loopy belief propaga-
tion. A CRF is defined by its node and edge potentials. Edge potentials encode the
‘compatibility’ of labels in neighboring pixels of the CRF. In the case of MDCC encode
these edge potentials neighborhood relations between characters in the ground truth la-
bel sequence. As such, edge potentials can be interpreted as facilitating the correct
transcription of the text.
Node potentials of a CRF on the other hand encode the ‘compatibility’ between pixels
of the CRF and their labels without taking neighboring pixels into account. Node poten-
tials are thus only dependent on the spatial position of the pixel and the assigned label.
In MDCC the node potential function is given by Equation 6.6.3. It defines the node po-
tential in dependency of the glyph probabilities estimated by the deep neural network.
This is because the idea behind MDCC is to modify the soft-assignment as estimated
by the DNN as little as possible, but still correct it to facilitate correct glyph neighbors
and thus correct transcription. Equation 6.5.1 necessitates this approach since it de-
fines the distortion function for expectation-maximization in MDCC in such a way that
the soft-assignment estimated by LBP needs to show a low cross-entropy towards the
soft-assignment estimated by the DNN.
In this section we will discuss the according modification to the node potential function.
The original node potential function
ys
ψ ss(C , l,y) = e lCs (6.7.1)
is augmented by introducing an additional dependency on a static soft-assignment k as
ys +θ×ks
ψ (Cs, l,y,k) = e lCs lCss (6.7.2)
where 0 < θ < 1 is a constant coefficient to weigh the DNN-estimated soft-assignment y
with the static soft-assignment k. The static soft-assignment k introduces a prior to the
node potential function. A similar approach of augmenting the node potential function is
detailed in Section 7.3 for implementation of a two-dimensional forced alignment.
Emphasis is put on segmentation by choosing the static soft-assignment k in such
a way that it preserves and represents the assignment between spatial positions in the
presented input image and glyphs of the alphabet in use. One way to produce this static
soft-assignment would be to move a sliding window over the input image and to do sin-
gle character recognition using e.g. a convolutional neural network or support-vector
machine. Since this soft-assignment is dependent only on the input image it can be com-
puted once per data set and then reused, reducing the impact of this approach on the
overall training time in MDCC.
The question is how good the error rate of this single character recognition must be
since its task stands in competition with MDCC and it seems that this approach of aug-
menting the node potential function just moves the paragraph-level transcription problem
to another algorithmic abstraction level. However, the CRF in MDCC encodes the truth
label sequence in its edge potentials and as such the soft-assignment as approximated
by the CRF and LBP always decodes to the correct string. The single character recogni-
tion produces the soft-assignment k thus does not need to be 100 percent correct. It just
needs to preserve the spatial relationship between pixels in the input image and glyphs of
the alphabet while being correct in enough cases. ‘Enough cases’ means that each true
prediction of the single character recognition will fixate the related node in the CRF to
one single labeling or at last a low amount of labels. This in turn reduces the overall pos-
sible configurations for placing the ground truth label string to the subset which respect
this constraint. This means that each true prediction by the single character recognition,
incorporated in soft-assignment k will improve the quality of the segmentation provided
138
by the soft-assignment in MDCC, both estimated by the DNN and approximated by the
CRF.
Emphasis should be placed on only incorporating reliable, or highly likely, predictions
by the single character recognizer into the static soft-assignment k. Applying a threshold,
if possible, to the predictions of the single character recognizer facilitates this precaution.
This threshold could e.g. be a lower limit on the probabilities given by a softmax function
in a convolutional neural network or a lower limit on the separation margin in a support-
vector machine. While following this approach, adding a few high probability predictions
to the soft-assignment k should yield better results than many low probability predictions.
As stated was this approach of emphasizing segmentation not implemented or eval-
uated in the scope of this thesis. This section still serves as a reminder that MDCC can
easily be modified to support variants of the paragraph-level transcription task.
139
140
Chapter 7
Text Recognition for Paragraphs
7.1 Overview
In Chapters 5 and 6 we have discussed the methodology and theory for using a deep neu-
ral network (DNN) for segmentation-free multi-line offline text transcription. The pipeline
first employs a deep neural network and the training algorithm of Chapter 6 to estimate
a probabilistic soft-assignment between pixels of the input image and glyphs from the al-
phabet. Chapter 5 finalizes the transcription pipeline by providing the multi-line decoding
algorithm to produce a highly likely string from the estimated soft-assignment. In total the
pipeline discussed in these two chapters of the thesis is capable of transcribing multi-line
text from an image. So far the discussion has been on theory and the resulting meth-
ods, but this chapter will detail the experiments and results that have been done with this
methodology.
This chapter will discuss the practical application of the proposed method and modi-
fication necessary for this. These practical changes include data augmentation on the
training data set, a well-known approach in deep learning where the training data is
slightly altered in a random fashion. These variations lead to a better generalization of the
trained model. We will also discuss a two-dimensional forced alignment [124] approach
that serves as an initializer for the conditional random fields used in multi-dimensional
connectionist classification (MDCC). Both data augmentation and forced alignment were
applied to improve the method’s error rate on the used data set. Discussions on what the
specific problems addressed by these approaches are, can be found in Sections 7.2 and
7.3.
Section 7.4 will discuss the specific topology of the deep neural network used for the
experiments. The network used for the experiments of this chapter is a combination of a
convolutional neural network (CNN) and recurrent neural network (RNNS) in the form of
long short-term memory (LSTM) cells. The corresponding section refers back to Chapter
2 for the neural network topologies.
The IAM offline handwriting database[88] served as the basis for all experiments in
this chapter. It is the quasi-standard data set for evaluating and comparing segmentation
and transcription algorithms on handwritten English text. The experiments and results on
this data set using the before discussed method and practical implementation is detailed
in Section 7.5. This section also includes comparisons with the methods discussed in
Chapter 3 that either address the same problem as MDCC or are current state-of-the-art
methods on afore-mentioned IAM database.
141
7.2 Data Augmentation
The IAM offline handwriting database[88] was used for all the experiments described in
this chapter. The IAM database is a set of scanned pages of English handwritten text.
Each page of the IAM database was generated by selecting a text from the London/Oslo-
Bergen (LOB) corpus[62], printed on top of a blank physical sheet of paper and then
letting a human writer copy the text to the free area below the machine written text. Finally,
each physical page of handwritten text was scanned in 300 dpi resolution and stored as
a grayscale digital image. The truth label strings provided with the IAM database match
the handwritten text and reflect the line breaks as in the handwritten text, not the machine
printed text, and also include potential spelling errors. The IAM database was designed
primarily for training and evaluation of line- and word-wise handwritten text transcription
and segmentation methods. The handwritten text lines typically have spacing between
them. From the IAM offline handwriting database description[88, sect. 2]:
As the main focus of the research that led to the acquisition of the database
described in this paper is on high-level recognition using language models, we
wanted to make the image processing part as easy as possible. Therefore, it
was decided that the writers had to use rulers. These guiding lines, with 1.5
cm space between them, were printed on a separate sheet of paper which
was put under the form.
This poses the first reason for applying data augmentation to the IAM database in the
context of this thesis. The method proposed in this thesis is designed for transcribing
multi-line handwritten text without prior segmentation, even in the face of overlapping text
lines. As such, a certain amount of overlaps between text lines is expected in the training
data. Line overlaps will be artificially created by applying data augmentation.
The second reason for data augmentation is the amount of examples available in the
IAM database. All experiments in this chapter use the official split for the large writer
independent text line recognition task, which splits the IAM database into four sets (train-
ing, validation 1, validation 2 and test) without any overlaps of writers between the splits.
In the experiments, the training set was used for training the deep neural network using
multi-dimensional connectionist classification. The validation set 1 was used for hyper-
parameter tuning and model selection. The test set was used for evaluation and compar-
ison with existing works. Validation set 2 was not used in the experiments of this chapter,
but in those of Chapter 9. The number of examples in these splits is listed in Table 7.1.
The table shows that the number of examples, in the case of this work the number of
paragraphs, in the training and validation sets are on the lower end for robust training of
a deep neural network and the associated hyper-parameter optimization.
Table 7.1: Characteristic sizes of the large writer independent text line recognition task on the
IAM offline handwriting database.
Training Validation 1 Validation 2 Test
Num. Paragraphs 747 105 115 232
Num. Lines 6161 900 940 1861
Num. Writers 283 46 43 128
To counter these two properties of the IAM database, data augmentation was applied
to the training and validation sets. Data augmentation is a technique in machine learning
by which the number of examples in a data set is artificially increased by modification
or perturbation of the original examples in a systematic way, although sometimes with
142
a random component. The test or evaluation data is typically not augmented since that
would result in distorted and not directly comparable results and error rates. The main
goal of data augmentation is to increase the number of examples used for automatic
parameter optimization or manual hyper-parameter optimization of the machine learning
model in order to reduce the likelihood of overfitting the data.
One data augmentation method applied to the training and validation in the experi-
ments in this chapter is to artificially reduce the line spacing between the text lines. This
can be done on the IAM database since the annotated data contains the segmentation
info on line levels. This segmentation data was manually corrected by the authors of the
IAM database and thus considered to be the ‘perfect’ segmentation without error. Using
this segmentation info, individual text lines were extracted and then combined again to a
whole paragraph image, but with each line moved upwards vertically by a fixed amount of
pixels. The number of pixels by which the text lines were moved were based on a chosen
distance in millimeters and the known image resolution of 300 dpi. Since this reduced line
spacing results in overlaps in the text lines, as required for data augmentation purposes,
a pixel-wise logical OR-operation was applied to the text line images while merging. If
the pixel in question was dark from ink in either of the overlapping text lines, the resulting
merged pixel is dark in the augmented example. In typical grayscale image encoding
schemes, this means the minimal numerical pixel value was used. This way the IAM data
set was augmented by reducing the line spacing of the paragraphs in all examples of
the training and validation set by 3 mm, 5 mm and 10 mm. The truth label string in the
examples was not modified by this augmentation.
Figure 7.2.1 shows one example from the IAM database. It is the example image as it
is in the database, but cropped to the minimal axis-parallel rectangle around the handwrit-
ten paragraph. This crop was necessary since the full paragraph images delivered with
the IAM database are digital scans of the whole page, including meta data and the truth
text in machine print. In order to retrieve the handwritten paragraphs, image cropping
was applied to all data examples within the training, validation and test sets. In this way is
Figure 7.2.1 an example for the images that were used in the experiments of this chapter
while not applying data augmentation. These original paragraphs were only cropped to
contain only the handwritten paragraph, but otherwise not modified.
Figure 7.2.1: Example from the IAM offline handwriting database. Cropped to the minimal axis-
parallel rectangle around the handwritten paragraph. The same crop was applied
to all training, validation and test data.
143
Figures 7.2.2, 7.2.3 and 7.2.4 show the same example paragraph from Figure 7.2.1
but with line spacing reduced by 3 mm, 5 mm or 10 mm respectively.
Figure 7.2.2: Example of Figure 7.2.1 with line spacing reduced by 3 mm.
Figure 7.2.3: Example of Figure 7.2.1 with line spacing reduced by 5 mm.
Figure 7.2.4: Example of Figure 7.2.1 with line spacing reduced by 10 mm.
A second form of data augmentation used in this work is to artificially increase the
number paragraphs by splitting them into smaller parts of at least two text lines. This
again can be done since the annotated data of the IAM database contains the true line
segmentation information. Data augmentation was applied by generating sub-paragraphs
of at least two text lines by cropping the minimal axis-parallel rectangle around the se-
lected text lines according to the annotated segmentation information. This extraction
of sub-paragraphs was only done if the resulting axis-parallel rectangle around the text
lines did not intersect with the neighboring lines. Meaning no sliver of text from adja-
cent text lines was included. Only if this was the case for the cropped image region, the
144
sub-paragraph was included in the augmented data set. This process of cropping sub-
paragraphs was done for all examples and for all possible combinations of starting and
ending text lines, given that the resulting crop would not overlap with non-included text
lines. At least two text lines were cropped per sub-paragraph in order to ensure that there
is at least one line separator remaining and as such the example still is a multi-line para-
graph. The annotated label string needed to be modified accordingly, by only using the
annotated text related to the cropped text lines.
Figure 7.2.5 shows the crop of the first two text lines for the example given in Figure
7.2.1. Figure 7.2.6 shows another sub-paragraph cropped from this example. In total
there were 15 sub-paragraphs extracted by data augmentation for this example at hand.
Figure 7.2.5: Example of a valid sub-paragraph from Figure 7.2.1 cropped to lines 1 and 2.
Figure 7.2.6: Example of a valid sub-paragraph from Figure 7.2.1 cropped to lines 2 through 4.
These two data augmentation methods, reducing the line spacing between text lines
and cropping sub-paragraphs of adjacent text lines, were applied to all examples of the
training set and validation set 1 of the IAM database. The test set was not augmented at
all. This drastically increased the number of examples in these sets. Table 7.2 shows the
number of examples after data augmentation. The number of training examples for the
experiments went up from 747 to 20698 and the number of validation examples from 105
to 3163.
Table 7.2: Number of examples in the large writer independent text line recognition task on the
IAM database after data augmentation.
Training Validation 1 Test
Num. Original Paragraphs - - 232
Num. Reduced Line Spacing 2241 315 -
Num. Sub-Paragraph Crops 18457 2848 -
Sum total 20698 3163 232
The data augmentation as described in the above paragraphs and the data splits of
Table 7.2 were used for all experiments in the remainder of this chapter.
145
7.3 Forced Alignment
Idea and One-Dimensional Forced Alignment
In Chapter 6 we have discussed multi-dimensional connectionist classification (MDCC),
the training algorithm proposed in this thesis. MDCC employs a conditional random field
to approximate the alignment of the truth label string over the two-dimensional pixel space
of the DNN estimation. We will now discuss a modification of this alignment method that
was used in the experiments of this chapter to improve the convergence speed of the
DNN training in the initial phase. This modification is based on the idea of forced align-
ment [124] for deep neural networks trained with connectionist temporal classification[46,
47].
The loss function of connectionist temporal classification implicitly computes the one-
dimensional alignment of the truth label string over the DNN estimation. This implicit
alignment can also be done explicitly. The loss function has to be changed to cross-
entropy in this case. This is where forced alignment comes in. In the initial phase of
training, the parameters of the deep neural network are random and then iteratively up-
dated, which leads to more or less random estimations from this DNN in the beginning.
In the case of random DNN estimations, connectionist temporal classification will still
produce a valid alignment of the truth label string. However, the localization of the char-
acters of this label string in the one-dimensional DNN estimate will be random, too. As
shown in the forced alignment paper[124], this hinders the DNN optimization in the initial
phase. Forced alignment combats this effect by explicitly generating an alignment based
on assumptions on how handwritten text is structured, but without taking the DNN esti-
mate into account, and uses this alignment to optimize the deep neural network via the
cross-entropy loss. These assumptions for the forced alignment are that each character
produces only a ‘spike’ (a narrow peak in probability for this character), that this spike
is either in the beginning, middle or end of the character and that each character is of
roughly the same width. After some epochs of training, forced alignment is replaced by
connectionist temporal classification for further optimization.
εg εg εg εg εg εg εg εg εg εg
G a i t s k e l l
Figure 7.3.1: One-dimensional forced alignment on an example word from the IAM offline hand-
writing database. The characters in the forced alignment are uniformly spaced.
Two-Dimensional Forced Alignment
The same idea of using the truth label string and some assumptions about how multi-line
text is structured can be used to compute a two-dimensional forced alignment. In the
case of this work, the forced alignment is a two-step process: First, placing the text lines
within the two-dimensional pixel space. Second, placing the characters within each text
line in pixel space.
146
Placing the text lines is based on the assumption that prototypical text lines are ori-
ented horizontally and roughly of the same height. Forced alignment in 2d thus places
each text line as a perfectly horizontal, axis-parallel rectangle. All text lines are of the
same size or with the smallest height difference that is possible. Two adjacent text lines
are separated by a horizontal pixel row of line separators ϵl of exactly one pixel in height.
These line separators have a probability of 100 percent in their respective pixels of the
forced alignment. It is worth noting that these assumptions for forced alignment are de-
signed for stabilizing the MDCC training on the IAM database with its roughly horizontal
text lines. Figure 7.3.2 shows the line separators ϵl of the forced alignment of the example
in Figure 7.2.1.
Figure 7.3.2: Line separator ϵl from two-dimensional forced alignment on the example from Fig-
ure 7.2.1. Red encodes a high probability, blue low a one.
Placing characters such as the glyph separator ϵg or visible glyphs from the alphabet
requires another set of assumptions on the structure of text. They are assumed to be
placed left to right, occupying the full vertical range within their respective text line, being
of roughly the same width and having an unsharp transition between two adjacent charac-
ters. Forced alignment of the characters starts by calculating the width of each character
by dividing the width of the pixel space by the number of characters in the longest text line
of the truth label string. This character width in pixels is then applied to all text lines. We
then place the mid-points of each character in their text lines, beginning from the left with
a margin of half a character width to the left border and one character width between each
two adjacent characters. Converting these mid-points of the characters to probabilities
is done by placing a normal distribution over each mid-point, with the pixel coordinate of
the mid-point being the mean of the normal distribution. Character probabilities are then
drawn from these normal distributions. Normalizing these drawn probabilities to sum up
to exactly 100 percent per pixel yields the final probabilities for characters in 2d forced
alignment. Figure 7.3.3 shows this for the glyph separator ϵg. Figure 7.3.4 shows the
forced alignment of the glyph ‘e’ in the example from Figure 7.2.1.
Figure 7.3.3: Glyph separator ϵg from two-dimensional forced alignment on the example from
Figure 7.2.1. Red encodes a high probability, blue low a one.
Not all text lines have the same amount of characters in length and we have already
made the assumption to place the characters aligned to the left border and from there left-
to-right. This means there is potential unused space to the right of the individual text lines
in pixel space. Normalizing the probability vector in each pixel to a sum of 100 percent
results in the last character of each text line filling up the space up to the right border
of the pixel space. This is not how handwritten text is typically structured, instead there
147
Figure 7.3.4: Glyph ‘e’ from two-dimensional forced alignment on the example from Figure 7.2.1.
Red encodes a high probability, blue low a one.
is a white space to the right of each line. This is modeled by placing a space character
at the end of each text line, exactly at the rightmost pixel column. The probabilities for
this trailing space are then also drawn from a normal distribution with its mean at this
right border. Figure 7.3.5 illustrates the probabilities for the space glyph in the example
of Figure 7.2.1.
Figure 7.3.5: Space glyph from two-dimensional forced alignment on the example from Figure
7.2.1. Red encodes a high probability, blue low a one.
Placing the white space on the right border as shown in Figure 7.3.5 needs to reflect
the truth label string and if the text is expected to be left-aligned. Depending on the data,
this filler space can also be placed on the left border or both. The experiments in this
chapter were done by aligning the text lines of the 2d forced alignment to the left border
since this is how the examples in the IAM offline handwriting database are written.
Using this technique, 2d forced alignment produces a soft-assignment with the same
properties, but different likelihoods, as the soft-assignment z approximated by loopy be-
lief propagation on a conditional random field as described in Section 6.6. This allows
replacement of the soft-assignment as estimated by the conditional random field by this
2d forced alignment. However, with the use of conditional random fields there is a better
way of integrating 2d forced alignment, which we will discuss in the next paragraphs.
Forced Alignment for Conditional Random Fields
We recall our discussion of Section 6.6 for the definition of a conditional random field
via its node and edge potential functions. In the case of multi-dimensional connectionist
classification, the node potential is defined by Equation 6.6.3. The node potential function
ψs gives the ‘compatibility’ between the character Cs of the label string l and a spatial
position s, which in MDCC is proportional to the estimated likelihood as given by the
soft-assignment y from the deep neural network.
For including forced alignment into the CRF, we simply include it in such a way that
the node potential function ψs is also proportional to it instead of only being dependent
on the DNN prediction y. We can control the influence of FA by weighting it in this new
node potential function
k ×yss DNN l +kFA ×FA(s,C
s,l)+k
ψs(C , l,y) = e Cs
b (7.3.1)
148
where constant coefficients kDNN and kFA define the relative weighting of the DNN pre-
diction and the forced alignment. Constant kb is a bias that is identical for all spatial
positions and all characters of the label string.
The bias kb = 1 was kept constant for all pixels and glyphs. This bias was introduced
because it should be possible, in principle, for any glyphs occurring in any pixel. This is
not always reflected in the node potential function with forced alignment because the pix-
els of the same character as estimated by the deep neural network and forced alignment
may be non-overlapping, leaving a ‘hole’ between. The conditional random field is then in
a self-contradicting state, sometimes called a ‘frustrated CRF’ in the literature, introduc-
ing an unwanted random element to MDCC. Allowing any glyph in any pixel combats this
phenomenon in MDCC by favoring a low-energy state over the whole CRF.
Both the estimation ysl of the deep neural network and FA(s, C
s, l) are probabilities
Cs
in the value range of [0, 1], which makes choosing the constants kDNN and kFA easier.
When beginning to train the deep neural network, with model parameters initialized ran-
domly, the estimate ysl will most likely also be a random low value for all pixels and allCs
glyphs. This changes with progression in the deep neural network, the estimated likeli-
hoods increasing for glyph-pixel combinations that the DNN deems correct and decreas-
ing otherwise. This means that the maximum value in the soft-assignment y increases
over training time. Weighting the DNN and forced alignment with constants kDNN and kFA
thus decreases the influence of the forced alignment over time. In this work, values of
kDNN = 3 and kFA = 1 where chosen by experimentation with different values.
Using the node potential function of Equation 7.3.1 instead of 6.6.3 introduces forced
alignment into multi-dimensional connectionist classification while decreasing the influ-
ence of the forced alignment over time. All experiments in this chapter were done using
a two-dimensional forced alignment in this style.
7.4 Neural Network Model
Network Topology
In this section we will discuss the topology of the deep neural network used for the experi-
ments in this chapter. We will discuss the topology itself as well as ideas and observations
that lead to this choice for the neural network model.
The overall type of deep neural network is a mixture of a convolutional neural network
and a recurrent neural network with the mixture being layer-wise, that is each layer is
either convolutional or recurrent. This choice is based on two trains of thought: First,
RNNs are well established and proven in the field of handwriting recognition. This trend
can be seen, starting with the publications[43, 45, 47] of Alex Graves on using multi-
dimensional long short-term memory (MDLSTM)[45] for handwriting recognition. Later
work[71, 102, 150] follows up on this trend. The second idea underlying this topology
comes from recent publications[12, 104] that discuss the possibility of using CNNs for
handwriting recognition.
A hybrid RNN-CNN model was thus chosen for this work to keep the benefits of the
implicit language modeling capabilities of LSTM networks, while gaining speed benefits
from using convolutional layers, which are well supported in GPGPU computing.
Better use of GPGPU capabilities are also the reasoning behind using separable
MDLSTM[156] layers instead of ‘classic’ multi-dimensional LSTM[45]. Separable MDL-
STM only adds recurrent connections along a single dimension and not along all di-
mensions. Separable MDLSTM has been discussed in Section 2.3 of this thesis and a
visualization of a separable MDLSTM is shown in Figure 2.3.19. MDLSTM layers with
recurrent connections along all dimensions introduce dependencies into the computa-
149
tion of the neural activation at each pixel in a way that only a small set of pixels can be
computed in parallel. One way of applying GPGPU processing to MDLSTM layers is to
compute the pixels on a common diagonal at the same time. This is implemented in
the RETURNN[29] library. However, separable MDLSTM is easy to implement in com-
mon deep learning frameworks and parallelizes computation by treating the columns and
rows of an image as mini-batches of one-dimensional sequences.
We will now continue with discussing the actual deep neural network topology used
in the following experiments. Similar to the schema of Figure 2.3.19, Figure 7.4.1 shows
the sequence of operations for a convolutional block in this work. This is because this
exact sequence does repeat in every convolutional block of the overall DNN topology.
Input Feature Map
Convolution 2D
Batch Norm 2D
(1/α)+β
Non-Linear Activation
σ(x)
Dropout
Output Feature Map
Figure 7.4.1: Convolutional block as used in this work. It consists of a two-dimensional convolu-
tion, followed by batch normalization, a non-linear activation function and a layer-
wise dropout.
This convolutional block is shown in Figure 7.4.1, which consists of a two-dimensional
convolutional layer followed by Batch Normalization[60] and a non-linear activation func-
tion. Batch Normalization was added because of the practical observation that it improves
the convergence rate of the model error during training. Batch Normalization normalizes
150
the same feature map for all examples within a batch, or over a larger history of exam-
ples, to a mean value of approximate zero and standard deviation of approximate one.
The non-linear activation function applied to all convolutional blocks was Leaky ReLU
(Rectified Linear Unit)[85], which is the piecewise linear function σ(x) = max(x, αx) with
0 < α < 1. For a positive value of x, Leaky ReLU is simply the identity function. Negative
values of x still result in a linear activation, but with a lower slope of α. Leaky ReLU in
practice has a high speed because of a low computational complexity and partly mitigates
the vanishing gradient effect by having a constant derivative of 1 for positive values and
of α (with a typical value of α = 0.01) for negative values.
Dropout[102, 137] was added to the last three of the convolutional blocks of the DNN.
Dropout improves generalization and reduces overfitting of the deep neural network by
randomly removing feature maps during the training. This results in a certain degree of
redundancy in the neural network since no feature map is reliable by itself alone. During
inference, all feature maps are used without removal. Dropout was added only to the
last three convolutional blocks based on the idea that the convolutional blocks closer
to the input image learn to recognize geometrical features of handwritten glyphs and
convolutional blocks higher up in the network do learn abstracted linguistic features. This
means that dropout in the later layers reduces overfitting to specific text, whereas dropout
in the earlier layers affects generalization from specific writers. Reducing overfitting on
the higher-level layers seems more prudent in this case, especially since (because of
pooling operations in the after the first convolutional blocks) there are effectively more
training data available for the first layers of the neural network. However, this choice of
dropout must probably be adapted for different data sets.
Figure 7.4.4 shows the overall deep neural network topology as used in the experi-
ments in this chapter. The convolutional blocks are as described above and the recurrent
blocks are separable MDLSTM as shown in Figure 2.3.19.
The input into the DNN as shown in Figure 7.4.4 is a grayscale image of the IAM
offline handwriting database with augmentations as discussed in Section 7.2. The input
image is presented to the network in multiple variants, two of which apply binarization
methods in order to obtain a bimodal image.
Figure 7.4.2: Example paragraph image from the IAM database with the Otsu threshold applied.
The Otsu threshold [98, 130] computes a single scalar value as a threshold for sep-
arating the pixels of the digital image into two disjoint classes. The pixel assignment to
these two classes, lower and higher intensity than the threshold, is the resulting bimodal
image. The Otsu threshold is selected in such a way that the variance within each of
151
the two classes is minimized. The method computes this threshold by first generating
the intensity histogram of the image, followed by iteratively testing each threshold for its
resulting intra-class variance based on this histogram. The threshold that minimizes the
intra-class variance is selected. Figure 7.4.2 shows an example paragraph with the Otsu
threshold applied.
Figure 7.4.3: Example paragraph image from the IAM database with Yen’s method applied.
Similar to the Otsu threshold is Yen’s method [130, 159] a histogram-based approach
to image thresholding. The first step in this method is to compute the histogram of in-
tensity values of the digital image. The threshold itself is calculated by a weighted com-
bination of multiple threshold candidates in order to maximize the entropy withing the
two classes. Again, assigning pixels to the two classes according to this threshold value
produces the binary image. Figure 7.4.3 shows an example for Yen’s method.
In the experiments in this chapter, paragraph images are presented to the DNN in four
different variants:
1. The grayscale image with the values normalized such that the mean value is 0 and
the standard deviation 1 within the current image.
2. The grayscale image with normalization to mean 0 and standard deviation 1 over
the whole training data set. The same normalization was also applied to samples
outside the training data set.
3. A bimodal image produced by applying the Otsu threshold to the grayscale image.
4. The bimodal image from application of Yen’s method.
All four variants of the example image were provided to the DNN in order to allow for
the opportunity to automatically learn a good use of these variants during DNN training.
The bimodal images are based on the idea that grayscale images of handwritten text
basically consists of two different types of material: untouched paper and paper colored
by ink. The difference between the two is the color of the pixel. Producing a bimodal
image thus separates the actual writing from the background and reduces the low-level
image operations that the DNN has to learn during training. This frees up filters in the
first convolutional and recurrent blocks for other operations.
Figure 7.4.4 shows the overall topology of the deep neural network used in this thesis.
It consists of ten blocks (five of each) of alternating convolutional and recurrent blocks.
The first three convolutional blocks are followed by average pooling operations. In the
152
Input Feature Map: Separable MDLSTM:
1) Image-wise - 96 filters
normalization - Tanh activation
2) Dataset-wide
normalization
3) Local binarization
4) Global binarization
Convolutional Block: Convolutional Block:
- 5x5 kernel - 5x5 kernel
- 16 filters - 112 filters
- Leaky ReLU activation - Leaky ReLU activation
- No dropout - 25% dropout
Average pooling by 3x3
Separable MDLSTM: Separable MDLSTM:
- 32 filters - 128 filters
- Tanh activation - Tanh activation
Convolutional Block:
- 5x5 kernel Convolutional Block:
- 48 filters - 5x5 kernel
- Leaky ReLU activation - 144 filters
- No dropout - Leaky ReLU activation- 25% dropout
Average pooling by 3x3
Separable MDLSTM: Separable MDLSTM:
- 64 filters - 160 filters
- Tanh activation - Tanh activation
Convolutional Block:
- 5x5 kernel Estimated Soft-Assignmed:
- 80 filters - Convolution with 1x1 kernel
- Leaky ReLU activation and one filter per glyph of the
- 25% dropout alphabet
- Pixel-wise Softmax function
Average pooling by 2x2
Figure 7.4.4: Deep neural network as used in this work and following experiments. It is a combi-
nation of a CNN and RNN with alternating layers of two-dimensional convolutions
and separable MDLSTM. When used for one-dimensional transcription, a collapse
layer is added between the final 1x1 convolution and the softmax function.
153
course of the experiments, both maximum pooling and average pooling operations where
used. Average pooling showed a slightly lower error rate than maximum pooling. The
final layer of the DNN is a pixel-wise feed forward layer, which is equal in function to a
convolution with a kernel size of one by one pixels, with as many neurons as there are
glyphs in the alphabet. This layer provides the estimate for assigning the specific glyph to
the specific pixel in question. Normalization to probabilities is implemented by applying a
pixel-wise softmax function to this estimate.
This same DNN topology can also be used for one-dimensional or line-wise handwrit-
ing recognition in combination with connectionist temporal classification. In this case a
collapse layer is added in between the very last convolution, which predicts the glyph to
pixel assignments, and the softmax function. The collapse layer is described in Section
3.1.
Optimization Method
Optimization of the deep neural network parameters was done automatically by applying
multi-dimensional connectionist classification as discussed in Chapter 6 in combination
with backpropagation[112, 113] and gradient descent[11, 65, 107]. See Section 2.3 for
information on these two techniques.
Gradient descent uses the first-order derivative of the loss function or optimization
function in general in order to incrementally change the model parameters towards a
better solution. There are other optimization methods, based on gradient descent, that
include estimated second-order information to improve on shortcomings in gradient de-
scent. All experiments in this thesis were done using Adam[67] with a base learning rate
of 0.001. The reasoning for using Adam was twofold: First, it is a method that estimates
a suitable learning rate per parameter of the model and adapts these parameter-wise
learning rates over the course of the training. In theory, this should show a better behav-
ior in cases with a gradient of low magnitude, e.g. near local optima or saddle points. In
practice Adam showed a fast and stable convergence rate. The second reason for using
Adam was that it can be used for non-stationary loss functions. This is the case in MDCC
since the aligned soft-assignment changes at every iteration, even for training examples
already processed before. This means that even while MDCC falls into the category of
supervised learning, the targets in the training data set actually change at every epoch.
In the course of the experiments, the deep neural network was first initialized with
random parameters. Afterwards a pre-training was implemented by adding a collapse
layer as described above and training the DNN with connectionist temporal classification.
This training with CTC was done using the single line examples provided by the IAM
offline handwriting database. Pre-training was done for a constant amount of 25 epochs
using a mini-batch size of 16. After pre-training, the collapse layer was removed and the
DNN trained using multi-dimensional connectionist classification with a mini-batch size
of 8. Pre-training was done to allow the filters in the convolutional layers to adapt to
recognizing the geometric shapes inherent in handwritten text.
7.5 Experiments and Results
General Approach
Chapters 5 and 6 detailed the decoding algorithm and training mechanism as proposed
in this work for multi-dimensional connectionist classification. Section 7.2 explained the
use of the IAM offline handwriting database and Sections 7.3 and 7.4 the two-dimensional
forced alignment method and the deep neural network model as used for experimentation
154
and evaluation. Chapter 5 details the decoding algorithms proposed for MDCC. The
experiments in this section used the continuous separators variant of Algorithm 5.4.2 for
finding lines within the DNN prediction. The experiments further applied the beam search
variant of Algorithm 5.5.2 for decoding each line to uncover its label sequence.
No explicit language model was employed during the course of the described exper-
iments. The only language model in use was the one implicitly learned by the recurrent
neural networks during training via CTC or MDCC. The alphabet used for transcription
consisted of all characters occurring in the ground truth texts of the IAM offline handwrit-
ing database. No mapping between upper and lower case characters was done, which
means wrong capitalization in the transcription negatively impacts the error rate. The
same alphabet was used for all methods and experiments in this chapter.
This first step towards evaluation of experiments is to decide on a measurement of the
error that the transcription method in question produced. A commonly used error metric
in offline handwriting recognition is the character error rate (CER)
Edit(Decoder(y), l)
CER(y, l) = 100× (7.5.1)
|l|
which measures the ratio of the Edit-distance[81, 151] between the transcribed string and
truth label string l and the length |l| of the truth label string. Symbol y in Equation 7.5.1 is
again the soft-assignment as estimated by the deep neural network. Roughly speaking,
the CER measures the ratio between the number of wrongly transcribed characters and
the total number of characters. Please note that the lower limit of the CER is zero if the
transcribed string and the truth label string are identical, but the CER is not limited to an
upper value and values over 100 are possible. We will use the CER as the measure for
comparison of different transcription methods in the following paragraphs.
Figure 7.5.1 shows the character error rate of Equation 7.5.1 as it was measured
during training of the deep neural network for multi-line handwriting recognition. The
DNN was first pre-trained using CTC on the provided ground truth line segmentation for
25 epochs and then switched to MDCC for the remainder of the training until convergence.
This switch is indicated by the vertical line at epoch 25 of the diagram.
The model parameters used for the multi-dimensional connectionist classification ex-
periments in this chapter and Chapter 9 were reached after 59 epochs of training, 25
epochs of line-wise pre-training with CTC and 34 epochs of paragraph-wise training with
MDCC. These model parameters produced a CER of 2.53 on the training data, 8.41 on
the validation data and 10.22 on the evaluation data when applied to paragraph-wise
transcription.
The computing hardware use for these experiments were a Intel Core i5 6500 and
a Nvidia GeForce GTX 1080 Ti. A single epoch of training on the training set of the
IAM offline handwriting database, augmented as detailed in Section 7.2, with subse-
quent calculation of the CER on all three data sets (training, validation and evaluation)
took between 8 and 12 hours. The earlier epochs of training with MDCC took longer
since loopy belief propagation takes longer to converge to a stable point while the DNN-
predicted soft-assignment is noisy or contains many errors. Loopy belief propagation in
multi-dimensional connectionist classification converges faster if the soft-assignments as
predicted by the DNN and CRF are already similar. The training of Figure 7.5.1 ran for a
total of 14 days.
Comparison with Line-Level Transcription
The first experiments done in this work were aimed at the question if multi-dimensional
connectionist classification, that is paragraph-wise transcription, performs better than
155
Figure 7.5.1: Convergence of the character error rate while training the deep neural network with
multi-dimensional connectionist classification. Red indicates the CER on the train-
ing set, blue on the validation set. The vertical line on epoch 25 signals the transition
from line-wise pre-training using CTC to paragraph-wise training using MDCC.
connectionist temporal classification, that is line-wise transcription, for offline handwrit-
ing recognition and if so in which cases. The deep neural network was trained on the
IAM offline handwriting database as described above. The same DNN topology, but with
different parameters, was trained on the ground truth line segmentation as provided in the
IAM offline handwriting database. For this, the collapse layer was added in between the
final convolutional layer and the softmax function. The paragraph-level network was pre-
trained with CTC as discussed. The line level network was directly trained with CTC until
convergence. The result were two deep neural networks with the same topology, except
the collapse layer, and with training on the same data set but with parameters optimized
for either paragraph level transcription or for line level transcription.
Evaluation of these two deep neural networks included the transcription of the test
data set from the Large Writer Independent Text Line Recognition Task. The offset be-
tween each two adjacent text lines were reduced by a constant amount of millimeters
in order to evaluate the sensitivity of the methods to overlapping text lines. This can
easily be done on the IAM offline handwriting database since the ground truth line seg-
mentation is provided, which is considered ‘perfect’, and the resulting line images can be
artificially moved closer together. New segmentation into lines was then necessary since
this artificial line offset invalidated the provided line segmentation.
Three publicly available line segmentation algorithms were applied to obtain new line
images from the modified database. Two were based on open-source optical charac-
ter recognition (OCR) software in the form of Tesseract and GNU Ocrad. Tesseract1
is an end-to-end OCR system that has layout analysis and text segmentation included.
GNU Ocrad2 is also an end-to-end OCR system in the style of Tesseract. The line-
segmentation algorithms implemented and provided by these two OCR systems were
used for obtaining segmented line images from the paragraph images of the IAM of-
fline handwriting database. An A*-based line segmentation method[140] was applied in
1https://github.com/tesseract-ocr/tesseract/, version 4.1.1
2https://www.gnu.org/software/ocrad/, version 0.27
156
addition to these two open-source systems. A publicly available implementation3 of this
method was used. This method of using A* path planning for line segmentation is referred
to as A* Paths in Tables 7.3, 9.2, 9.4, 9.6 and 9.4.1.
Table 7.3 contains the character error rates that were measured using the two deep
neural networks as stated in combination with a variable amount of artificially reduced
line offset and while applying Tesseract or GNU Ocrad as line segmentation algorithms.
Connectionist temporal classification on the ground truth line segmentation outperforms
every other described method by a margin. This is not really surprising since the ground
truth line segmentation in the IAM offline handwriting database has been manually veri-
fied and is considered ‘perfect’. However, even using automatic line segmentation on the
unmodified data adds roughly 9 points of character error rate to the line-wise transcription
using CTC. Artificially decreasing the line spacing increases the error rate for both CTC
and MDCC, but with a slower increase for MDCC.
Table 7.3: Average character error rates (CER) for connectionist temporal classification (CTC)
and multi-dimensional connectionist classification (MDCC) on full paragraphs of the
test set of the IAM offline handwriting database while using different line offsets and
line segmentation methods. The last column gives the percentage of examples where
MDCC produces a lower CER than CTC.
CER Examples where
Line Offs. Line Segm. CTC MDCC MDCC better than CTC
0 mm Ground Truth 7.94 15.09%
0 mm Tesseract 16.74 10.22 63.79%0 mm Ocrad 18.48 67.24%
0 mm A* Paths 16.31 71.55%
3 mm Tesseract 20.53 68.53%
3 mm Ocrad 16.87 10.80 68.10%
3 mm A* Paths 19.09 75.86%
5 mm Tesseract 27.58 75.86%
5 mm Ocrad 20.19 12.76 64.22%
5 mm A* Paths 24.83 81.47%
10 mm Tesseract 74.77 95.26%
10 mm Ocrad 56.87 31.20 90.52%
10 mm A* Paths 63.77 95.26%
Tesseract 34.90 -
Average Ocrad 28.10 16.24 -
A* Paths 31.00 -
Comparison with Attention Networks
Section 3.2 discussed the application of attention networks to offline handwriting recog-
nition, specifically the work of Bluche[8] on line-wise paragraph transcription. Since both
the attention networks and multi-dimensional connectionist classification tackle the same
problem of segmentation-free multi-line offline handwriting recognition, a comparison was
in order. A reimplementation of the attention network method was done to facilitate a di-
rect comparison to MDCC. As the original, the reimplemented attention network consisted
of three deep neural networks:
1. The encoder network once per example extracts high-level features from the input
image. The encoder network used in the experiments was identical to the DNN
described in Figure 7.4.4 except that the final convolutional layer and softmax func-
tion were replaced by a convolutional block with 256 filters. This way the model
3https://github.com/smeucci/LineSegm, version of August 6th, 2020
157
capacities of the attention network encoder and the MDCC network are similar.
The difference between the two is that the encoder produces 256 feature maps for
further processing while the MDCC network directly estimates glyph probabilities.
2. The attention mechanism is a neural network that performs line-by-line extraction
of meaningful features from the encoder in order to transform the two-dimensional
image into a one-dimensional sequence of feature vectors. The attention mecha-
nism in the reimplemented network consisted of two layers of separable MDLSTM,
see Section 2.3, with 128 filters each. This was followed by a convolution with a
1x1 kernel, 1 filter and a column-wise softmax over the whole encoded image. This
yielded the probability that a specific pixel should be extracted from the encoder in
order to be part of the representation of the current text line. The original attention
mechanism consisted of two MDLSTM layers with 32 filters each. The increase in
the number of filters is to compensate for the switch from MDLSTM to separable
MDLSTM.
3. The decoder network transcribes the one-dimensional sequence of features vec-
tors provided by the encoder network and attention mechanism. Its output is the
transcription as trained with connectionist temporal classification. It consisted of
one layer of bi-directional LSTM with 256 filters and a convolution with a 1x1 ker-
nel and as many filters as glyphs in the alphabet. The activation function was the
softmax function in order to obtain glyph probabilities as required by CTC. This is
identical to the original work.
The attention mechanism in this deep neural network was executed for a constant
number of times, each time providing its last output as additional input. As discussed
in Section 3.2 this can be seen as a sort of recursion over the whole DNN. Each time
the output of the attention mechanism was used to extract features from the encoder
and collapse them to a one-dimensional sequence. These one-dimensional sequences
represent the text lines within the overall paragraph and were concatenated to one large
sequence for the full paragraph. This attention loop was performed 12 times since the
IAM offline handwriting database has a maximum of 12 text lines per paragraph.
Using the deep neural network from MDCC as the encoder in the attention network
results, in principle, in the same capacity for extracting meaningful high-level features in
both methods. The attention network gained some additional model capacity in compar-
ison to MDCC by adding multiple LSTM layers in the attention mechanism and decoder
network, which are not present in the MDCC neural network. Both the MDCC neural net-
work and the attention network were trained on the same augmented data from the IAM
offline handwriting database with identical mini-batch sizes and parameters for the Adam
optimizer. The training of the MDCC network was done as discussed above.
Table 7.4 contains the error rates for these experiments while comparing the attention
networks method for paragraph-wise transcription with MDCC. It shows that the attention
networks produce a consistently lower error rate than MDCC. MDCC still resulted in a
lower error in roughly a quarter of examples if evaluated individually.
Table 7.5 shows the time measurements for both the transcription using attention
networks and MDCC. The implementation of both methods was done using the PyTorch4
deep learning library in version 1.7.0 using Intel MKL5 and Nvidia CUDA6 for hardware
acceleration. The hardware in use was a Intel Core i5 6500 with 4 cores at 3.2 GHz for
execution on a CPU and a Nvidia GeForce GTX 1080 Ti on a GPU. Linux was used as
4https://pytorch.org/
5https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl.
html
6https://developer.nvidia.com/cuda-zone
158
Table 7.4: Comparison of the character error rates (CER) between MDCC and attention networks
with CTC in the similar layout as of Table 7.3. Transcription was done on full paragraphs
in 300 dpi resolution.
CER Examples where
Line Offs. Attn. MDCC MDCC better than Attn.
0 mm 9.01 10.22 28.88%
3 mm 9.27 10.80 26.72%
5 mm 9.86 12.76 21.12%
10 mm 22.44 31.20 7.76%
Average 12.64 16.24 -
the operating system. The decoding algorithm was multi-threaded and in parallel for the
examples within the same mini-batch. The timings show that MDCC uses roughly half
the execution time compared to the attention network approach, except in the case of a
mini-batch size of one example while using GPU acceleration. This result is unsurprising
since the attention networks use the same encoder DNN as the MDCC method, with
additional neural networks for the attention mechanism and decoder. Still it highlights
that MDCC can be implemented and executed with only a single pass through a deep
neural network.
The exception that MDCC is slower with a mini-batch size of one example and using
the GPU for hardware acceleration is explainable with the complexity of the decoding al-
gorithms for CTC and MDCC. In both cases the decoding algorithm was always executed
on the CPU, even when using GPU acceleration. The decoding algorithm for CTC has a
lower runtime since it decodes one-dimensional sequences and does not need to identify
likely line separators. This means that even while the deep neural network in the attention
network approach takes longer to execute, the longer runtime for the MDCC decoding al-
gorithm compensates for this. This effect is mitigated for larger mini-batches where the
decoding algorithm is executed in parallel for the examples within the batch.
Table 7.5: Runtime comparison between MDCC and attention networks for full paragraphs and
different batch sizes. Decoding for batch sizes greater than one used multiple threads.
Time measurements include the total of 1539 paragraphs from the IAMDB.
Wall Clock Average per Example
Batch Size Hardware Attn. MDCC Attn. MDCC
1 CPU 51 m 0 s 22 m 32 s 1988 ms 878 ms
1 GPU 8 m 45 s 9 m 15 s 341 ms 360 ms
4 GPU 7 m 15 s 4 m 11 s 282 ms 163 ms
8 GPU 7 m 6 s 3 m 18 s 276 ms 128 ms
Comparison with Published Work
Table 7.6 contains a comparison of multi-dimensional connectionist classification with
other methods for offline handwriting recognition as published in the literature. All char-
acter error rates are as reported on the test set of the IAM large writer independent text
line recognition task and use an image resolution of 300 dpi if not stated otherwise. Some
methods employ data augmentation or normalization techniques and some use explicit
language models. This is stated in the corresponding entry. Overall these results show
a clear tendency to favor line-wise transcription methods using connectionist temporal
classification if a reliable line segmentation is available, which is the case with the ground
truth line segmentation from the offline handwriting database.
159
Table 7.6: Comparison of the character error rates (CER) between MDCC and methods proposed
in other publications. The error rates are on the test set of the Large Writer Indepen-
dent Text Line Recognition Task on the IAM offline handwriting database. Line-wise
transcriptions use the ground truth segmentation as provided in the database.
Method CER
Multi-dimensional connectionist classification 10.22
Attention networks (as reproduced) 9.01
Bluche[8] no language model 7.9
Bluche[8] with language model 5.5
Paragraph-wise Bluche et al.[9] at 150 dpi 16.2
Singh et al.[135] at ∼145 dpi 6.7
Singh et al.[135] at ∼145 dpi with augmentation 6.3
Coquenet et al.[21]7 5.45
Coquenet et al.[20]7 4.32
Connectionist temporal classification (as reproduced) 7.94
Voigtlaender et al.[150] 3.5
Doetsch et al.[28] 4.7
Pham et al.[102] no language model 10.8
Line-wise Pham et al.[102] with language model 5.1
Kozielski et al.[71] with deslanting 10.9
Kozielski et al.[71] with deslanting and moments 5.5
Puigcerver[104] no distortions 8.3
Puigcerver[104] with distortions 6.2
Table 7.6 compares MDCC to other paragraph-level transcription methods. Work by
Bluche et al.[8, 9] is discussed in Section 3.2. It results in a lower CER, but is also limited
to a specific DNN topology in the form of attention networks. Table 7.5 outlines time
measurements, which is summary indicate that MDCC performs faster than this type of
attention network for paragraph transcription. Singh et al.[135] again apply an attention-
based DNN, discussed in Section 3.2, with the same drawbacks. Coquenet et al.[20,
21] use an unofficial data split for evaluation. SPAN[21] is based on reshaping a CNN,
as detailed in Section 3.3, which works well if the text lines are oriented in a roughly
horizontal fashion with a similar and uniform height per text line. Coquenet et al.[20] also
apply an attention network for paragraph transcription. It introduces further restrictions
with its attention mechanism collapsing the horizontal dimension, resulting in attention
windows that are perfectly horizontal.
Comparison with Forced Alignment Only
Section 7.3 discussed the idea of forced alignment in both one-dimensional and two-
dimensional cases and how to integrate forced alignment with MDCC. Forced alignment
on its own already produces a soft-assignment that encodes the truth text and this forced
alignment soft-assignment is applied to the conditional random field (CRF) in MDCC as
a prior. This begs the question of how much the soft-assignment as estimated by loopy
belief propagation (LBP) adds to the transcription method. An experiment was performed
to compare the transcription of a deep neural network trained with forced alignment only
to the one trained with MDCC. The setup of data sets, deep neural network architecture
and decoding algorithm was identical to the transcription with MDCC. The only difference
between the DNN as trained with MDCC was that no inference using LBP was done, but
instead the soft-assignment provided by forced alignment was directly used as the target
soft-assignment in the cross-entropy loss while optimizing the DNN parameters.
7Using an unofficial split of the training, validation and evaluation data.
160
The deep neural network parameters after 25 epochs of line-wise pre-training were
used as the starting parameters for this experiment. The following paragraph-wise train-
ing was done using the soft-assignment of forced alignment as the target values. The
best model parameters yielded character error rates of 3.55 on the training data, 11.24
on the validation data and 12.85 on the evaluation data after an additional 187 epochs of
training with forced alignment only.
Training the DNN for transcription using MDCC yielded an evaluation set CER of 10.22
after only 34 epochs of paragraph-wise training. Thus MDCC resulted in a decrease
of the character error rate of 2.63 points while also decreasing the number of training
epochs required by a total of 153. MDCC both reduced the error rate and increased the
convergence rate during training when compared with forced alignment only.
7.6 Discussion
The last Section 7.5 detailed the experiments with multi-dimensional connectionist clas-
sifications, its evaluation and comparison to established method. Of special interest was
the comparison to line-wise transcription mechanisms using connectionist temporal clas-
sification and to the paragraph-wise transcription using attention networks. We will now
go into a discussion of what the implications of these results are and how these methods
compare besides their respective error rates.
The first set of experiments aimed at the question of how multi-dimensional connec-
tionist classification on a paragraph-level compares to connectionist temporal classifica-
tion on a line-level transcription. Examples from the IAM offline handwriting database
were modified by artificially reducing the spacing between text lines by a fixed amount of
millimeters. This was followed up by a transcription with the method in question. In case
of CTC, two publicly available OCR implementations were applied for line segmentation.
The results are detailed in Table 7.3. This experiment shows that line-wise transcription
becomes increasingly difficult with more overlaps between text lines. In this experiment,
MDCC performed better than CTC with an identical deep neural network topology and
training paradigm. It stands to reason that this is a general benefit of paragraph-wise
transcription over line-wise transcription in the face of overlapping, and thus hard to seg-
ment, text lines. MDCC provides an advantage over CTC when applied to difficult to
segment paragraphs as can be deducted from the lower error rates in MDCC when arti-
ficially reducing the line spacing.
The second set of experiments evaluated the attention networks method by Bluche[8]
and includes a comparison of this method against MDCC in both error rate and runtime.
Again this was done using artificially decreased spacing between text lines. Table 7.4
compares the character error rate of both methods and shows that these results are
favorable towards using the attention networks. The CER produced by applying the at-
tention networks was roughly one point in error rate (9.01 versus 10.22) lower than with
MDCC. However, MDCC still produced lower error rates in roughly one quarter of individ-
ual examples, except for examples were the line spacing was artificially reduced by 10
millimeters.
Contrary results are observed when measuring the runtime for transcription of exam-
ples from the IAM offline handwriting database. Table 7.5 shows the measured total wall
clock times for transcribing all 1539 examples, as well as the average runtime per exam-
ple. It shows a runtime reduction of nearly half when using MDCC instead of the attention
network method. This favors using multi-dimensional connectionist classification in use
cases were a low runtime is required or of benefit. Newer attention-based methods seem
to follow this trend[135, p. 12].
161
It is the opinion of the author that there are general properties of multi-dimensional
connectionist classification that make the case for further application and research in this
direction. These general properties of MDCC are discussed in the next three paragraphs.
Multi-dimensional connectionist classification is implemented outside of the deep neu-
ral network in form of a special loss function based on an expectation-maximization ap-
proach and a matching decoding algorithm. This puts relatively low constraints on the
type of deep neural network in use for MDCC. The DNN must be capable of estimating
the soft-assignment from image input as described in Chapter 6. MDCC further requires
only one pass through the DNN per example. These constraints are in contrast to the
attention network method, which implements paragraph-level transcription by applying a
very specific DNN topology. This means that MDCC can be applied in combination with
a range of deep neural network types that are specifically designed for the use case at
hand. This could be topologies for execution on special hardware or topologies optimized
for runtime or memory usage.
This idea can be taken one step further by removing the assumption that the machine
learning model has to be a deep neural network at all. As stated, MDCC requires the
model to estimate the soft-assignment from in input input of handwritten text. It then ap-
plies supervised training in the form of expectation-maximization for parameter optimiza-
tion. This is possible with any model that supports supervised training using gradient
descent. MDCC can further be applied to machine learning models which optimization
is not gradient-based but still compatible with the M-step of Section 6.5. As such, multi-
dimensional connectionist classification could be applied to paragraph-level handwriting
recognition with models other than deep neural networks.
The indicator function α of multi-dimensional connectionist classification as discussed
in Section 6.2 and the decoding algorithm as detailed in Chapter 5 currently both assume
that text lines are oriented roughly horizontal at an angle of at most 45 degree and do not
include ‘U-turns’. That is each text line is assumed to be exactly one vertical interval per
pixel column in the input image. This matches the idea of a collapse layer in connectionist
temporal classification. It stands to reason that both parts of MDCC could be generalized
in order to remove this constraint and e.g. allow for spiral-formed text. The line-wise
paragraph transcription based on attention networks[8] is again limited to a maximum
curvature or rotation of the text lines to up to 45 degree. This limitation emerges from the
way the softmax function is applied as a column-wise normalization in the attention step.
Explicitly modeling the line separators, as is the case in MDCC, instead of relying on the
softmax function should make generalization of the text orientation more feasible.
These observations lead to reason that MDCC is faster in execution time than the
method[8] based on attention networks. MDCC shows also general properties in favor of
the application of it. Attention network approaches do however seem superior in case of
roughly horizontal text lines and without tight restrictions on the execution time.
The last part of the experimentation had the goal of establishing a reliable compar-
ison with published methods. This is shown in Table 7.6. Tables 7.6 and 7.3 in combi-
nation make clear that connectionist temporal classification yields a lower error rate than
paragraph-level transcription as long as a reliable line segmentation is available. This is
certainly the case for the ground truth line segmentation provided in the IAM offline hand-
writing database. Difficult cases of automatic line segmentation then favor a paragraph-
level approach to handwriting recognition. As observed before, MDCC performs worse
than the attention network method for paragraph-level transcription, at least in terms of
plain error rates.
Table 7.6 also shows that the application of an explicit language model or image
distortions and deslanting generally improves the error rates in handwriting recognition
tasks.
162
Overall the experiments show that multi-dimensional connectionist classification is a
competitive method in the case of paragraph-level handwriting recognition and handwrit-
ing recognition in general. It also seems to be the case that the deep neural network in
use for MDCC should be tuned towards the specific requirements of error rates, hardware
employed for execution, as well as runtime and memory limits for the use case at hand. It
is however also clear that line-wise transcription should be preferred in cases were robust
line segmentation is feasible.
163
164
Chapter 8
Hyper-Parameter Search using
Visual Analytics
Figure 8.0.1: The visualization technique proposed in this chapter targets the soft-assignment as
estimated by the deep neural network or conditional random field, while utilizing the
input image as contextual support.
This chapter is based on the following publication with Section 1.3 discussing the
individual contributions of its authors:
Martin Schall, Dominik Sacha, Manuel Stein, Matthias O. Franz, and Daniel A. Keim.
“Visualization-Assisted Development of Deep Learning Models in Offline Handwriting
Recognition.” In: Symposium on Visualization in Data Science (VDS) at IEEE VIS 2018.
Oct. 2018
8.1 Problem Description and Idea
In Chapters 5 and 6 we have discussed the theoretical idea and method of multi-dimen-
sional connectionist classification (MDCC) in terms of training a deep neural network
(DNN) for multi-line handwriting recognition and decoding the DNN estimate. Chapter
7 details the practical application of MDCC and provides an experimental evaluation. It
discusses the deep neural network topology, hyper-parameters in the DNN as well as
the optimization method and data augmentation. It did however not discuss how to derive
these hyper-parameters. This will be the topic of this chapter. The term ‘hyper-parameter’
in this context refers to all parameter of a machine learning model that are not covered by
automatic optimization, e.g. gradient descent in our case. These hyper-parameters are
typically selected by a model engineer. Examples for hyper-parameters are the number
of neurons per layer, the learning rate or the resolution of the input image.
The general, ‘black-box’ behavior of deep neural networks is one topic that needs to
be respected when choosing and optimizing hyper-parameters in deep neural networks.
Since supervised learning in DNNs is a form of curve-fitting of a non-linear function to
165
a finite set of (noisy) data points in a high-dimensional space, general methods to e.g.
detect and reduce overfitting or adjusting the learning rate apply. We will touch on these
in this chapter.
There is however the problem of breaking open the black-box of a deep neural net-
work and in order to inspect its specific functionality regarding the task at hand. This
allows to derive specific actions during data preparation, model building and model train-
ing. This problem of breaking open the black-box is part explainable AI (XAI), a research
field that emerged in the last few years. Common techniques for XAI in DNNs include
visualizing the sensitivity of the neural network output towards its input, e.g. in convo-
lutional neural networks (CNNs)[134, 161]. explAIner[136] is an approach to track the
evolution of a deep neural networks topology and hyper-parameters, combined with vi-
sualizing both the DNN topology and metrics that provide insight into its performance. A
visual analytics (VA) loop is then set up between the automatic training and the human
expert in order to gain insight into the influence of each hyper-parameter and topology
change and derive meaningful actions that improve the model performance. Combining
visualizations and verbalization for explaining the inner workings of deep neural networks
is another method[129] to provide insight to the expert user.
Vis4ML[114] is an ontology and guidance that provides a map of visual analytics
techniques for machine learning and explainable AI that both shows existing methods, but
also serves to identify under-explored areas of XAI for ML. It is, among other purposes, an
entry point for discovering existing XAI methods in ML. The workflow and visualizations
of this chapter are included as one example on the Vis4ML website1 and also as an
example in the talk2 given by Dominik Sacha.
One method for finding hyper-parameters in machine learning models is to do an au-
tomatic search by applying e.g. grid search or random search. Both these methods are
based on the idea that training and evaluation of a single model is fast and thus a large
amount of different combinations of hyper-parameters can be applied in a reasonable
amount of time. This assumption is however not true in the case of multi-dimensional
connectionist classification, which is expensive to train since both the deep neural net-
work is relatively large and approximation of the correct alignment using conditional ran-
dom fields (CRFs) and loopy belief propagation (LBP) takes time. In the experiments
discussed in Chapter 7, a single epoch of training on the augmented training data set
took between 8 and 12 hours of wall clock time. This amounts to several days or weeks
per training run. Grid search or random search for optimizing the hyper-parameters is
thus not feasible.
This chapter proposes both a workflow and a heatmap-based visualization technique
that together form a visual analytics loop that allows the model engineer to optimize the
hyper-parameters for a model while training with multi-dimensional connectionist classi-
fication. The workflow proposed in this chapter is designed to steer the model engineer
through an optimization process for the hyper-parameters by posing questions that, when
answered, provide insight into possible sources for errors in MDCC and guide the model
engineer towards meaningful changes in the hyper-parameters. Answering the question
is, if possible, supported by the proposed visualization technique. This process of guiding
the model engineer through this workflow is repeatedly applied at different stages of the
training, each time improving the hyper-parameters towards a higher model accuracy.
The proposed heatmap-based visualization is designed with the properties of MDCC
in mind. It allows the model engineer to visualize both the soft-assignment as estimated
by the deep neural network and the aligned soft-assignment produced by the conditional
random field in order to inspect both and spot relevant differences. See Figure 8.0.1.
1https://vis4ml.dbvis.de/
2https://vimeo.com/303202734
166
The visualization is based on a heatmap, indicating the glyph probabilities, partially su-
perimposed over the input image of multi-line text. This visualization technique and the
proposed workflow in combination allows the model engineer to employ expert knowl-
edge to derive meaningful changes to the hyper-parameters in MDCC and to arrive at a
reasonable set of hyper-parameters without the excessive runtime requirements of grid
search or random search.
In terms of positioning of this work, we refer to two papers by other authors. Choo
and Liu[17] identified understanding, debugging and refinement or steering as the tree
tasks in XAI for deep learning. The proposed workflow addresses both debugging and
refinement/steering. Debugging means to identify defective or faulty parts of the model
or training system in order to improve on those points. Refinement and steering includes
expert knowledge into the training process of a machine learning model to quickly derive
meaningful changes and to improve the model accuracy or speed up the training process.
A survey[58] by Hohman et al. categorizes VA techniques in deep learning by posing
appropriate questions. In the case of the method of this chapter, the answers to these
categorizations are as follows: Why: ‘Debugging & Improving Models’. Our work aims at
improving the model accuracy. Comparisons of different models are also possible with
this technique. What? would be ‘Individual Computational Units’ in the output layer of
the deep neural network. This work specifically visualizes the soft-assignment in order
to gain insight by comparing the output of the DNN and CRF. When? is after the current
training run and before starting a new one. However, stopping the current training run
and resuming it can also be done. This is e.g. the case when fixing errors in the ground
truth data. Who? are clearly ‘Model Developers & Builders’. How? are ‘Line Charts
for Temporal Metrics’, see Figure 7.5.1. We also present a heatmap-based visualization
technique that falls in the ‘Instance-based Analysis & Exploration’ category. Where? is
‘Application Domains & Models’.
8.2 Error Sources in MDCC
A first step in researching and applying the proposed workflow and heatmap-based visu-
alization of this chapter is to understand the sources of error that can occur while training
a deep neural network with multi-dimensional connectionist classification. For this we
remember the discussions of Chapters 5, 6 and 7. The following list gives an overview
over the possible reasons for an unsatisfying accuracy in MDCC. The numberings of the
error sources will be later used in Figure 8.3.1 and Sections 8.3 and 8.4. The paragraphs
following this overview discuss the details of these sources of error.
1. Data
(a) Too few data in general.
(b) Too few data for outlier examples.
(c) Truth data has systematic fault.
(d) Truth data has individual examples wrong.
2. Transformation
(a) Resolution of input image too low.
(b) Resolution of input image too high.
3. DNN Topology
(a) DNN topology not suitable for task.
167
(b) DNN capacity too small.
(c) Subsampling within the DNN too large.
(d) Subsampling within the DNN too small.
4. CRF Alignment
(a) LBP has not converged to a stable point.
5. Training Process
(a) Training is not finished yet.
(b) Overfitting to training data.
(c) Optimizer hyper-parameters are sub-optimal.
6. General
(a) General configuration error.
(b) Implementation bug.
Data: The first type of error sources identifies problems with the training data and data
in general that is available for DNN optimization. A common problem in deep learning is
that too few annotated data examples are available, which prevents effective optimization
in high-dimensional parameter spaces. Increasing the amount of annotated data avail-
able is often useful in this case (‘data is king’). We also need to keep in mind that multi-
dimensional connectionist classification deals with the transcription of handwritten text,
which is often natural language. The frequency of glyphs in natural text follows Zipf’s
law[87], which states that for natural languages the number of occurrences of any glyph
is roughly twice that of the next less frequent glyph. This means that training a DNN for
transcription of natural texts inherently deals with imbalanced data sets where not every
glyph is occurring roughly equal amount of times. Only a few annotation errors in exam-
ples with infrequent glyphs can be harmful to the overall accuracy. In general there are
other potential errors in annotation, e.g. systematic faults such as incorrect capitalization
or missing punctuation marks. Of course there may be more general types of errors in
the data, such as for example input images that are geometrically rotated or flipped.
Transformation: This type of error sources concerns the image transformations.
Those errors occur if the available data is correct, but the images are incorrectly trans-
formed while loading them into the deep neural network. The two error types of most
concern for MDCC is loading the input image of handwritten text in a too high or too low
resolution. Choosing a wrong input resolution may lead to errors in recognizing charac-
ters or text lines. An example here would be the glyph ‘i’, which looks like a glyph ‘l’ or ‘I’
in a too low resolution. On the other hand, a too high resolution separate the dot from the
main body of the glyph by many time steps in a recurrent neural network and the glyph
will likely then be be incorrectly transcribed. There are works[48, 54, 55] that test the
ability of long short-term memory networks to recognize long-distance dependencies in
sequential data.
DNN Topology: There are two rather general types of error in this category. One
is that the topology of the deep neural network is not suitable to the task of multi-line
handwritten text recognition. This could be the case if e.g. the type of non-linear activation
functions were of an unsuitable type. Choosing a subsampling, e.g. maximum or average
pooling operations, that is too fine or too coarse will lead to similar phenomenon as if the
input image has a too low or too high resolution. In case of a too low or high subsampling,
the DNN may be incapable of correctly identifying geometric features of handwritten text
and their relations with each other in natural language. Extreme cases occur if the spatial
168
size of the soft-assignment is smaller than the size of the truth label, which is the case if
the number of pixel rows is smaller than the number of text lines or the number of pixel
columns is smaller than the number of characters in the longest text line. In such a case,
no correct soft-assignment can be estimated or computed at all, leading to the failure of
the alignment using the conditional random field.
CRF Alignment: The main topic when looking for error sources in the alignment
by conditional random field is to check if the beliefs in loopy belief propagation have
converged to a stable point and if this point is a valid solution. Please see Section 2.2
and Chapter 6 for information on aligning the soft-assignment in MDCC. The convergence
criteria proposed in Algorithm 6.6.1 do test for both these cases. Still, the possibility
remains that a large number of examples did not converge to a valid stable point and thus
a large amount of training data was implicitly discarded, hindering the training process.
Training Process: Problems found in the the wider field of machine learning also
apply to multi-dimensional connectionist classification. Gradient descent, which is an
iterative algorithm, is applied for parameter optimization in MDCC. As such it could simply
be the case that if an unsatisfying error rate in the model is observed, the training process
only needs to be run for a longer timer. Overfitting also applies here, which means that
too few training data is used for the model capacity at hand. In this case the model starts
fitting noise or outliers in the data instead of generally occurring concepts. The inverse
of overfitting would be a too high learning rate in the optimizer, which leads to sporadic
divergence of the model parameters, away from a low error solution in gradient descent.
General: This category covers the most general of problems such as for example
choosing the wrong alphabet for transcription or an implementation bug somewhere in
the overall system.
8.3 Workflow for Identification of Error Sources
So far in Section 8.2 we have discussed the potential error sources that lead to an in-
crease in character error rate (CER), a lower accuracy, in multi-dimensional connection-
ist classification. This section discusses a workflow that is designed to guide the model
engineer to meaningful changes in hyper-parameters of the model, the training process
or general data acquisition in order to effectively reduce the error rate in MDCC. Figure
8.3.1 outlines this workflow, which is a decision tree encoding expert knowledge regard-
ing training in multi-dimensional connectionist classification. Part of this workflow involves
answering questions about the current state of the model in MDCC and some of these
questions are supported by a heatmap-based visualization. This visualization technique
will be discussed in Section 8.4.
We will discuss the workflow of Figure 8.3.1 and its suggested actions for error mitiga-
tion in the following paragraphs. The nodes in Figure 8.3.1 are encoded in the following
way:
• Red signals questions to be answered or tests to be done on the current state of
the model.
• Blue suggests the application of the heatmap-based visualization technique dis-
cussed in Section 8.4.
• Green indicates suggestions for actions that to improve the accuracy of the model
trained with MDCC.
• The gray trapezoid suggests multiple options to proceed at this point. It is in the
discretion of the model engineer to choose one action or to apply all of the sug-
gested actions.
169
Figure 8.3.1: Proposed workflow to identify meaningful actions given the current state of the train-
ing process.
170
The workflow begins by posing question A, which is to measure the character error
rate on the validation data and with the current model parameters. As Figure 7.5.1 shows,
does the error rate on the validation set converge towards the minimum during gradient
descent. That is if no overfitting of the model happens, the model capacity and training
set are large enough and no other error happened during training. The error rate does
indeed converge towards the minimum in most cases, but starts to diverge or oscillate at
some point. Question A of the workflow is targeted at the question if the current CER on
the validation data set is satisfying. The model engineer may choose to stop the training
run in this case and use the model parameters which produced the lowest error rate.
Question B of the workflow assesses if the character error rate on the validation data
set is still improving with each episode of the training. If this is the case, waiting for a
better parameter set that produces a lower error rate on the validation set is still an op-
tion. Another option in this case would be to check the hyper-parameters of the optimizer
for potential improvements, e.g. increasing the learning rate if the error rates on both the
training and validation set are only decreasing slowly. If the error rate on the validation
set is not further improving, then question C poses the test if the error rate on the training
set is still improving. If this is the case, we are likely dealing with overfitting, which is
a common problem in machine learning. Actions to mitigate overfitting in multi-dimen-
sional connectionist classifications are to collect more annotated data for this task (‘data
is king’), thus improving the ratio between the number of training data and the model
capacity. Instead of increasing the amount of data, reducing the model capacity or intro-
ducing further constraints to the model and its parameters is also a suitable action when
overfitting is observed. Actions in this direction are to reduce the number of layers in the
deep neural network, reduce the number of neurons per layer, add dropout to the layers
or add a regularization term to the loss function.
Question D results in a significant split in the workflow since the outcome of this ex-
periment determines if the problem at hand lies in the data, model or training system in
general or with specific individual examples or glyphs. Question D can be answered by
calculating the character error rate per example from the annotated data and then plot a
histogram or compute the variance over these error rates per example. If the CER is not
satisfying for all the examples, then question E follows up to determine probable causes.
Question E requires that the annotated data is also available line-wise, that is each para-
graph correctly segmented into its contained lines. The experiment of question E is to use
the line-wise data set with the same model and hyper-parameters, as far as possible, and
then to train the model using connectionist temporal classification for line-wise transcrip-
tion. Training the model with CTC for line-wise transcription is a robust way of testing the
deep neural network architecture in general since CTC is a robust and well understood
transcription method for handwritten lines of text. If this CTC training run achieves a rea-
sonable character error rate, further inspection of individual examples when transcribing
using MDCC is in order. If training with connectionist temporal classification does not lead
to a reasonable error rate, a general error is to be expected in the data or model. The
model could be unsuitable for the task at hand or the model capacity could be too small
to solve the task of handwriting recognition. Repeating the training with a different deep
neural network architecture or with more neurons, thus more parameters, in the neural
network is advised. Choosing a suitable DNN architecture is partly dependent on the
experience of the model developer. The DNN architecture used in this work is detailed in
Section 7.4. On the other hand could there be too few data or the data has a systematic
fault. Systematic faults in the annotated truth data in handwriting recognition includes
e.g. capitalization of words or missing punctuation marks.
Questions D and E in combination establish if the employed data set and deep neural
network architecture are suitable for solving the task of multi-line handwriting recognition.
171
Once it is established that the DNN model is capable of solving the task of offline hand-
writing recognition, the workflow directs the model engineer towards error sources within
individual data examples.
Question F of the workflow is the first one that requires visual inspection of specific
examples. We will discuss the heatmap-based visualization and ways to filter for inter-
esting glyphs in Section 8.4. This proposed visualization technique is capable of answer-
ing questions about the localization of characters predicted by the deep neural network,
about the resolution of the soft-assignment as estimated by the DNN or conditional ran-
dom field, about the correctness of the CRF alignment, about glyphs which are affected
by a high error rate and about general inspections on the difference between the deep
neural network and conditional random field soft-assignments.
Filtering the data set for interesting data examples is again done by calculating the
character error rate per example and selecting few examples with a high error rate. Those
examples are then visualized with the technique proposed in this chapter. All glyphs being
roughly of the same error rate and type points to a more general problem with the hyper-
parameters in the model, data or software implementation. Only a few glyphs showing a
high error rate on the other hand points to more localized and specific problems. Question
I picks up this trail by testing if frequent or infrequent glyphs of the alphabet are impaired
by a high error rate. For this the heatmap-based visualization in the missing mode, which
highlights glyphs that the deep neural network did falsely miss, or ghosting mode, which
targets falsely predicted glyphs, is employed. Typical problems leading to a higher than
expected error rate in infrequent glyphs are a too low model capacity or a poor choice
of the optimizer hyper-parameters. Both of these error sources can disproportional affect
infrequent glyphs since in terms of the MDCC loss function, errors in infrequent glyphs
have only a small impact on the overall transcription error of the model. Other types of
errors that occur in combination with infrequent glyphs are a data collection bias that
further reduced the number of examples with outliers or systematic errors in the ground
truth data annotation that only affects infrequent glyphs.
The remaining options for error sources in multi-dimensional connectionist classifica-
tion are covered by a multiple choice path in the proposed workflow of Figure 8.3.1. The
model engineer may choose to follow only one of the two options proposed in the work-
flow, although it is recommended to perform both tests. The heatmap-based visualization
is employed to inspect one example from the data set to boost the decision on how to
proceed. Question G compares the resolution of the data input image with the resolution
of the soft-assignments estimated by the deep neural network and the conditional ran-
dom field. The source of error may be that the alignment resolution is too coarse or too
fine, in which case the subsampling factor inherent in the DNN architecture or the reso-
lution of the input image should be adapted. In the course of this work, soft-assignment
resolution of 3-5 pixels in both height and width per character has worked well. Charac-
ters in the DNN and CRF soft-assignments should however not consist of only a single
pixel. Choosing a too low input resolution or too high subsampling factor may render
certain glyphs indistinguishable from each other, e.g. the glyphs ‘i’, ‘l’ and ‘I’. A too high
input resolution or too low subsampling factor on the other hand may create a too large
distance between disconnected features of the same glyph and thus introduce additional
long-range dependencies into the long short-term memory (LSTM) layers.
Question H is an experiment to test for general errors in the data and software im-
plementation. It may again be that there is a systematic fault in the ground truth data or
that individual examples from the data set are wrong. There may also be a general con-
figuration error, e.g. flipping the width and height of the input image at some point, or an
implementation bug. Since loopy belief propagation is an iterative inference method that
requires repeated multiplication and summation of probabilities, numerical instabilities in
172
the implementation may propagate to impact the overall error rate of the model. Also
loopy belief propagation may simply not have converged to a stable point while estimat-
ing the soft-assignment in multi-dimensional connectionist classification. Increasing the
limit on the number of iterations in LBP is advised in this case.
We will discuss the proposed heatmap-based visualization technique in the remainder
of this chapter.
8.4 Heatmap-Based Visualization for MDCC
Sections 8.2 and 8.3 have detailed the sources of error that may occur while applying
multi-dimensional connectionist classification to the training of a deep neural network and
proposed a workflow for deriving meaningful actions to modify the annotated data, hyper-
parameters of the model or optimizer or the DNN architecture itself to mitigate these
errors. What is missing is the heatmap-based visualization technique that is employed in
the workflow of Figure 8.3.1 to inspect data examples for specific errors. Discussing this
visualization technique is the topic of this section.
This heatmap-based visualization targets the intelligible visualization of the soft-as-
signments as estimated by the deep neural network and conditional random field. The
structure of these soft-assignments has been discussed in Sections 5.2 and 6.2. Specif-
ically the soft-assignment zΣ of Equation 6.3.2 in case of the conditional random field is
targeted by this visualization. This allows direct comparison of the deep neural network
and conditional random field soft-assignments since both are probabilities per glyph of
the alphabet, not probabilities per character of the label string. These two soft-assign-
ment, DNN and CRF estimations, are probability distributions with two spatial dimensions
for the two dimensions in the input image of handwritten text, with a third dimensions over
the glyphs of the alphabet in use. Understanding the spatial relationship between glyphs
in a text is, so we assume, intuitive for speakers of that language as long as enough
context is given. This required context is provided by the proposed heatmap-based vi-
sualization by using the original image of handwritten text as a background and partially
superimposing the glyph probabilities as a heatmap. Due to using the input image as
background, only one example of the data set can be visualized at a time.
Figure 8.4.1 shows this heatmap-based visualization for the example text of the IAM
offline handwriting database often used in this thesis and specifically the glyph ‘a’ of the
alphabet. The left part of the visualization shows the soft-assignment of the glyph ‘a’
as estimated by the deep neural network and the right part for the conditional random
field. The background in both cases is the input image of handwritten text as a grayscale
image. Superimposed on this background is a heatmap of the probabilities from the soft-
assignments for this specific glyph. The heatmap is only partially superimposed to leave
large parts of the handwritten text visible for reference to the user. This begs the question
on how to decide which parts of the image should be plot over with the heatmap, that is
in which spatial positions the soft-assignments should be visualized.
The value range of the soft-assignments is in [0, 1] since the values are probabilities.
Summing the probabilities over all glyphs for one pixel will always yield 100 percent prob-
ability since each pixel has to be assigned to one glyph. Noisy predictions will thus yield
probabilities near 1|A| with |A| being the number of glyphs in the alphabet in use. High
confidence predictions will yield a few predictions with probabilities near one, that is for
assignments of pixels to glyphs that the DNN or CRF expect, and many probabilities near
zero which indicate assignments of low confidence. This effect is used to decide which
pixels of the image of handwritten text to superimpose with the heatmap. Interesting ar-
eas are where the DNN or CRF predict a high probability for the glyph in question and
those areas should be superimposed. In the proposed visualization, these interesting
173
areas are defined as the pixels where the probability of assigning the pixel to the glyph
is higher than the mean probability for this glyph over all pixels, plus one standard de-
viation. Formalized, the heatmap of the soft-assignment y is superimposed for pixels s
where ysg > µyg + σyg for glyph g at hand. This means that roughly 16 percent of pixels
will be superimposed by the heatmap.
The heatmap further encodes the absolute probability of the pixel being assigned to
the specific glyph in color with high probabilities being encoded in yellow. This color
coding of the probabilities further allows to distinguish between superimposed heatmap
that shows noise (seemingly random pixels are superimposed) or a heatmap that shows
confident predictions by the DNN or CRF (contextually correct pixels are superimposed).
This partially superimposed heatmap allows the user to distinguish between soft-as-
signments from the deep neural network or conditional random field that are seemingly
noise and those that represent high confidence prediction. If the user decides that the
soft-assignments at hand are not noise, further inspection of the soft-assignments is pos-
sible by incorporating the context provided by the corresponding handwritten text in the
background.
Figure 8.4.1: Heatmap visualization technique for inspecting and comparing the deep neural net-
work prediction and conditional random field alignment for a single glyph from the
alphabet. The background shows the input image. The partially superimposed
heatmap details the probabilities from the according soft-assignments. High proba-
bilities are encoded in yellow, low ones in blue.
Figure 8.4.1 shows a higher resolution of the pixels in the input image than in the
superimposed heatmap. This is because the spatial resolution of the soft-assignments
estimated by the deep neural network and conditional random field are actually of a lower
resolution than the input image. This effect is because of subsampling within the DNN
architecture, which reduces the spatial resolution while forward propagating through the
neural network.
This proposed visualization technique is capable of only displaying a single glyph of a
single example from the data set at a time. Section 8.3 discussed the workflow proposed
in this chapter and suggests that interesting examples from the data set can be identified
by calculating the character error rate (CER) per example and selecting examples that
have the highest or at least higher than expected error rate. This is a quick way to identify
problematic examples which may indicate errors in the model, data or hyper-parameters
but of course any example from the data set can be visualized using this technique.
Identifying interesting glyphs from the alphabet at hand will be the topic of the discussion
in the next few paragraphs.
We propose three metrics for ordering the glyphs by ‘interestingness’ within one ex-
ample of the data set: difference, which employs the cross-entropy between the CRF
alignment and DNN estimate, ghosting as a measure of false-positives and missing for
false-negatives. The user may choose the metric that is most useful for the task at hand
and select one or more glyphs to visualize in order to gain insight into the DNN prediction
174
and CRF alignment. Similar to the discussions on expectation-maximization of Section
6.5 will the following equations use zΣ to denote the alignment as estimated by the con-
ditional random field and y for the soft-assignment predicted by the deep neural network.
The difference metric ∑
Difference(y, zΣ, g) = − z s sΣg × log(yg) (8.4.1)
s
measures the cross-entropy between the soft-assignments zΣ and y for the example at
hand. That is it measures the information gained about zΣ while y being observed. A high
value of cross-entropy indicates a low information again, which shows that the difference
between the DNN prediction and CRF alignment is high.
The ghosting metric
∑{ys iff z sΣ < ϵ
Ghosting(y, zΣ, g) =
g g (8.4.2)
s 0 else
has a high value for glyphs that are predicted by the deep neural network in pixels where
the conditional random field does not indicate their assignment. In a standard machine
learning classification task, this would be false-positives.
The inverse to the ghosting metric is the mis{sing metric∑ z s sΣ
Missing(y, z , g) = g
iff yg < ϵ
Σ (8.4.3)
s 0 else
which flips the soft-assignments of the DNN and CRF and thus has a high value for
pixels where the CRF assigns the glyph, but not the DNN. These are false-negatives
in classification tasks. ϵ in both the Ghosting and Missing metrics is a small positive
probability as a threshold.
Selecting interesting glyphs for visualization is done by choosing one of these three
metrics, computing its value for all glyphs in the alphabet and for the specific data set
example at hand. Plotting the values of the metric applied to the glyphs in a histogram
then allows to quickly filter for interesting glyphs. Figure 8.4.2 shows this histogram for
the data set example known beforehand from Figure 8.4.1 and with the difference metric
applied. The histogram shows a high difference in the line separator glyph.
Figure 8.4.2: Histogram for one specific example and over all glyphs of the alphabet, indicating
the difference between the deep neural network prediction and the conditional ran-
dom field alignment. The histogram indicates a high difference in the line separator
ϵl.
Figure 8.4.3 applies the ghosting metric to the same example as of Figure 8.4.2.
There are false-positives in the space-glyph and ‘o’, ‘e’, ‘a’ glyphs. This can partially be
175
explained with the fact that these glyphs are frequent in the English language. Normal-
izing the ghosting and missing metrics by the frequency of the glyph may however be
contraindicated. Both metrics are the sum over all pixels in the soft-assignments and
as such, glyphs that occupy a large space tend to have higher values in these metrics.
Normalization would thus require knowledge about the spatial extent of each glyph, an
information that is explicitly missing when applying multi-dimensional connectionist clas-
sification.
Figure 8.4.3: Histogram similar to Figure 8.4.2 but with the ghosting metric. The space-glyph as
well as the glyphs ‘o’, ‘e’ and ‘a’ are the most ghosted glyphs in this example.
Plotting and viewing these histograms as in Figures 8.4.2 and 8.4.3 allows to filter for
interesting glyphs and to apply the heatmap-based visualization of Figure 8.4.1. Figure
8.4.4 shows the heatmap-based visualization applied to the example and hand for the
top-3 glyphs according to the ghosting metric. We can see that the structure of the soft-
assignments created by the deep neural network and conditional random fields is similar,
but there are differences in the exact location of characters and the exact value of the
probabilities. This is the case because a forward pass through a deep neural network
and loopy belief propagation are still very different inference algorithms and are expected
to produce non-identical estimates. However, due to expectation-maximization the two
soft-assignments should converge to a stable point in which they are similar to each
other.
Figure 8.4.4 shows the top-3 glyphs according to the ghosting metric, which would
be false-positives in typical classification tasks. The glyph ‘o’ indeed does show a false-
positive in the bottom-left area where the word ‘and’ is seen in the background image, but
the character ‘a’ is mistaken for an ‘o’.
This section proposed and discussed a heatmap-based visualization for multi-dimen-
sional connectionist classification and ways of filtering for interesting data set examples
and glyphs from the alphabet. This completes the application of the workflow proposed
in Section 8.3.
176
Figure 8.4.4: Heatmap visualizations for the top-3 glyphs according to the ghosting metric. Much
seems to be minor perturbations in the two prediction methods (DNN and CRF),
but the glyph ‘o’ is truly ghosted in the bottom-left area of the example.
177
8.5 Discussion
This chapter proposed a heatmap-based visualization technique and workflow for iden-
tification of error sources while training a deep neural network with multi-dimensional
connectionist classification as well as for proposing meaningful changes to the model, its
hyper-parameters and the annotated ground truth data in order to mitigate the identified
errors. The workflow is designed to be executed by the model engineer after a training run
with MDCC or during a current training run in order to decide on actions to improve the
error rate when applying MDCC to the specific data set. The proposed workflow covers
both typical machine learning problems such as overfitting or errors in the annotation of
the ground truth data, but also extends to problems that are specific to offline handwriting
recognition and multi-dimensional connectionist classification. The workflow proposes
actions for improvements in both cases.
The heatmap-based visualization is designed to inspect and compare the soft-assign-
ments as estimated by the deep neural network and conditional random field. It provides
contextual information to the user by incorporating the input image of handwritten text as
a background to the heatmap. The heatmap itself is partially superimposed on this back-
ground and visualizes probabilistic assignments between pixels and glyphs in the soft-
assignments. The heatmap is superimposed in a way that allows to identify and compare
spatial positions with high probabilities, but also to identify if the soft-assignment is largely
a product of noise.
Although no user study was performed to evaluate the visualization technique and
workflow proposed in this chapter, they were of immense use for the author of this thesis
while training deep neural networks using multi-dimensional connectionist classification.
The experience collected while performing the experiments earlier showed that this tech-
nique is of help while loading and transforming data and choosing a suitable DNN archi-
tecture for the task at hand. It also proved useful during hyper-parameter selection for this
model. It was also employed during training runs with MDCC to determine if they were
worth of continuation or if it was better to stop the training, adjust the hyper-parameters
and start a new training run.
One example for this was a case where a mistake was done in the architecture of
the deep neural network: the final neural layer that produced estimates for the glyphs
was followed by a non-linear activation function and on top of this the softmax activation
function followed to produce probability estimates. Applying the proposed workflow and
heatmap-based visualization showed that infrequent glyphs were predicted poorly and
the predicted soft-assignment for these infrequent glyphs showed random noise. This
observation made it clear that the mistake must be of such a type that inhibits the recog-
nition of infrequent glyphs. Applying two non-linear activation functions consecutively
has this effect since the gradient for each glyph is then reduced by the product of the two
function derivatives. In this practical example the workflow and visualization technique
pointed towards an error source in the DNN architecture.
Visualization-guided changes to the model, hyper-parameters and potentially anno-
tated data is crucial in this context since training deep neural networks for offline hand-
writing is time intense, even while using GPGPU acceleration. As stated before, a single
epoch of training with MDCC on the IAM offline handwriting database has a duration of
up to 12 hours. This amounts to multiple weeks for a full training run until the error rate
has converged to a low value. Therefore it is in the interest of the model engineer to
detect potential problems early and mitigate those problems in a directed and meaningful
approach.
One possible next step in further researching and developing this proposed visualiza-
tion technique and workflow is to tightly couple it with the training process. This could
178
for example involve automatic identification and visualization of difficult examples and
glyphs while training the DNN, potentially tracking changes over time and between dif-
ferent models. To this end the visualization technique could be incorporated in already
existing software for tracking DNN training runs, such as e.g. TensorBoard[1]. Integrat-
ing these visualizations with semi-automatic annotated software[125] is another path for
potential further research. Integrating these heatmap-based visualizations in software
for annotating ground truth data could provide further insight into which data examples
prove to be consistently erroneous and provide context to support improved annotation to
mitigate some problems. Sections 8.2 and 8.3 discussed error sources that occur within
the ground truth annotation and tight integration of visualizations of models with the data
annotation process could limit the occurrence of such problems in the first place.
179
180
Chapter 9
Combined Models for Text
Recognition
9.1 Idea and Overview
This thesis mainly proposes and discusses multi-dimensional connectionist classification
(MDCC) as a novel method for paragraph-wise handwriting recognition in Chapters 5 and
6 with empirical experimentation in Chapter 7. The experiments of Chapter 7 focused on
a comparison of the character error rate (CER) of paragraphs transcribed with MDCC
and with connectionist temporal classification (CTC), which was discussed in Section
3.1. While MDCC is a method for paragraph-wise transcription, CTC is one for line-wise
transcription. This chapter proposes methods for combining both paragraph- and line-
wise transcription. Each of the proposed methods constitutes a classifier that estimated
which of the two transcription methods yields the lower error rate on the example at hand.
The idea behind this approach is that not all handwritten texts are equally difficult to
transcribe. Many paragraphs are constituted of text lines that are organized in a neatly
horizontal fashion without overlaps between lines. These are typically well behaving in
line-segmentation algorithms and produce good results when applying connectionist tem-
poral classification. On the other hand are some adjacent text lines overlapping because
of warped base lines of one or multiple text lines or because individual characters in-
cluded in the text lines are overly extended in vertical direction. Noise in the image also
is a factor during line segmentation and transcription. These factors may reduce the
quality of line segmentation results and thus favor the application of multi-dimensional
connectionist classification as a paragraph-wise transcription method.
The IAM offline handwriting database[88] and its large writer independent text line
recognition task again, as in Chapter 7, serves as the data basis for the experiments of
this chapter. As discussed before does the large writer independent text line recognition
task split the IAM database into four disjoint sets: training, validation 1, validation 2 and
test. Chapter 7 did use the training set for automatic parameter optimization, validation
1 for validation and optimization of the hyper-parameters and test for final evaluation and
comparison with other methods. This leaves the validation 2 split unused so far. For the
purpose of the current chapter, both training and validation 1 will be used for automatic
parameter optimization if a machine learning classifier is used. Thus validation 2 will
be used for hyper-parameter optimization and test again for evaluation and comparison.
This approach ensures that no data examples that were used for optimization of the
transcription models will again be used for optimization of the classifiers for combining
line- and paragraph-wise transcription. Similar to Table 7.1 is the data split used for the
classifiers in this chapter outlined in Table 9.1.
181
Table 9.1: Data split of the IAM offline handwriting database as used in this chapter.
Training Validation Evaluation
Train. + Val. 1 Val. 2 Test
Num. Paragraphs 747 + 105 = 852 115 232
Num. Lines 6161 + 900 = 7061 940 1861
Num. Writers 283 + 46 = 329 43 128
Similar to the experiments of Chapter 7 are the experiments of this chapter executed
on both the original paragraph images of the IAM offline handwriting database, as well as
paragraph images in which the lines have been artificially moved closer together by off-
sets of 3, 5 and 10 millimeters. Tesseract1, GNU Ocrad2 and A* path planning[140] have
been applied as methods to achieve line segmentation apart from the provided ground
truth. Applying these line segmentation algorithms was necessary for experiments with
artificial line offsets since no ground truth data was available for these paragraphs.
Before proposing methods for application in real scientific or industrial settings, an
overview over the potential gain by combining paragraph- and line-wise transcription is
in order. The experiments of Chapter 7 and error rates detailed in Table 7.3 serve as
the basis for the experiments in this chapter. Figure 9.1.1 outlines an ‘oracle’ classi-
fier that decides on a example-by-example basis if the transcription generated by multi-
dimensional connectionist classification or the one generated by connectionist temporal
classification yields the lower character error rate. To this end, the ‘oracle’ classifier re-
ceives the truth texts as an additional input and chooses the transcribed text that actually
has the lower error rate. This means that this ‘oracle’ classifier yields the theoretical opti-
mal decision, but is also impossible to implement in a scenario where the truth texts are
not known beforehand. It will serve as a baseline for estimating the gain that a combined
transcription method can achieve.
Table 9.2 details the character error rates when applying the ‘oracle’ classifier of Fig-
ure 9.1.1 to the line- and paragraph-wise transcription models of Chapter 7 and the eval-
uation data of Table 9.1. It shows that there is indeed a small improvement in character
error rate when choosing between a line- and paragraph-wise transcription on a example-
by-example basis.
The following pages of this chapter will discuss three approaches for building clas-
sifiers that decide between line- and paragraph-wise transcription to reduce the overall
CER of the system. These three classifiers do not, in contrast to the ‘oracle’ classifier,
rely on knowledge of the truth data and thus will be applicable to real scientific and indus-
trial systems. They continue to use the data split of Table 9.1 and an overview similar to
Figure 9.1.1 will be provided for each classifier.
One note on the approach that was chosen for the classifiers of this chapter: all three
classifiers are based on the idea that each of the line- or paragraph-wise transcription
produces the transcribed text of the full paragraph on its own. The classifiers only choose
which of the two transcriptions is expected to yield the lower error rate. Another approach
to this problem would be to merge the two transcribed texts, producing a new transcription
that is different than each of the two individual ones but also yields an overall lower error
rate. However, this is not the approach of this chapter since useful merging of natural
texts exceeds the scope of this thesis.
1https://github.com/tesseract-ocr/tesseract/, version 4.1.1
2https://www.gnu.org/software/ocrad/, version 0.27
182
Line Segm.
Segm.
MDCC CTC
Transcr. Transcr.
String String
Truth Text Lower CER
XOR
Transcr. Transcr.
Figure 9.1.1: Diagram showing the data flow and application of methods in an ‘oracle’ classifier
that knows the truth text and always makes the correct decision that leads to the
lowest possible character error rate (CER). Green boxes are data, blue ones meth-
ods. Yellow signifies the classifier and dashed lines are exclusive to each other,
depending on the classifier decision.
183
Table 9.2: Average character error rates (CER) for connectionist temporal classification (CTC)
and multi-dimensional connectionist classification (MDCC) on full paragraphs of the
test set of the IAM offline handwriting database while using different line offsets and
line segmentation methods. The last two columns combine both line- and paragraph-
wise transcription by applying an ‘oracle’ classifier which always makes the correct
decision. Thus the last two columns give the percentage of examples in which MDCC
yields a lower transcription CER than CTC does and the resulting combined average
CER if always making the optimal choice. This serves as the lower bound in the error
rate when applying a non-all-knowing classifier. This table is based on Table 7.3.
CER ‘Oracle’
Line Offs. Line Segm. CTC MDCC MDCC selected in CER
0 mm Ground Truth 7.94 15.09% 7.78
0 mm Tesseract 16.74 63.79% 9.48
0 mm Ocrad 18.48 10.22 67.24% 9.38
0 mm A* Paths 16.31 71.55% 9.37
3 mm Tesseract 20.53 68.53% 10.10
3 mm Ocrad 16.87 10.80 68.10% 9.90
3 mm A* Paths 19.09 75.86% 10.01
5 mm Tesseract 27.58 75.86% 11.72
5 mm Ocrad 20.19 12.76 64.22% 11.50
5 mm A* Paths 24.83 81.47% 11.91
10 mm Tesseract 74.77 95.26% 30.75
10 mm Ocrad 56.87 31.20 90.52% 30.33
10 mm A* Paths 63.77 95.26% 30.93
Tesseract 34.90 - 15.51
Average Ocrad 28.10 16.24 - 15.27
A* Paths 31.00 - 15.55
9.2 Classifier on Paragraph Images
Classifier
The first classifier type that we will discuss in this chapter is the application of a support-
vector machine (SVM)[10, 30, 49, 147] on grayscale paragraph images. In its basic
formulation, a SVM is a two-class classifier that finds the separation plane with the opti-
mal margin. ‘Optimal margin’ in this case means that the minimal distance between any
training data point and the separation plane is maximized within the feature space. As
such a SVM is very robust to overfitting, given the general limitations regarding the num-
ber of features and number of training samples available. The support-vector machine
method can be applied to non-linear classification problems by implementing the kernel
trick [57] in which the feature space is implicitly mapped to a high-dimensional space via a
non-linear mapping. The SVM implementation of Scikit-Learn3 was applied in the course
of the experiments of this section.
In case of the work detailed in the following paragraphs, the two classes of the
support-vector machine are an expected lower error rate during line- or paragraph-wise
transcription, respectively. The input feature space consists of handcrafted features ex-
tracted from the raw grayscale image of an paragraph of handwritten text. A radial basis
function (RBF)[148] kernel was applied to this feature space in order to allow for a non-
linear separation place in the SVM. Support-vector machines include a hyper-parameter
for controlling the regularization during hyperplane fitting, often called the ‘slack variable’,
which was chosen as C = 2 in the course of the work of this chapter. Increasing the
slack variable decreases the regularization and fits the training data more closely, even
at the cost of overfitting. This hyper-parameter and the kernel function were chosen by
3https://scikit-learn.org, version 0.23.2
184
an exhaustive search over a predefined set of hyper-parameters. The grid-search imple-
mentation of Scikit-Learn was utilized to this end. It may be that a different kernel function
and a different set of hyper-parameters yields better results when the proposed method
is applied to a different data set of handwritten paragraphs.
Figure 9.2.1 details this approach to selecting line- or paragraph-wise transcription
per paragraph image using a support-vector machine. During transcription, the first step
is to extract features from the grayscale paragraph image. This will be the topic of the
next section. The extracted features are then fed into the SVM, which infers an estimate
of which of the two methods for transcription will yield the lower character error rate. Only
the selected method will then be applied, reducing the overall runtime to the minimum
necessary in this approach.
Feature 
Extraction
Model 
SVM Parameters
XOR
Line Segm.
Segm.
MDCC CTC
Transcr. Transcr.
Figure 9.2.1: Diagram similar to Figure 9.1.1, but detailing the application of a support-vector
machine on features extracted from the raw grayscale images. Dashed lines are
only executed for one of the two alternatives.
Feature Space
The support-vector machine was provided with a handcrafted feature space in which
the features were extracted from the grayscale image of a handwritten paragraph. The
number of dimensions in this features space needs to be constant for all examples within
the data set, which stands in contrast to the variable width and height of the paragraph
images, which are of a constant resolution of 300 dpi. Features of constant dimension
were extracted from the variable-sized paragraph images by first resizing the paragraphs
to a standard height, leaving only a variable width in order to keep the original aspect
ratio of the image. The standard height, to which the images were resized, was set to
185
the mean height plus one standard deviation over the images within the training data
set. Assuming a normal distribution of the height of paragraph images, this leads to an
increase of the height in 68 percent of images.
The remaining variable width of each paragraph image was eliminated by applying a
windowing approach. Each paragraph image was divided into five equally sized windows
along the horizontal dimension. The overall image was treated as the sixth window.
Separation of the overall paragraph image into five windows along the horizontal axis
allows for slightly slanted or curved text lines while reducing the impact of these curved
base lines on the stability of the extracted features. Engineering the feature extraction in
such a way that each pixel in height and each window is treated as one spatial position,
the total number of dimensions in the feature space was reduced to a constant of six
times the standard height in pixels times the number of different features extracted. To
this end all features were extracted by marginalization within the respective window to
eliminate biases introduced by variable width windows. Figure 9.2.2 details this approach
for an example of the IAM offline handwriting database.
All grayscale images were normalized to a mean pixel value of 0 and a standard
deviation of pixel values of 1 over the training data set. This same normalization was
also applied to samples outside of the training set. Features were extracted from these
normalized brightness values. The features were designed in such a way that they allow
identification of overlapping text lines within the same pixel row. The types of features
extracted from each the windows were as follows, with each feature being extracted per
pixel row:
• The mean pixel intensity and according standard deviation.
• The value span in brightness between the darkest and brightest pixel.
• The number of transitions from a dark to a bright pixel. That is the transition from
< 128 to ≥ 128 in unnormalized grayscale images. This count was divided by the
width of the window in pixels.
• As above the normalized number of transitions from bright to dark pixels within the
row.
• The sum of the two above pixel transition quantities. That is the sum of transitions
between dark and bright pixels, independent of their direction of transition.
The basic idea behind these features is that pixel rows that strike through a text line
end-to-end will be of average brightness and contain a lot of transitions between dark
and bright pixels. This is because these pixel rows will contain many pen strokes that
are dark, but also many bright spots between and within the glyphs. Pixel rows that
were intended by the writer as separators between two adjacent text lines, but contain
overlapping glyphs of these text lines will display different characteristics. These pixel
rows are assumed to be mostly of bright intensity since they contain only few pen strokes.
They also show only few transitions between dark and bright pixels. On the other hand do
they still show a large value span between the brightest and darkest pixel. Pixel rows that
are separators between text lines and do not contain overlaps of glyphs will on average
be very bright with no transitions between dark and bright pixels and also a low value
span within the pixel intensities.
The proposed feature space contains a high number of dimensions which together
with a non-linear kernel function increases the risk of overfitting to the training data. Anal-
ysis of variance (ANOVA)[32] was applied as a univariate feature selection strategy to
reduce the number of features to one quarter of the number of training examples. In this
case, a total of 213 features. ANOVA as a feature selection strategy is based on the
186
Overall window
Win. 1 Win. 2 Win. 3 Win. 4 Win. 5
Figure 9.2.2: Resizing of the paragraph image to a standard height and application of five verti-
cal windows and one overall window to obtain a standard number of features per
paragraph. The lower part shows the features as they are extracted as a marginal-
ization of the horizontal axis of each window. Each feature can thus be interpreted
as a distribution of values over the vertical axis of each window, here indicated as a
histogram along the vertical axis. Please note that the exemplified features do not
correlate to the paragraph image above.
187
Resize to standard height
assumption that an observed variable, in this case the classification in two classes, is
actually a mixture of a set of variables, in this case the features. ANOVA selects the top-n
features that best explain the variance within the classification target.
Empirical Evaluation
This feature space in combination with the SVM hyper-parameters described above was
applied to the task at hand. The evaluation is discussed in the next paragraphs. The
split between training, validation and evaluation data set as used in these experiments in
detailed in Table 9.1.
Table 9.3 shows the error rates of the support-vector machine in percent of wrongly
classified examples for both the training and validation data. It shows a tendency to-
wards a lower error rate with a higher line offset, which is to be expected based on the
observation that the character error rate in line-wise transcription increases faster than
in paragraph-wise transcription. As Table 9.2 shows does the number of examples that
truly should be transcribed using MDCC increases with an increasing line offset. Table
9.3 also shows that overfitting is to be expected in this approach.
Table 9.3: Percentages of wrongly classified examples from the training and validation data while
using the image-based SVM classifier of Figure 9.2.1.
Error Rate
Line Offs. Line Segm. Training Validation
0 mm Ground Truth 2.44% 10.38%
0 mm Tesseract 22.87% 41.96%
0 mm Ocrad 24.26% 43.64%
0 mm A* Paths 27.14% 40.35%
3 mm Tesseract 24.51% 28.57%
3 mm Ocrad 31.77% 37.27%
3 mm A* Paths 29.37% 36.61%
5 mm Tesseract 21.19% 29.46%
5 mm Ocrad 31.17% 44.64%
5 mm A* Paths 19.93% 30.36%
10 mm Tesseract 2.59% 7.83%
10 mm Ocrad 6.00% 9.73%
10 mm A* Paths 3.29% 1.74%
Table 9.4 includes the character error rates when applying this image-based support-
vector machine classifier to the IAM offline handwriting database. The reported CER is
the average over all examples within the evaluation data set. These results show a slight
improvement in the CER when combining multi-dimensional connectionist classification
with CTC and the GNU Ocrad line segmentation on paragraphs with an artificial line offset
that is reduced by 10 millimeters. However, in most cases MDCC alone results in a lower
error rate.
188
Table 9.4: Average CER when combining MDCC and CTC by applying a support-vector machine
(SVM) on features extracted from the grayscale image. Table layout is identical to Table
9.2 with the last two columns changed. The highlighted CER marks the instance where
the combined error rate is lower than any of the line- or paragraph-wise transcription
on its own.
CER SVM on Images
Line Offs. Line Segm. CTC MDCC MDCC selected in CER
0 mm Ground Truth 7.94 0.0% 7.94
0 mm Tesseract 16.74 10.22 99.57% 10.220 mm Ocrad 18.48 87.07% 10.71
0 mm A* Paths 16.31 99.14% 10.31
3 mm Tesseract 20.53 88.79% 11.17
3 mm Ocrad 16.87 10.80 87.50% 10.95
3 mm A* Paths 19.09 94.83% 11.36
5 mm Tesseract 27.58 94.83% 12.80
5 mm Ocrad 20.19 12.76 96.55% 12.85
5 mm A* Paths 24.83 98.71% 12.89
10 mm Tesseract 74.77 100.0% 31.20
10 mm Ocrad 56.87 31.20 97.41% 31.08
10 mm A* Paths 63.77 100.0% 31.20
Tesseract 34.90 - 16.34
Average Ocrad 28.10 16.24 - 16.39
A* Paths 31.00 - 16.44
9.3 Classifier on Transcribed Texts
Classifier
The approach detailed in this section is to move the classifier from the beginning of the
pipeline (where only a grayscale image is available) to the end of the pipeline (where the
transcribed text is available). The classifier of this section is hence based on the extraction
of character n-grams from the transcribed text and comparing the n-gram frequencies of
both line- and paragraph-wise transcription with a reference corpus to decide which of
both transcriptions is closer to the expected n-gram frequency distribution.
Figure 9.3.1 outlines this classifier method. Both a line- and paragraph-wise tran-
scription of the paragraph image is performed to obtain the two transcription variants.
All character n-grams of a constant size are then extracted from these transcribed texts.
Character n-grams were beforehand extracted from a reference corpus of natural texts
of the same language. Comparing the n-gram frequencies of the two transcribed texts
and the reference corpus according to the Jensen-Shannon divergence[87, 97] yields the
classification on which of the two texts is closer to the reference corpus. The one with the
lower Jensen-Shannon divergence between the n-gram frequencies of the transcribed
text and the reference corpus is assumed to contain fewer transcription errors.
N-Grams and Reference Corpus
A character n-gram as applied in the context of this classifier is a continuous sub-sequence
of a constant number of characters from a natural text or transcription thereof. The clas-
sifier proposed only extracts full n-grams from the texts, that is it uses no partial n-grams
of shorter length that occur in the beginning or end of the text. Figure 9.3.2 shows the
extraction of character 3-grams from the text ‘Hello World’.
The first step in the proposed n-gram based classifier is to extract character n-grams
in this fashion from the reference corpus of natural texts of the same language. In the
case of this work, the truth texts of the training data set were used as the reference
189
Line Segm.
Segm.
MDCC CTC
Transcr. Transcr.
N-Grams N-Grams
N-Grams 
of Lower 
Training Set Jensen-Shannon 
Divergence
XOR
Transcr. Transcr.
Figure 9.3.1: Diagram showing the application of n-gram frequencies and the Jensen-Shannon
divergence to decide between line- and paragraph-wise transcription. Dashed lines
are only executed for one of the alternatives.
Hel ... o W ... rld
Hello World
ell ... orl
Figure 9.3.2: Example of extracting character 3-grams from the text ‘Hello World’. Only a subset
of contained 3-grams is shown to outline the process in the beginning, mid and end
part of the text. The full list of 3-grams in this text is ‘Hel’, ‘ell’, ‘llo’, ‘lo ’, ‘o W’, ‘ Wo’,
‘Wor’, ‘orl’ and ‘rld’.
190
corpus. All truth texts of the IAM offline handwriting database from the London/Oslo-
Bergen (LOB) corpus[62] and as such are of the same language. The sample split of the
larger writer independent text line recognition task on the IAM database is further in such
a way that the splits are mutually exclusive regarding their truth texts. That is no truth text
and no writer occurs in more than one sample split. The splits being mutually exclusive
makes the choice of the training data set as the reference corpus a safe one without risk
of leaking information from the evaluation into the training data. Character n-grams were
hence extracted from the training data and kept with their respective counts for further
use.
The classifier proposed in this section uses character 3-grams for measuring the di-
vergence between two texts. Natural language processing methods that operate on the
basis of n-grams typically allow the configuration of the n-gram size. In this work the size
of three characters was chosen based on some experimentation. Shorter n-grams are
less prone to transcription errors since there are more n-grams in total extracted from
the same text and there is less a chance of an transcription error corrupting a specific
individual n-gram. Longer n-grams do carry more information and are thus better suited
for the detection of transcription errors. However since there are fewer n-grams in total, a
single transcription error may overly impact the result. A trade-off exists in the choice of
the n-gram size between the robustness of the classifier and the significance of each tran-
scription error. The choice of character 3-grams in this work was based on preliminary
experiments with lengths of two, three and four characters per n-gram.
We will use the symbol ci,t throughout this section to denote the number of occur-
rences of the character 3-gram i in text or corpus t. The three corpora and texts in use
are R for the reference corpus and M, N for the two transcribed texts. A count of ci,R = 1
will be assumed whenever the n-gram i does not occur in corpus R, but does so in either
transcription M or N. The same is true for n-grams in the transcribed texts in relation to
the other text and reference corpus. Assuming a default n-gram count whenever a n-gram
is non-existent within a specific text, a so called out-of-vocabulary word, is necessary to
avoid numerical instabilities and omittance of information from the metrics in use.
Frequencies and Jensen-Shannon Divergence
We will now discuss how to apply the Jensen-Shannon divergence in a classifier that
estimates if the line- or paragraph-wise transcribed text is closer to the natural language
of the training corpus. The Jensen-Shannon divergence is a symmetric variant of the
Kullback-Leibler divergence (KL)[76], which we need to define first. The KL divergence
measures the uncertainty about a reference (unobserved) probability distribution given
an observed probability distribution. That is it measures the mean number of bits re-
quires to encode the symbols of the reference distribution given the encoding scheme
of the observed distribution. The Jensen-Shannon divergence removes this asymmetry
of the KL divergence by averaging the Kullback-Leibler divergence of both reference and
observed probability distributions when measured in relation to the mean of both distri-
butions. The next few paragraphs will detail the Jensen-Shannon divergence applied to
character n-grams.
The frequency
fi,t = ∑ ci,t (9.3.1)
j∈t cj,t
of an n-gram i within a text or corpus t is its number of occurrences ci,t normalized by
the total number of n-gram occurrences within the text or corpus.
191
Based on these n-gram frequencies, the∑Kullback-Leibler divergencefi,V
KL(U,V) = − [fi,U log( )] (9.3.2)
f
∈ ∪ i,Ui U V
measures the average number of bits required to encode the n-grams of text or corpus V
given a coding scheme based on the n-gram frequencies of text or corpus U.
As discussed above is the Jensen-Shannon divergence based on the KL divergence
when measured in relation to the mean of distributions U and U. This averaged distribu-
tion is a finite discrete set that contains the union of n-grams
Q = U ∪V (9.3.3)
with their frequencies
1
fi,Q = [fi,U + fi,V] (9.3.4)
2
being the average of frequencies given by sets U and V.
The Jensen-Shannon divergence of text t, in our case either transcription M or N, to
the reference corpus is then the average of the two KL divergences towards the mean Q
of t and R. Equation 9.3.5 details the Jensen-Shannon divergence.
1
JSD(t,R) = [KL(t,Q) + KL(R,Q)],Q = t ∪R (9.3.5)
2
The classifier of this section computes the Jensen-Shannon divergence JSD(M,R) of
the transcribed text M and the reference corpus R. The same computation is performed
for the transcribed text N and the reference corpus. Whichever transcription produces the
lower Jensen-Shannon divergence is assumed to be the one with fewer character errors
since its n-gram frequencies are closer to the expectation set by the reference corpus.
Empirical Evaluation
As with the support-vector machine of Section 9.2, classifies this character n-gram based
model each paragraph of the IAM database into two categories. Either the line-wise or
paragraph-wise transcription is assumed to yield the lower character error rate. Eval-
uation of the raw classification errors is provided in Table 9.5 where the percentage of
wrongly classified examples is listed for the training and validation sets. The training set
was used as the reference corpus.
Table 9.6 includes the resulting average CER when applying this 3-gram classifier to
the IAM database. There are several cases where the character error rate is lowered
in comparison to either line-wise transcription using connectionist temporal classification
or paragraph-wise transcription using multi-dimensional connectionist classification. On
average, an improved CER can be expected when applying this classifier in combination
with MDCC, CTC and either the line-segmentation of Tesseract or based on A* path
planning[140].
192
Table 9.5: Percentages of wrongly classified examples from the training and validation data while
using the 3-gram classifier of Figure 9.3.1.
Error Rate
Line Offs. Line Segm. Training Validation
0 mm Ground Truth 48.41% 52.83%
0 mm Tesseract 26.89% 37.50%
0 mm Ocrad 44.09% 49.09%
0 mm A* Paths 26.39% 28.95%
3 mm Tesseract 23.77% 27.68%
3 mm Ocrad 36.93% 37.27%
3 mm A* Paths 21.00% 24.11%
5 mm Tesseract 17.07% 21.43%
5 mm Ocrad 31.41% 32.14%
5 mm A* Paths 14.44% 14.29%
10 mm Tesseract 2.35% 2.61%
10 mm Ocrad 6.82% 3.54%
10 mm A* Paths 2.23% 0.87%
Table 9.6: Average CER when combining MDCC and CTC by applying a classifier based on the
3-gram character frequencies of the transcribed strings compared with the 3-gram
character frequencies of the training truth texts as the reference corpus. Table layout
is identical to Table 9.2 with the last two columns changed. Highlighted error rates
mark the instances where the combined error rate is lower than any of the line- or
paragraph-wise transcription on its own.
CER 3-Gram Freq.
Line Offs. Line Segm. CTC MDCC MDCC selected in CER
0 mm Ground Truth 7.94 60.78% 9.28
0 mm Tesseract 16.74 10.22 75.86% 10.220 mm Ocrad 18.48 64.22% 12.73
0 mm A* Paths 16.31 74.57% 10.35
3 mm Tesseract 20.53 76.72% 10.89
3 mm Ocrad 16.87 10.80 64.66% 12.80
3 mm A* Paths 19.09 82.76% 10.60
5 mm Tesseract 27.58 80.17% 12.54
5 mm Ocrad 20.19 12.76 69.83% 13.63
5 mm A* Paths 24.83 85.34% 12.38
10 mm Tesseract 74.77 96.55% 30.99
10 mm Ocrad 56.87 31.20 91.81% 30.69
10 mm A* Paths 63.77 98.71% 31.07
Tesseract 34.90 - 16.16
Average Ocrad 28.10 16.24 - 17.46
A* Paths 31.00 - 16.10
193
9.4 Classifier on Segmentation Information
Classifier
The last classifier that this chapter proposes in order to decide between paragraph- and
line-wise transcription is utilizing geometric information provided by the line-segmenta-
tion algorithms. Both Tesseract and GNU Ocrad do allow to retrieve additional informa-
tion about the extracted line while applying the line-segmentation on paragraph images.
Tesseract distinguishes between different ‘segmentation levels’ and e.g. level 4 are whole
lines and level 5 are words within lines. The same is true for GNU Ocrad, but with differ-
ent wording for the levels. The segmentation output of both GNU Ocrad and Tesseract
are axis-parallel rectangles around the respective segment. In this chapter the classifier
uses the provided coordinates of the four corner points of each line and word to extract
features from paragraph images while using the segment numbering provided by the al-
gorithm for topological ordering of the features. These coordinates are then transformed
to create a feature space suitable for a support-vector machine as the classifier. Figure
9.4.1 outlines this approach.
Tesseract GNU Ocrad
Segm. Segm.
Geometric Geometric 
Info Info
Model 
SVM Parameters
XOR
Line Segm.
Segm.
MDCC CTC
Transcr. Transcr.
Figure 9.4.1: Flow when classifying each example according to the geometric information, that
is corner points of lines, provided by the segmentation algorithms of Tesseract and
GNU Ocrad. Dashed lines are only executed for one of the alternatives.
194
As with the SVM of Section 9.2, are the two classes of the task designed to indicate if
a line- or paragraph-wise transcription can be expected to yield a lower error rate. Scikit-
Learn was again employed for an exhaustive search of hyper-parameters and kernel
functions. In the SVM of this section, a polynomial kernel of degree 5 was selected. A
slack variable of C = 0.5 was used, which increases the strength of the regularization and
thus reduces overfitting of the SVM parameters to the training data set. The data split as
indicated in Table 9.1 was used for training and validation. Application of the trained SVM
to previously unseen paragraph images yields the classification with the information if a
line- or paragraph-wise transcription is expected to yield a lower character error rate.
Feature Space
Features in this method are extracted by applying the line-segmentation algorithms of
Tesseract and GNU Ocrad to the paragraph image, retrieving line- and word-level seg-
mentation information. The coordinates of each segment are transformed to a minimal
axis-parallel rectangle around this segment. Figure 9.4.2 shows an example of this with
an paragraph image from the IAM database. Line-level segmentation is shown in green,
word-level segmentation in red. The included segmentation is not exhaustive for this ex-
ample and there are more lines and words contained than marked in the figure. It serves
as an example to illustrate this approach.
Figure 9.4.2: Example paragraph image from the IAM offline handwriting database showing the
different segmentation levels employed for this classifier. Green rectangles mark
the line-level segmentation, red ones the word-level segmentation. Please note that
the included segmentation is not exhaustive and serves as an example only. The
example shows that one true text line may decompose into multiple line segments.
The number of segments extracted from the paragraph image is variable since there
are actually more or less text lines in each paragraph and each text line may contain
a different number of words. As discussed in Section 7.5 is there a maximum of 12 text
lines per paragraph in the IAM offline handwriting database, according to the ground truth
data included with the data set. In order to obtain a feature space with a fixed number of
dimensions, the number of line segments included in the feature transformation was set
to 20. Paragraphs with more line segments were truncated to the first 20 and the ones
with fewer were filled up with features that cannot occur naturally. The ground truth data
of the IAM database indicates an average of 7 words per text line. For feature extraction
the number of word segments was set to 20 per text line.
195
The features for use in this support-vector machine are transformed per line- or word-
segment and are designed to distinguish between valid and invalid segmentation results.
The first step in the proposed feature transformation is to normalize the coordinates of
the four corner points of each segment to an interval of [0, 1] with the coordinate (0, 0)
being the top-left corner. The coordinate (1, 1) is always the bottom-right corner. This
removes the variable size of the paragraph images from influencing the feature space.
Line and word segments are processed during feature transformation in such an order
that a top to bottom order for lines and left to right order for words can be assumed. This
order of processing is based on the topological order of segments as provided by the line
segmentation algorithm. It is not based on the pixel coordinates of the segments. This
is an import difference in this feature transformation since a valid segmentation result
iterated in topological order should also produce a top to bottom and left to right ordering
in pixel coordinates. Invalid orderings will result in backward jumps in pixel coordinates if
text lines or words were extracted in the wrong order during segmentation. Figure 9.4.3
shows such an backward jump in which the topological ordering given by the segmenta-
tion algorithm is written in the rectangles and arrows indicate movements in pixel space
when following the top left corner of each line segment. A valid movement can be ob-
served from line one to two, but not from two to three. The features included in the feature
space are designed to recognize such cases.
Line 1
Line 3 Line 2
Figure 9.4.3: Example of invalid line segments. The line numberings are given as the topological
order by the segmentation algorithm. However, an invalid movement in pixel space
is observed going from line two to three.
Another potential problem in line- and word-level segments are overly large jumps in
pixel space. Large gaps in pixel space while moving through segments in topological
order can be interpreted as words or text lines that are existing within the paragraph
image, but where missed by the line segmentation algorithm. Such gaps are also visible
in the proposed feature space.
The features included in the proposed feature space are thus as follows, with all fea-
ture types being transformed from both the Tesseract and GNU Ocrad segmentation:
• The four corner points of the first 20 line segments with pixel coordinates normalized
to a [0, 1] interval. If there are fewer than 20 line segments, the remaining ones are
filled up with values of −1, which cannot occur in true line segments.
• In the same way, the four corner points of the first 20 word segments per text line.
• The contained area per word segment in normalized coordinates.
• The vector between the top-left corner of the last word segment to the top-left corner
of the current word segment.
• In the same way, the vector between the last bottom-right corner to the current
top-left corner.
This feature space includes, as the one of Section 9.2, a large number of dimen-
sions and thus slows down optimization of the SVM parameters and increases the risk
of overfitting to the training data set. ANOVA was again applied to reduce the number of
dimensions of the feature space to one quarter the number of training examples.
196
Empirical Evaluation
This proposed SVM classifier and feature extraction was experimentally applied to the
IAM offline handwriting database with a data split as detailed in Table 9.1. The resulting
raw error rate in terms of percentage of wrongly classified paragraph is shown in Table
9.7. These observations are similar to the SVM of Section 9.2 in that there is some
overfitting to the training data, but the overall error rate is slightly decreasing for more
difficult paragraphs, that is ones with the offset between lines artificially reduced.
Table 9.7: Percentages of wrongly classified examples from the training and validation data while
using the segmentation-based SVM classifier of Figure 9.4.1.
Error Rate
Line Offs. Line Segm. Training Validation
0 mm Ground Truth 2.80% 10.38%
0 mm Tesseract 19.95% 40.18%
0 mm Ocrad 21.43% 43.64%
0 mm A* Paths 25.15% 36.84%
3 mm Tesseract 22.92% 28.57%
3 mm Ocrad 28.18% 37.27%
3 mm A* Paths 26.94% 33.04%
5 mm Tesseract 19.98% 33.04%
5 mm Ocrad 27.80% 35.71%
5 mm A* Paths 15.75% 26.79%
10 mm Tesseract 1.88% 6.96%
10 mm Ocrad 4.59% 6.19%
10 mm A* Paths 1.88% 1.74%
Table 9.8 shows the character error rate when applying this classifier to the evaluation
set of the IAM database, computing the average CER of the transcribed texts. The tran-
scriptions were either retrieved by applying multi-dimensional connectionist classification
on paragraph images or connectionist temporal classification on line images, depending
on the outcome of the SVM prediction. Combined transcription methods that show a
lower average CER than either MDCC or CTC alone are highlighted in bold. Marginal-
ization over all artificial line offsets shows that in general a lower CER can be expected
when applying this method in combination the GNU Ocrad line segmentation algorithm.
197
Table 9.8: Average CER when combining MDCC and CTC by applying a SVM on the coordinates
provided by the Tesseract and GNU Ocrad line segmentation algorithms. Table layout
is identical to Table 9.2 with the last two columns changed. Highlighted error rates
mark the instances where the combined error rate is lower than any of the line- or
paragraph-wise transcription on its own.
CER SVM on Segm.
Line Offs. Line Segm. CTC MDCC MDCC selected in CER
0 mm Ground Truth 7.94 0.0% 7.94
0 mm Tesseract 16.74 10.22 98.28% 10.240 mm Ocrad 18.48 88.36% 10.10
0 mm A* Paths 16.31 99.14% 10.25
3 mm Tesseract 20.53 96.12% 10.76
3 mm Ocrad 16.87 10.80 80.60% 10.79
3 mm A* Paths 19.09 95.69% 11.11
5 mm Tesseract 27.58 98.71% 12.80
5 mm Ocrad 20.19 12.76 81.03% 12.47
5 mm A* Paths 24.83 96.12% 13.21
10 mm Tesseract 74.77 98.28% 31.25
10 mm Ocrad 56.87 31.20 92.24% 31.54
10 mm A* Paths 63.77 99.57% 31.17
Tesseract 34.90 - 16.26
Average Ocrad 28.10 16.24 - 16.22
A* Paths 31.00 - 16.43
9.5 Discussion
This chapter discussed and detailed three methods for deciding if the line- or paragraph-
wise transcription should be used for a specific paragraph. The goal of this decision
is to reduce the overall character error rate of the transcription process. Connectionist
temporal classification was employed for line-wise transcription and multi-dimensional
connectionist classification for paragraph-wise transcription. As before, line segmentation
algorithms of Tesseract, GNU Ocrad and based on A* path planning[140] were used to
obtain line segmentation images. Results were presented as empirical evaluations on
the IAM offline handwriting database, evaluating both the raw classification error of the
underlying two-class classifier and the average CER of the transcribed texts.
Section 9.2 proposed the application of support-vector machines on features ex-
tracted directly from the grayscale paragraph image. Section 9.4 extracted features from
coordinates of line and word segments provided by Tesseract and GNU Ocrad and set
up a SVM classifier on these features. Section 9.3 discussed an approach for compar-
ing character n-gram of the transcribed texts to a reference corpus and choosing the
transcription that is closer to the expected n-gram distribution.
The evaluations of this chapter show that deciding if line- or paragraph-wise transcrip-
tion should be applied while only observing the grayscale paragraph image is difficult.
The results in this case show only a slight improvement in CER for one single evaluation
case. Classifying the geometric information provided by the corner points of line- and
word-segments from two line segmentation algorithms yielded better results and a lower
error rate. However, the overall lowest transcription error was achieved by comparing the
transcribed texts to a reference corpus using the n-gram distribution. In combination with
an A* path planning segmentation did this approach reduce the CER from 31.00 in a line-
wise transcription and 16.24 in a paragraph-wise transcription to 16.10 in the combined
transcription.
It is worth noting that all three proposed method do add additional steps to the tran-
scription pipeline, producing transcribed texts from images of handwritten paragraphs. As
198
such, these methods should only be employed if the character error rate of the transcribed
texts is the main concern of the task at hand.
The methods of this chapter show that combining line- and paragraph-wise transcrip-
tion yields an overall lower error rate than any of the two methods alone, given that the
deep neural networks in use for both CTC and MDCC are similar in structure and ca-
pacity. This shows that paragraph-wise transcription using MDCC can be a worthwhile
addition to handwriting recognition systems. It also reinforces the conclusion of Chapter
7 that MDCC is preferable to CTC in hard to segment paragraphs as seen again in the
experiments with an artificially reduced line offset. Of course can the symmetry in the
DNN topologies between the MDCC and CTC methods be broken and the neural net-
works be tuned towards either CTC and MDCC to achieve lower error rates. However,
one of the goals of this chapter is to show that MDCC as a method in terms of its train-
ing system, loss function and decoding algorithms is a good addition to the transcription
method proposed in CTC.
199
200
Chapter 10
Dictionary-Based Decoding
Algorithms
10.1 Overview and Relation to This Work
This chapter at hand proposes and details two approaches to single-line decoding in
offline handwriting recognition. These two ideas of Sections 10.2 and 10.3 can potentially
be applied to both connectionist temporal classification (CTC), see Section 3.1, and the
line-decoding in multi-dimensional connectionist classification (MDCC) which is detailed
in Chapter 5.
Improvement of single-line decoding methods is not the main focus of this thesis,
which deals with paragraph-wise transcription of handwritten text. We chose to focus
on paragraph-wise transcription in the context of this thesis instead of further following
the ideas discussed in the two following Sections 10.2 and 10.3. Please see the original
publications[119, 122] for more details on these two approaches.
10.2 Decoding using a Large Lexicon and Fuzzy Search
The following section is based on the following paper:
Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Increasing Robustness
of Handwriting Recognition Using Character N-Gram Decoding on Large Lexica.” In:
2016 12th IAPR Workshop on Document Analysis Systems (DAS). Apr. 2016,
pp. 156–161. DOI: 10.1109/DAS.2016.43
Section 1.3 outlines the authors individual contributions.
Overview
Section 3.1 of this work discussed connectionist temporal classification (CTC)[46], which
is a method for transcribing one-dimensional sequences, e.g. single lines of handwritten
text or audio recordings. CTC introduced both a loss function for training deep neural
networks towards this task as well as decoding algorithms to retrieve a high-likelihood
label sequence from the DNN prediction. This decoded label sequence, e.g. a sequence
of glyphs from an alphabet, constitutes the transcription result based on the given input.
The deep neural networks employed for CTC are ended with a softmax function as
their last layer and as such produce the likelihood of each time step of the sequence be-
ing part of a specific glyph of the alphabet, with the glyphs being exclusive to each other
201
per time step. Decoding the DNN output is to uncover a high-likelihood label sequence, a
sequence of glyphs, that explains this prediction. As such, there are different methods on
how to approach this task. We have discussed best path decoding and beam search de-
coding before, both proposed by Alex Graves for CTC[43, ch. 7.5.2]. Best path decoding
identifies the single label sequence, that is one path through e.g. Figure 3.1.1, with the
highest overall probability by greedily collecting the glyph with the highest probability per
time step. This decoding algorithm is fast and low on memory usage, but transcription
quality suffers if parts of the DNN output are weakly predicted. Beam search decoding
builds on best path decoding by observing that there may be multiple paths that decode to
the same label sequence. Beam search decoding marginalizes over all paths that decod-
ing to the same sequence and this way introduces robustness against weakly predicted
time steps. However, beam search decoding is computationally expensive since it keeps
and manages a trie with all so far known prefixes of label sequences.
The publication[119] outlined in this section at hand proposes a decoding algorithm for
CTC based on matching character n-grams between the deep neural network prediction
and an index of possible strings. It proposes to beforehand build an index out of a dic-
tionary of strings, then online extract character n-grams from the DNN predicted, weigh
them with probabilities and identify dictionary strings with a high probability to explain the
DNN output. This decoding algorithm can both be applied as a stand-alone method as
well as a pre-filter followed by beam search decoding. The proposed decoding algorithm
is both robust in weakly predicted parts and capable of improving decoding speeds in
beam search decoding when used as a pre-filter.
Index Generation
The first step to apply n-gram decoding to CTC is to build an n-gram index for later use.
This can be done offline, before transcription using CTC, and the resulting index can be
stored on disk. The overall process is outlined in Figure 10.2.1. Input into the index
generation process is a dictionary of strings to which the decoding algorithm should be
capable to decode. It is thus assumed to be in the same language as the later transcribed
texts and using the same alphabet.
Index Structure Hit list
ID 1 Pos. 1
ID 2 Pos. 2
... ...
Dictionary Map n-gram to hit list
ID 1 "abc" Bi-gram "AB" Hit list
Index Generation
ID 2 "abd" Process Bi-gram "BC" ID 1 Pos. 2
... ... Bi-gram "BD" ... ...
...
Hit list
ID 2 Pos. 2
... ...
Figure 10.2.1: Structure of the generated n-gram index. Hit lists are sorted by ascending ID first
and ascending position second.
202
The index structure consists of a map that relates character n-grams, in this example
bi-grams, to hit lists. Upper or lower case characters are mapped to all upper case
variants to implement additional robustness of the decoding method. The data structure
for this map must contain each n-gram (key) exactly once and allow fast exact lookup.
Suitable are for example tries or hash maps. Emphasis while choosing a data structure
for this map should be placed on a low runtime for lookup, that is mapping n-grams to hit
lists, since this is critical for a low runtime during transcription of handwritten texts using
this decoding method. The index further contains exactly one hit list per character n-gram
of the dictionary. Each hit list marks the occurrences of this specific n-gram. The hit lists
are simple sorted list of 2-tuples containing an identifier for the dictionary entry and the
position, in character from the beginning of the string, within the string. Sorting these
hit lists by ascending identifiers and positions is beneficial for applying fast intersection
algorithms to these lists.
Decoding using Fuzzy Search
After building the above described index it can be applied to filter the dictionary for strings
that have a high probability given the prediction of a deep neural network trained by CTC.
As mentioned before is the last layer of a DNN for CTC a softmax function that produces
likelihoods for assigning each time step to one of the glyphs from the given alphabet.
Assignments of the glyphs are exclusive to each other per time step. Table 10.1 illustrates
a DNN prediction for a simple case.
Table 10.1: Example output of a deep neural network training with connectionist temporal classi-
fication. The sequence consists of ten time step and predicts labels from an alphabet
of three glyphs plus the glyph separator ϵ of CTC. The probabilities of each time step
(column) are exclusive to each other and thus sum up to exactly one.
Time step 1 2 3 4 5 6 7 8 9 10
A 0.5 0.9 0.1 0.0 0.3 0.0 0.1 0.0 0.3 0.1
B 0.0 0.0 0.1 0.7 0.3 0.9 0.7 0.0 0.2 0.0
C 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.7
ϵ 0.2 0.1 0.8 0.3 0.4 0.1 0.2 1.0 0.1 0.2
The first step in decoding the soft-assignment of Table 10.1 is to extract character
n-grams from it and weigh them with probabilities. The order of n-grams (bi-gram, tri-
gram, ...) must be identical to the n-grams used for index generation. Upper and lower
case glyphs must also be mapped to all upper case if done so in index generation. The
glyph separator ϵ is contained in the soft-assignment as predicted by the DNN, but not
in the index structure of Figure 10.2.1. Glyph separators ϵ are thus used for computing
the weight of each n-gram, but not included for lookup in the map structure to retrieve
the matching hit list. N-grams are extracted from the soft-assignment by application of a
backtracking[22] algorithm. Backtracking is started once at every time step and extracts
all n-grams starting at it. Only n-grams starting and ending with a printable glyph, that is
not a separator ϵ, are extracted. Since the CTC label sequence alternates between print-
able glyphs and the separator, each n-gram consists of n time steps of printable glyphs
and up to n−1 separator glyphs. Separator glyphs ϵ may be less than the maximum n−1
since they are not required for decoding if the two adjacent glyphs are different and thus
can be distinguished from a single character that repeats over multiple time steps without
placing a separator glyph in between. The backtracking algorithm thus extracts character
n-grams spanning between n and (2× n)− 1 time steps, beginning at each time step. All
n-grams start and end with printable glyphs.
203
The proposed backtracking algorithm extracts character n-grams from the DNN pre-
diction while at the same time calculating the mean probability over all labels included
in this n-gram. Adjacent repetitions of the same glyph are resolved by using only the
maximum probability for each character. This behavior is in line with the observation
that DNNs trained with connectionist temporal classification tend to predict ‘spikes’, each
one being a very localized high-probability prediction for a character. The weight of each
extracted n-gram G is then the mean probability over∑all ‘spikes’ used during extraction
| 1P (G y) = yt (10.2.1)
(2× n)− 1 g
with y being the soft-assignment produced by CTC, e.g. Table 10.1, and t, g being the
time step and glyph used for each position in the n-gram G.
Some example probabilities of bi-grams contained in Table 10.1 starting at the first
time step are:
• P (AϵA|y) = 13(0.5 + 0.1 + 0.1) ≈ 0.233
• P (AϵB|y) = 13(0.5 + 0.1 + 0.1) ≈ 0.233
• P (AAB|y) = 13(0.5 + 0.9 + 0.1) = 0.5
Second step of the decoding algorithm is to identify the index hit lists, see Figure
10.2.1, matching the extracted n-grams. This is simply a series of exact searches within
the map data structure. N-grams that were extracted from the CTC prediction but that are
not contained in the index structure are skipped. Each hit list is weighed by the probability
of the matching extracted n-gram as defined by Equation 10.2.1.
The last step during decoding is to intersect the n-gram hit lists, weighed by their
respective probability, and to identify the highest probability dictionary strings. Probabil-
ities of the dictionary strings are again computed by averaging the probabilities of the
extracted n-grams contained within the string. This allows to simply weigh unmatched
n-grams with zero probability in order to perform index queries with incomplete informa-
tion, but still obtain meaningful dictionary entries. Intersection the hit lists is related to the
multiple search and t-intersection problems. In these, ordered lists are intersected with
the additional constraint that each element contained in the intersection must also be
contained in at least t of the individual hit lists. Several algorithms have been proposed[4,
72, 146] to address these problems. The dictionary entries contained in the intersection
are then weighed and ordered by their mean n-gram probability.
The dictionary entry with the highest probability according to the decoding algorithm
outlined above is the one label sequence that explains the DNN prediction best, at least
under the assumptions incorporated in this decoding method. This decoding algorithm
can also be used as a pre-filter for beam search decoding. In this case not only the
highest probability dictionary entry is used, but the top-n entries. These are then used to
restrict beam search decoding to process only label sequences, and their prefixes, that
are contained within this pre-filtered list of dictionary entries.
Evaluation
This proposed decoding algorithm using character n-grams was tested on a postal data
set at Siemens Parcel Logistics. The data set consisted of address lines from the United
States and Canada. 135000 line images were used for training and each 3000 for vali-
dation and evaluation. These three data splits were disjoint to each other. The dictionary
in use consisted of 423170 strings, containing the correct transcription texts and variants
of them. The trained deep neural network showed a character error rate (CER) of 5.50
204
on the training, 7.01 on the validation and 6.86 on the evaluation set when decoded with
unconstrained beam search decoding.
Table 10.2 shows the average CER and runtime per example of the evaluation split
when decoded with beam search decoding constrained to either the full dictionary or a
pre-filter obtained by the proposed n-gram index. As expected does beam search with an
unlimited beam width produce the lowest error rate, but at the highest cost. Pre-filtering
the dictionary to the best 500 entries and constraining to those produces a slightly higher
error rate (0.65 instead of 0.58) but for a fraction of the required runtime (18.9 instead
of 7444.5 milliseconds). Choosing the single best matched dictionary entry results in a
high, compared to the other decoding methods, error rate of 3.05 with only little benefit in
decoding speeds.
Table 10.2: Character error rates on the evaluation split when constraining beam search decod-
ing to the full dictionary or to a pre-filtered dictionary using n-gram decoding. Time
measurements are the average wall-clock time to decode a single DNN output.
Decoder Configuration CER Runtime
Constr. on full dict. beam-w. 10000 1.05 81.9 ms
Constr. on full dict. beam-w. 100000 0.76 2297.7 ms
Constr. on full dict. beam-w. unlimited 0.58 7444.5 ms
Constr. by n-gram idx. beam-w. 10000, 3-grams, filter 100 0.85 15.0 ms
Constr. by n-gram idx. beam-w. 10000, 3-grams, filter 500 0.65 18.9 ms
Single best match tri-grams 3.05 13.3 ms
10.3 Decoding using LSTM Networks and Metric Learning
This section is based on the following publication:
Martin Schall, Haiyan P. Buehrig, Marc-Peter Schambach, and Matthias O. Franz.
“LSTM Networks for Edit Distance Calculation with Exchangeable Dictionaries.” In: 2018
13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018
Section 1.3 outlines the authors individual contributions.
Overview
The method discussed in this section is based on the observation that recurrent neural
networks (RNNs) are Turing complete[101] and able to evaluate computer code[160]. In
this section we will explore if a long short-term memory (LSTM)[48, 55] network is capable
of learning the algorithm for computing the Edit-distance[81, 151] between a query string
and a dictionary of strings. Inference in the LSTM network yields the Edit-distances
between the query string and every string of the dictionary. Both the query string and
the dictionary of strings should be exchangeable for previously unseen variants. In other
words, the LSTM network is required to learn to approximate the algorithm for computing
the Edit-distance, independent of the data presented. Figure 10.3.1 shows an overview
of this approach.
This method is related to offline handwriting recognition in terms of its potential ap-
plication as a decoding algorithm. Incorporating this method would be similar to the one
discussed in the previous Section 10.2. The method discussed in this section takes a
specific query string as input and estimates the Edit-distances towards a given dictionary
205
Query String
Deep Neural List of 
Network Edit-Distances
Dictionary 
of Strings
Figure 10.3.1: General idea for the metric learning of this section. The Deep Neural Network
estimates the Edit-distances between an unseen query string and and unseen
dictionary. The DNN should only learn the metric, that is the algorithm for compu-
tation of the Edit-distance and not memorize a specific dictionary.
of string. Replacing the query string with the output of a connectionist temporal classifi-
cation (CTC)[46], see Section 3.1, would yield a decoding algorithm for CTC. Decoding in
this case is finding the dictionary string with the minimal Edit-distance towards the CTC
output. It could be employed in the paragraph-wise decoding algorithm of Chapter 5 for
decoding individual lines.
Network Topology and Training Method
The first step in designing a deep neural network for estimating the Edit-distance between
strings is to encode these strings in a way that is suitable for the task. In this case, each
string is a sequence of characters from the Latin alphabet and the glyphs are mutually
exclusive to each other per position in the string. That is each string position is exactly
one character from the alphabet, not zero or multiple ones. As such the strings were
encoded in a one-hot scheme were each string position is encoded as a 26-dimensional,
for 26 Latin glyphs, vector with exactly one position set to the value 1 and others to 0.
The glyph ‘A’ is encoded as a vector with the first coefficient set to 1, others to 0. ‘B’ with
the second coefficient set to 1 and so on. In the case of these experiments, the strings
were limited to a maximum length of ten characters. If the encoded string was shorter,
the remaining positions were encoded as vectors with all zeros. The tensor for encoding
the query string was thus of shape B × 10 × 26 with B being the mini-batch size and
allowing for a maximum string length of 10 over the Latin alphabet of 26 characters.
Dictionaries of strings for comparison with the query string were encoded in a similar
fashion. Each dictionary presented to the deep neural network contained 100 strings,
stacked along the ‘encoding-dimension’. The tensor containing an encoded dictionary
was thus of shape B × 10 × 2600 in the experiments of this section. Both the one-hot
encoded query string and dictionary were presented to the LSTM network for estimation
of the Edit-distance. Figure 10.3.2 illustrates this approach and topological details.
Figure 10.3.2 details the deep neural network topology employed in this method. The
core idea is to stack layers of bidirectional LSTM layers, each time again presenting the
encoded dictionary to the layer. This allows the network to optimize towards computing
the Edit-distance, not towards forwarding the 100 dictionary strings through its layers.
In addition to this wanted behavior does this approach allow to encode the dictionary
external of the DNN, making it exchangeable for previously unseen ones. Bidirectional
LSTM was chosen since it has proven to work well for sequence labeling and sequence
transformation tasks in the past, CTC being a prominent example. There is also work[39]
showing that LSTM networks with forget gates are capable of recognizing basic syntax.
The last layer of the deep neural network is a fully connected feed-forward layer with
a ReLU[85] non-linear activation function. The purpose of this fully connected layer is
206
Encoded Dictionary Encoded Queries 
(|Batch| x 10 x 2600) (|Batch| x 10 x 26)
Concat along last dimension 
(|Batch| x 10 x 2626)
Bidirectional LSTM 
(|Batch| x 10 x #Neurons)
Concat along last dimension
Repeat according
to topology.
Bidirectional LSTM
Fully Connected with ReLU 
(|Batch| x 100)
Estimated Edit-Distances
Figure 10.3.2: The Deep Neural Network used for this method consists of a stack of bidirectional
long short-term memory, ended with a fully connected layer that estimates the
Edit-distances. The string length is limited to a maximum of 10 characters in the
experiments of this section. The alphabet consists of 26 Latin characters. The
dictionary passed into the DNN contains exactly 100 strings.
207
to estimate the Edit-distances as one scalar per dictionary entry. This way the overall
problem is formulated as a regression task. ReLU is suitable here since the Edit-distance
has a minimum of zero, but is unbound in the positive range. This is true for the ReLU
function.
Data for training and evaluation was derived from the 20k most frequent English
words[91]. The word list used for this work was retrieved from https://github.com/
first20hours/google-10000-english/blob/master/20k.txt on the 29th of September
2017. These 20k most frequent English words were filtered to ones between 3 and 10
characters in length, leaving 16968 strings. 1000 random ones were used to build 10 dic-
tionaries of 100 strings each. 9 of these 10 dictionaries were used for training and the last
one reserved for validation and evaluation. Of the remaining 15968 strings, 80 percent
were used as query strings for training of the DNN and 10 percent each for validation and
evaluation.
Training was conducted by choosing a random dictionaries from the 9 ones reserved
for training. Also a random set of query strings was chosen from those set aside for train-
ing. The true Edit-distances were computed for this combination of dictionary and query
strings, facilitating optimization of the DNN in a supervised fashion. The optimization
criteria was to minimize the mean squared error (MSE) between the true and estimated
Edit-distances. The DNN parameters were optimized using backpropagation, gradient
descent and Adam[67] in mini-batch mode.
Evaluation
Table 10.3 shows the average root mean squared error (RMSE) between the true and
estimated Edit-distances after optimizing the deep neural network in the detailed way. The
RMSE is a measure of the absolute difference in the same unit as the estimated quantity,
in this case the Edit-distance. The evaluation table is split into two parts: the upper one
while using the dictionary withheld from training and the lower part using the 9 dictionaries
applied for training. The columns detail the RMSE for different DNN topologies with 2 to 5
LSTM layers with 30 to 200 neurons each. Figure 10.3.2 illustrates this network topology.
The dictionaries were not shuffled and kept in the same order for these experiments.
Table 10.3: Average RMSE when estimating the Edit-distances for known and unknown dictio-
naries. The order of strings within the dictionaries was kept constant for training and
evaluation.
Num. layers and neurons
2× 30 2× 60 3× 60 5× 200
Eval. set 1.78 1.78 1.56 2.13
Unkn. Dict. Val. set 1.78 1.80 1.57 2.14
Train. set 1.78 1.79 1.57 2.12
Eval. set 0.37 0.30 0.29 0.36
Known Dict. Val. set 0.37 0.29 0.29 0.36
Train. set 0.37 0.29 0.28 0.34
Table 10.3 clearly shows a lower prediction error of ∼0.3 to ∼0.4 on the dictionaries
seen during training in comparison to an error of ∼1.6 to ∼2.2 on the remaining unseen
dictionary. This indicates that the DNN learning to either recognize or to memorize the
dictionaries presented during training. This is contrary to the goal laid out in the begin-
ning of this section, namely that the DNN should learn to approximate the algorithm of
computing the Edit-distance while allowing exchangeable dictionaries. To this end, a sec-
ond set of experiments was conducted. This time the words contained in the 9 training
208
dictionaries were shuffled randomly before each optimization iteration. Each dictionary
contained 100 strings, yielding 100! ≈ 10158 different permutations per dictionary. Train-
ing and evaluated was repeated with these randomized dictionaries as the only change
in comparison to the set of experiments detailed in Table 10.3.
Table 10.4 shows the results of the empirical evaluation using the shuffled dictionar-
ies. This time the RMSE of the predicted Edit-distance is between ∼0.8 and ∼0.85 on
the dictionaries seen during training and ∼0.85 on the unseen dictionary. Memorizing or
recognizing the dictionaries presented during training was not possible anymore. Similar
to this observation are the prediction errors between the training, validation and evalu-
ation data sets close together. In conclusion does this training paradigm allow to both
exchange the dictionary and the query strings while the DNN itself only learns to approx-
imate the Edit-distance, not recognize specific inputs presented to it.
Table 10.4: Evaluation as in Table 10.3 but with the order of strings shuffled randomly for each
training iteration and evaluation step. Overfitting is reduced by shuffling the dictionar-
ies randomly.
Num. layers and neurons
2× 30 2× 60 3× 60 5× 200
Eval. set 0.86 0.84 0.84 0.84
Unkn. Dict. Val. set 0.86 0.84 0.84 0.84
Train. set 0.86 0.84 0.84 0.84
Eval. set 0.85 0.82 0.82 0.81
Known Dict. Val. set 0.85 0.82 0.82 0.81
Train. set 0.85 0.82 0.82 0.81
The achieved RMSE of ∼0.8 for predicting the Edit-distance between a query string
and a dictionary of words is not small enough to retrieve the correct Edit-distance by
rounding. However, this method shows that LSTM networks are capable of approxima-
tion of the algorithm for computing the Edit-distance while keeping both the dictionary and
query string exchangeable. This opens the door for further scientific research into appli-
cations of this type of DNN, for example for fuzzy search or decoding of CTC or MDCC
predictions in offline handwriting recognition.
209
210
Chapter 11
Discussion and Conclusion
11.1 Achieved Goals
In the beginning of this thesis, Chapter 1 discussed the motivation and scientific contri-
butions for it. Chapter 3 then detailed the works which can be directly compared to multi-
dimensional connectionist classification as proposed in this thesis, followed by Chapter 4
which contains the problem statement for segmentation-free multi-line offline handwriting
recognition. The target of this section is now to reflect on this and to discuss the goals
which were set and achieved.
Chapters 5 and 6 proposed and detailed multi-dimensional connectionist classifica-
tion (MDCC), which tackles the problem of segmentation-free multi-line offline handwrit-
ing recognition. The goal for this research was to identify a method which is able to
apply deep neural networks to the transcription of handwritten multi-line paragraphs with
a low error rate, but also without prior segmentation of the input input and while providing
only the truth texts as labeling during training of the neural network. This goal has been
achieved by MDCC and demonstrated in Chapter 7 by applying it to the IAM offline hand-
writing database[88] and empirical evaluation and comparison. The empirical evaluation
of MDCC on the IAM offline handwriting recognition showed competitive error rates in
comparison to published methods and results.
The results of Chapter 7 require discussion in the context of the comparison to the
related works of Chapter 3. The time frame in which the research into the methods of
this thesis has been conducted also saw the publication of other methods that address
the same problem. Most prominently the application of attention network to multi-line
offline handwriting recognition[8, 9, 135]. These have been discussed in Section 3.2 of
this thesis. Table 7.6 shows the comparison of MDCC to these methods in terms of the
character error rate achieved on the IAM offline handwriting database. This table shows
a higher character error rate for MDCC than the other methods.
However, multi-dimensional connectionist classification does offer some benefits in
comparison to both applying attention networks, see Section 3.2, to offline handwriting
recognition or reshaping the prediction of a convolutional neural network, see Section
3.3:
• MDCC is designed and implemented as a training algorithm, based on expectation-
maximization and conditional random fields, for deep neural networks. MDCC thus
sets only broad requirements on the DNN topology and could in theory even be
used in combination with other machine learning models, aside from artificial neural
networks. The conditions on the machine learning model are that it produces the
probabilistic soft-assignment required for MDCC and that is can be optimized in an
expectation-maximization loop. This allows adapting MDCC for different use-case
requirements such as memory or general hardware and runtime limitations. This
211
property of MDCC also enables its potential application to future, yet unknown,
machine learning models.
• MDCC is fast in transcribing paragraphs. This is achieved by moving a large part
of the computational effort into the training method, not into inference during tran-
scription. MDCC proposes a multi-line decoding algorithm, which is fast to execute
and easy to parallelize.
• MDCC can handle a variety of cursive writing styles in handwritten text. It does
not set assumptions on the shape or size of glyphs, these are learned by the deep
neural network. It also does not assume text lines to be of equal height and allows
text lines to have different heights, but also individual text lines to vary internally in
height. The label space for encoding multi-line text, discussed in Section 6.2, allows
slants and angles of text lines up to 45 degree. Also the text lines in MDCC do not
have to be aligned to the left or right border of the image.
Chapter 8 proposed a visual analytics method for identifying and inspecting interesting
examples during the training of a deep neural network with MDCC. The express goal of
this is to allow a human expert user to identify potential error sources during the training
process and address them. Specifically does this method target the optimization of hyper-
parameters in MDCC and improving the ground truth data available. As such is this
method an important part of applying MDCC to specific tasks and data sets.
Chapter 9 finally proposed a method for combining line- and paragraph-wise tran-
scription on a case-by-case basis. This is designed to utilize both methods in order to
achieve a lower error rate. This goal is interesting from a purely scientific viewpoint, but
also a common approach in industry and thus does support the deployment of MDCC in
such use-cases.
11.2 Ideas for Future Research
The achieved goals and direct outcomes of the research discussed in this doctoral the-
sis has been detailed in the sections before. This section will now offer a few ideas for
research that builds upon the work of this thesis. Please keep in mind that these sugges-
tions will entail some speculation. However, this speculation is informed by the authors
experience while conducting the research into multi-line offline handwriting recognition.
It is also worth pointing out that the author of this thesis does not claim these proposed
methods as his own. Instead are the following paragraphs only ideas for continuing the
research of MDCC and the application of expectation-maximization to deep neural net-
works and conditional random fields where this thesis ends.
Generalized Writing Direction
This thesis discussed the structure of multi-line text in Section 6.2 and the conditional
random field (CRF) used for inference of the alignment between the truth text given in
the labeled data and the pixel space of the image is built based on these observations.
One of the limitations that restricted the structure of paragraphs that can be transcribed
with multi-dimensional connectionist classification (MDCC) is that text lines need to be
aligned from top to bottom, characters from left to right and text lines may be rotated
by a maximum of 45 degree. This limitation is also reflected in the proposed decoding
algorithm of Chapter 5.
Switching the reading order from left-to-right around to right-to-left is not a big change
in MDCC. On the other hand would it likely be an interesting research topic to generalize
212
MDCC in order to transcribe other writing directions. Figure 11.2.1 shows one such an
example where the text is ordered in a spiral, starting in the outer ring and progressing
towards the center of the spiral.
elitr, se
cin
g d d
ore mag
i
l na
do os et a ao e ca rebe uergreksi
Figure 11.2.1: Text typeset as a spiral. The reading direction is starting at the outer end, reading
towards the middle.
Text generated with https://www.loremipsum.de/ on September 24th, 2021.
Transcribing generalized writing directions such as the example of Figure 11.2.1 seems
to be possible with an improved version of MDCC since MDCC models the separation of
two lines with an explicit symbol. The orientation and extent of individual text lines is
thus not implicitly defined by geometry and assumptions about handwritten text, but by
the prediction of the line separator glyph. Transcribing text ordered in a spiral would thus
entail modeling the line separator glyph as a spiral in between adjacent loops of the same
text line.
This concept could further be generalized by removing the knowledge about the writ-
ing order, e.g. that the text is written in a spiral, from the labels of the annotated training
data. It may be possible to structure the CRF and decoder of multi-dimensional connec-
tionist classification in such a way that it allows general structures of multi-line text. For
example one could introduce two additional tokens into MDCC that signal the start and
end of a text line. The CRF could then model text lines in arbitrary shapes as long as no
two text lines are crossing. Decoding would entail starting at a predicted line start and
following the corridor between the predicted line separator glyphs, even if this corridor is
shaped in e.g. spiral or wave forms.
Document Layout Analysis
The next step up from transcribing generalized writing orders would be to analyze and
transcribe general document structures. Figure 11.2.2 is an example from the PubLayNet
213
ons
etetur sad
c labore ips
nt
ut et
tua. Atp veu lores e
r
o t
sd
gub
a taaa
dolor sit a
um
m
or
temp in
e
vi t,
od
d
diam u
se
d vol
sto
duo d
lita kc ao set
Lorem
numy e ipno irm s
am uyam eraliq tsamu e
,
t
c j. Ste
u
t
m n, n
ma
database[163], which shows a complex layout. Adapting multi-dimensional connectionist
classification for the analysis of such document would entail generalizing the geometric
neighborhoods between entities. These have been discussed in Section 6.2 of this thesis.
In the case of this example, the neighborhoods would reflect that there are two columns,
left and right of each other. The structure would also entail that there is a table to the top of
both of these columns. Hierarchical information also needs to be encoded since the two
text columns and the table contain text, which in itself is a label space with relationships
between characters. In a similar fashion does the right column contain two figures, which
also represents a hierarchical relationship.
Encoding hierarchical structures in the label space of MDCC would not necessarily
entail a hierarchical conditional random field. It could also be possible to encode this
hierarchy by introducing even more symbols with special meaning into the alphabet of
MDCC and thus the label space. In the case of Figure 11.2.2, these symbols could e.g.
be ‘upper border of a table’, ‘right border of a table’ and so on for tables, figures and the
left and multi-column layouts. This way the labels within e.g. a table could be modeled
independent from the outside label space, connected only by the indicated table border.
Changes to the decoding algorithm proposed in Chapter 5 would be necessary to
reflect these hierarchical relationships in the document and thus label space.
Scene Text Recognition
Figure 11.2.3 shows an example from the Natural Environment OCR (NEOCR) data
set[95, 96], which addresses the scene text recognition task. Scene text recognition
is to locate and transcribe text within natural images, e.g. scenes of streets with their
street signs and advertisement in shop windows.
One future research direction could be to apply multi-dimensional connectionist clas-
sification to this task by, as before, introducing some special symbols. In the case of the
example in Figure 11.2.3, these symbols could be generic such as ‘sky’, ‘floor’, ‘pole’,
‘tree’. It would then be possible to model this scene with its text embedded into the natu-
ral environment. The label space, see Section 6.2, of MDCC would in this case indicate
a tree left of the text, sky to the top-right, a floor below and a pole to the right. The text
itself could be modeled, as in this thesis, as a multi-line text.
Object Segmentation with Incomplete Information
The last suggestion for further research into multi-dimensional connectionist classification
is to apply it to object segmentation, but with incomplete information. The task of object
segmentation is to take an image of a natural scene as input and produce a pixel-wise
mask that assigns each pixel to an object within this scene image. Object segmentation
thus allows both geometric location as well as identification of objects. In other words
is object segmentation the task of separating individual objects within an image. Figure
11.2.4 shows a scene image of a smaller airport with a general aviation airplane. The
objects in this case would be e.g. the airplane, landing strip, grassland, trees, buildings
in the background and the sky.
In contrast to the scene text recognition task would the object segmentation task re-
quire an ‘alphabet’ of very specific symbols, one for each object type within the data set.
For example the COCO data set[83] consists of 80 different object classes. The COCO
data set is labeled with the truth pixel-wise segmentation masks and object types and
thus provides full information about the scene image. MDCC could be applied to object
segmentation with incomplete information by reducing the labeled truth data to object
types and their geometric relation to each other, but omitting the pixel-wise masks.
214
Figure 11.2.2: Example from the PubLayNet database[163]. This example shows a two-column
page layout with a spanning table on top. The right column contains two separate
figures.
215
Figure 11.2.3: Scene text image contained in the Natural Environment OCR (NEOCR) data
set[95, 96], provided by https://www.cs6.tf.fau.de/neocr.
Figure 11.2.4: Scene image captured on a smaller airport and showing a general aviation
airplane.
Image from https://commons.wikimedia.org/wiki/File:Glenrothes_
Airport.JPG and released into public domain by author Michael Westwa-
ter.
216
The geometric relationships in Figure 11.2.4 would be the following:
• A tree on the left border.
• A grassland to the right of the tree.
• A small airplane surrounded by the grassland on the top, left and bottom.
• A landing strip surrounded by the grassland on the top, left and bottom.
• Buildings to the right of the tree and on top of the grassland.
• Sky on top of the tree, buildings and grassland. Spanning up to the image border.
• And so on...
The label space of MDCC would then reflect these geometric relationship between
objects, enabling the conditional random field to infer the pixel-wise mask from the current
DNN estimate and the given label space. The DNN trained in this fashion would thus learn
a pixel-wise object segmentation mask for each example by optimization on a large data
set with only incomplete segmentation information provided.
11.3 Discussion
This section represents the conclusion of this thesis and as such is the place for some
final thoughts. The following paragraphs are thus the personal opinion of the author.
Chapters 7 and 9 did show that paragraph-wise transcription becomes increasingly
preferable to line-wise transcription of handwritten text if the text lines become harder
to segment. This has been shown with experiments in which the offset between adja-
cent text lines has been artificially reduced. Multi-dimensional connectionist classification
and attention-network based methods[8] can be applied for paragraph-wise transcription.
Both are robust methods for transcription of overlapping text lines. MDCC is faster than
the compared attention-network method, which can be important in time-sensitive ap-
plications. MDCC also shows general properties that allow its application to machine
learning models other than deep neural networks. Section 11.2 discusses a some ideas
on how to generalize MDCC for more complex document layouts than single paragraphs.
One observation to point out is that at the beginning of this research, the state of the
art in offline handwriting recognition was to apply connectionist temporal classification
(CTC)[46] in combination with deep neural networks. CTC still is, with good reason, in the
state of the art in this research field. However, multiple novel methods for paragraph-wise
offline handwriting recognition have emerged independently of each other while working
on MDCC. This seems to indicate some sort of ‘convergent evolution’ in document analy-
sis and that there is indeed interest and need in multi-line offline handwriting recognition.
Multi-dimensional connectionist classification fits into this.
Another point is an observation on the published papers at recent document analysis
conferences, especially ICDAR 2021. The number of published papers on different topics
seem to signal an increasing interest in extracting information from complex documents,
e.g. invoices, with little or no prior explicit layout analysis. On the other hand are there
still numerous publications on historical document analysis. I think this makes sense
since the number of contemporary, modern handwritten documents is likely decreasing.
This trend does indicate that further research into multi-dimensional connectionist clas-
sification for document layout analysis, see the ideas of Section 11.2, may be useful
and fruitful. MDCC could potentially be applied to both complex document layouts and
historical documents.
217
Manually debugging and optimizing the MDCC training method and deep neural net-
work hyper-parameters also made it obvious that some sort of visual analytics for identi-
fying error sources in MDCC is necessary. This was discussed in Chapter 8 in the context
of this thesis. I do hope that this trend extends to more artificial intelligence methods in
what is called explainable AI. I think involving human experts into the general evaluation
of and identification of error sources in artificial intelligence systems is a component for a
widespread use of AI while maintaining the trust of users and others who are affected by
these methods.
These observations conclude this thesis, hopefully with a positive outlook on the re-
search fields touched by this work and with the contributions of this thesis being added
to them.
218
Bibliography
[1] Martin Abadi et al. “TensorFlow: A system for large-scale machine learning.” In:
12th USENIX Symposium on Operating Systems Design and Implementation
(OSDI 16). 2016, pp. 265–283.
[2] Stefan Arnborg, Derek G. Corneil, and Andrzej Proskurowski. “Complexity of find-
ing embeddings in a k-tree.” In: SIAM Journal on Algebraic Discrete Methods 8.2
(1987), pp. 277–284.
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Trans-
lation by Jointly Learning to Align and Translate.” In: 3rd International Conference
on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015.
[4] Jérémy Barbay and Claire Kenyon. “Adaptive Intersection and t-Threshold Prob-
lems.” In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Dis-
crete Algorithms. SODA ’02. San Francisco, California: Society for Industrial and
Applied Mathematics, 2002, pp. 390–399. ISBN: 089871513X.
[5] Yoshua Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009.
[6] Yoshua Bengio. “Practical recommendations for gradient-based training of deep
architectures.” In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437–
478.
[7] Christopher M. Bishop. Pattern Recognition and Machine Learning. springer,
2006. ISBN: 978-0-387-31073-2.
[8] Théodore Bluche. “Joint Line Segmentation and Transcription for End-to-End
Handwritten Paragraph Recognition.” In: Advances in Neural Information Process-
ing Systems. 2016, pp. 838–846.
[9] Théodore Bluche, Jérôome Louradour, and Ronaldo Messina. “Scan, Attend and
Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention.”
In: 2017 14th IAPR International Conference on Document Analysis and Recog-
nition (ICDAR). Vol. 1. IEEE. 2017, pp. 1050–1055.
[10] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. “A training algo-
rithm for optimal margin classifiers.” In: Proceedings of the fifth annual workshop
on Computational learning theory. 1992, pp. 144–152.
[11] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimization methods for large-
scale machine learning.” In: Siam Review 60.2 (2018), pp. 223–311.
[12] Thomas M. Breuel. “High performance text recognition using a hybrid
convolutional-lstm implementation.” In: 2017 14th IAPR International Conference
on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 11–16.
[13] John S. Bridle. “Probabilistic interpretation of feedforward classification network
outputs, with relationships to statistical pattern recognition.” In: Neurocomputing.
Springer, 1990, pp. 227–236.
219
[14] Rich Caruana, Steve Lawrence, and C. Lee Giles. “Overfitting in neural nets: Back-
propagation, conjugate gradient, and early stopping.” In: Advances in neural infor-
mation processing systems. 2001, pp. 402–408.
[15] Venkat Chandrasekaran, Nathan Srebro, and Prahladh Harsha. “Complexity of
Inference in Graphical Models.” In: CoRR abs/1206.3240 (2012). arXiv: 1206.
3240.
[16] Darren M. Chitty. “A data parallel approach to genetic programming using pro-
grammable graphics hardware.” In: Proceedings of the 9th annual conference on
Genetic and evolutionary computation. 2007, pp. 1566–1573.
[17] Jaegul Choo and Shixia Liu. “Visual analytics for explainable deep learning.” In:
IEEE computer graphics and applications 38.4 (2018), pp. 84–92.
[18] Anna Choromanska et al. “The loss surfaces of multilayer networks.” In: Artificial
intelligence and statistics. 2015, pp. 192–204.
[19] Barry A. Cipra. “An introduction to the Ising model.” In: The American Mathemati-
cal Monthly 94.10 (1987), pp. 937–959.
[20] Denis Coquenet, Clément Chatelain, and Thierry Paquet. “End-to-end Handwrit-
ten Paragraph Text Recognition Using a Vertical Attention Network.” In: CoRR
abs/2012.03868 (2020). arXiv: 2012.03868.
[21] Denis Coquenet, Clément Chatelain, and Thierry Paquet. “SPAN: A Simple Pre-
dict & Align Network for Handwritten Paragraph Recognition.” In: Document Anal-
ysis and Recognition – ICDAR 2021. Ed. by Josep Lladós, Daniel Lopresti, and
Seiichi Uchida. Cham: Springer International Publishing, 2021, pp. 70–84. ISBN:
978-3-030-86334-0.
[22] Thomas H. Cormen et al. Introduction to algorithms. 2009.
[23] Paul Dagum and Michael Luby. “Approximating probabilistic inference in Bayesian
belief networks is NP-hard.” In: Artificial intelligence 60.1 (1993), pp. 141–153.
[24] Sanjoy Dasgupta. “Learning Polytrees.” In: CoRR abs/1301.6688 (2013). arXiv:
1301.6688.
[25] Yann N. Dauphin et al. “Identifying and attacking the saddle point problem in high-
dimensional non-convex optimization.” In: Advances in neural information pro-
cessing systems. 2014, pp. 2933–2941.
[26] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. “Maximum likelihood from
incomplete data via the EM algorithm.” In: Journal of the Royal Statistical Society:
Series B (Methodological) 39.1 (1977), pp. 1–22.
[27] Jia Deng et al. “Imagenet: A large-scale hierarchical image database.” In: 2009
IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248–
255.
[28] Patrick Doetsch, Michal Kozielski, and Hermann Ney. “Fast and robust training
of recurrent neural networks for offline handwriting recognition.” In: 2014 14th
International Conference on Frontiers in Handwriting Recognition. IEEE. 2014,
pp. 279–284.
[29] Patrick Doetsch et al. “RETURNN: The RWTH extensible training framework for
universal recurrent neural networks.” In: 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5345–5349.
220
[30] Harris Drucker et al. “Support Vector Regression Machines.” In: Advances in Neu-
ral Information Processing Systems 9, NIPS, Denver, CO, USA, December 2-5,
1996. Ed. by Michael Mozer, Michael I. Jordan, and Thomas Petsche. MIT Press,
1996, pp. 155–161.
[31] B. W. A. C. Farley and W. Clark. “Simulation of self-organizing systems by digital
computer.” In: Transactions of the IRE Professional Group on Information Theory
4.4 (1954), pp. 76–84.
[32] Ronald Aylmer Fisher. “Statistical methods for research workers.” In: Break-
throughs in statistics. Springer, 1992, pp. 66–70.
[33] G. David Forney. “The viterbi algorithm.” In: Proceedings of the IEEE 61.3 (1973),
pp. 268–278.
[34] Brendan J. Frey and David J. C. MacKay. “A revolution: Belief propagation in
graphs with cycles.” In: Advances in neural information processing systems. 1998,
pp. 479–485.
[35] Kunihiko Fukushima. “Neural network model for selective attention in visual pat-
tern recognition and associative recall.” In: Applied Optics 26.23 (1987), pp. 4985–
4992.
[36] Kunihiko Fukushima. “Neocognitron: A hierarchical neural network capable of vi-
sual pattern recognition.” In: Neural networks 1.2 (1988), pp. 119–130.
[37] James Fung and Steve Mann. “Computer vision signal processing on graphics
processing units.” In: 2004 IEEE International Conference on Acoustics, Speech,
and Signal Processing. Vol. 5. IEEE. 2004, pp. V–93.
[38] Yarin Gal and Zoubin Ghahramani. “A theoretically grounded application of
dropout in recurrent neural networks.” In: Advances in neural information process-
ing systems. 2016, pp. 1019–1027.
[39] Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. “Learning to Forget:
Continual Prediction with LSTM.” In: Neural Comput. 12.10 (2000), pp. 2451–
2471. DOI: 10.1162/089976600300015015.
[40] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. “Learning precise
timing with LSTM recurrent networks.” In: Journal of machine learning research
3.Aug (2002), pp. 115–143.
[41] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT press,
2016. ISBN: 978-0-262-03561-3.
[42] Ian J. Goodfellow and Oriol Vinyals. “Qualitatively characterizing neural network
optimization problems.” In: 3rd International Conference on Learning Represen-
tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Pro-
ceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. URL: http://arxiv.
org/abs/1412.6544.
[43] Alex Graves. Supervised sequence labelling with recurrent neural networks.
Springer, 2012.
[44] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. “Multi-Dimensional
Recurrent Neural Networks.” In: International Conference on Artificial Neural Net-
works. Springer. 2007, pp. 549–558.
[45] Alex Graves and Jürgen Schmidhuber. “Offline Handwriting Recognition with Mul-
tidimensional Recurrent Neural Networks.” In: Advances in Neural Information
Processing Systems. 2009, pp. 545–552.
221
[46] Alex Graves et al. “Connectionist Temporal Classification: Labelling Unsegmented
Sequence Data with Recurrent Neural Networks.” In: Proceedings of the 23rd In-
ternational Conference on Machine learning. ACM. 2006, pp. 369–376.
[47] Alex Graves et al. “A Novel Connectionist System for Unconstrained Handwriting
Recognition.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence
31.5 (2008), pp. 855–868.
[48] Klaus Greff et al. “LSTM: A Search Space Odyssey.” In: IEEE transactions on
neural networks and learning systems 28.10 (2016), pp. 2222–2232.
[49] Isabelle Guyon, B. Boser, and Vladimir Vapnik. “Automatic capacity tuning of very
large VC-dimension classifiers.” In: Advances in neural information processing
systems. 1993, pp. 147–155.
[50] John M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices
(unpublished). 1971.
[51] Kaiming He et al. “Deep residual learning for image recognition.” In: Proceed-
ings of the IEEE conference on computer vision and pattern recognition. 2016,
pp. 770–778.
[52] Donald Olding Hebb. The organization of behavior: a neuropsychological theory.
J. Wiley; Chapman & Hall, 1949.
[53] Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Netzen.” In:
Diploma, Technische Universität München 91.1 (1991).
[54] Sepp Hochreiter. “Gradient flow in recurrent nets: the difficulty of learning long-
term dependencies.” In: A Field Guide to Dynamical Recurrent Neural Networks
(2001), pp. 237–244.
[55] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory.” In: Neural
computation 9.8 (1997), pp. 1735–1780.
[56] Arthur E. Hoerl and Robert W. Kennard. “Ridge regression: Biased estimation for
nonorthogonal problems.” In: Technometrics 12.1 (1970), pp. 55–67.
[57] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. “Kernel methods
in machine learning.” In: The annals of statistics 36.3 (2008), pp. 1171–1220.
[58] Fred Hohman et al. “Visual analytics in deep learning: An interrogative survey for
the next frontiers.” In: IEEE transactions on visualization and computer graphics
25.8 (2018), pp. 2674–2693.
[59] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge university
press, 2012.
[60] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep net-
work training by reducing internal covariate shift.” In: International conference on
machine learning. PMLR. 2015, pp. 448–456.
[61] Ernst Ising. “Beitrag zur Theorie des Ferromagnetismus.” In: Zeitschrift für Physik
31.1 (1925), pp. 253–258.
[62] Stig Johansson, Geoffrey N. Leech, and Helen Goodluck. Lancaster/Oslo-Bergen
Corpus Manual. 1978. URL: http://korpus.uib.no/icame/manuals/LOB/INDEX.
HTM.
[63] Melvin Johnson et al. “Google’s multilingual neural machine translation system:
Enabling zero-shot translation.” In: Transactions of the Association for Computa-
tional Linguistics 5 (2017), pp. 339–351.
222
[64] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. “Grid Long Short-Term Mem-
ory.” In: 4th International Conference on Learning Representations, ICLR 2016,
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by
Yoshua Bengio and Yann LeCun. 2016.
[65] Jack Kiefer, Jacob Wolfowitz, et al. “Stochastic estimation of the maximum of a re-
gression function.” In: The Annals of Mathematical Statistics 23.3 (1952), pp. 462–
466.
[66] Ross Kindermann and J. Laurie Snell. Markov Random Fields and Their Appli-
cations. Vol. 1. American Mathematical Society, 1980. ISBN: 978-0-8218-5001-5.
DOI: 10.1090/conm/001.
[67] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization.”
In: 3rd International Conference on Learning Representations, ICLR 2015, San
Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua
Bengio and Yann LeCun. 2015.
[68] Stephen Cole Kleene. Representation of events in nerve nets and finite automata.
Tech. rep. RAND PROJECT AIR FORCE SANTA MONICA CA, 1951.
[69] Donald E. Knuth. The Art of Computer Programming, Vol. 1: Fundamental
Algorithms. Third. Reading, Mass.: Addison-Wesley, 1997. ISBN: 0201896834
9780201896831.
[70] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and
techniques. MIT press, 2009.
[71] Michał Kozielski, Patrick Doetsch, Hermann Ney, et al. “Improvements in RWTH’s
system for offline handwriting recognition.” In: 2013 12th International Conference
on Document Analysis and Recognition. IEEE. 2013, pp. 935–939.
[72] Robert Krauthgamer et al. “Greedy list intersection.” In: 2008 IEEE 24th Interna-
tional Conference on Data Engineering. IEEE. 2008, pp. 1033–1042.
[73] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification
with deep convolutional neural networks.” In: Advances in neural information pro-
cessing systems. 2012, pp. 1097–1105.
[74] Anders Krogh and John A. Hertz. “A simple weight decay can improve general-
ization.” In: Advances in neural information processing systems. 1992, pp. 950–
957.
[75] Frank R. Kschischang, Brendan J. Frey, and H.-A. Loeliger. “Factor graphs and the
sum-product algorithm.” In: IEEE Transactions on information theory 47.2 (2001),
pp. 498–519.
[76] Solomon Kullback and Richard A. Leibler. “On information and sufficiency.” In: The
annals of mathematical statistics 22.1 (1951), pp. 79–86.
[77] John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. “Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence data.” In:
(2001).
[78] Yann LeCun, Yoshua Bengio, et al. “Convolutional networks for images, speech,
and time series.” In: The handbook of brain theory and neural networks 3361.10
(1995), p. 1995.
[79] Yann LeCun et al. “Backpropagation applied to handwritten zip code recognition.”
In: Neural computation 1.4 (1989), pp. 541–551.
223
[80] Vasilica Lepar and Prakash P. Shenoy. “A Comparison of Lauritzen-Spiegelhalter,
Hugin, and Shenoy-Shafer Architectures for Computing Marginals of Probability
Distributions.” In: CoRR abs/1301.7394 (2013). arXiv: 1301.7394.
[81] Vladimir I. Levenshtein. “Binary codes capable of correcting deletions, insertions,
and reversals.” In: Soviet physics doklady. Vol. 10. 8. 1966, pp. 707–710.
[82] Stan Z. Li. Markov Random Field Modeling in Image Analysis. Advances in Pattern
Recognition. Springer, 2009. ISBN: 978-1-84800-278-4. DOI: 10.1007/978- 1-
84800-279-1.
[83] Tsung-Yi Lin et al. “Microsoft COCO: Common objects in context.” In: European
conference on computer vision. Springer. 2014, pp. 740–755.
[84] Ilya Loshchilov and Frank Hutter. “Fixing Weight Decay Regularization in Adam.”
In: CoRR abs/1711.05101 (2017). arXiv: 1711.05101.
[85] Andrew L Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier nonlinearities im-
prove neural network acoustic models.” In: Proc. icml. Vol. 30. 1. Citeseer. 2013,
p. 3.
[86] Anders L. Madsen et al. “The Hugin tool for probabilistic graphical models.” In:
International Journal on Artificial Intelligence Tools 14.03 (2005), pp. 507–543.
[87] Christopher Manning and Hinrich Schutze. Foundations of statistical natural lan-
guage processing. MIT press, 1999.
[88] U.-V. Marti and Horst Bunke. “The IAM-database: an English sentence database
for offline handwriting recognition.” In: International Journal on Document Analysis
and Recognition 5.1 (2002), pp. 39–46.
[89] Warren S. McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in
nervous activity.” In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–
133.
[90] Geoffrey J. McLachlan and Kaye E. Basford. Mixture models: Inference and appli-
cations to clustering. Vol. 38. M. Dekker New York, 1988.
[91] Jean-Baptiste Michel et al. “Quantitative analysis of culture using millions of digi-
tized books.” In: science 331.6014 (2011), pp. 176–182.
[92] Marvin Minsky and Seymour Papert. Perceptrons - an introduction to computa-
tional geometry. MIT Press, 1987. ISBN: 978-0-262-63111-2.
[93] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.
ISBN: 978-0-262-01802-9.
[94] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. “Loopy Belief Propagation
for Approximate Inference: An Empirical Study.” In: UAI ’99: Proceedings of the
Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden,
July 30 - August 1, 1999. Ed. by Kathryn B. Laskey and Henri Prade. Morgan
Kaufmann, 1999, pp. 467–475.
[95] Robert Nagy, Anders Dicker, and Klaus Meyer-Wegener. “Definition and Evalua-
tion of the NEOCR Dataset for Natural-Image Text Recognition.” In: (2011).
[96] Robert Nagy, Anders Dicker, and Klaus Meyer-Wegener. “NEOCR: A configurable
dataset for natural image text recognition.” In: International Workshop on Camera-
Based Document Analysis and Recognition. Springer. 2011, pp. 150–163.
[97] Frank Nielsen. “A family of statistical symmetric divergences based on Jensen’s
inequality.” In: CoRR abs/1009.4004 (2010). arXiv: 1009.4004.
224
[98] Nobuyuki Otsu. “A threshold selection method from gray-level histograms.” In:
IEEE transactions on systems, man, and cybernetics 9.1 (1979), pp. 62–66.
[99] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of train-
ing recurrent neural networks.” In: International conference on machine learning.
2013, pp. 1310–1318.
[100] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible
Inference. Morgan Kaufmann, 1988.
[101] Jorge Pérez, Javier Marinkovic, and Pablo Barceló. “On the Turing Complete-
ness of Modern Neural Network Architectures.” In: 7th International Conference
on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net, 2019.
[102] Vu Pham et al. “Dropout improves recurrent neural networks for handwriting
recognition.” In: 2014 14th international conference on frontiers in handwriting
recognition. IEEE. 2014, pp. 285–290.
[103] Lutz Prechelt. “Early stopping-but when?” In: Neural Networks: Tricks of the trade.
Springer, 1998, pp. 55–69.
[104] Joan Puigcerver. “Are multidimensional recurrent layers really necessary for hand-
written text recognition?” In: 2017 14th IAPR International Conference on Docu-
ment Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 67–72.
[105] Lawrence R. Rabiner. “A tutorial on hidden Markov models and selected applica-
tions in speech recognition.” In: Proceedings of the IEEE 77.2 (1989), pp. 257–
286.
[106] D. Raj Reddy et al. “Speech understanding systems: A summary of results of
the five-year research effort.” In: Department of Computer Science. Camegie-Mell
University, Pittsburgh, PA 17 (1977), p. 138.
[107] Herbert Robbins and Sutton Monro. “A stochastic approximation method.” In: The
annals of mathematical statistics (1951), pp. 400–407.
[108] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional
networks for biomedical image segmentation.” In: International Conference on
Medical image computing and computer-assisted intervention. Springer. 2015,
pp. 234–241.
[109] Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton
Project Para. Cornell Aeronautical Laboratory, 1957.
[110] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage
and organization in the brain.” In: Psychological review 65.6 (1958), p. 386.
[111] Dan Roth. “On the hardness of approximate reasoning.” In: Artificial Intelligence
82.1-2 (1996), pp. 273–302.
[112] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning rep-
resentations by back-propagating errors.” In: nature 323.6088 (1986), pp. 533–
536.
[113] David E. Rumelhart et al. “Backpropagation: The basic theory.” In: Backpropaga-
tion: Theory, architectures and applications (1995), pp. 1–34.
[114] Dominik Sacha et al. “Vis4ML: An ontology for visual analytics assisted ma-
chine learning.” In: IEEE transactions on visualization and computer graphics 25.1
(2018), pp. 385–395.
225
[115] Fadil Santosa and William W. Symes. “Linear inversion of band-limited reflec-
tion seismograms.” In: SIAM Journal on Scientific and Statistical Computing 7.4
(1986), pp. 1307–1330.
[116] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. “Exact solutions to the
nonlinear dynamics of learning in deep linear neural networks.” In: 2nd Interna-
tional Conference on Learning Representations, ICLR 2014, Banff, AB, Canada,
April 14-16, 2014, Conference Track Proceedings. Ed. by Yoshua Bengio and
Yann LeCun. 2014. URL: http://arxiv.org/abs/1312.6120.
[117] Kenneth M. Sayre. “Machine Recognition of Handwritten Words: A Project Re-
port.” In: Pattern Recognition 5.3 (1973), pp. 213–228.
[118] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Improving
gradient-based LSTM training for offline handwriting recognition by careful se-
lection of the optimization method.” In: BW-CAR| SINCOM (2016), p. 11.
[119] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Increasing Ro-
bustness of Handwriting Recognition Using Character N-Gram Decoding on Large
Lexica.” In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS).
Apr. 2016, pp. 156–161. DOI: 10.1109/DAS.2016.43.
[120] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional
Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR In-
ternational Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–
410. DOI: 10.1109/DAS.2018.36.
[121] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-
Line Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th
IAPR International Conference on Document Analysis and Recognition (ICDAR).
Sept. 2019. DOI: 10.1109/ICDAR.2019.00015.
[122] Martin Schall et al. “LSTM Networks for Edit Distance Calculation with Exchange-
able Dictionaries.” In: 2018 13th IAPR International Workshop on Document Anal-
ysis Systems (DAS). Apr. 2018.
[123] Martin Schall et al. “Visualization-Assisted Development of Deep Learning Mod-
els in Offline Handwriting Recognition.” In: Symposium on Visualization in Data
Science (VDS) at IEEE VIS 2018. Oct. 2018.
[124] Marc-Peter Schambach and Sheikh Faisal Rashid. “Stabilize Sequence Learning
with Recurrent Neural Networks by Forced Alignment.” In: 2013 12th International
Conference on Document Analysis and Recognition. IEEE. 2013, pp. 1270–1274.
[125] Marc-Peter Schambach, Stephan von der Nüll, and Martin Schall. “Fast and Reli-
able Acquisition of Truth Data for Document Analysis using Cyclic Suggest Algo-
rithms.” In: 2019 International Conference on Document Analysis and Recognition
Workshops (ICDARW). Vol. 2. Sept. 2019, pp. 7–12. DOI: 10.1109/ICDARW.2019.
10030.
[126] Jürgen Schmidhuber. “Deep learning in neural networks: An overview.” In: Neural
networks 61 (2015), pp. 85–117.
[127] Bernhard Schölkopf, Alexander J. Smola, and Francis Bach. Learning with ker-
nels: support vector machines, regularization, optimization, and beyond. the MIT
Press, 2018.
[128] Mike Schuster and Kuldip K. Paliwal. “Bidirectional Recurrent Neural
Networks.” In: IEEE Transactions on Signal Processing 45.11 (1997), pp. 2673–
2681.
226
[129] Rita Sevastjanova et al. “Going beyond visualization: Verbalization as complemen-
tary medium to explain machine learning models.” In: Workshop on Visualization
for AI Explainability at IEEE VIS. 2018.
[130] Mehmet Sezgin and Bülent Sankur. “Survey over image thresholding techniques
and quantitative performance evaluation.” In: Journal of Electronic imaging 13.1
(2004), pp. 146–165.
[131] Prakash P. Shenoy. “Binary join trees for computing marginals in the Shenoy-
Shafer architecture.” In: International Journal of approximate reasoning 17.2-3
(1997), pp. 239–263.
[132] David Silver et al. “Mastering the game of Go with deep neural networks and tree
search.” In: nature 529.7587 (2016), pp. 484–489.
[133] David Silver et al. “Mastering the game of go without human knowledge.” In: nature
550.7676 (2017), pp. 354–359.
[134] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. “Deep Inside Convo-
lutional Networks: Visualising Image Classification Models and Saliency Maps.”
In: 2nd International Conference on Learning Representations, ICLR 2014, Banff,
AB, Canada, April 14-16, 2014, Workshop Track Proceedings. Ed. by Yoshua Ben-
gio and Yann LeCun. 2014.
[135] Sumeet S. Singh and Sergey Karayev. “Full Page Handwriting Recognition via
Image to Sequence Extraction.” In: Document Analysis and Recognition – ICDAR
2021. Ed. by Josep Lladós, Daniel Lopresti, and Seiichi Uchida. Cham: Springer
International Publishing, 2021, pp. 55–69. ISBN: 978-3-030-86334-0.
[136] Thilo Spinner et al. “explAIner: A visual analytics framework for interactive and ex-
plainable machine learning.” In: IEEE transactions on visualization and computer
graphics 26.1 (2019), pp. 1064–1074.
[137] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from
overfitting.” In: The journal of machine learning research 15.1 (2014), pp. 1929–
1958.
[138] Richard P. Stanley. “Enumerative Combinatorics Volume 1 second edition.” In:
Cambridge studies in advanced mathematics (2011).
[139] Erik B. Sudderth and William T. Freeman. “Signal and image processing with belief
propagation.” In: IEEE Signal Processing Magazine 25.2 (2008), pp. 114–141.
[140] Olarik Surinta et al. “A* path planning for line segmentation of handwritten docu-
ments.” In: 2014 14th International Conference on Frontiers in Handwriting Recog-
nition. IEEE. 2014, pp. 175–180.
[141] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning
with neural networks.” In: Advances in neural information processing systems.
2014, pp. 3104–3112.
[142] Richard S. Sutton, Andrew G. Barto, et al. Introduction to reinforcement learning.
Vol. 135. MIT press Cambridge, 1998.
[143] Christian Szegedy et al. “Scalable, High-Quality Object Detection.” In: CoRR
abs/1412.1441 (2014). arXiv: 1412.1441.
[144] Christian Szegedy et al. “Going deeper with convolutions.” In: Proceedings of the
IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9.
[145] Robert Tibshirani. “Regression shrinkage and selection via the lasso.” In: Journal
of the Royal Statistical Society: Series B (Methodological) 58.1 (1996), pp. 267–
288.
227
[146] Dimitris Tsirogiannis, Sudipto Guha, and Nick Koudas. “Improving the perfor-
mance of list intersection.” In: Proceedings of the VLDB Endowment 2.1 (2009),
pp. 838–849.
[147] Vladimir Vapnik, Steven E. Golowich, Alex Smola, et al. “Support vector method
for function approximation, regression estimation, and signal processing.” In: Ad-
vances in neural information processing systems (1997), pp. 281–287.
[148] Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf. “A primer on kernel
methods.” In: Kernel methods in computational biology 47 (2004), pp. 35–70.
[149] Andrew J. Viterbi. “A personal history of the Viterbi algorithm.” In: IEEE Signal
Processing Magazine 23.4 (2006), pp. 120–142.
[150] Paul Voigtlaender, Patrick Doetsch, and Hermann Ney. “Handwriting recognition
with large multidimensional long short-term memory recurrent neural networks.”
In: 2016 15th International Conference on Frontiers in Handwriting Recognition
(ICFHR). IEEE. 2016, pp. 228–233.
[151] Robert A. Wagner and Michael J. Fischer. “The string-to-string correction prob-
lem.” In: Journal of the ACM (JACM) 21.1 (1974), pp. 168–173.
[152] Paul J. Werbos. “Backpropagation through time: what it does and how to do it.” In:
Proceedings of the IEEE 78.10 (1990), pp. 1550–1560.
[153] Tobias Weyand, Ilya Kostrikov, and James Philbin. “Planet-photo geolocation with
convolutional neural networks.” In: European Conference on Computer Vision.
Springer. 2016, pp. 37–55.
[154] Alfred Whitehead and Bertrand Russell. Principia mathematica. Cambridge, 1910.
[155] D. Randall Wilson and Tony R. Martinez. “The general inefficiency of batch training
for gradient descent learning.” In: Neural networks 16.10 (2003), pp. 1429–1451.
[156] Yi-Chao Wu et al. “Handwritten chinese text recognition using separable multi-
dimensional recurrent neural network.” In: 2017 14th IAPR International Confer-
ence on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 79–
84.
[157] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation.” In: CoRR abs/1609.08144 (2016).
arXiv: 1609.08144.
[158] Kelvin Xu et al. “Show, attend and tell: Neural image caption generation with visual
attention.” In: International conference on machine learning. 2015, pp. 2048–2057.
[159] Jui-Cheng Yen, Fu-Juay Chang, and Shyang Chang. “A new criterion for automatic
multilevel thresholding.” In: IEEE Transactions on Image Processing 4.3 (1995),
pp. 370–378.
[160] Wojciech Zaremba and Ilya Sutskever. “Learning to Execute.” In: CoRR
abs/1410.4615 (2014). arXiv: 1410.4615.
[161] Matthew D. Zeiler and Rob Fergus. “Visualizing and understanding convolutional
networks.” In: European conference on computer vision. Springer. 2014, pp. 818–
833.
[162] Nevin L. Zhang and David Poole. “A simple approach to Bayesian network com-
putations.” In: Proc. of the Tenth Canadian Conference on Artificial Intelligence.
1994.
[163] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. “Publaynet: largest dataset
ever for document layout analysis.” In: 2019 International Conference on Docu-
ment Analysis and Recognition (ICDAR). IEEE. 2019, pp. 1015–1022.
228