Multi-Dimensional Connectionist Classification: Segmentation-Free Handwriting Recognition Dissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) vorgelegt von Schall, Martin an der Mathematisch-Naturwissenschaftliche Sektion Fachbereich Informatik und Informationswissenschaft Konstanz, 2022 KonstanzerO nline-Publikations-System(K OPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-2-1jwkb729u4hib7 Tag der mündlichen Prüfung: 1. Juli 2022 1. Referent: Prof. Dr. Daniel A. Keim 2. Referent: Prof. Dr. Matthias O. Franz Vorsitzender: Prof. Dr. Bastian Goldlücke Abstract Offline handwriting recognition is one area of research in document analysis. It is the task of automatic transcription of natural handwritten text from images. As such it finds applications in scientific systems, as well as industrial and consumer products. This thesis deals with the segmentation-free offline handwriting recognition of hand- written paragraphs, which is the transcription of paragraphs from images without prior segmentation of the image into individual lines, words or characters. Removing the need for prior segmentation also removes a potential error source from the overall pipeline while transcribing text from images. The beginning chapters of this thesis outline and discuss the state of the art in offline handwriting recognition and general methods from machine learning and deep learning which form the basis for this research. The discussed state of the art methods include connectionist temporal classification, a method for line-wise offline handwriting recogni- tion, and a paragraph-wise transcription method based on attention networks. Both are used for empirical evaluation and comparison in the later chapters. Following this overview of the state of the art is a discussion of the research ques- tion and difficulty of segmentation-free paragraph-wise offline handwriting recognition. The main method and theory of this thesis is discussed and detailed by proposing multi- dimensional connectionist classification, which addresses segmentation-free paragraph transcription. Multi-dimensional connectionist classification is a novel training method for deep neural networks that builds an expectation-maximization loop in combination with a conditional random field in order to predict glyph probabilities from an image of handwrit- ten text. These glyph probabilities are transcribed to a computer-processable string by applying a novel multi-line decoding algorithm. Multi-dimensional connectionist classifica- tion and its decoding algorithm are empirically evaluated by applying them to handwritten paragraphs. A novel heatmap-based visual analytics technique and workflow are proposed for human inspection of the multi-dimensional connectionist classification training loop and transcription results. This technique is designed to preserve the contextual information given by the handwritten text while enabling the model engineer to identify potential error sources and improve hyper-parameters. This thesis also discusses methods for combining line-wise and paragraph-wise tran- scription in offline handwriting recognition. The goal of such an approach is to reduce the overall error rate by combining the strengths of both methods. Several methods for com- bining transcription methods in different steps of their respective pipelines are proposed and empirically evaluated. 1 2 Zusammenfassung Offline-Handschrifterkennung ist ein Forschungsbereich in der Dokumentenanalyse und ist die automatisierte Transkription von natürlichen Texten aus Bildern. Handschrifterken- nung findet Anwendung in wissenschaftlichen Systemen, als auch industriellen Lösungen und Endbenutzergeräten. Diese Dissertation beschäftigt sich mit der segmentierungsfreien Offline-Handschrift- erkennung von Paragraphen. Dies betrifft die Transkription von Paragraphen aus Bildern ohne vorhergehende Zerlegung in Zeilen, Worte oder Zeichen. Der Wegfall der Notwen- digkeit zur vorhergehenden Segmentierung reduziert die möglichen Fehlerquellen des gesamten Systems zur Transkription. Die eröffnenden Kapitel dieser Arbeit zeigen den aktuellen Stand der Wissenschaft in der Offline-Handschrifterkennung auf und diskutieren diesen. Auch werden generelle Me- thoden des maschinellen Lernens und Deep Learning diskutiert, da diese die Grundlage der vorliegenden Arbeit bilden. Als aktuelle Methoden der Offline-Handschrifterkennung werden Connectionist Temporal Classification für zeilenweise Transkription und ein para- graphenweises Transkriptionsverfahren basierend auf Attention Networks detailliert. Bei- de Methoden werden zur empirischen Evaluation und Auswertung in späteren Kapiteln herangezogen. Die wissenschaftlichen Forschungsfrage und Problemstellung der segmentierungs- freien, paragraphenweisen Offline-Handschrifterkennung werden auf diese Übersicht fol- gend diskutiert. Die hauptsächliche Methode und Theorie dieser Arbeit bildet Multi-Di- mensional Connectionist Classification, ein neuartiges Verfahren zur segmentierungs- freien Transkription von Paragraphen. Multi-Dimensional Connectionist Classification ist ein Lernverfahren für tiefe neuronale Netze, welches Expectation-Maximization in Verbin- dung mit Conditional Random Fields anwendet um Zeichenwahrscheinlichkeiten aus Pa- ragraphenbildern vorherzusagen. Diese Zeichenwahrscheinlichkeiten werden dann durch einen neuartigen Dekodierungsalgorithmus in maschinenlesbare Strings umgesetzt. Mul- ti-Dimensional Connectionist Classification und sein zugehöriger Dekodierungsalgorith- mus werden empirisch ausgewertet indem diese auf handgeschriebene Paragraphen an- gewendet werden. Weiter wird ein neuartiges, Heatmap-basiertes Verfahren zur visuellen Analyse und Inspektion von Trainingsläufen und -resultaten Multi-Dimensional Connectionist Classifi- cation vorgeschlagen. Diese Methode integriert den durch das ursprüngliche Handschrift- bild gegebenen Kontext und erlaubt es dadurch dem Modellingenieur, mögliche Fehler- quellen zu identifizieren und Hyper-Parameter zu verbessern. Im Anschluss daran zeigt diese Dissertation Verfahren auf um zeilen- und paragra- phenweise Transkriptionen der Offline-Handschrifterkennung zu kombinieren. Das Ziel dieser Methoden ist es, die Stärken beider Verfahren zu kombinieren. Es werden mehre- re, an unterschiedlichen Stellen des Gesamtsystems ansetzende Methoden zur Kombi- nation von zeilen- und paragraphenweiser Transkription diskutiert und empirisch ausge- wertet. 3 4 Danksagung Mein Dank gilt Prof. Dr. Matthias Franz, welcher diese Forschungsarbeit betreut hat. Ich kenne Matthias Franz bereits aus dem Studium und als Betreuer meiner Masterarbeit und verdanke ihm sowohl sehr viel Fachwissen als auch Erfahrung und Gespür zu wissen- schaftlichen Fragestellungen. Die freundschaftliche und lockere, aber auch professionelle und zielgerichtete Zusammenarbeit mit ihm schätze ich sehr. Auch möchte ich meinen Dank an Prof. Dr. Daniel Keim aussprechen. Er hat diese Forschungsarbeit betreut und mir gegenüber dabei einen Vertrauensvorschuss geleistet, da er mich vorhergehend nicht kannte. Im Laufe unserer Zusammenarbeit habe ich ihn dann als sehr kompetente, hilfsbereite und freundliche Person kennengelernt. Von ihm habe ich dabei viel Fachwissen und wissenschaftliche Arbeitsweise erlernt. Weiter möchte ich gerne Dr. Marc-Peter Schambach danken. Er hat mir diese For- schungsarbeit initial bei der Siemens Parcel Logistics GmbH ermöglicht und dann im weiteren Verlauf als Mentor informell betreut. Mit ihm habe ich einen guten Freund ge- wonnen. Danke an Dr. Pascal Laube, der durch viele interessante und aufschlussreiche Dis- kussionen, aber auch als guter Freund zu dieser Arbeit beigetragen hat. Gerne möchte ich allen am Institut für Optische Systeme und der Arbeitsgruppe Da- tenanalyse und Visualisierung danken. Ihr habt mich bei dieser Arbeit als Freunde und mit Diskussionen, Tipps und als Co-Autoren unterstützt. Insbesondere möchte ich hier Prof. Dr. Georg Umlauf, Prof. Dr. Oliver Dürr, Dr. Dominik Sacha, Dr. Manuel Stein, Dr. Michael Behrisch, Dr. Michael Blumenschein, Dr. Johannes Fuchs, Dr. Dominik Jäckle, Dr. Dirk Streeb, Michael Grunwald, Matthias Hermann, Mennatallah El-Assady, Tobias Birkle, Dennis Griesser, Daniel Dold, Rita Sevastjanova, Thilo Spinner, Fabian Sperrle, Udo Schlegel, Robin Mattes, Felix Peter, Daniel Seebacher, Juri Buchmüller, Nico Brügel, Henning Krause und Haiyan Bührig danken. Mein Dank gilt ebenso meinen Arbeitskollegen bei der Siemens Parcel Logistics GmbH. Im Kontext dieser Arbeit möchte ich besonders Dr. Jörg Rottland, Stephan v.d. Nüll, Mi- chael Zettler und Insa Sigl danken. Keine professionelle Arbeit ist ohne die Unterstützung durch Familie und Freunde möglich. Vielen Dank an euch alle, die ich mit gutem Grund meine Freunde nenne. Sehr viel Dank empfinde ich für meine Eltern Elfriede und Werner Schall, meinen Bruder Se- bastian Schall und seine Lebensgefährtin Stefanie Eckardt. Vielen Dank an Simone und Joachim Breyer, Stefan Lang und Andreas Bolz. Diese Arbeit wurde durch die Siemens Parcel Logistics GmbH finanziert (meine Ar- beitszeit, Computerhardware, sowie Teilnahme an Konferenzen) und damit erst ermög- licht. Diese Aufzählung ist mit Sicherheit nicht vollständig. Danke an alle, die ich als Familie, Freunde, Kollegen und Wissenschaftler kenne! 5 6 Contents 1 Introduction 13 1.1 Overview and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.2 Scientific Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.4 Organization of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2 Background 23 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.2 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.4 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3 Related Work 61 3.1 Connectionist Temporal Classification . . . . . . . . . . . . . . . . . . . . . 61 3.2 Paragraph Transcription using Attention Networks . . . . . . . . . . . . . . 67 3.3 Paragraph Transcription by Reshaping CNNs . . . . . . . . . . . . . . . . . 72 4 The Problem with Multi-Line Handwriting Recognition 75 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2 Segmentation of Handwritten Paragraphs . . . . . . . . . . . . . . . . . . . 75 4.3 Computational Considerations . . . . . . . . . . . . . . . . . . . . . . . . . 82 5 Decoding Algorithms for Multi-Line Text Recognition 89 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.2 Structure of the Model Output . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3 Multi-Line Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Finding Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5 Decoding Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6 Multi-Dimensional Connectionist Classification (MDCC) 113 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.2 Structure of Paragraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.3 Basic of Multi-Line Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.4 Maximum Likelihood Training . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.5 Expectation-Maximization Training . . . . . . . . . . . . . . . . . . . . . . . 127 6.6 Construction of and Inference in the CRF . . . . . . . . . . . . . . . . . . . 133 6.7 Emphasizing Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7 Text Recognition for Paragraphs 141 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3 Forced Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 7 7.4 Neural Network Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.5 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 7.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 8 Hyper-Parameter Search using Visual Analytics 165 8.1 Problem Description and Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.2 Error Sources in MDCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.3 Workflow for Identification of Error Sources . . . . . . . . . . . . . . . . . . 169 8.4 Heatmap-Based Visualization for MDCC . . . . . . . . . . . . . . . . . . . . 173 8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9 Combined Models for Text Recognition 181 9.1 Idea and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 9.2 Classifier on Paragraph Images . . . . . . . . . . . . . . . . . . . . . . . . . 184 9.3 Classifier on Transcribed Texts . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.4 Classifier on Segmentation Information . . . . . . . . . . . . . . . . . . . . 194 9.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 10 Dictionary-Based Decoding Algorithms 201 10.1 Overview and Relation to This Work . . . . . . . . . . . . . . . . . . . . . . 201 10.2 Decoding using a Large Lexicon and Fuzzy Search . . . . . . . . . . . . . . 201 10.3 Decoding using LSTM Networks and Metric Learning . . . . . . . . . . . . 205 11 Discussion and Conclusion 211 11.1 Achieved Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.2 Ideas for Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212 11.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Bibliography 219 8 Nomenclature BP Belief Propagation BPTT Backpropagation Through Time CNN Convolutional Neural Network CRF Conditional Random Field CTC Connectionist Temporal Classification DGM Directed Graphical Model DNN Deep Neural Network EM Expectation-Maximization GPGPU General Purpose Graphics Processing Unit LBP Loopy Belief Propagation LSTM Long Short-Term Memory MAP Maximum A-Posterior MDLSTM Multi-Dimensional Long Short-Term Memory ML Machine Learning MLP Multi-Layer Perceptron MRF Markov Random Field NLP Natural Language Processing RNN Recurrent Neural Network SVM Support-Vector Machine VA Visual Analytics XAI eXplainable AI 9 10 Mathematical Notation Throughout this thesis we will discuss multiple mathematical concepts such as graphical models, artificial neural networks and deep neural networks from mathematical fields such as statistics, information theory, linear algebra and analysis. This thesis applies a common mathematical notation to all of these concepts. This notation is as follows: Scalar variables are typeset in italic font, e.g. i, j or n, m. Italic typeset symbols with an index, e.g. yi refer to a specific scalar within a tensor, matrix, vector or set. Functions are typeset in Roman font, e.g. exp(). Constants, e.g. k, are also in Roman font. Tensor, matrix, vector and set symbols are typeset in bold font, e.g. W, x or y. As stated before, specific elements of these are typeset in italic font but with an index, e.g. Wi. 11 12 Chapter 1 Introduction 1.1 Overview and Motivation The topic of this thesis is mainly offline handwriting recognition, that is the automatic, computerized transcription of handwritten text from an image containing this handwritten text. The image is typically produced by scanning a sheet of paper or captured by an optical camera system. The offline handwriting recognition task is then to run an algorithm on a computer that will transcribe the text contained in this image in a form which is further processable by software, e.g. as a UTF-8 encoded string. Offline handwriting recognition as a research field is part of document analysis and supports research in e.g. historical document analysis. The motivation, application and context for the research and development of methods discussed in this thesis is provided by the products and solutions of Siemens Parcel Logistics. Offline handwriting recogni- tion is part of the pipeline for processing and sorting mail by automatic reading of the sender and receiver addresses from mail and parcel items. The overall pipeline and goal is to optically capture one or multiple images from mail items while they physically travel through the sorting system, reading and encoding required information from these im- ages and finally to decide how to proceed with the processing of this mail piece, yielding according control commands to the physical sorting machine. In this context offline hand- writing recognition is part of reading an coding of addresses on mail and parcels. Figure 1.1.1 shows a belt system that transports parcels through a tunnel for optical capture of images of the parcel from six sides. Figure 1.1.2 shows Siemens cross-belt sorter VarioSort EXB which is used for auto- matic sorting of parcels. Offline handwriting recognition of the addresses on the parcels is an intermediate step to issuing commands to the belts of the sorting system in order to physically transport the parcels to their intended destinations. Figure 1.1.3 shows a Siemens Integrated Reading and Video Coding Machine (IRV) for automatic sorting of letter mail. Similarly, offline handwriting recognition is required to read addresses on these mail items. It is true that at the time of writing, most of the addresses on mail items and parcels are not handwritten by humans, but printed by computerized machines. This means that the layout of the labels on mail items and the font in use for addresses are much easier to recognize, read and encode correctly. However, there are still mail and parcels in cir- culation with handwritten addresses and as such the need for reliable offline handwriting recognition is still given. By nature of the mode of production of handwritten texts, that is by a human using a pen on a sheet of paper, some glyphs in the final handwritten text may be overlapping. Overlapping glyphs may occur within a text line between adjacent glyphs, but also in over- lapping text lines. Modern methods in offline handwriting recognition typically address the 13 Figure 1.1.1: Tunnel for capturing images of parcels in preparation for the automatic reading and coding step. Image retrieved from https://www.siemens-logistics.com/en/ parcel-logistics/reading-and-coding on September 23rd, 2021. Figure 1.1.2: Siemens VarioSort EXB for automatic sorting of parcels. Image retrieved from https://www.siemens-logistics.com/en/ parcel-logistics/sorting on September 23rd, 2021. 14 Figure 1.1.3: Siemens Integrated Reading and Video Coding Machine (IRV) for mail sorting. Image retrieved from https://www.siemens-logistics.com/en/mail-sorting/ letter-sorting-and-sequencing on September 23rd, 2021. problem of overlapping glyphs by applying so called segmentation-free methods, that is methods that are able to transcribe text without prior separation of its components. For example, paragraphs may be segmented into lines or lines into words and characters. Segmentation-free methods avoid this since every segmentation step introduces a po- tential source for errors into the overall offline handwriting recognition system. Connectionist temporal classification (CTC)[46], see also Section 3.1, is one such segmentation-free method. CTC was developed for the automatic transcription of text lines. It removes the need for segmentation of text lines into words or characters. Figure 1.1.4 shows an example postal address for which transcription with CTC is applicable. However, overlaps may occur between text lines, as Figure 1.1.5 shows in an example from the IAM offline handwriting database[88]. Overlapping lines may also occur in hand- written addresses on mail items or parcels. Figure 1.1.4: Postal address image from a Siemens Parcel Logistics project in New Zealand. Multi-dimensional connectionist classification (MDCC), proposed in this thesis in Chap- ters 5 and 6, is designed as a segmentation-free offline handwriting recognition method for transcribing whole paragraphs without prior segmentation into individual lines, words or characters. It is specifically designed to handle overlapping text lines. It thus removes the error source of additional segmentation steps within the overall transcription system. 15 Figure 1.1.5: Paragraph from the IAM offline handwriting database that shows overlapping text lines. The technical motivation for the research into paragraph-wise segmentation-free of- fline handwriting recognition is thus given by the facts that handwritten text lines are in- deed sometimes overlapping and that there are mail items and parcels with handwritten addresses in circulation. Modern offline handwriting recognition methods rely on deep neural networks (DNNs), a specific type of machine learning model. DNNs proved suc- cessful in solving complex recognition tasks based on large amounts of data, but also are difficult to understand and optimize by human experts. As such, this research lies within the intersection of machine learning, document analysis and visual analytics. 1.2 Scientific Contributions This section details the novel scientific contributions contained within this doctoral the- sis and it relates these contributions to their respective research fields. The research contributions of this thesis are as follows: • Multi-dimensional connectionist classification as a whole with its training method and decoding algorithm is a novel method for paragraph-wise segmentation-free offline handwriting recognition. It is capable of transcribing handwritten paragraphs with overlapping text lines, writing of varying size and slanted or angled text (up to 45 degree of angle). MDCC as detailed in this thesis is combined with a deep neural network as the actual model for the recognition of handwritten text. However, MDCC itself is a training method and decoding algorithm that is not restricted to a specific deep neural network architecture or machine learning model. The training method of MDCC is discussed in Chapter 6 and its decoding algorithm in Chapter 5. • Multi-dimensional connectionist classification as a training method is a novel con- tribution to machine learning and computer science. It interprets paragraph-wise segmentation-free offline handwriting recognition as an inference task over a space of two spatial dimensions while given only incomplete information. The information provided in this task is the image of handwritten text and, only during training, the label sequence of the correctly transcribed text. No geometric information, e.g. the 16 position or extent of characters, is provided. This missing information needs to be inferred. To this end, MDCC sets up an expectation-maximization loop between a conditional random field (CRF) and a deep neural network (DNN) in order to in- fer the missing information and optimize the model parameters at the same time. Chapter 6 discusses this approach. • Training deep neural networks using MDCC highlighted the need for understand- ing its workings in the context of offline handwriting recognition. Understanding the deep learning model in use in combination with MDCC allows the expert user to im- prove its hyper-parameters and to correct potential errors in the ground truth data or software implementation. Chapter 8 of this thesis proposes a novel visual an- alytics technique for inspecting the predictions of the DNN and CRF models while preserving the contextual information provided by the handwritten text. It also pro- poses techniques for identifying interesting cases in MDCC and a novel workflow for identification and improvement of error sources in MDCC. • Multi-dimensional connectionist classification is designed for the paragraph-wise transcription of handwritten texts. There are methods, e.g. connectionist temporal classification (CTC), that allow for line-wise transcription of handwritten texts. Since paragraph-wise transcription only unfolds its full benefit in difficult to segment para- graphs, the question arises if the decision to transcribe line- or paragraph-wise can be made on a case-by-case basis. Chapter 9 proposes novel methods for combin- ing line- and paragraph-wise transcription by classifying each example in order to predict which transcription method yields a lower error rate. • Section 10.2 discusses a novel method for decoding predictions of a DNN trained with CTC by extracting character n-grams and fuzzy search within a large dictionary of possible strings. This method can be used to speed up the decoding process in combination with CTC. • Section 10.3 proposes a novel training method for optimizing DNNs towards esti- mating the Edit-distance between a query string and a reference dictionary, keep- ing both the query string and dictionary exchangeable. The DNN in this case only learns to approximate the algorithm for computing the Edit-distance. 1.3 Publications This doctoral thesis is based on the following of my works. The order is based on the timeline in which they have been published, starting with the newest publication. All pub- lications in this list have gone through a peer-review process beforehand. The individual authors have been asked for permission to use these publications in this thesis and their individual contributions are outlined in the following listing. The attributions of the contri- butions of each author is written from the perspective of the author of this thesis. There are also, of course, the general contributions of Daniel A. Keim and Matthias O. Franz as my doctoral advisers. Both my advisers contributed by teaching good research practice, teaching machine learning and visual analytics, as well as proof-reading of pub- lications. I would also like to point out that Marc-Peter Schambach, with his experience in offline handwriting recognition and a colleague at Siemens Parcel Logistics, acted as sort of an informal adviser throughout my doctoral research. 17 Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-Line Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019. DOI: 10.1109/ICDAR.2019.00015 This was the second publication on multi-dimensional connectionist classification. The largest part of this research and paper is my work. Marc-Peter Schambach en- gaged with me in discussions on difficult cases in both multi-line alignment and decoding. Matthias O. Franz continued with discussions on conditional random fields and how to formulate MDCC in an expectation-maximization framework. My contribution was re- search into multi-line text alignment using conditional random fields, multi-line decoding algorithms and deep neural networks for offline handwriting recognition. I have formulated the CRF topology and decoding algorithm as proposed in this work. Implementation of the algorithms and experiments with a following evaluation was also done by me. The paper was written by me, while incorporating the feedback of my co-authors. Both co- authors proof-read the paper before publication. Marc-Peter Schambach, Stephan von der Nüll, and Martin Schall. “Fast and Reliable Acquisition of Truth Data for Document Analysis using Cyclic Suggest Algorithms.” In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). vol. 2. Sept. 2019, pp. 7–12. DOI: 10.1109/ICDARW.2019.10030 The research into this topic is mainly the work of Marc-Peter Schambach and he conducted the according implementation and experiments. He wrote this paper, incor- porating feedback after proof-reading by both co-authors. Stephan v.d. Nüll was the team lead at Siemens Logistics of both Marc-Peter Schambach and me during the time of this research and he discussed the use cases for this work, as well as data format and storage. My contributions were discussions on the requirements for capturing ground truth data for multi-line text recognition. I also discussed resolving the cyclic dependen- cies that occur during semi-automatic annotation of ground truth data with Marc-Peter Schambach. Martin Schall, Dominik Sacha, Manuel Stein, Matthias O. Franz, and Daniel A. Keim. “Visualization-Assisted Development of Deep Learning Models in Offline Handwriting Recognition.” In: Symposium on Visualization in Data Science (VDS) at IEEE VIS 2018. Oct. 2018 This publication is a result of heatmap-based visualizations that I had created for multi- dimensional connectionist classification for debugging and subsequently formalized as a visual analytics method. As such the research into this visualization technique and match- ing workflow was mainly my work. The body of the paper was written by me. Dominik Sacha contributed by discussing the proposed workflow in context of his Vis4ML[114] research. Manuel Stein provided feedback on the visualization and presentation of the workflow. Daniel A. Keim engaged in discussions on the presentation of and argumenta- tion for the heatmap-based visualization. Matthias O. Franz discussed the interpretation of the heatmap technique in the context of deep neural networks. My contribution was the development of the heatmap-based visualization technique and related workflow. I wrote the main body of the paper. I also implemented this method for experimentation. All co-authors provided feedback after proof-reading the paper, which I incorporated for the final publication. 18 Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–410. DOI: 10.1109/DAS.2018.36 This was the first publication on multi-dimensional connectionist classification. The main part of the research was done by me. Implementation, experimentation, evaluation and writing of the paper was my work. Marc-Peter Schambach engaged with me in discussions on offline handwriting recognition and multi-line text alignment. Matthias O. Franz contributed by discussing conditional random fields and their applications with me. My contribution was the research of applying a conditional random field and loopy belief propagation to the problem of multi-line text alignment, including the specification of the CRF topology for MDCC. I also formulated the multi-line decoding algorithm proposed in this work. The implementation of the algorithms for both training and decoding, the according experimental setup and evaluation was also done by me. I wrote this paper and both co-authors provided feedback before publication. Martin Schall, Haiyan P. Buehrig, Marc-Peter Schambach, and Matthias O. Franz. “LSTM Networks for Edit Distance Calculation with Exchangeable Dictionaries.” In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018 The idea for this work came to me based on my previous working experience and the question if calculating the Edit-distance can easily be accelerated using GPU hard- ware. Haiyan P. Buehrig worked on this topic as his bachelor thesis and he performed the implementation of the method and he performed the experiments and evaluation. Marc-Peter Schambach contributed by proposing to directly learn the Edit-distance as a metric. Matthias O. Franz was the supervising professor of this bachelor thesis. He engaged in discussions on the deep neural network architecture, the experimental setup and the interpretation of these result. My contribution was the research idea for this work and I co-supervised this work together with Matthias O. Franz. I contributed ideas on how to encode the dictionary and query strings for them to be suitable for deep neural networks and I discussed the application of long short-term memory layers towards this research goal. I, together with Matthias O. Franz, guided the experiments and how to build on their results. The content of this paper was written by me, while incorporating feedback provided by Marc-Peter Schambach and Matthias O. Franz. Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Improving gradient-based LSTM training for offline handwriting recognition by careful selection of the optimization method.” In: BW-CAR| SINCOM (2016), p. 11 This paper is based on general observations while evaluating modern optimization methods for offline handwriting recognition. I conducted the main part of the research and experiments described in this paper and also wrote the paper itself. Marc-Peter Scham- bach engaged with me in discussions on which properties of an optimization method are useful for handwriting recognition. Matthias O. Franz guided the experimental setup for this work and discussed with me the general properties of these modern optimization methods. My contribution was the application of the modern optimization methods to offline handwriting recognition based on their mathematical properties. I conducted the implementation of methods and experiments for this paper. Both co-authors provided me with feedback on the paper. 19 Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Increasing Robustness of Handwriting Recognition Using Character N-Gram Decoding on Large Lexica.” In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). Apr. 2016, pp. 156–161. DOI: 10.1109/DAS.2016.43 The largest part of the research into this topic and writing of the paper was conducted by me. Marc-Peter Schambach provided the idea for this work as an introduction for me into both offline handwriting recognition and the OCR software at Siemens Logistics. He engaged with me in discussions on offline handwriting recognition and decoding al- gorithms for it. Matthias O. Franz gave his feedback on the experimental results and their discussion. My contribution was research into how to use and structure an n-gram index for decoding in offline handwriting recognition. I also conducted the implementation and evaluation of this method. The paper was written by me with feedback given by my co-authors. 1.4 Organization of this Thesis The section at hand outlines the structure and organization of this thesis. We will also discuss decisions that lead to this specific organization, in the hope of improving the reading flow. Chapter 2 discusses works and methods by other researchers and authors that serve as a basis for the research in this doctoral thesis. This does not mean that the work of this thesis is derived from the methods detailed in Chapter 2, but that it builds upon them. The purpose of this chapter is to introduce the reader to the concepts necessary for understanding the methods proposed in this thesis. Chapter 2 discusses machine learning, conditional random fields, deep neural networks and expectation-maximization. I expect that many readers are already familiar with these methods and may want to skip, partial or in full, this chapter. Section 2.2 discusses conditional random fields from a perspective of defining the model and its parameters, using it for inference based on given knowledge and assumptions. This is in contrast to using a CRF as a machine learning model while automatically learning its parameters, which is not how CRFs are used in this thesis. Chapter 3 details methods that can be directly related to the methods proposed in this thesis. These works are either compared to those in this thesis or their influence on this thesis is shown. Section 3.1 discusses connectionist temporal classification, which is a method for segmentation-free line-wise offline handwriting recognition. This is in contrast to this work, which addresses paragraph-wise offline handwriting recognition. Sections 3.2 and 3.3 outline methods for paragraph-wise offline handwriting recognition that can be directly compared to multi-dimensional connectionist classification. Chapter 4 discusses the problem of multi-line offline handwriting recognition from both a perspective of document analysis and computational complexity. It answers the questions ‘Why is it hard to go from line-wise transcription to paragraph-wise transcrip- tion?’ and ‘Why is it hard to go from one-dimensional sequence labeling to two- or multi- dimensional sequence labeling?’. Chapters 5, 6, 7, 8 and 9 represent the main body of work conducted in this doctoral thesis. Chapter 5 discusses the decoding algorithms for transcribing multi-line texts as pro- posed in multi-dimensional connectionist classification. Chapter 6 details the training method for optimization of a DNN in multi-dimensional connectionist classification. The order of these two chapters is flipped in comparison to how MDCC is applied: one first needs to train the artificial neural network and perform inference using this optimized 20 model before its prediction can be decoded. The reason for switching the order of these two chapters is that MDCC relies on a latent variable, called the ‘soft-assignment’ or ‘alignment’ in this thesis, which needs explanation. Explaining the MDCC training al- gorithm requires an explanation of how to predict this soft-assignment given an image of handwritten text. On the other hand relies the discussion of the decoding algorithm on how to retrieve a computer-processable string from this soft-assignment. The dif- ference here is that an image of handwritten text consist of a large amount of tokens (its pixels) with each token carrying only a low amount of information and that it can be counter-intuitive to discuss the properties of handwritten text from a visual perspective. A computer string of natural language on the other hand consists of only a low number of tokens (its characters) with each token carrying a high amount of information and a intuitive understanding of the structure of this string is given by our everyday reading and writing on computers. It seems easier to the reader to first discuss the decoding algorithms, followed by the training algorithms. This way the discussion of the decoding algorithms provides a basis for detailing the soft-assignment. Chapter 7 details the implementation of multi-dimensional connectionist classification, its application to the IAM offline handwriting database[88] and the empirical evaluation based on these experiments. This chapter contains an evaluation of MDCC itself, as well as a direct comparison to the method of applying attention-networks for paragraph-wise offline handwriting recognition. This chapter also discusses MDCC in relation to existing works in publications of other authors. Chapter 8 proposes a visualization technique for multi-dimensional connectionist clas- sification. This visualization technique is embedded into a workflow for guiding an expert user to optimize the hyper-parameters of a DNN for MDCC, as well as allow the identi- fication of further error sources in MDCC. This chapter is based on the observation that automatic identification of error sources and automatic hyper-parameter optimization in MDCC is difficult, but much easier if the human expert user is taken into account. This chapter proposes a workflow for improving the DNN model trained with MDCC by putting the human expert user into the loop. Chapter 9 discusses methods for combining line- and paragraph-wise offline hand- writing recognition. The proposed methods do this by classifying each handwritten text in order to infer which of the two methods should be applied. The goal of these methods is to improve the combined transcription error rate in line- and paragraph-wise transcription. Chapter 10 presents novel research conducted in the context of this doctoral thesis, but which is not part of the ‘main story line’ of this thesis. Section 10.2 details an ap- proach to single-line decoding in the context of connectionist temporal classification by applying a fuzzy search within a large database of valid strings. Section 10.3 discusses a method for the application of long short-term memory networks in order to estimate the Edit-distance between a query string and the entries of a dictionary. Both methods are ideas for improving the line-decoding in connectionist temporal classification or multi- dimensional connectionist classification. 21 22 Chapter 2 Background 2.1 Machine Learning In this chapter we will discuss some general machine learning (ML) concepts, practical approaches as well as terminology used in this thesis. I would like to start this by quoting Kevin P. Murphy[93]: With the ever increasing amounts of data in electronic form, the need for au- tomated methods for data analysis continues to grow. The goal of machine learning is to develop methods that can automatically detect patterns in data, and then to use the uncovered patterns to predict future data or other out- comes of interest. This raises the question about what type of data is processed in ML. In general, a wide variety of data can and is processed by ML systems, most common in scientific and industrial applications are probably images, videos, sound recordings, financial data and abstract measurements, e.g. temperature, noise level, electrical current or GPS positions for a singular point in time or for a whole time series. Data in ML can either be labeled or unlabeled, which means that semantic information is attached to the data. This might be information about the content of the data, e.g. if the image shows a dog or a cat. As quoted above, the goal of ML is to find patterns within the data and exploit these pattern to predict useful but unobserved values. This involves a training process in which a ML model is tailored towards the specific task and data at hand. Depending on how the data is collected, possibly labeled, or generated, different training paradigms like super- vised learning, unsupervised learning or reinforcement learning are applied. Supervised learning uses the observed data, but also has information about the true outcome of the prediction step. This allows to control the predictions done by the ML model and correct them towards the true predictions that are known beforehand. As both the observed data and truth prediction outcomes are known beforehand, this is the most controlled train- ing environment in machine learning. Unsupervised learning applies only the observed data to the training process, but has now information about the prediction outcome be- forehand. The ML model in unsupervised learning is to uncover patterns in the data and make useful predictions on its own. Clustering of data is one case of unsupervised learning. Reinforcement learning[142] is not based on data known beforehand at all, but instead generates data during training by executing actions withing a simulated en- vironment such as for example a simulation of a driving car. The training paradigm in reinforcement learning is to observe the state of environment generated by the simu- lation, execute an action as suggested by the ML model and then to observe a scalar reward that is to be maximized during learning. We see that these are three very different training schemes with different semantic information about the data and task at hand, 23 but still fall into the general machine learning category about uncovering and exploiting patterns in the observed data. Next we will discuss different types of tasks that models in machine learning may solve. First, a model in ML refers to an abstract representation of the patterns that have been detected or knowledge that has been learned during training. This representation is in modern ML methods often in the form of statistical correlations in the data or e.g. in semantic representations like decision trees. A model in ML allows inference from known to unseen data by applying the patterns uncovered during training. Typical models in ML are e.g. artificial neural networks (ANNs)[126][7, ch. 5] or support-vector machines (SVMs)[10, 30, 127]. A task in ML describes the goal that inference on the model should solve. The two atypical tasks are regression and classification. Figure 2.1.1: Regression task of predicting a scalar value based on observed features. Figure 2.1.1 shows a set of example data and example model for a regression task. Regression is to intra- or extrapolate scalar values from seen features and is a general method in statistics. The black crosses in Figure 2.1.1 mark the data points, whereas the blue dotted line is a partial plot of a linear regression model. Figure 2.1.2: Classification task of predicting a class assignment based on observed features. Figure 2.1.2 exemplifies a classification task in which data points are assigned to one of two or more classes. Classes in this case relate to different semantic concepts within the data, e.g. the cats or dogs in images for object classification. The figure shows data points as crosses, with the two classes in red and green. The model is again visualized 24 as a blue dotted line and is in this case a linear classifier that describes the separating plane between the two classes. The task that we will discuss in this thesis is called sequence labeling[43], which is the assignment of a specific sequence of labels, e.g. a character string, over one or more discrete spatial or temporal dimensions. The length of the label sequence is smaller or equal to the size of these temporal or spatial dimensions. It follows that not every point in the temporal or spatial dimensions has a unique label and labels may span over several points with their exact location and extent being unknown. This is a variant of a classification task, but applied to temporal or spatial problems. Prominent examples for this type of task is transcription of text from audio, e.g. voice recognition in smart assistant devices or transcription of text from images. We have briefly discussed different training paradigms, as well as what a model is in general and what types of tasks there are in machine learning. How is learning in machine learning facilitated? Learning is done by parameterization of models in the form of automatically adapted parameters and user-defined hyper-parameters. Parameters in the context of this thesis refer to all variables within a model that are automatically optimized using a training method, such as e.g. gradient descent. Gradient descent sets up the learning as an optimization problem over the parameters of the model whereas a specific optimization criteria needs to be solved, e.g. the error of a loss function that is to be minimized. We will discuss gradient descent for deep neural networks in Section 2.3. Parameters in deep neural networks, also discussed later, are the weights and biases that define the linear combination of artificial neurons. Hyper-parameters are variables in the model or optimization algorithm that are user-defined and not automatically chosen. Hyper-parameters are for example the number of layers in a multi-layer perceptron, the learning rate in gradient descent or even the choice of the modeling method itself. The distinction between parameters and hyper-parameters in the context of an opti- mization problem requires us to think about how to evaluate specific solutions in the form of parameter sets for models. In a supervised setting, a set of input data and true pre- dictions are provided and can be applied to both optimization and evaluation. In order to conduct this in a statistically fair and comparable fashion, this data set should be split into several disjoint parts. Splitting should be done in such a way that general patterns of the problem at hand will occur in all parts, but patterns not correlated to the task will be restricted to individual splits. For example in the transcription of handwritten text, all char- acters should occur in all splits and all splits should contain many different texts that are valid in the language at hand. On the other hand, individual writers should be restricted to their specific split. This is done to prevent the machine learning model from becoming sensitive to patterns that are unrelated to the task at hand but still help to improve the op- timization criteria. This way we will be able to detect if the model is sensitive to patterns unrelated to the task by evaluation of the other data set splits. Into how many disjoint parts do we need to split the available data for a fair set-up of supervised training? Su- pervised training requires data for automatic optimization of the parameters of the model. After that the model is potentially overly sensitive to the specific patterns that occur in this training set. This is known as overfitting, where the error on the training set for automatic optimization is smaller than the error on previously unseen data. To combat this, we need to split the available data into at least two parts: One for automatic optimization and one for evaluation in order to detect overfitting. This covers the automatic optimization, but models that include hyper-parameters are also manually tuned by the model engineer. Of course this also introduces its own overfitting since the model engineer will choose hyper-parameters that lead to a low error rate on the available data sets. This means the overall optimization process (automatic and manual) is now prone to overfitting on the available data. We can combat this with the same approach as with automatic optimiza- 25 tion, by splitting the unseen data into yet another part. We have now three disjoint parts of the available data in supervised learning: The training set, which is used for automatic op- timization of the parameters. The validation set, which is used to detect overfitting in the automatic optimization and also for manual optimization of hyper-parameters. The third one is the evaluation set or test set, which facilitates detection of overfitting caused by ill-chosen hyper-parameters. Automatic optimization on models is state-less in a way that models and optimization algorithms do not carry over information from different models or optimization runs. This is not the case for humans and in extension not the case for the model engineer. Therefore in a perfect world the evaluation set should be hidden from the model engineer and only used for evaluation after the seemingly best hyper-parameters have been identified. 2.2 Conditional Random Fields A conditional random field (CRF) is a graphical model of a multivariate probability dis- tribution. It defines the joint distribution of multiple random variables in dependency of observed variables. CRFs are a specialization of a Markov random field (MRF) in the way that CRFs allow to condition its random variables on observed variables. MRFs are in turn a variant of graphical models to include undirected dependencies between random variables and to allow cycles in the graph structure. We will start by introducing graphical models in general and building up to CRFs from there. A A A B B B C C C ... ... ... T H E ... ... ... X X X Y Y Y Z Z Z y0 y1 y2 Figure 2.2.1: Example of a graphical model. Figure 2.2.1 shows a small example graphical model that we will use to point out the basic terminology. It shows three random variables or nodes Y0 to Y2 that are dependent on each other in an ascending sequential order. Each random variable has the set of characters A through Z as its discrete states. We could now use this graphical model to define the joint distribution of words of three characters in length. In this case we would expect ‘THE’ to have a high probability, whereas for example ‘AXZ’ should have a low probability. We will further discuss the definition of joint probabilities for graphical models in this section. Equations and algorithms in this section are based on and similar to, but not neces- sarily identical to, the ones stated by Kevin P. Murphy[93]. 26 Bayesian Networks A directed graphical model (DGM)[93, ch. 10] uses a graph structure to define the joint distribution of multiple random variables by encoding variables as nodes and their depen- dencies as edges. As the name suggests, their dependencies are unidirectional with the conditioning of two random variables yi and yj in the form of P (yi|yj) instead of P (yi, yj). If the DGM does not contain any cyclic dependencies - meaning a random variable can- not be indirectly conditioned on itself - it is also called a Bayesian network, which will be the first type of DGM that we will discuss in detail. y0 y1 y2 y3 y4 Figure 2.2.2: Bayesian networks are one type of a directed graphical model. A Bayesian network is directed and acyclic, which means its nodes can be topolog- ically ordered. The joint distribution that it defines is then the probability distribution of each random variable, conditioned on its predecessors. The joint distribution P (y) = P (y0)P (y1|y0)P (y2|y0, y1)P (y3|y0, y1, y2)P (y4|y0, y1, y2, y3) (2.2.1) is general applicable to Bayesian networks with five nodes y0 through y4. The Markov assumption states that the distribution of one random variable is condi- tionally independent of the remaining graphical model given its direct neighbors. This is also called the Markov blanket in graphical models. Applying the Markov assumption to Equation 2.2.1 allows us to simplify the joint distribution to P (y) = P (y0)P (y1|y0)P (y2|y0)P (y3)P (y4|y1, y2, y3) (2.2.2) for the model in Figure 2.2.2. Markov Random Fields We will now move on from Bayesian networks to Markov random fields (MRFs)[66, 82][93, ch. 19], which generalize the concept of graphical models to undirected and cyclic graph structures. Figure 2.2.3 shows one example MRF. Please note that the graphical models of Fig- ures 2.2.2 and 2.2.3 do not define the same probability distribution since converting from unidirectional to bidirectional dependencies did introduce additional conditionals for the random variables. Topological ordering of the nodes in the graph is not possible in a MRF since the graph includes cyclic structures. We will introduce the concept of potential functions to define the joint distribution. At this point it is prudent to differentiate between continuous and discrete MRFs. In a continuous MRF each random variable has a continuous scalar as its state. Discrete 27 y0 y1 y2 y3 y4 Figure 2.2.3: Markov random fields allow undirected cyclic graph topologies. MRFs assign discrete states to their random variables. This difference gives rise to differ- ent definitions of the joint distribution and inference. Continuous values require integrals for the joint distribution and inference, whereas discrete values simplify to the sum over the finite number of states. Since this thesis deals with discrete states, we will only dis- cuss the case for discrete MRFs. Assigning one specific discrete state to each of the random variables of a MRF is called a configuration. As such, each configuration is one overall discrete state in which the MRF can be. Since the number of states per random variable is finite, the number of configuration for the MRF is also finite. Using the MRF from Figure 2.2.3 as an example and assuming two discrete states for each of the five random variables, the MRF in total has 25 = 32 different configurations. The concept of a configuration comes into play when defining the joint distribution of a MRF. The Hammersley-Clifford theorem[50, 70, 77] states the conditions that are necessary for the joint distribution of a probabilistic graphical model to be defined by the product of its maximal clique potentials. Luckily, any non-negative function that is dependent on the random variables within the clique can be used as a potential function. A clique in graph theory is defined as a subset of nodes of a graph in such a way that every node is a neighbor of every other node in the clique. A maximal clique is a clique where no more nodes from the graph can be added while still retaining the clique property. A MRF is parameterized by assigning a potential function to each maximal clique of the graph. Please note that each node of the MRF can be part of multiple maximal cliques. The potential function defines the ‘compatibility’ of the nodes within the clique. The value of the potential function should be non-negative and higher if the states of the random variables within the clique are more ‘compatible’ to each other. The joint distribution of the MRF is then proportional to the product of its maximal clique potentials. This gives rise to the definition of the joint d∏istribution P (y) ∝ ψC(yC) (2.2.3) C of a MRF with ψC(yC) being the potential function of the maximal clique C. Normalization leads to the equation for the joint distribution 1 ∏ P (y) = ψC(yC) (2.2.4) Z C of a MRF with the Zustandssumme or p∑arti∏tion function Z = [ ψC(πC)] (2.2.5) π C 28 being the sum over all possible configurations π of the random field. Assigning potentials to the maximal cliques of a MRF is only one way to parameterize the MRF. Another way is to assign potentials to each edge of the MRF and the two nodes it connects[93, ch. 19.3.1], which then is called a pairwise MRF. This type of parameterization is used for the remainder of this thesis. A pairwise MRF is parameterized by two potential functions. Node potential function ψs(ys) defines the potential for the state ys of node s. Similarly, the edge potential function ψs,t(ys, yt) defines the potential for states ys, yt of two neighboring nodes s and t. This then leads to the joint distribution 1 ∏ ∏ P (y) = ψs(ys) ψs,t(ys, yt) (2.2.6) Z s s∼t of a discrete MRF, which is applicable to a wide variety of problems. Relation ∼ defines edges within the graphical mode∑l and∏ ∏ Z = [ ψs(πs) ψs,t(πs, πt)] (2.2.7) π s s∼t again is the partition function. The Ising model [19, 61] is a basic example for a Markov random field. It is named after Ernst Ising and describes the spin of atoms in ferromagnets and anti-ferromagnets. Each node of the MRF in an Ising model represents one atom with two states yi ∈ {−1,+1} for a positive or negative spin respectively. The Ising model is a pairwise MRF with its edge potential function ( ewst e− ) wst ψs,t(ys, yt) = e−w (2.2.8) st ewst encoding the ‘compatibility’ of equal spin of neighboring atoms on the diagonal and of non-equal spin on the off-diagonal. wst is zero for non-neighboring nodes (atoms that do not influence each other’s spin). A simplification of this model is to define wst = J for all nodes s and t. The node potential function ψs(ys) in an Ising model is defined as ψs(ys) = e 0 for all node-state combinations since no prior for an atoms spin is assumed. There are now three different possibilities for the behavior of an Ising model. First, we can define J > 0, modeling a material in which the spin of adjacent atoms tend to be identical. This is the case for ferromagnets. Second, a value of J < 0 favors configurations of the material in which the spin of neighboring atoms are different, which is the case for anti-ferromagnets. Such a MRF is also called a ‘frustrated system’ since nodes in an Ising model are ordered in a grid structure and choosing J < 0 means that there exists no configuration of the MRF in which there are no edges with a low compatibility. At least one neighboring relation encoded in the edge potential function will always be contradicted by such a frustrated system. The third possibility in an Ising model is to define J = 0, modeling a material in which atoms do not influence each other’s spin. Conditional Random Fields Transitioning from Markov random fields to conditional random fields (CRFs)[77][93, ch. 19.6] is a straightforward step, since CRFs basically are only MRFs that are conditioned on observed variables. Figure 2.2.4 shows our familiar graphical model, this time as a CRF with conditioning on five observed variables x0, ..., x4. The join distribution of a CRF is defined in a similar fashion to the join distribution of a MRF, that is by its node potential function ψs(ys|xs) and edge potential function ψs,t(ys, yt) over maximal cliques or, as in the equation given in this section, as a pairwise CRF. The 29 x0 y0 x1 y1 y2 x2 x3 y3 y4 x4 Figure 2.2.4: Conditional random fields are MRFs with conditioning on observed variables. difference to a MRF is that the node potential function is conditioned on the observed variables x. Each variable y of a conditional random field is conditioned on one observed variable x. The parameterization of a CRF is dependent on observed variables, with the node potential function being ψs(ys|xs) and edge potential function ψs,t(ys, yt). This leads to the joint probability ∏ ∏ P (y| 1x) = ψs(ys|xs) ψs,t(ys, yt) (2.2.9) Z s s∼t with the partition function Z bei∑ng t∏he normaliza∏tion over all configurations π: Z = [ ψs(πs|xs) ψs,t(πs, πt)] (2.2.10) π s s∼t An example application of CRFs very much worth mentioning in the context of this the- sis is that of handwriting recognition. In such a CRF the label space yi ∈ A is the alphabet A of the language at hand. The edge potential function ψs,t(ys, yt) encodes a probabilistic language model, that is the likelihood of observing specific character 2-grams in this lan- guage. For example the 2-gram ‘th’ has a higher likelihood and thus higher compatibility than ‘xq’ in the English language. The node potential function ψs(ys|x) encodes a dis- criminate classifier that produces a high node-state compatibility if the node of the CRF (a pixel in offline handwriting recognition or a time step in online handwriting recognition) corresponds to the specific glyph from the alphabet A. The model parameters of both edge and node potential functions in such a CRF for handwriting recognition are learned from a data set of examples of the language at hand. This model optimization for CRFs is not discussed here since it is not part of this thesis, which deals with the optimization of deep neural networks and applies conditional random fields for inference. Training the model parameters of a CRF is discussed in literature[93, ch. 19.6.3]. There are three differences between the CRFs described above and the method pre- sented in this thesis: first, the node potential function of the CRFs in this thesis does not itself encode a discriminate classifier, but instead relies on a deep neural network as the best available classifier, which is then iteratively improved. Second, the edge potential function in multi-dimensional connectionist classification is not probabilistic and describ- ing a whole language model, but instead is a discrete description of one specific example from a data set. Third, the model parameters of the CRFs in this thesis are defined by a fixed set of rules as discussed in Section 6.2 and not learned from a data set. Overall is 30 the CRF of this example much more similar to the deep neural network in this thesis in that both serve as discriminate classifiers for glyphs. A concrete example for the application of CRFs is stereo vision[139] in which the depth dimension in physical space is estimated from two corresponding images that show a disparity in the horizontal axis between each other. Human vision can be seen as an example of this, were the vision from both eyes is combined for depth estimation. Observation x in stereo vision is a pair of images xL from the left camera and xR from the right camera. The labels yi encode the horizontal disparity between the two images, which in this case is discrete since the disparity is measured in whole pixels. Let is be the horizontal and js the vertical position of the pixel s in question. The node potential function 1 ψs(ys|x) = exp[− (xL(is, js)− xR(i 2s + ys, js)) ] (2.2.11) 2σ2 encodes a Gaussian prior with the assumption that corresponding pixels from the two camera images will show a similar pixel intensity or color. The edge potential function 1 ψs,t(ys, yt) = exp[− (ys − y 2t) ] (2.2.12) 2γ2 encodes a Gaussian prior of neighboring pixels having similar disparities between the two camera images. Inferring the node-state combinations ys of such a CRF for stereo vision means to infer the disparity between the two camera images for each pixel. This disparity directly correlates to the distance between the camera system and the physical object that it captured. A high disparity indicates a small distance to the object and a low disparity a large distance. Belief Propagation The last sections discussed the concepts of graphical models and how to define their joint distribution. Some use-cases require to do inference in such graphical models, for example estimating the local posterior marginals of individual random variables. This is the case in this thesis. In general there are two different types of inference on graphical models: computing the local posterior marginals or computing the local maximum a posteriori (MAP). The local posterior marginals∑in a MR∏F are defin∏ed as follows:1 P (yi) = [ ψs(πs) ψs,t(πs, πt)] (2.2.13) Z π:πi=yi s s∼t The local MAP is defined as follows: 1 ∏ ∏ y⋆ = argmax[ ψs(πs) ψs,t(πs, πt)] (2.2.14) π Z s s∼t Evaluation of the local posterior marginals or local MAP using the above definitions of Equations 2.2.13 and 2.2.14 is obviously computationally prohibitive since they require enumeration of all configurations π of the graphical model. The number of possible con- figurations π for a graphical model of n random variables with m discrete states is mn and too large even for simple models. The complexity of inference in general graphical models is the topic of works[2, 15, 23, 80] by other authors. 31 y0 y1 y2 y3 y4 Figure 2.2.5: Graphical models in the form of a polytree allow efficient exact inference. Efficient exact inference in a graphical model is possible if the graphical model is a polytree[24]. A polytree is a directed acyclic graph of which the underlying structure is an undirected tree. Figure 2.2.5 shows an example polytree. One algorithm for exact inference in graphical models is called belief propagation (BP)[100][93, ch. 20] and is a message passing algorithm. When discussing BP, the terms belief and message occur frequently. A belief is an estimation of the probability distribution of one random variable based on the currently available information. A message is the assumption about the probability distribution of a neighboring random variable based on the own belief. Beliefs and messages influence each other since beliefs are formed by collecting evidence from neighboring random vari- ables via messages. Messages towards neighboring random variables are based on the belief about the source random variable. BP can be broken down into three individual steps: first, choose one of the nodes of the graphical model as the root node. Second, collect evidence for the probability distribu- tion of the root node by message passing from the leaf nodes to the root node. Equations 2.2.15 and 2.2.16 below model this step. Third, update the beliefs of the remaining nodes by message passing from the root node towards the leaf nodes. Equations 2.2.17 and 2.2.18 define this second step. The beliefs of the root node will be correct after the sec- ond step, collecting all the evidence for the probability distribution of the root node. From there the third step will correct the beliefs of the remaining nodes. In the second step, incomplete beliefs ∏ bel−t (yt) ∝ ψt(y ) m − t c→t(yt) (2.2.15) c∈child(t) are build by collecting evidence from ch∑ild nodes by propagating messages m− (y ) = ψ (y , y )bel−s→t t s,t s t s (ys) (2.2.16) ys towards the root node. Passing evidence from the root node towards the leaf nodes yields the correct, but unnormalized, beliefs ∏ bels(ys) ∝ bel−s (y +s) mt→s(ys) (2.2.17) t∈parent(s) of the nodes of the graphical model wit∑h the messages m+ belt(yt) t→s(ys) = ψt,s(yt, ys) − (2.2.18)m yt s→t(yt) 32 being propagated from parent to child nodes in this step. The upward messages m+t→s(ys) collect evidence about node t and infer an incomplete state distribution about node s. It is important that the evidence about the state distribution of node t does not incorporate the node s, to which the upward message is propagated. Multiplication of the incomplete state distributions collected by upward messages m+ yields the true state distribution of ys. Equation 2.2.18 can be further generalized by recognizing that ignoring the evidence from node s about node t is not only possible by dividing through the corresponding message m−s→t(yt), but also by simply ignoring node s while collecting evidence about s. This produces ano∑ther variant of Equation∏2.2.18: ∏ m+ − +t→s(ys) = ψt,s(yt, ys)ψt(yt) mc→t(yt) mp→t(yt) (2.2.19) yt c∈child(t)\s p∈parent(t) The above Equations 2.2.15, 2.2.16, 2.2.17, 2.2.18 and its variant 2.2.19 describe the sum-product algorithm based on published works[93, ch. 20.2.1]. Normalizing the beliefs bels(ys) of Equatio∑n 2.2.17 such that the beliefs of one random variable sum up to onewill yield the local posterior marginals as defined in Equation 2.2.13. Replacing the operator with the max operator in Equations 2.2.16 and 2.2.19 will yield the max-product algorithm which computes the local MAP of Equation 2.2.14. See [93, ch. 20.2] for an in-depth discussion. Computing the exact marginals is possible in a polytree since it is possible to pass messages in such an order that collects all evidence for one node, correctly fixating the beliefs for this node. Evidence is then distributed from this node and the correct local marginals are computed for all remaining nodes based on their fully available evidence. Applying the sum-product algorithm to a chain structured graphical model is the same principle as the forward-backward algorithm[105], whereas applying the max-product al- gorithm is the principle of the Viterbi Algorithm[33, 149] with backtracking. Loopy Belief Propagation We have discussed belief propagation and how it yields exact marginals in case of a polytree structure of the graphical model. This thesis deals with CRFs in which the nodes are structured in a 8-neighborhood grid. Grid-structured undirected graphical models contain cycles and cannot be represented as a polytree. It follows that BP is not directly applicable to it. We will now discuss loopy belief propagation (LBP)[34, 94][93, ch. 22], which is an inference algorithm to approximate the local marginals. LBP is based on a simple approach: repeatedly apply BP to the cyclic graphical model until convergence of the beliefs or other stopping criteria are met. This approach is not guaranteed to converge to a stable point within polynomial time, as even approximating the posterior marginals is generally NP-hard[23], or that this stable point is near the exact solution for the local marginals[100]. However, LBP has been successfully applied to a variety of practical problems[94]. In computer vision, CRFs and LBP have been successfully applied to image segmentation, object tracking and stereo depth estimation. The BP algorithm as described before implements the so called serial protocol for message scheduling. In it, messages are sent one after another from node to neighboring node. The order in which messages are processed is sequential. LBP employs the parallel protocol, see known works[93, ch. 20.2.2], in which all messages are being sent simultaneously. This means for LBP we will initialize all messages, then iteratively update all messages simultaneously. This process of parallel message updates is repeated until 33 the stopping criteria are satisfie∑d. Message updates ∏ ms→t(yt) = [ψs(ys)ψs,t(ys, yt) mu→s(ys)] (2.2.20) ys u∈nbr(s)\t as stated by Kevin Murphy[93, ch. 20.2.2] are dependent only on the potential functions ψ and the current message values m. Message updates are computed by collecting evidence for the source node s by receiving messages from all neighboring nodes u, except the target node t. This evidence is used to form a belief about the probability distribution of the target node t and send the corresponding message update to the target node t. Approximated beliefs ∏ bels(ys) ∝ ψs(ys) mt→s(ys) (2.2.21) t∈nbr(s) are formed by collecting the evidence via message updates from all neighboring nodes. The beliefs bels(ys) are proportional to the approximation of the local marginals and thus can be normalized to retrieve these approximated local marginals. Algorithm 2.2.1 Loopy Belief Propagation in Sum-Product Mode Initialize messages ms→t(yt) = mt→s(ys) = 1 for all edges s ∼ t. Initialize beliefs bels(ys) = 1 for all nodes s. Choose a random but fixed order for message updates. repeat Send messa∑ges along each edge∏according to the chosen ordering: ms→t(yt) = y [ψ∏s(ys)ψs,t(ys, y )s t u∈nbr(s)\tmu→s(ys)]Update beliefs for each node: bels(ys) ∝ ψs(ys) t∈nbr(s)mt→s(ys) until stopping criteria are satisfied. Return marginal beliefs bels(ys). Algorithm 2.2.1 outlines LBP in sum-product mode as pseudo-code based on the version stated by Kevin Murphy[93, ch. 22.2.2] with the stopping criteria generalized. LBP is typically stopped and the computed beliefs returned after the beliefs do not change significantly anymore[9∑3, ch. 22.2.2]. While this is surely the most prominent stoppingcriteria, other more problem-specific ones are possible. We will explore this later in this thesis. Replacing the operator with the max operator will again yield the max-product mode. Please note that the max-product algorithm in LBP can yield inconsistent results, e.g. nodes being in contradicting states. In theory, the parallel protocol updates all messages at the same time. However, im- plementing LBP on a physical computer will most likely lead to some serialization of the message order with the hardware parallelization of modern computing hardware being used for a speed-up in terms of wall clock time. It is worth noting that there are different strategies for computing the message updates in the parallel protocol. One way is the synchronous update in which the new messages are computed based on the message values from the last iteration. Asynchronous update holds and uses only the most recent value per message. In practice this leads to inconsistencies in the ‘versioning’ of mes- sages since currently held message values are originating from two different iterations at times n and n− 1. It has been observed[93, ch. 22.2.4.3] that this does not pose a prob- lem, but can be used to increase the convergence rate of the messages within LBP. This is achieved by choosing a random order for message updates at the beginning and then using this fixed order in each LBP iteration. See figure[93, Fig. 22.5] for a comparison of the convergence rate in synchronous and asynchronous updates. 34 2.3 Deep Neural Networks A deep neural network (DNN)[41, 126] is a (rather loosely) defined type of an artificial neural network [126][7, ch. 5]. An artificial neural network is a machine learning (ML)[7, 93] method that employs a large number of neurons-like structures and their connections to approximate decision functions. Each individual neuron in an artificial neural network is modeled after a highly abstracted model of a biological neuron[31, 52, 68, 89, 110]. An artificial neural network is typically organized in layers with each layer consisting of a set of neurons that receive signals from the previous layer. Each layer adds more parameters and non-linear mappings to the learning machine. The input signals of the first layer is the observed data as defined by the ML task. The output signals of the last layer are the values of the decision function and are dependent on the ML task, e.g. if the task is classification or regression. This type of artificial neural network is also called a multi- layer perceptron (MLP)[7, ch. 5.1]. A DNN is often defined as a MLP or MLP-like artificial neural network with more than one hidden layers[5] that are not directly visible by the user (neither in- nor output). We will discuss the function of individual artificial neurons and layers in this section. A Single Neuron x0 X w0 x X w11 Σ l a b xn X wn Figure 2.3.1: A single neuron of an artificial neural network is a linear combination followed by a non-linear function. A single neuron within an artificial neural network is a weighted linear combination of its input signals x, followed by a non-linear activa∑tion function. This is described by thefunction f(x,W, b) = σ(b+ xiWi) (2.3.1) i with input signals x and learnable parameters W and b. To learn parameters here means to optimize them regarding a task-specific target function. Optimization of an artificial neural network is most often done using gradient descent [11, 65, 107] and backpropa- gation[112, 113], both of which will be discussed later. Function σ is a non-linear function and often called the ‘activation function’ of the neuron. In the case of a MLP or DNN, most likely the same activation function is used for all neurons in the layer, an approach which offers computational benefits. The activation function allows the neuron to transition from a ‘non-activated state’ to a ‘activated state’, which following neurons use to compute their own activations. Figure 2.3.2 shows the step function, which transitions the neuron between two discrete states. Please note that 35 Figure 2.3.2: Step function as an activation function σ. this activation function is not used anymore, for various reasons. We will discuss this when detailing backpropagation. Figure 2.3.3: Standard logistic sigmoid as an activation function σ. Figure 2.3.3 shows the standard logistic sigmoid σ(x) = 11+exp(−x) , which is in com- mon use and a much better choice than the step function. The standard logistic sigmoid function is differentiable at any point and monotonically increasing, two properties which allow for gradient descent. The formulation of a single neuron of a artificial neural network is also the formulation of the decision function of the perceptron algorithm[109, 110]. In the case of a perceptron the activation function is also called the ‘threshold function’. As with a perceptron, a single neuron is not able to learn non-linear separable classification problems[92]. Please note that we can easily combine the weights W and biases b of an artificial neural network into one parameter set Ŵ by expanding the input vector x by a dimension 36 that has a constant coefficient of 1. This constant feature allows us to move the bias b into the weights W as one additional weight coefficient. Using x̂ as the expanded feature vector, Equation 2.3.1 reduced to ∑ f(x̂,Ŵ) = σ( x̂iŴi) (2.3.2) i with x̂ and Ŵ having one dimension more than x and its according W. From now on we will use the formulation with the expanded feature vector without explicitly denoting this fact. Multi-Layer Perceptron We will now discuss how to construct a MLP from the above definition of a single neuron. A MLP is an artificial neural network that comprises a set of neurons organized in ‘layers’. Each layer is a set of neurons that, in the case on non-recurrent layers, use the activations of the previous layer as inputs and then compute their own activations which are either the output of the MLP or fed into the consecutive layer. The neurons within one layer are organized in parallel. The first layer takes the observed features as input, provided by the user. The activations of the last layer are the value of the decision function. x0 h0 y0 x1 h1 y1 xn hn yn Figure 2.3.4: A multi-layer perceptron featuring a single hidden layer. Figure 2.3.4 shows a simple MLP with one hidden layer. Nodes x0 through xn rep- resent the input features as provided by the user. Nodes h0 through hn and y0 through yn are artificial neurons. Each one is of the type as detailed in Figure 2.3.1. Important in a MLP is to choose a non-linear activation function σ. Without this non-linearity, the decision function of the MLP reduces to a single linear layer. This effect can be produced by substituting the equations for a single neuron and applying the distributing property to reduce the weight and bias coefficients of the linear combination. Organizing the artificial neural network as a MLP in layers and each neuron within a layer modeling the same function, apart from its weights and biases, allows us to simplify Equation 2.3.1 using matrix linear algebra. Resulting in the function f(x,W) = σ(Wx) (2.3.3) for a layer with multiple neurons. The above function f : Rn → Rm describes the activa- tions of a layer of m neurons given n input signals, either observed features or activations from the previous layer. In this case the vector x is of dimension n and matrix W of di- mension m×n. Activation function σ is a coefficient-wise non-linear function as described above for a single neuron. This formulation using matrix linear algebra has the benefit 37 True False False True Figure 2.3.5: Logical XOR is a classic example for a non-linear separable classification as it re- quires two separating planes in linear space. of allowing to utilize computationally efficient matrix multiplication algorithms and general purpose graphics processing unit (GPGPU)[16, 37] computation. MLPs allow to solve non-linear separable classification problems. This can be ex- emplified with the logical XOR problem. The logical XOR is a boolean function that is true if its inputs are different. Figure 2.3.5 shows this for two boolean variables. It also shows that two decision boundaries (dashed lines) would be necessary to directly solve the problem, which is not possible with a linear classifier. We will now introduce one hidden layer with two neurons into the MLP. True 1 0 False False True Figure 2.3.6: Separating one ‘true’ case of the logical XOR in its own hidden neuron. Figure 2.3.6 shows the decision boundary of the first neuron of the hidden layer. Figure 2.3.7 shows the decision boundary of the second neuron of the hidden layer. Now we have separated the two cases were the logical XOR is true from the remaining cases. Figure 2.3.8 shows the decision boundary of the output neuron. Whenever at least one of the two hidden neurons is activated (1), the logical XOR will be true. Please note that both neurons with decision boundaries from Figures 2.3.6 and 2.3.7 cannot be active at the same time. We have now seen that introducing one hidden layer allows a MLP to solve non-linear separable classification problems. Softmax and Cross-Entropy for Classification Tasks This work uses DNNs to transcribe multi-line texts from images. For this we will need to produce probabilities for individual glyphs at specific spatial positions within the image. 38 True 0 False 1 False True Figure 2.3.7: Separating the other ‘true’ case of the logical XOR in a second hidden neuron. 1 0 0 1 Figure 2.3.8: Logical XOR is true if at least one of the two hidden neurons is activated. Both hidden neurons cannot be active at the same time. 39 We will now discuss a suitable non-linear activation function and loss function for this task. The equation ∑exp(xi)σ(x)i = (2.3.4) j exp(xj) describes the softmax[13][7, p. 198][41, ch. 6.2.2.3] activation function, which produces activated values between zero and one that sum up to exactly one. This allows the ac- tivated values to be interpreted as probabilities. Since the vector x can be of arbitrary dimensionality, softmax allows to model multi-class problems were classes are mutually exclusive to each other. This is not the case for e.g. the standard logistic sigmoid, which also is in range zero to one, but does not normalize the sum of activations. This makes softmax suited for multi-class problems were classes are mutually exclusive and the stan- dard logistic sigmoid suitable for non-exclusive classes. The softmax function produces class probabilities for a multi-class problem. Now a suitable loss function is necessary for optimization of the parameters of a MLP for solving a classification problem. Let us use y for the class probability distribution estimated by the MLP and z for the true class distribution. Please note that we are discussing the case of discrete classes, since this work deals with discrete classes in the form of glyphs from an alphabet. The cross-entropy[7, ch. 4.3.2][41, ch. 3.13] los∑s function is described by L(y, z) = −Ez[log(y)] = − [zi log(yi)] (2.3.5) i for discrete probability distributions y and z over the same event set. Minimizing the cross-entropy loss minimizes the uncertainty regarding the true distribution z given the estimated distribution y. It can also be interpreted as minimizing the coding length when using a coding scheme based on the estimated distribution y for coding the true data from distribution z. This makes the cross-entropy loss a suitable target function for a multi-class problem since in this case we can only estimate the probability distribution for unknown samples, but we can minimize the uncertainty beforehand on a known training data set. Optimization by Gradient Descent Up until now we have discussed how a MLP is structured and working in general. We also have described cross-entropy as a target or loss function, that is a function that describes the error value that the MLP did produce in its estimation. It follows now that we can use this target or loss function to derive how to optimize the parameters of the MLP, that is find a set of weights W⋆ ∈ Rn with n being the number of parameters in the MLP that minimizes the value of the loss function as W⋆ = argminL(S,W) (2.3.6) W with S being the training data set and thus minimizes the value of the error on this data. We will do this using gradient descent [11, 65, 107]. Applying gradient descent means trying to minimize the value of the loss function over a fixed, finite set of examples. As discussed in Section 2.1, this data set is called the training set and is the largest part, e.g. commonly 80-90 percent, of the overall avail- able examples regarding the problem at hand. The remaining data examples will later be used to validate and evaluate a possible solution. Having a data set of examples for train- ing with both the input (observed features) and output (true target values) is also called 40 supervised learning. This means that the ML model is fully under supervision by the optimization algorithm with errors being detected immediately and corrected if possible. Popular other training paradigms for DNNs are unsupervised learning and reinforcement learning[142]. This thesis deals with supervised learning problems and we will concen- trate of this training paradigm. gradient descent itself is an iterative algorithm for supervised training of ML models. It requires the loss function and the ML model to be differentiable. It assumes the set of training data to be a representative sample of the problem at hand and the loss function to be a function over weight space, that is over the DNN parameters of weights. gradient descent uses the gradient ∂L (2.3.7) ∂W of the loss function L in the weight space defined by W. Since the gradient in weight space gives the direction of steepest increase of the function, the negative gradient gives the steepest descent of it. This is why gradient descent follows the negative gradient of the loss function for minimizing it. The new parameter set Wt+1 ∂L Wt+1 = Wt − µ (2.3.8) ∂Wt is the old parameter set with the negative gradient of the loss function added. Hyper- parameter µ is called the learning rate and is a coefficient added to control the speed of gradient descent. This is necessary since we may otherwise not reach a stationary point. This process of calculating the first order derivative of the loss function in weight space and following the negative gradient is repeated until convergence. If gradient descent converges to a stationary point, it is guaranteed (assuming the loss function is from a left-bounded interval) to be a local minima or saddle point of the loss function. Algorithm 2.3.1 Gradient Descent Initialize weight set W. Choose a learning rate µ. repeat Evaluate the loss function L with current weights W. Calculate gradient ∂L∂W of the loss function. Update weights W = W − µ ∂L∂W . until convergence to a stable point. Return final weight set W as W⋆. Algorithm 2.3.1 describes gradient descent in its most basic case. Figure 2.3.9 shows an example of gradient descent for finding the minimum of the function f(w) = w2 starting at w = −10 and a learning rate µ = 0.1. Please note that this is a good-behaved case since the function f(w) = w2 is convex and thus only has one minimum. In such a case gradient descent will reliable converge to a stable point near or at the global minimum, given that the learning rate µ is not set too high. Figure 2.3.10 exemplifies the case for a learning rate of µ = 0.9 which is too high and the parameter w will oscillate and only slowly converge to the minimum of the function. If we choose the the learning rate even higher, then the parameter w in gradient descent will diverge from the minimum. In the case of optimizing the weights W of a DNN regarding a given loss function, the loss value would actually increase over time. The above examples show the simple case of gradient descent in a one-dimensional space over parameter w while minimizing the convex function f(w) = w2. We cannot expect the loss function L to be convex in the case of training a DNN[42]. As such 41 Figure 2.3.9: Gradient descent in function f(w) = w2 (blue) with µ = 0.1. Figure 2.3.10: Gradient descent in function f(w) = w2 (blue) with µ = 0.9. 42 gradient descent may converge to a stable point that is not a global minimum, which would be a sub-optimal solution. We need to keep this in mind when optimizing a DNN and validate each possible solution. However, it seems that from a practical point of view, most local minima are at a good-enough solution in the case of DNNs[18, 25, 116]. So far we have discussed the theory of gradient descent with some examples, but for practical implementation we need to decide on a schedule for gradient descent. ‘Sched- ule’ in the context of gradient descent describes the order in which the training set is processed and at which intervals an update of the parameters is executed. We need to specify the terms epoch and iteration for beforehand. An epoch in supervised training of DNNs is the update of the network parameters with each example from the training set ex- actly once. An iteration is one individual update step of the network parameters, weather this is with one example from the training set or multiple examples at once. The number of examples per iteration is actually the difference between online, mini-batch and batch training of DNNs. Let us use S as the training set in supervised training x, z ∈ S with x being the network input (extracted features or e.g. an image) and z being the correct output. y will be the prediction by the DNN. Let us also use the Mean Squared Error as our loss function in the following examples. Online training ∑N1 Lo(S, i) = (yi,n − z 2i,n) (2.3.9) N n with yi = f(xi,W) and xi, zi ∈ S is then the loss for exactly one training set element xi, zi per iteration i. Batch training 1 ∑|S| N1 ∑ L 2b(S) = (yi,n − zi,n) (2.3.10)|S| N i n is the other extreme to online training. In batch training, the full training set will be used per iteration whereas online training uses exactly one example per iteration. A middle way between online and batch training is mini-batch training where only a part of the training set, but more than one example, is used per iteration in gradient descent. This mini-batch training is formulated as the loss (i+∑1)×b N1 1 ∑ Lm(S, b, i) = (yj,n − zj,n)2 (2.3.11) b N j=i×b n where 1 < b < |S| is the mini-batch size and i indicates the current iteration. When ap- plying training using gradient descent to supervised learning problems, a choice between these three schedules has to be made. This is mainly a trade-off between computa- tion time and convergence rate of the optimization process[6, 155]. Using online training or mini-batch training with a low mini-batch size allows gradient descent to better follow the curvature of the loss function during optimization, but also yields more iterations per epoch. This in combination leads to an overall faster convergence rate of the loss value during optimization. On the other hand does batch training or mini-batch training with a larger mini-batch size allow the efficient utilization of GPGPU hardware by enabling multiple independent computation threads. In general, mini-batch training is currently the schedule of choice for practical applications. Mini-batch sizes range typically somewhere in between 8 and 64, depending on how much training data and computer memory is available. 43 Backpropagation So far we have discussed how MLPs are structured and how their parameters can be optimized towards a given task using gradient descent. However, gradient descent re- quires the partial derivatives ∂L∂W of the loss function L towards the MLP parameters W. Backpropagation[112, 113] is an algorithm for calculating the partial derivatives ∂L∂W for MLPs and DNNs. MLPs are organized in layers of neurons, each layer being a matrix multiplication of the input features and the weights of the layer. This linear combination of each layer is followed by a non-linear activation function. We can see that the linear combination is a function f(x,W) of its input x and the weight set W of the layer. The non-linear activation function σ(x) is a function that has the result of the linear combination as its input x. In the same way, the output of the non-linear activation function of one layer in a MLP is the input of the next layer in the MLP. This means that a MLP is a series of composite (or nested) functions. The equation m(x,W) = σ(l2(σ(l1(x,W1)),W2)) (2.3.12) exemplifies this for a MLP m with two layers consisting of linear combinations l1, l2 and their non-linear activation functions σ. Backpropagation takes advantage of the fact that the inner derivative of a composite function is the product of the derivatives of the outer and inner functions: ∂f(g(x)) ∂f ∂g = (2.3.13) ∂x ∂g ∂x This is known as the ‘chain-rule’ and means we can now specify the partial derivatives for the above MLP: ∂m ∂m ∂l2 = (2.3.14) ∂W2 ∂l2 ∂W2 ∂m ∂m ∂l2 ∂l1 = (2.3.15) ∂W1 ∂l2 ∂l1 ∂W1 The chain-rule and in extension backpropagation also applies to the loss function for optimization itself. It allows calculation of any necessary derivative for gradient descent, e.g. for W1 of the above MLP m(x,W): ∂L ∂L ∂m = (2.3.16) ∂W1 ∂m ∂W1 Backpropagation is the application of the chain rule for calculating partial derivatives of the MLP, beginning from the loss function and ‘working towards’ the input of the MLP. First step is to calculate the derivative of the loss function L, then applying the chain rule to obtain the derivative of the output layer, then the hidden layer(s) and so on. The derivatives of the individual layers of the MLP∑are often simple, as for the example above: l(x,W) = xiWi (2.3.17) i means ∂l = Wi (2.3.18) ∂xi and ∂l = xi (2.3.19) ∂Wi 44 both of which are known if Wi and xi are stored, which is the case for Wi anyway since it is a MLP parameter that is optimized by gradient descent. The derivative of the standard logistic sigmoid 1 σ(x) = (2.3.20) 1 + exp(−x) is ∂σ(x) = σ(x)(1− σ(x)) (2.3.21) ∂x which means it is also easy to compute when storing σ(x) from the forward pass through the MLP. The backpropagation algorithm can be summarized as follows. It is used to calculate the partial derivatives ∂L∂W , which are then used to optimize W towards minimizing the loss function L using gradient descent. Please note that loss function L most likely has more parameters than just xN , depending on the ML task at hand. Algorithm 2.3.2 Backpropagation Define current weight set W. Define current input x0. repeat ▷ Forward pass Compute linear combination li = Wixi−1. Apply non-linear activation function xi = σ(li). Store both li and xi. until output layer i = N . Compute loss function L(xN ). Compute derivative ∂L∂x .N repeat ▷ Backward pass Compute derivative ∂L∂l = ∂L ∂xi i ∂xi ∂l . i Compute derivative ∂L ∂L ∂li∂W = ∂l ∂W .i i i Compute derivative ∂L = ∂L ∂li∂xi−1 ∂l .i ∂xi−1 Store ∂L∂W .i until input layer i = 0. Return set of derivatives ∂L∂W . Algorithm 2.3.2 describes backpropagation for a standard MLP. Backpropagation can be adapted for other DNN topologies as well, as long as the individual modules of the network are differentiable. Modules refer to building blocks of the network, e.g. a non- linear layer in a MLP, that feed forward and process the data. These modules need to be differentiable regarding their parameters that should be optimized, if any, and their input in order to back propagate the loss derivative to the previous module(s). The forward and backward passes in backpropagation need to be changed to match the modules in use accordingly. We will later discuss convolutional neural networks and recurrent neural networks, which both use different module types than fully connected non-linear layers. Convolutional Neural Networks Convolutional neural networks (CNNs)[36, 78, 79] are a variant of MLPs that are espe- cially suited to process spatial data. As with MLPs are CNNs organized in layers with the first layer processing the user-defined input and then each layer receiving the forwarded activations from the previous layer. In contrast to MLPs do CNNs often consist of several 45 types of layers, convolutional layers which give them their names and often pooling layers to reduce the spatial resolution. Convolutional layers in a CNN, as with artificial neuronal layers, consist of a linear combination with learnable weights followed by a non-linear activation function. In con- trast to a MLP are convolutional layers in a CNN not ‘fully connected’, meaning no artificial neuron receives the full data from the previous layer as input. Input data is organized in ‘feature maps’ with one or multiple spatial dimensions (e.g. two in the case of image data) and one or more channels or feature maps. Each feature map contains either the one in- put feature or the activations from one artificial neuron from the previous layer, organized in a spatial map. Convolutional layers now consist of neurons that receive only a part of the feature map, from a sliding rectangular window or kernel, as input. Each window may be processed by multiple neurons with different weight sets, resulting in multiple output feature maps, but neurons in different windows share their weights. Referring to Equation 2.3.2, weights W are shared from window to window, but inputs x are individual per win- dow. This behavior is similar to kernels in Computer Vision methods, but the coefficients of the kernel are learned by gradient descent. The equation f(x,W, k)i = σ(Wx[i−⌊ k ⌋,i+⌊ k ⌋]) (2.3.22) 2 2 describes a convolutional layer with input x, weight set W and a window/kernel size k. It is a one-dimensional convolution in this example and as such the interval for slicing x is only along one dimension. Convolutional layers can easily be extended to multiple dimensions by extending the window to those dimensions and slicing x accordingly. In the equation above, weight matrix W would be m × (kn) in size, with m neurons in the convolutional layer and n input channels. same W, different x σ(Wx) σ(Wx) σ(Wx) σ(Wx) channels channels Figure 2.3.11: Convolutional layers apply learnable spatial kernels to the feature maps and used to extract general features from the data. Colors indicate different feature maps or channels. Figure 2.3.11 shows a basic example of a convolutional layer along one spatial di- mension. The input feature map is in this case 6 in size along the spatial dimension with 2 channels. The convolutional layer has a kernel size of 3 with 3 artificial neurons, re- sulting in an output feature map with a spatial dimension of 4 with 3 channels. We have reduced the spatial size in this case by 2 because only spatial positions were processed were the window lies fully within the input feature map. This behavior can be prohibited adding padding, e.g. constant zeros or a mirror of the feature map, of size ⌊k2⌋ to the input feature map. 46 spatial dimension spatial dimension CNNs have a similar architecture as MLPs, but with shared weights in the convolu- tional layers. This greatly reduces the number of learnable parameters, as compared to fully connected layers. This not only makes a CNN computationally more favorable by reducing both memory usage and CPU load but also reduces overfitting of the learned solution. This is because on average there are more data points per learnable parameter in a CNN than in a MLP, resulting in fewer minima of the loss function in weight space. Also the same convolutional kernels are applied to different spatial positions within the input feature map, which prevents neurons from becoming receptive to specific features that only occur in a single or very few spatial positions. Instead, neuron weight sets which are receptive to general features in many different spatial positions are favored during gradient descent. CNNs typically consist of convolutional layers, see our discussion before, and pool- ing operations. Pooling operations in a CNN are ‘layers’ or functions without learnable parameters that are used to reduce the spatial resolution of the feature map. Pooling in a CNN is done by moving a non-overlapping window over the feature map and reducing the contained feature map to only one position, e.g. ‘pixel’ in image data. For example in pooling with a windows size of two, the spatial resolution will be halved. The operation to reduce the spatial resolution is applied channel-wise, preserving the number of chan- nels, and is often a simple combination of the input data, most likely the maximal value or average value. It is necessary to keep this operation differentiable in order to apply backpropagation to the CNN. Backpropagation for maximum pooling is done by propa- gating the loss derivative only towards the coefficient that was forwarded during pooling, ignoring other data. Equation f(x, w)i,c = max(x[iw,(i+1)w],c) (2.3.23) describes a maximum pooling operation with window size w. 2 1 1 2 max 5 4 2 2 6 3 max 6 4 2 7 max 8 7 8 0 channels Figure 2.3.12: Pooling operations, here in the example of maximum pooling, are a way to reduce the spatial resolution and introduce translation invariance to the deep neural net- work. Colors indicate different feature maps or channels. Figure 2.3.12 shows an example of an one-dimensional feature map with two chan- nels to which a maximum pooling operation with a windows size of two is applied. Pooling layers are used in CNNs to both reduce the memory consumption and compu- tation time, but also to introduce translation invariance to the CNN. Convolutional layers are receptive to certain features within the feature map, e.g. edges or Gabor filters[161]. Reducing the spatial resolution of the output feature map from convolutions by pooling allows that the CNN is sensitive to the feature occurring somewhere within the pooling window, but removed the requirement that it is detected in all spatial positions in order to be propagated as a strong activation to the next layer. 47 spatial dimension High resolution Input Spatial resolution Number of channels Output Many channels Figure 2.3.13: Schematic showing the relation between the spatial resolution and number of fea- tures of the feature maps in common CNN architectures. Now the question arises on how to choose suitable hyper-parameters, namely the number of neurons per convolutional layer and the window sizes for pooling, in a CNN. A common approach is to start out with a high spatial resolution and low number of features, where individual features add few information but only the accumulation over the whole spatial extent will carry meaningful information about the input data. Take an RGB image as an example, where each individual pixel carries only a tiny amount of information and the three channels for themselves are of no high abstraction, but viewing the whole image (as a human) easily reveals the content in terms of objects, their classes, and so on. This input feature map of high spatial resolution of low-abstraction features will then gradually be converted to a feature map with low spatial resolution but high abstraction per feature by reducing the spatial resolution via pooling, but at the same time increasing the number of neurons per convolutional layer. In the best case, this relation between spatial resolution and number of channels in the feature map will be chosen in such a way that the relevant information can always be contained in it, but random noise or irrelevant information will be dropped. Figure 2.3.13 visualizes this relation between the spatial resolution and number of channels in the feature maps of CNNs. A term commonly used in the context of CNNs is the ‘receptive field’. This describes the fact that in a CNN, data is processed by sliding fixed size windows or kernel over the feature map and applying convolutions to it which means that a specific instance of a convolutional neuron only receives a finite, fixed-size part of the feature map as input. Convolutions and pooling operations in a CNN are stacked in multiple layers, accumulat- ing the total window size. For example two convolutional layers with a kernel size of three will result in a accumulated kernel size of five for the second convolutional layer. This accumulated window in a stack of convolutions and pooling layers is called the receptive field. This is a useful concept when deciding on window or kernel sizes for a CNN since it is necessary to feed the relevant information into the convolutions for them to correctly solve the task at hand, e.g. a CNN for object classification from image input should have a receptive field large enough such that the output neurons actually receive the full object is input. Convolutional neural networks include many state of the art methods[73, 108, 143, 144, 153] for computer vision problems, such as e.g. the ImageNet[27] object classi- fication problem. Convolutional neural network were even applied, in combination with reinforcement learning and Monte Carlo tree search, to learn to play the game of Go at a human-level performance[132, 133]. 48 Propagation through CNN Recurrent Neural Networks Recurrent neural networks (RNNs)[55, 112] are another common topology of artificial neural networks, besides CNNs. A RNN is organized in layers as in a MLP, but recurrent layers not only receive the activations from the previous layer as inputs, but also their own activations from the previous time step or spatial position. This means that a RNN is used to process a time series or feature map, position by position with the activations in each step being dependent on the previous ones. On the topic of terminology, ‘neurons’ in recurrent topologies are commonly also referred to as ‘cells’. The basic RNN layer is, as usual, a weighted linear combination followed by a non- linear activation function. It possesses two weight matrices, one for the feed forward activation from the previous layer and one for its own activations from the previous time step. Equation f(xi, ri,Wx,Wr) = σ(Wxxi +Wrri) (2.3.24) with { f(xi−1, ri−1,Wx,Wr), if i > 0 ri = (2.3.25) 0, if i = 0 describes a simple RNN with Wx being of size m×n and Wr of size m×m for a recurrent layer with m cells and n inputs from the previous layer. i being the index variable to indicate positions within a one-dimensional time series or feature map. Please note that weights Wx and Wr are, similar to convolutional layers, weight shared between different time steps. This is important when optimizing the parameters. same W, different x and r σ(Wxx) σ(Wxx+Wrr) σ(Wxx+Wrr) σ(Wxx+Wrr) σ(Wxx+Wrr) σ(Wxx+Wrr) channels channels Figure 2.3.14: Recurrent neural network processing a feature map with one spatial dimension. The RNN activations of the previous step are fed into the same RNN in the next step. In contrast to CNNs do RNNs not have receptive fields of fixed size. Instead they can be applied to variable size sequences with the receptive field for the last RNN step being the whole sequence. The dependency on the previous steps and the ability for variable size receptive fields do make RNNs suitable for ML tasks in natural language processing (NLP)[63, 141, 157], as is the case in this thesis. 49 spatial dimension spatial dimension Wr x Wx RNN a Figure 2.3.15: Schematic representation of a recurrent neural network. Backpropagation is not possible since the recurrence is infinite. x Wx0 RNN a0 Wr x Wx1 RNN a1 Wr x Wx2 RNN a2 Figure 2.3.16: Recurrent neural network from Figure 2.3.15 unrolled for a finite sequence of three steps. Recurrence is eliminated and backpropagation is applicable. RNNs cannot directly be optimized using gradient descent and backpropagation, since the recurrence in the formulation disallows backpropagation by repeated application of the chain rule to the composite function at hand. However, RNNs are for obvious rea- sons trained on a data set with sequences of finite length. This leads to the observation that the actual depth of the composite formulation is also finite. backpropagation through time (BPTT)[152] is based on this fact that even though the RNN formulation is infinite in theory (Figure 2.3.15), when applied to finite sequences it can be rewritten as a feed forward network without recurrence (Figure 2.3.16). This is called ‘unrolling a RNN’ and standard backpropagation and gradient descent are applied to an unrolled RNN in order to optimize its parameters. A commonly observed behavior when training RNNs are the so-called ‘vanishing gra- dient’ and ‘exploding gradient’ problems[53, 54, 55, 99]. This is based on the fact that backpropagation is the repeated multiplication of partial derivatives, based on the chain rule for composite functions. If the partial derivative of one module is smaller one, the overall gradient will be reduced in magnitude. The same is true for a partial derivative greater than one, which will increase the magnitude of the gradient. This can pose prob- lems during training since a decreasing magnitude of the gradient (vanishing gradient) may end up with a gradient close to zero and thus only very small changes to the weights during gradient descent, inhibiting effective training. On the other hand an exploding gra- dient may exaggerate individual training examples and lead to oscillations during gradient descent, resulting in seemingly random results. In MLPs and CNNs, the number of multiplications of the derivatives is linear in the number of layers of the network. This means that the factor by which the magnitude of the gradient is increased or decreased is relatively constant and can thus easily be countered by e.g. choosing different learning rates for the individual layers. This is not 50 applicable to RNNs since the sequence length may be variable and rather lengthy. As- signing different learning rates would mean modification of the learning rate for different steps within the sequence, resulting in a RNN that is artificially (in)sensitive to parts of the sequence. Mitigation of the vanishing or exploding gradient in RNNs is commonly done by introducing parameterized gates to the recurrence, keeping the gradient along the recurrent connections in a predictable magnitude. Long short-term memory (LSTM)[39, 55] increases the complexity of the cells of RNNs by giving them a parameterized internal structure. This internal structure con- sists of gated connections that control information flow within the cell, as well as their in- and outputs. Gated connections feed forward information, but are modulated by gates. A modulated connection f(s,x,Wg) = s⊙ σ(Wgx) (2.3.26) forwards signal s with σ(Wgx) being the gate in form of a standard logistic sigmoid of the linear combination of gate weights Wg and cell input vector x. Operator ⊙ is the Hadamard product [59, ch. 5] which is a coefficient-wise multiplication of two matrices. Gate σ(Wgx) near one will forward the signal s (nearly) unchanged, while a value near zero will drop the signal. Matrix Wg is of size n ×m for n neuron activations in signal s that are to be modulated and m inputs into the cell. Please note that m in the case of RNNS are both inputs from the previous layer as well as recurrent connections from the own layer, but from a previous time or space step. Figure 2.3.17 shows the inner structure of an original LSTM cell with constant error carousel. As usual in a RNN the cell input is a weighted linear combination over the activations from the previous layer, as well as the own layer in the previous time step. In contrast to a plain RNN, the cell state in LSTM is gated by an input and an output gate. Equations for the LSTM topology in Figure 2.3.17 are ĉt = σs(Wcxt +Rcat−1) (2.3.27) it = σg(Wixt +Riat−1) (2.3.28) ct = ĉt ⊙ it + ct−1 (2.3.29) ot = σg(Woxt +Roat−1) (2.3.30) at = σs(ct)⊙ ot (2.3.31) with σs and σg being the activation functions for the signal and gates respectively. σg is the standard logistic sigmoid in most cases. xt the activations from the previous layer, at the activations of the LSTM layer and ct the internal cell state of the LSTM layer. W and R denote learnable weights to the previous layer and for the recurrent connection. The above LSTM formulation uses the constant error carousel, which means that the cell state from the previous time step is added to the current cell state as-is, that is without multiplication with weights. Through this constant error carousel mechanism, the gradient of the cell state will always be exactly one along the time dimension. This constant prop- agation of the cell state is one part for preventing the vanishing and exploding gradient effects during training. The second part to this end are the gated connections along the input and output gates. These are learnable gates that adaptively prevent modification of the cell state (input gate) or modification of the LSTM activation (output gate), effectively reducing the LSTM block to only the constant error carousel if both gates are closed. Since both gates are optimized using backpropagation and gradient descent, they learn to only propagate signals in or out of the LSTM cell when necessary for solving the task at hand. This cuts off the gradient in and out of the LSTM cell when closed, preventing the vanishing and exploding gradient effects. It is thus that LSTM blocks are only expected 51 Output(t-1) Output(t) σg(Wox) ⋅ σs Cell State t-1 σg(Wix) ⋅ σs(Wcx) Input(t) Figure 2.3.17: Original long short-term memory formulation using two gated connections and a constant error carousel. 52 to show the vanishing or exploding gradient effect if one of the the input or output gates is open for a long period of time. Variants[48] of the original LSTM formulation include the forget gate[39], which con- trols the constant error carousel by preventing information flow along the time dimension. This effectively allows the LSTM cell to reset its memory. Variations also include ‘peep- hole connections’[40], which add weighted connections from the LSTM cell state c (before activation σ) to the gates. So far we have discussed RNNs and LSTMs that are unidirectional in one dimension and are in theory processing infinite sequences. The one exception so far is BPTT which requires finite sequences in order to ‘unroll’ the RNN structure and apply standard back- propagation to it. Many ML problems however deal with sequences that are finite in their nature and where the whole sequence is observed from the beginning. This allows to ap- ply bi-directional RNNs[128] to the data. Bi-directional RNNs consist of two RNN layers that process the same input sequence independently of each other. One of the two RNN layers processes the sequence from start to end, the other one reversely from end to start. Typical implementations of bi-directional RNNs feed the same input sequence into both RNN layers, then process the sequence in both directions and finally concatenate the two output sequences along the feature-/neuron-dimension. This results in a ‘black box’ bi-directional RNN layer that can be plugged into a DNN. This work deals with image data and two-dimensional feature maps in DNNs. As such we need to discuss a method on how to apply LSTMs to two-dimensional feature maps. Multi-dimensional long short-term memory (MDLSTM)[44] is a type of layer that extends the bi-directional RNN to arbitrary numbers of dimensions in LSTMs. MDLSTM processes the input in 2n different orderings with n being the number of temporal and spa- tial dimensions in the feature map. Bi-directional LSTM does process a one-dimensional sequence in two orderings, whereas MDLSTM for two-dimensional feature maps, e.g. images, processes the input in four different orderings. For this, the LSTM formulation is extended to multiple recurrent dimensions: • The cell input, the input gate and the output gate receive recurrent input along all dimensions. • The forget gate, if used in the topology, only has recurrent connections along the dimension it modulates. This means that there needs to be one forget gate for each dimension. Figure 2.3.18 shows the four orders of processing for MDLSTM over a two-dimensional feature map. The LSTM activations from the four passes will be concatenated at the end to acquire the overall result feature map. Please note that at each position in the feature map, MDLSTM actually has two predecessor states. This makes MDLSTM a powerful model since it allows to utilize context information from the preceding rectangle (in 2D) within the feature map. All processing orders of MDLSTM in combination allow to use the full feature map as input at any point of the following layer. This characteristic of MDL- STM enabled several state of the art models in document analysis and natural language processing (NLP), but also adds the drawback of being hard to implement on GPGPU systems. This is because each MDLSTM state in the feature map is dependent on one neighboring state along each of the n dimensions of the feature map. This leads to the fact that no two or more parts of one feature map can be computed independently of each other, which is the property that would allow to fully utilize GPGPU hardware. Separable multi-dimensional long short-term memory [156] is a simplification of the MDLSTM concept that allows the application of LSTM networks to multi-dimensional problems but reduces the recurrent connection in the RNN to only one dimension. Sep- arable MDLSTM processes a feature map in 2n different orderings with n again being 53 2D-LSTM 2D-LSTM Input feature map Concat Output feature map 2D-LSTM features 2D-LSTM Figure 2.3.18: The four orderings of processing in MDLSTM over a two-dimensional feature map. the number of temporal and spatial dimensions in the feature map. It processes each dimension in a bi-directional fashion independently from other dimensions. For example in a 2D image, each row and each column is treated as a independent sequence and processed on its own with a bi-directional LSTM. Figure 2.3.19 shows the four orderings of processing in a separable MDLSTM for a two-dimensional feature map. As in MDLSTM, the activations from the individual LSTM runs are concatenated along the feature dimension to acquire the final output feature map. Compared to MDLSTM do separable MDLSTM have less context information ‘to work with’ for predictions and thus are overall less powerful. This is because of the fact that each prediction is based only on a one-dimensional slice of the feature map and not based on a rectangle (in two-dimensional feature maps). On the other hand do separable MDLSTM reduce the number of processing orderings for the LSTM cells, thus reducing the overall computational runtime. Also the fact that each one-dimensional slice of the feature map is processed independently does again allow for parallelization using GPGPU hardware. An overview over many LSTM variants can be found in published literature[39, 44, 48, 55, 64, 156]. Overfitting and Regularization As we have discussed before, deep neural networks are trained by optimizing their pa- rameters towards minimizing a specified loss function using backpropagation and gradi- ent descent. This optimization is done over a finite data set of examples from the ML problem at hand, which means that there will be only a finite amount of observations from the feature space. This poses two problems: One, there will always be unknown examples that the neural network (or ML model in general) has not seen before and we want it to generalize correctly from the finite training set to unseen examples. Two, the learning capacity of the ML model might be large enough to allow recognizing specific 54 1D-LSTM 1D-LSTM Input feature map Concatfeatures Output feature map1D-LSTM 1D-LSTM Figure 2.3.19: The four orderings of processing in separable MDLSTM over a two-dimensional feature map. examples from the training set without actually learning the discriminating features that lead to a correct classification, regression or prediction in general. These two problems will lead to cases were the error rate of the model predictions over the training set is lower than over an independent validation or evaluation data set. This is called overfitting or generalization error. Reducing overfitting and thus the generalization error of a model is called regularization and is commonly done by introducing additional constraints to the optimization procedure. The goal while training a ML model is to reach a satisfying error rate on a data set with examples unseen during training in order to gain the confidence that the model will make correct predictions on further unseen examples. We will now discuss three regularization techniques used while training deep neural networks that facilitate the reduction of overfitting: early stopping, dropout and L2 regularization. Early stopping[14, 103] is a rather simple regularization technique that can be imple- mented with a variety of ML models. For early stopping we need to split the available data set into three disjoint parts, the training set, validation set and evaluation set. The training set is commonly the largest one and is used for automatic optimization of the ML model using, in the case of neural networks, backpropagation and gradient descent. Calculation of the error rate or loss value is done with the current ML model on both the training set and validation set in identical intervals. What is now observed in the case of overfitting in deep learning is that in the beginning of the training, both the training and validation sets see a reduction of the error rate. However, it is common that at some point the error rate on the training set will continue to decrease but start to increase on the validation set. This is the point were overfitting sets in. Early stopping is a simple mechanism were the training process is stopped after the error rate on the validation set has not decreased for a certain amount of time. If this is the case, training is stopped and the ML model with the minimal achieved error rate on the validation set is used. This carries the risk that the ML model is now overfitting on the validation set because we have chosen the parameters that produce the minimal error rate on the validation set. This is why there is need for a third data set, the evaluation set, which is evaluated once after early stopping 55 and which gives us the error rate that we can expect on further unseen data. In practice, the error rates on the validation and evaluation sets will often be similar to each other, but it is still necessary to keep this new source of overfitting in mind and evaluate the ML model accordingly. Dropout [38, 102, 137] is a regularization technique specific to MLPs and derived models, such as CNNs and RNNs. The trick in dropout learning is to reduce the model capacity by randomly removing (‘dropping’) parts of the model during training, thus re- ducing the sensitivity to specific examples by enforcing redundancy of general features within the data. Dropout is implemented as additional layer(s) in a MLP and is said to be applied to the layer previous to the dropout layer. Dropout is applied during training by randomly setting activations of the previous layer to zero according to fT (x, p) = x⊙ B(n = |x|, 1− p) (2.3.32) were x are the activations of the previous neural layer, B is a Binomial distribution with the probability p of dropping an individual neuron and number of events n. n is equal to the number of neurons in the previous layer. p is also called the ‘dropout rate’ since it defines how frequently individual neurons are removed from the model. A dropout rate of p = 0.5 means that in every forward pass, only half of the neurons are actually used in the model. A lower dropout rate removes fewer neurons from the forward passes and p = 0 deactivates dropout completely. Outside of training, no neuron activations will be dropped and all learned redundancies will be used for effect. Not dropping any neurons during inference generally leads to a much higher activation in inference than in training since no inputs of the next layers’ linear combination will be set to zero. To counter this, dropout during inference scales the activations accordingly: 1 fI(x, p) = x (2.3.33) 1− p Dropout thus is implemented differently during training, see function fT , and inference, see function fI . fT randomly removes parts of the neural network, whereas fT scales the activations to make sure that the linear combination of the following neuron layer stays in the same value range. The last techniques for regularization in deep neural networks that we will discuss here are L1 regularization[115, 145] and L2 regularization[56]. Both are based on the observation that overfitting in artificial neural networks often means that parts of the net- work get more and more sensitive to specific patterns that occur within the training data. In practice this means that the absolute value of the weights in the neural network in- crease more and more. Intuitively said, as the weights of the neural network increase in magnitude, the activations of the linear combination will be in the saturation on either side of the activation function (if the activation function in use has saturation). At some point in training, activations will always be either in saturation on one side or the other of the ac- tivation function or always tend to positive or negative infinity. This will lower the training error significantly if the network capacity is large enough since parts of the network will be sensitive to those patterns and activations will tend towards one saturation if a specific pattern is shown to the network. One could say that the DNN in this case actually has a ‘grandmother neuron’. To counter this effect, L1 and L2 regularization impose a penalty on network weights with large magnitude. Implementations either add a penalty term to the loss function or in case of L2 regularization in ANNs, do weight decay [74]. Weight decay is a functionally equal formulation of L2 regularization. The loss function L = Ltask + λ||W||2 (2.3.34) with penalty term λ||W||2 adds L2 regularization to the loss function as defined by the ML task at hand. Factor λ controls how strong the regularization effect is during training. 56 Setting λ = 0 will disable regularization, whereas a large λ will force all network weights in weight set W to be at or near zero, thus inhibiting any learning at all. In a similar way, L1 regularization is modeled as loss L = Ltask + λ||W||1 (2.3.35) with penalty term λ||W||1. The practical difference between L1 and L2 regularization is that L1 in DNNs tends to eliminate connections from the network completely by bringing their weights to zero. It acts thus as a sort of automatic feature selection. L2 tends to- wards keeping the network weights at a low magnitude without elimination, thus reducing the sensitivity to specific patterns without eliminating features. 2.4 Expectation-Maximization E-step: Estimate latent variables. Constant model parameters. Model parameters Latent variables W z M-step: Update model parameters. Constant latent variables. Figure 2.4.1: Overview over expectation-maximization. We will now discuss the expectation-maximization (EM) algorithm, first proposed by Dempster et al.[26] as an iterative algorithm for maximum likelihood optimization under incomplete data. Maximum likelihood optimization in machine learning scenarios refers to the optimization of a parameter set W in regards to maximizing the probability of ob- serving specific data x. Since the logarithmic function is monotonic but many optimization problems are convex only after logarithmic transformation, it is prudent in these cases to maximize the log-likelihood. This correspond∑s to W⋆ = argmax logP (xi|W) (2.4.1) W i as the solution for W. Some tasks include latent variables z that may be marginalized in order to implement maximum likelihood t∑raining∑, leading to W⋆ = argmax log[ P (xi, zi|W)] (2.4.2) W i zi for the modified solution with latent variables zi. This marginalization is suitable in cases were variables zi can be observed. In some cases the latent variables zi are either unobserved or are themselves restricted by constraints that need to be observed. Expectation-maximization optimizes toward the maximum likelihood solution by iter- atively finding the expectation value for the latent variables zi and then optimizing the parameters W given the current expectation values for zi. This approach is iteratively repeated until the latent variables do not change anymore or some other suitable conver- gence criteria are met. This basic loop is visualized in Figure 2.4.1. 57 Setting up an expectation-maximization optimization requires an objective function or auxiliary function. Ideally we would like ∑to maximize the complete data log-likelihood P (W) = logP (xi, zi|W) (2.4.3) i of observing the data examples xi and corresponding latent variables zi, given our model parameters W. This cannot be done for tasks were expectation-maximization is utilized since those tasks are exactly those that do not allow to observe the latent variables zi. We define the objective function as the expected complete data-log likelihood J(W,Wold) = E[P (W)|X,Wold] (2.4.4) which will be maximized instead. In the E-step we will now compute the latent variables Z based on model parameters Wold and the observed data X. The M-step will maximize the objective function by choosing new model parameters W. This iterative optimization is repeated until convergence. Please note that in this section we will discuss both maximizing or minimizing the objective function J . This is task dependent and the expectation-maximization approach in general is unchanged by this. The equations and examples in this section on expectation-maximization, including the above ones, are based on, but not necessarily identical to, the ones given in the corresponding chapters by Christopher Bishop[7, ch. 9] and Kevin Murphy[93, ch. 11.4]. Gaussian Mixture Model We will discuss expectation-maximization by using Gaussian mixture models (GMMs) or mixture of Gaussians[90] as a basic example on how to apply EM to probabilistic optimization problems with latent variables. Mixture models in general describe multi- variate distributions by linear combination of base distributions. In case of GMMs, the base distribution is a multi-variate Gaussian distribution and a total of K distributions are mixed to model the K data clusters of the mixture model. Expectation-maximization is applied to fit the mixture model to a finite set x of observed data. The likelihood of observing a specific data point xi in a Gaussian mixture model is defined by the linear combination ∑K P (xi|W) = πkN (xi|µk,Σk) (2.4.5) k where π are the coefficients for the linear combination of the individual Gaussian distri- butions and W := {(πk, µk,Σk)|k ∈ [1,K]} are the model parameters for the K clusters of the GMM. The expected complete data log-likelihood of a GMM is ∑N J(W,Wold) := E[ logP (xi, zi|W)] ∑ iN ∑K ∑N ∑ (2.4.6)K = [ ri,k log πk] + [ ri,k logP (xi|µk,Σk)] i k i k with zi := {ri,k|k ∈ [1,K]} being the latent variables in form of the responsibility of cluster k for generating data point xi. 58 The E-step in expectation-maximization for Gaussian mixture models is the calcula- tion of these cluster responsibilities: ∑πkP (x |Woldi k )ri,k = (2.4.7) k′ πk′P (x old i|Wk′ ) The M-step is to maximize the log-likelihood of Equation 2.4.6 by choosing new model parameters W while keeping the latent variables z constant. The mixture coefficient πk of cluster k is simply its mean responsibility for generating the observed data: 1 ∑N πk = ri,k (2.4.8) N i ∑ ∑Maximizing J(W,Wold) regarding µk and Σk, which is to maximize the expression i k ri,k logP (xi|µk,Σk) of Equation 2.4∑.6, completes the M-step in GMMs. This max-imization yields ∑i ri,kxiµk = (2.4.9) i ri,k and ∑ T i∑ri,kxixiΣk = − µ µ Tk k (2.4.10) i ri,k for the cluster center and variance. This process of iteratively minimizing J by applying the E-step and M-step is repeated until cluster responsibilities ri,k do not change significantly anymore, in which case the M- step will also stabilize at the final cluster model parameters πk, µk and Σk. Convergence is guaranteed since both the E-step and M-step do in fact maximize the complete data log- likelihood J of Equation2.4.6 in each iteration of the expectation-maximization process. Generalized Expectation-Maximization The above case of applying expectation-maximization to the k-means clustering task could be seen as true expectation-maximization since there exist closed-form solutions for both the E-step and M-step, which in turn allows to minimize J at each iteration given the current assignment of latent variables. This may not always be the case for the E-step and/or the M-step and thus minimizing J at each iteration will not be possible. generalized expectation-maximization aims to still provide a framework for finding maximum likelihood solutions with incomplete data by still applying the EM algorithm to these cases, but in- stead of minimizing J at each step we try to decrease the value of J at each step. In this work the M-step will optimize the parameters of a deep neural network with the goal of minimizing J . To this end, backpropagation and gradient descent will be applied to the deep neural network while choosing a suitable surrogate loss function for the DNN. Since gradient descent has only local information about the loss function, is only applied to a small batch of data examples at a time and performs only a small step in weight space, minimizing J in each M-step is intractable. Instead gradient descent applied to the deep neural network will likely reduce the value of objective function J by reducing the value of a suitable surrogate loss function. Although this is not guaranteed since gradient descent for artificial neural networks may diverge or oscillate depending on the ‘loss landscape’ and step size in use. Convergence to a local optimum still occurs on average in general expectation-max- imization even if no closed-form solutions to the E-step and M-step are applied[93, ch. 11.4.5.2]. 59 k-Means Clustering The following paragraphs discuss k-means clustering as an example of generalized ex- pectation-maximization. The generalization in k-means clustering is that it is not a prob- abilistic model, but instead uses discrete cluster assignments between observed data points and cluster centers. In this respect can k-means clustering be seen as a dis- cretized variant of Gaussian mixture models. Let us define x as a set of points xi of our data set. This data should be separated into K different clusters in such a way that dis- tances between the points in the same cluster are smaller than between points in different clusters. For this we will need to assign points to clusters by defining cluster assignments ri,k, which is of value 1 for exactly one cluster k per point xi and 0 for all other clusters. Clusters are defined by their centers µk. In this case we want to optimize cluster centers µk in order to reflect the clusters present in the observed data points xi by minimizing the intra-cluster distance between points xi of the same cluster k. The unobserved variables are the cluster assignments rn,k. Expectation-maximization requires an objective function that will be minimized itera- tively. In k-means clustering we may choose to minimize the Euclidean distance between data points within the same cluster. ∑This∑leads to J = ri,k||xi − µk||2 (2.4.11) i k as our objective function. We start out by choosing initial cluster centers µk, for example a random selection of k data points from x. Afterwards we will iteratively minimize objective function J by first choosing new cluster assignments r that minimize J while keeping cluster centers µ constant (E-step) and then minimizing J by choosing new cluster centers while keeping assignments constant (M-step). The E-step is easily perform{ed by simply assigning each data point xi to the nearestcluster center µk: 1, if argminn ||x 2i − µn|| = kri,k = (2.4.12) 0, else In the M-step we want to minimize J given our current assignments ri,k of data points xi to clusters k. We keep in mind that function J is the sum of the Euclidean distances between data points and their assigned cluster centers, which is a convex function. Find- ing any extreme point will thus be a minimal point of function J . To this end we will derive J with respect to cluster centers µk ∂J ∑ = 2 ri,k(xi − µk) (2.4.13) ∂µk i and find centers µk where it is zero ∂J = 0 (2.4.14) ∂µk and thus a minimal point of the convex func∑tion J . This leads to µk = ∑i ri,kxi (2.4.15) i ri,k which simply is the average over the coordinates of data points xi assigned to cluster k. Again this iterative expectation-maximization approach is repeated until the cluster assignments ri,k do not change anymore. 60 Chapter 3 Related Work 3.1 Connectionist Temporal Classification We will now discuss connectionist temporal classification (CTC)[46][43, ch. 7], a method consisting of a loss function and a decoding function that solves the sequence labeling problem for one-dimensional sequences. We will discuss the CTC loss function in detail since it is crucial to understanding this solution to the one-dimensional sequence labeling problem, but also since it provides the context for this thesis. One decoding function for CTC will also be discussed in this section, with another decoding function being detailed in the original publication[43, ch. 7.5.2]. The mathematical symbols and equations in this section closely follow the ones in the original publications[46][43, ch. 7] by Alex Graves. Collapse Layer Before we begin to discuss the CTC loss function, we need to take a short look at the deep neural networks typically employed together with CTC. The task for which the CTC loss is designed is to predict a sequence of labels given an image (offline handwriting recognition) or audio (speech recognition) as input. Both cases typically use deep neu- ral networks based on LSTM[55] or MDLSTM[44] layers, although other topologies or combinations with convolutional neural networks[78] are possible. Images of handwriting or speech recordings are often multi-dimensional in their nature. Images consist of two spatial dimensions. Audio of one temporal dimension and the frequency domain. This type of data can be processed by a DNN that consists of multiple layers of LSTM, MDL- STM, convolutions or pooling and does not intrinsically pose a problem to modern DNN architectures. However, the output of a forward pass of such a neural network will also be multi-dimensional. The need to eliminate all but one dimension within the data arises out of the fact that CTC processes one-dimensional sequences. One possibility for eliminating the additional dimensions is to apply a so called col- lapse layer. A collapse layer sums up the predicted values along all but one dimension in order to marginalize those dimensions, and as such effectively eliminates them while maintaining differentiability of the overall DN∑N. The collapse function cx = ix,y (3.1.1) y with i being an image input and c being the collapse output for example marginalizes the y-dimension. Marginalization by summation is a common approach for array dimensions that are of a constant size. Variable array dimensions may be reduced by averaging or by 61 finding the maximum value. A collapse layer can be seen as a sort of ‘dynamic pooling’ with the pooling window always being the total extent of the dimension in question. A collapse layer is then followed by a softmax layer, see Section 2.3, in order to compute label probabilities. The overall network prediction y consists of one temporal or spatial dimension, along which label probabilities are estimated for characters from an alphabet. Fundamental Probabilities Let us start by defining the sequence labeling task. Let A be the set of glyphs of the script that will be transcribed using CTC. In Latin or roman script this would be the Latin alphabet, plus language- or region-specific symbols. For CTC to work we need to add one more label to this set, which is the blank or glyph separator. This is a special artificial label the meaning of which we will discuss soon. From now on this section, A will contain both the visible glyphs of the script as well as the artificial blank. Connectionist temporal classification employs a deep neural network (DNN) in order to transcribe texts. The DNN used in the original publication was a LSTM and MDLSTM, but other variants have been proposed[12, 104, 156] for the use with CTC since then. Let x be the input into the DNN and y = f(x,W) the prediction of the DNN using input x and parameter set W. The DNN f has one output neuron per element from label set A and produces an output sequence of length T . The output neurons of the DNN estimate the probabilities of each sequence position in T belonging to one of the labels from A. As such, y is from a real-valued probability distribution y ∈ RT×|A|. Sequence labeling is to assign a sequence l ∈ A|l| to the time steps of network prediction y in such a way that the probability P (l|x,W) is maximized. The first step towards this goal is to define the probability for observing a specific path or configuration π given the network output y. A configuration π is a label sequence of length T over alphabet A with π ∈ AT . We can easily define the probability ∏T P (π|y) = ytπ (3.1.2)t t for observing a specific path π. yts refers to the estimated probability of symbol s occurring at time step t. Next we will define a function F that maps configurations π to a label sequence l. This mapping should allow for repetitions of the same glyph in adjacent characters while also maintaining that the same character may stretch of multiple time steps. This mapping is done by first collapsing multiple adjacent occurrences of the same symbol to exactly one occurrence of the same symbol, e.g. F (aaabbaa) = aba. Next, artificial blank sym- bols ϵ will be removed, e.g. F (aaaϵϵaa) = aa. This already shows the usefulness of the blank symbol ϵ since it allows to distinguish actual repetitions of the same symbol from repetitions out of necessity to ‘fill up’ time steps T . Overall an example would be F (aaϵabbaa) = aaba and of course many different configurations π will map to the same label sequence l = F (π). Since there are many different paths or configuration π that map to the same label sequence l, but are conditionally independ∑ent of each other, we can now also defineprobability P (l|y) = P (π|y) (3.1.3) π:F (π)=l for observing a specific label sequence l given the network prediction y. 62 Prototypical Decoding This leads us to the prototypical formulation of what a decoding algorithm in the context in CTC does. Decoding is an algorithm that, given network prediction y finds the most likely label sequence l⋆ or at least a label sequence with reasonable high probability. This decoded label sequence is the overall result of the sequence labeling task, that is to find e.g. the most likely transcription given a recording of spoken language as is the case in voice assistant devices. The label sequence l⋆ = argmaxP (l|y) (3.1.4) l is thus the transcription of the network prediction y = f(x,W). This transcription method is not computationally feasible since it would require to enumerate all paths π ∈ AT which can easily be a prohibitively large set. We will later in this section discuss computationally feasible decoding algorithms for CTC. Prototypical Loss Training of a deep neural network (DNN) using CTC is done by gradient-based parameter optimization for maximum likelihood of the true label sequence. Let us use S as a training data set with (x, z) ∈ S being the input x and true label sequence z. y = f(x,W) is again the DNN prediction using the parameters W that will be optimized in the process. The loss function ∏ ∑ ∑ L = − ln P (z|y) = − lnP (z|y) = − lnP (z|x,W) (3.1.5) (x,z)∈S (x,z)∈S (x,z)∈S is then minimal when the likelihood for predicting z is maximized. The question remains how to evaluate the likelihood P (z|y) according to Equations 3.1.2 and 3.1.3 if doing so would require the enumeration of all paths or configurations π that relate to label sequence z. We can reduce the computational requirements for evaluation of Equation 3.1.3 by employing a dynamic programming approach similar to the forward-backward algorithm for hidden Markov models[105]. Forward-Backward Algorithm The forward-backward algorithm is based on the idea that both the temporal or spatial dimension of the (in our case) deep neural network prediction can be split into two disjoint parts, as well as the target label sequence can be split into two disjoint parts. We can then evaluate the probabilities of observing a specific label prefix at the beginning of the sequence and of observing the corresponding label suffix at the end of it. Multiplication of the probability for the prefix with the probability for the suffix yields the probability of observing the overall label sequence l, restricted to configurations π that indicate the label lu at time step t. Let U = |l| be the len∑gth of label sequence l and clarify this observation P (l : lu = π ′ t|y) = P (π |y) π′:F (π∑′)=l∧π′t=∏lut ∑ ∏T i i (3.1.6)= ( yπ′)( y ′)i πi π′:F (π′)=l i=1 π′1:u :F (π′)=lu+1:U i=t+1 = α(t, u)β(t, u) with π′ being paths related to prefixes or suffixes of l and t, u being the points in the DNN prediction and label sequence where the prefix or suffix starts or ends accordingly. In 63 1 ε ε ε ε ε ε ε ε ε ε ε ε H H H H H H H H H H H H ε ε ε ε ε ε ε ε ε ε ε ε E E E E E E E E E E E E ε ε ε ε ε ε ε ε ε ε ε ε L L L L L L L L L L L L ε ε ε ε ε ε ε ε ε ε ε ε L L L L L L L L L L L L ε ε ε ε ε ε ε ε ε ε ε ε O O O O O O O O O O O O U ε ε ε ε ε ε ε ε ε ε ε ε 1 Time steps of DNN prediction T Figure 3.1.1: Each connected path from top left to bottom right represents one path π were F (π) is the label sequence ‘HELLO’. Calculating the probability for each path and summing up all these yields the total probability of observing the label sequence ‘HELLO’. reference to Figure 3.1.1 this is the probability of picking one node at position (t, u) and calculating the probability of passing through that node. The forward variable ∑ ∏t α(t, u) = yiπ′ (3.1.7)i π′:F (π′)=l1:u i=1 and the backward variable ∑ ∏T β(t, u) = yiπ′ (3.1.8)i π′:F (π′)=lu+1:U i=t+1 will be evaluated in the following paragraphs using a dynamic programming approach. Equation 3.1.3 can then be rewritten as ∑U ∑U P (l|y) = P (l : lu = πt|y) = α(t, u)β(t, u),∀t ∈ [1, T ] (3.1.9) u u by observing that the total label probability for l is the sum of the paths passing through any position u in the label sequence at one time step. This is picking one vertical slice at point t out of the graph in Figure 3.1.1 and summing up the probabilities of all paths passing through this vertical slice. Let l′ be the label sequence l with the glyph separator ϵ added in between every label and also at the front and rear. This ϵ label is used to separate adjacent occurrences of the same glyph, but also in order to fill up the time dimension up to T in case the label sequence is shorter. As such the ϵ label is mandatory only for those adjacent occurrences of the same glyph and will be optional otherwise. Otherwise said we could omit the ϵ glyph in l′ whenever collate function F would still produce the correct true label sequence. 64 Label sequence Applying a dynamic programming approach to calculating α first requires us to set the initial probability for a prefix of one time step in size, then incrementally increase from there until we reach the end of the label sequence l′ and the last time step T . The initial probabilities for one time step can only be the first ϵ or the first visible glyph, since otherwise we would have skipped the first glyph in the sequence. This means that α(1, 1) = y1 1 1l′ = yϵ , α(1, 2) = yl′ = y 1 l and α(1, u) = 0, ∀u > 2. The recursive formulation 1 2 1 for α is then ∑u α(t, u) = ytl′ α(t− 1, i) (3.1.10)u i=head(u) with head(u) being the first valid pr{edecessor of position u in l′. This function is u− 1, if l′u = ϵ or l′ ′head(u) = u−2 = lu (3.1.11) u− 2, else and allows jumps over ϵ labels or no jumps if the ϵ is mandatory because of repetitions. We can apply the same dynamic programming approach to the backwards variable β by initializing β(T,U ′) = 1, β(T,U ′ − 1) = 1 and β(T, u) = 0,∀u < U ′ − 1 with U ′ being the length of augmented label sequence l′. Similar to α, but in reverse traversing order is the recursive formulation of ta∑il(u) β(t, u) = β(t+ 1, i)yt+1′ (3.1.12)li i=u with tail(u) being the last valid suc{cessor of position u according to u+ 1, if l′u = ϵ or l′ ′tail(u) = u+2 = lu (3.1.13) u+ 2, else which again models the mandatory ϵ labels. Loss based on the Forward-Backward Algorithm Equation 3.1.5 represents the loss function for CTC for full batch training on a training data set S. The sample loss is thus L(x, z) = − lnP (z|y) (3.1.14) with x being the example input, e.g. an image, and z being the true label sequence. By assuming l = z during training and substitution of Equation 3.1.9 we obtain ∑U L(x, z) = − ln α(t, u)β(t, u) (3.1.15) u and can begin computing the partial derivative ∂L(x, z) (3.1.16) ∂ytg for glyph g at time step t in order to apply backpropagation and gradient descent for optimizing the DNN parameters. We observe that ∂ lnx 1∂x = x and thus ∂L(x, z) −∂ lnP (z|y) 1 ∂P (z|y)= = − (3.1.17) ∂ytg ∂y t g P (z|y) ∂ytg 65 leads to the question of how to compute the partial derivative ∂P (z|y)∂yt . We further observeg from Equation 3.1.6 that { α(t,u)β(t,u) ∂α(t, u)β(t, u) yt , if g ∈ z= g (3.1.18) ∂ytg 0, else since glyphs g that are not contained in z do not influence the derivative and thus are now ready to complete the partial derivative ∂L(x,z)∂yt . We will sum the derivatives for individualg label positions u which are the same g in order to adhere to the above observation. This gives us ∂P (z|y) 1 ∑ = α(t, u)β(t, u) (3.1.19) ∂yt ytg g u:zu=g which can be substituted to obtain ∂L(x, z) − 1 ∂P (z|y) 1 ∑ = = − α(t, u)β(t, u) (3.1.20) ∂ytg P (z|y) ∂ytg P (z|y)ytg u:zu=g and thus we have a partial derivative of the loss function at hand for applying to parameter optimization. Decoding Algorithms So far we have discussed how to train a deep neural network for one-dimensional tran- scription using CTC and how to derive the loss function for its training. The question re- mains how to decode the network prediction in order to obtain the most likely (or a good choice) label sequence after the training. As discussed above the prototypical decoding algorithm should solve l⋆ = argmaxP (l|y) (3.1.21) l which is computationally unfeasible since it would require enumerating all possible paths π through y. We will now discuss two approximations of this. First, best path decoding[43, ch. 7.5.1] which finds the path π⋆ with the highest probability and assumes that this correlates to the most likely label sequence. As such l⋆ = F (π⋆) (3.1.22) with π⋆t = argmax y t g (3.1.23) g which means that π⋆ is simply the sequence of labels g with the highest probability ytg at their respective time step t. While this is simple and fast, it is prone to errors in case that some correct glyphs are only weakly predicted. We will now shortly discuss the beam search decoding or prefix search decoding[43, ch. 7.5.2] algorithm, which can prevent this shortcoming of weak predictions and find the most probably label sequence given enough time. Beam search decoding builds a prefix trie of known label sequences and updates the actual probabilities for them during decoding. Decoding starts out at t = 1 with an empty tree (only the root node repre- senting the empty label sequence) and then incrementally processes paths through the deep neural network prediction up until t = T while also incrementally updating the label sequences in the prefix trie. At each time step t, all glyphs in alphabet g ∈ A are iterated, their estimated probabilities ytg retrieved from the DNN and then appended to all the label sequences in the prefix trie. If at this point two or more prefixes in the trie collapse to one 66 when applying function F to them, they are actually collapsed to one label sequence and their probabilities summed. This corresponds to multiple paths π with F (π) resulting in the same label sequence. After each iteration the prefix trie is pruned to the top-n (usu- ally with 10 ≤ n ≤ 100) most probably prefixes in order to keep the runtime requirements low. At t = T , the most probable label sequence in the prefix trie is the solution l⋆ of the decoding process. Disabling the pruning after each iteration and always keeping all label sequences and incrementally appending to them would yield the true most prob- able label sequence as defined by Equation 3.1.4, but would also require enumeration of all possible paths π, which are numbering |A|T in total and thus can quickly grow to computationally intractable numbers. Relation to this work Connectionist temporal classification is based on the idea that given a label sequence l and a deep neural network prediction of length T , one can compute the probabilities of each possible path π with F (π) = l of length T and thus compute the alignment of the label sequence l. CTC solves this task by implementing a forward-backward algorithm to efficiently compute this alignment. In turn this allows to set up a loss function for gradient- based optimization of the deep neural network or any other machine learning model that is optimized by gradient descent. The transcription algorithm of CTC is completed by following up the DNN prediction with a decoding algorithm that produces the most likely label sequence from the DNN prediction. CTC during training takes into account all possible paths π with F (π) = l and in this matter solves the alignment task in an exact and optimal way. Probabilities for the individ- ual paths π, as well as for the overall label sequence l and on the other side specific glyph probabilities at specific time steps P (l : lu = πt|y), see Equation 3.1.6, are exact under the assumptions laid out before. The drawback is that the forward-backward algorithm only can be applied to one-dimensional sequences since at each time step t it requires the exact probabilities for all prefixes from [1, t − 1] and all suffixes from [t + 1, T ]. This is only possible if the prefix and suffix are conditionally independent given time step t. Looking at Equation 3.1.6 this is obviously the case in one-dimensional sequences. It does however not apply to multi-dimensional sequences. Sections 4.3 and 6.4 further detail this problem. Multi-dimensional connectionist clas- sification improves on this point by providing an approximate solution to the sequence labeling task in multi-dimensional spaces. 3.2 Paragraph Transcription using Attention Networks So far we have discussed connectionist temporal classification, a loss function and de- coding algorithms which together solve the sequence labeling task for one-dimensional sequences. Since the forward-backward algorithm for computing the alignment between the target label sequence and the deep neural network prediction is based on the fact that both need to be one-dimensional, it cannot be easily transferred to multi-dimensional problems, e.g. labeling paragraphs of multiple text lines. We will now discuss one method for the application of CTC to multi-dimensional problems by implicitly converting it to a one-dimensional sequence. As we have discussed before, see Section 3.1, deep neural networks for CTC typ- ically use a collapse layer followed by a softmax layer, see Section 2.3, to marginalize all but one dimensions of the input data and produce a one-dimensional sequence of label probabilities as prediction. We recall that the collapse layer marginalizes dimen- sions by summing up along them, effectively removing them from the prediction. The 67 works[8][9] of T. Bluche et al. replace this non-parameterized collapse layer by a collapse function based on attention networks in order to transform a multi-dimensional input to a one-dimensional prediction while allowing for complex relationships between the output sequence order and the spatial locations within the input. This modified collapse function based on attention networks is then applied to multi-line paragraph transcription based on the CTC loss. Attention networks[35] are a class of recurrent neural networks that try to mimic cog- nitive focus or attention by depending only on a small subset of the data available at each time step, but moving this focus to another subset in the input data at each time step. Se- lecting the subset is done based on the attention of the previous time step, as well as the input data itself and possibly the previous prediction. Applied to image data, an attention network does select a set of pixels (spatial positions) at each time step and computes its prediction based on these spatial positions. The selection of pixels is then moved to other positions and the whole process is repeated. Attention networks are regularly applied to e.g. images[158] or language transla- tion[3]. Attention Networks on Images We will now discuss this type of attention networks applied to image data. Figure 3.2.1 serves as an overview of this type of attention in deep neural networks. Let us begin with the input x, which is an image with two spatial dimensions. This input x may be encoded by using a convolutional neural network or a recurrent neural network in order to obtain encoded features that are meaningful to the task at hand. The encoder artificial neural network x′ = Encoder(x,We) (3.2.1) produces the encoded feature maps x′ based on the image input x and the encoder network parameters We. If no encoder network is employed, we can assume x′ = x. The next step in an attention network is the modeling of attention at on a subset of the encoded data at each time step t. This again is modeled as an artificial neural network. It is important to note that the spatial dimension and the size of the attention must be equal to those of the encoded data. The attention network at = Attention(x′,at−1,Wa) (3.2.2) models the attention based on the encoded data x′, its own attention at−1 of the previous time step and parameter set Wa. Attention at is of the same spatial resolution as the encoded data x′, but has only one feature map. This feature is bound to the value range [0, 1] to model a two-class classification problem. As such, the standard logistic sigmoid function would be a suitable choice for the final activation function in the attention net- works. Attention near or at 1 models focused attention to this point, whereas attention near or at 0 ignores this position. We now perform a coefficient-wise multiplication of the attention at and the encoded features x′ ∑I ∑J st = ati,jx ′ i,j (3.2.3) i j while collapsing the two spatial dimension I and J . Feature vector st is now a selection of the features from the encoded data x′, but with dependency on the current attention focus. 68 x: Image of handwritten text Encoder t-1 network a : Attention of last step Store at x': Encoded Attention image network at: Attention of current step ⋅ st: Collapsed st+1 st+2 st+... features Weighted features ∑ Decoder network yt: Label yt+1 yt+2 yt+... probabilities CTC loss Figure 3.2.1: Attention network applied to offline handwriting recognition. In this network each attention step processes one character. Processing a paragraph line-by-line is also possible. 69 Final step in the attention network is to decode the selected feature vector st in order to produce the prediction related to the task at hand. The decoder network y = Decoder(st,Wd) (3.2.4) with its parameter set Wd is typically a multi-layer perceptron, see Section 2.3, since those networks are well suited to predictions on feature vectors. Another possibility would be to use a recurrent neural network or LSTM network as the decoder and treat each time step of the attention network as one time step of the RNN. Attention networks consist of the three - or two, without encoder - networks Encoder, Attention and Decoder, each differentiable and combined in a way that allows the full attention network to be differentiable. This allows parameter optimization of We, Wa and Wd using backpropagation, see Section 2.3, and gradient descent, see Section 2.3. Paragraph Transcription We have discussed attention networks on image data and seen that at each time step t, the attention mechanism selects a subset of pixels from the image and collapses the two spatial dimensions I and J . The attention network effectively transform the two- dimensional image into a one-dimensional sequence of characters. We will now briefly discuss the details of a variant of this attention network used for line-wise paragraph transcription[8]. This network reads one text line, instead of one character, per attention step and is trainable by CTC. The attention mechanism is applied for a constant number of attention steps, but could also be modified to predict the end of the paragraph and to stop if this is the case. In the work[8], the attention mechanism is applied for a constant number of steps, each transcribing one text line and all these text lines were concatenated to one sequence in order to apply CTC to the full text at once. The first step is to apply a hybrid MDLSTM+CNN encoder x′ = Encoder(x,We) (3.2.5) to the input image. This is done once in order to extract meaningful features from the image. The attention network is a MDLSTM network and at = Attention(x′, lt−1,Wa) (3.2.6) estimates the position and extent of each text line. The last layer of the attention network is a linear layer and as such at is a feature map with one unbound scalar per pixel. Feature map lt−1 is at activated with a softmax activation function, applied to each pixel column thus giving the probability for each pixel that it belongs to the current text line: exp at lt ∑ i,ji,j = (3.2.7)J k exp a t i,k Indices i and j denote the column and row within the two-dimensional feature map. This type of softmax function allows to read one text line per attention step, but on the other hand limits the curvature of the text lines to a maximum of 45 degree. This is the same limitation as in the work of this thesis. The softmax activated attention map is then fed back into the attention network for the next step. The collapsing layer ∑J sti = l t ′ i,jxi,j (3.2.8) j 70 reduces the two-dimensional feature map x′ to a one-dimensional sequence st which denotes one text line. This process is repeated for a constant number of times and the resulting collapsed text lines s1 to sn are concatenated to one sequence s. The decoder network y = Decoder(s,Wd) (3.2.9) is a bidirectional LSTM network that predicts the one-dimensional character sequence. CTC is applied to this sequence y. Image to Sequence Techniques Another method[135] for paragraph-wise offline handwriting recognition using attention networks was presented at the International Conference on Document Analysis and Recognition (ICDAR) 2021. This method [135] applies a ResNet[51] encoder followed by a self-attention decoder network to transform an image input to a one-dimensional se- quence of labels. This inferred sequence may be of variable length and its end is signaled by a special token reserved for this task. This method was specifically designed for the transcription of tables and mathematical formulas. This method shows good results when applied to paragraph-wise offline handwriting recognition. On the other hand, the authors report high runtime requirements for tran- scription. From the publication[135, p. 12]: Inferencing (sic) takes an average of 4.6 seconds on a single CPU thread for a set of images averaging 2500x2200 pixels, 456 chars and 11.65 lines without model compression i.e., model pruning, distillation or quantization. Relation to this work The work using attention-based paragraph transcription[8, 9] directly address the same problem as this work, that is multi-line offline handwriting recognition without explicit line segmentation. We will directly compare the resulting transcriptions in Chapter 7. Newer works[135] will also be included, although in a shorter form, in this comparison of Chapter 7. The main difference between the approaches using attention networks and the work of this thesis is the modeling of line transitions. In this work, the alignment for labeling multi- line text is interpreted as a inference problem over a two-dimensional pixel space. This requires to model line transitions as one label class and to do a probabilistic assignment of the line transition class to a connected path of pixels from left to right (in English handwriting) in order to separate two lines in pixel space. More on this in Chapters 5 and 6. This translates to an extension of the CTC approach to multi-dimensional sequences and spaces. It also makes a robust inference of the line separators necessary in order to correctly separate and transcribe individual text lines. The attention-based approach to multi-line translation does not require modeling the line separators. Instead, individual characters or lines are iterated one-by-one by moving the attention of the network at each time step. The attention only needs to be at its spatial position to translate the text covered by the current attention. A hard separation between text lines does not seem to be strictly required and overlaps between the attention focus of adjacent text lines seem to be possible. Another difference lies in how the task is modeled. As we will see later (Chapters 5 and 6) the method proposed in this thesis is based on the idea to set up an expectation- maximization loop in order to solve the sequence labeling problem for multi-dimensional spaces and label sequences. This allows to model multi-line transcription in the loss 71 function and decoding of a deep neural network. On the other hand, the attention-based multi-line transcription explicitly employs a specific DNN topology to solve this task. This means that the attention-based approach is confined to a very specific topology of deep neural networks. The work of this thesis on the other hand only has some general re- quirements on the DNN topology and even on the machine learning model employed. It can be used in combination with e.g. recurrent or convolutional DNNs, a combination thereof or possibly a ML model that is not a neural network at all. Concluding from comparing the approaches to the same task, the attention-based solution should be more robust when translating difficult lines since it does not rely on an explicit encoding of line separators. On the other hand is the work of this thesis applicable to a variety of machine learning models as it is implemented in the loss/target function and not in the model itself. 3.3 Paragraph Transcription by Reshaping CNNs Overview and Method In this section we will briefly discuss a work[21] on paragraph-level offline handwriting recognition presented at the International Conference on Document Analysis and Recog- nition (ICDAR) 2021. The use case for this method is again to transcribe multi-line text from an image of a paragraph of handwritten text without prior segmentation into lines, words or characters. This method achieves this by applying a convolutional neural network (CNN), see Section 2.3, to the presented image. Both the input image and estimated output of the CNN consist of two spatial dimensions, their height and width. This method proposes to reshape the CNN output, ordering ‘pixel’ rows in a single one-dimensional sequence, starting with the topmost row. As this reshaped output is now one-dimensional, connec- tionist temporal classification (CTC) can be applied to it for both training and decoding. Figure 3.3.1 illustrates this approach. Image of Handwritten Text CNN Tensor of Shape (B x N x H x W) Reshape CTC Training and Decoding Figure 3.3.1: Convolutional Neural Network applied to paragraph-wise transcription by reshaping the CNN prediction. The CNN prediction of shape (batch-size × num. features × height × width) is reshaped to concatenate all pixel rows to form one sequence in a left-to-right and top-to-bottom fashion. 72 The last layer of the CNN (before reshaping) estimates a soft-assignment that does a probabilistic assignment between ‘pixels’ of the CNN output and the alphabet in use, e.g. Latin glyphs for English handwritten texts. Glyphs are exclusive to each other per pixel, which is achieved by applying a pixel-wise softmax function to this soft-assignment. The alphabet in use contains an additional glyph for distinguishing repetitions of the same glyph in adjacent characters. As such the output of this CNN is, in meaning, identical to the output of deep neural networks for CTC, see Section 3.1, and attention networks for paragraph-wise transcription, see Section 3.2, except that it contains two spatial dimen- sions instead of one. This soft-assignment is reduced from two spatial dimensions to one spatial dimension by reordering the ‘pixel’ rows in a one-dimensional sequence. Connectionist temporal classification is then applied to this sequence. The original publication[21] contains details on the deep neural network topology and training method applied. Relation to this work The advantage of this method of reshaping the CNN output for multi-line offline handwrit- ing recognition is that is is very easy to implement and employs connectionist temporal classification as both the loss function for training and decoding algorithm during infer- ence. Applying a CNN, not a RNN or LSTM network, makes inference very fast. The published paper[21] does not report the time required for inference, but the paper’s author reported a low amount of milliseconds in discussions on site at the ICDAR conference. In comparison to multi-dimensional connectionist classification, proposed in this the- sis, the CNN reshaping approach suffers from a disadvantage: the convolutional neural network transforms the presented input image, which is a pixel space with two spatial di- mensions, to a lower-resolution soft-assignment again with two spatial dimensions. Attri- bution between spatial positions in the predicted soft-assignment and input image is fixed according to the receptive field of the CNN. This yields a 1:1 assignment from rectangular areas in the presented image to characters in the transcription. The next step reshapes the predicted soft-assignment from two to one spatial dimension by concatenation of the pixel rows in a top-down fashion. Altogether this means that this method encodes the assumption that text lines in the paragraph presented to the CNN are oriented roughly horizontal and are of roughly the same height. The paper[21, p. 11] addresses this problem by oversampling the input image. Mean- ing that the two-dimensional soft-assignment predicted by the CNN contains more pixel rows than the presented image contains text lines. This introduces flexibility for transcrib- ing paragraphs with text lines of different heights and a variable number of text lines. The examples given in the publication also show transcription of slanted text lines. However, text lines can only be successfully transcribed using this method if they do not overlap in the estimated soft-assignment, e.g. each pixel row of the soft-assignment must be part of exactly zero or one text lines, but not multiple. This limitation is introduced by the design of this method in reshaping the CNN output. Multi-dimensional connectionist classification does not have such strict limitations on the size and orientation of text lines since it introduces a special token that separates text lines within the two-dimensional soft-assignment. This allows one pixel row of the soft-assignment to be part of multiple text lines. 73 74 Chapter 4 The Problem with Multi-Line Handwriting Recognition 4.1 Overview This chapter is devoted to a discussion of the problems which are arising with auto- matically transcribing multi-line paragraphs of handwritten texts. Connectionist temporal classification (CTC)[46], see also Section 3.1, addresses the transcription of natural texts from one-dimensional inputs. CTC is a method for training and decoding a deep neural network in a way that estimates a one-dimensional sequence of labels, e.g. glyphs of an alphabet, from an image of a single text line (offline handwriting recognition), a sequence of pen strokes (online handwriting recognition) or audio (voice recognition). All three in- put types can be represented as a one-dimensional sequence (left-to-right or beginning- to-end) and accordingly their respective transcribed output is always a one-dimensional sequence of labels. In this thesis, only offline handwriting recognition is of interest to us. However, the question of why CTC cannot be directly applied to multi-line paragraphs of text and why solving the same problem for multi-line texts and multi-dimensional input is harder arises out of this transition from one- to two-dimensional inputs. The following Section 4.2 discusses these question from a practical perspective based on examples of actual handwritten paragraphs. Section 4.3 touches on the computational difficulties in transcribing multi-line texts. 4.2 Segmentation of Handwritten Paragraphs Overview As mentioned in the previous Section 4.1 this section is discussing the multi-line offline handwriting recognition problem from a practical perspective. To this end it employs ex- amples from the IAM offline handwriting database[88] which are problematic for a ‘clas- sical’ transcription pipeline. Figure 4.2.1 shows one example paragraph from the IAM database. It consists of several lines of handwritten text in English language. The lines are aligned in a neat horizontal fashion with similar spacing between and heights of lines. The cursive writing is uniform and was done using a high-contrast pen in comparison to the background sheet of paper. No overlaps exist between adjacent words or adjacent text lines. As such, Figure 4.2.1 is a prototypical example of a very well written paragraph of cursive writing that is, assumedly, easy to transcribe. Transcription of this paragraph using CTC would entail a line-level segmentation, cutting the overall paragraph image 75 Figure 4.2.1: Example paragraph from the IAM offline handwriting database with non- overlapping, nearly straight horizontal text lines without character or word correc- tions and in high contrast. into multiple smaller images, each containing exactly one complete text line. A deep neural network trained with connectionist temporal classification can then be applied to each individual text line image. Unfortunately not all handwritten paragraphs are in such an enabling layout. The fol- lowing paragraphs will discuss problematic cases that favor multi-line transcription without explicit segmentation. Problematic Segmentation The main case for the multi-line transcription method proposed in this thesis is the tran- scription of overlapping text lines without explicit segmentation. Such a paragraph is shown in Figure 4.2.2. Overlaps between adjacent text lines and near-overlaps are marked in red and orange respectively. Near-overlaps are also marked since they may, depending on the line-segmentation algorithm in use, also be problematic. True overlaps between adjacent text lines render the line-level segmentation espe- cially hard since there is no clear corridor of background pixels between the two text lines. When e.g. applying a connected components algorithm, the overlapping glyphs of the two text lines would appear as one continuous glyph. Correctly separating these overlapping glyphs into multiple individual glyphs would require knowledge about the contained text in order to infer the prototypical shape of these glyphs. This is known as Sayre’s para- dox [117] or Sayre’s knot which in summary states that: Transcription of cursive text requires segmentation of it. Segmentation of cursive text requires transcription of it. Figure 4.2.3 illustrates Sayre’s knot with three overlapping characters. Each square of the figure represents a pixel of an image and the coloring indicates the assignments of pixels to characters. As we can see most pixels are part of the background and not part of any character. Many other pixels are part of exactly one character. However, some pixels are part of two characters at the same time. This example poses two segmentation 76 Figure 4.2.2: Example paragraph from the IAM offline handwriting database that shows multiple overlaps, marked in red, or near-overlaps, marked in orange, between adjacent text lines. problems at the same time: first, assigning the pixels to characters and second also deciding if pixels are part of multiple characters. Figure 4.2.3: Sayre’s knot in an example of three overlapping characters ‘A’, ‘B’ and ‘C’. Red, blue and yellow pixels are part of exactly one of the three characters. Purple and orange pixels are part of two characters at the same time. Correctly assigning pixels to characters requires knowledge about the characters and their prototypical glyph shape. On the other hand is the pixel assignment necessary for correctly identifying the glyphs. Solving this segmentation problem shown in Figure 4.2.3 would require knowledge about the content of the pixel image, that is knowledge which glyphs are contained and in which order. Knowledge of the contained glyphs could be incorporated in a segmentation method in the way of prototypical shapes of glyphs. Unfortunately the contained glyphs are not known before transcription and segmentation happens before transcription, which is creating Sayre’s paradox. This effect may occur at any level of segmentation, that is while separating lines or word or characters. Figure 4.2.4 shows possible variants for a ‘classic’ transcription pipeline for handwritten paragraphs. Applying connectionist temporal classification en- tails line- or word-level segmentation, followed by transcription using CTC. At any stage of segmentation, overlaps may occur and thus reduce the quality of the segmentation re- sult. Degraded segmentation will negatively influence the following transcription and thus increase the overall transcription error. Since segmentation is done prior to transcription without feedback, this increase in error will prevail. 77 Line-level segmentation Transcription "A MOVE ..." Word-level segmentation Transcription "A MOVE ..." Character-level segmentation Transcription "A MOVE ..." Figure 4.2.4: Transcription pipeline for handwritten paragraphs based on prior segmentation on a line-, word- or character-level. Segmentation does not have to be sequential from one level to the next one. However, each segmentation step introduces a chance for errors, influencing the final transcription result. Possible sources of segmentation errors are indicated by the lightning symbols. The shown paragraph image is from the IAM offline handwriting database. 78 This thesis proposes a transcription method for handwritten multi-line paragraphs that does not depend on prior line-, word- or character-level segmentation. Figure 4.2.5 illus- trates, contrasting to Figure 4.2.4, this idea. The idea of multi-dimensional connectionist classification (MDCC) is that prior segmentation before transcription is not necessary and instead, segmentation and transcription are two products of the same process. MDCC emphasizes transcription, but Section 6.7 will briefly discuss how MDCC could be modi- fied to emphasize segmentation. Assuming that segmentation and transcription are two independent products of the same process and not processes that are dependent on each other effectively solves Sayre’s paradox on a paragraph-level. We state that MDCC solves Sayre’s paradox on a paragraph-level since MDCC ap- plies only to transcription within a paragraph. A scanned document may still, and often does, consist of multiple paragraphs with occasional figures and tables. Analyzing such document structure and extracting individual paragraphs poses its own problems, which are not in the scope of this thesis. Figure 11.2.2 of Section 11.2 does however show one such example of a complex document layout. Transcription "A MOVE ..." Figure 4.2.5: Transcription pipeline proposed in this thesis. No explicit segmentation of the para- graph image is performed. The paragraph image is again from the IAM offline handwriting database. Other Considerations The previous section discussed problems with overlapping text lines while applying line-, word- or character-level segmentation followed by a transcription method. Addressing problems of overlapping text lines is the main reason for MDCC. However, the following paragraphs will briefly touch on other, more general, interesting cases that occur while transcribing handwritten texts. The following examples are based on knowledge of seg- mentation and transcription methods, as well as inspection of the IAM offline handwriting database. They do serve to detail some of the problems that typically occur during offline handwriting recognition. 79 Figure 4.2.6 shows an example text where segmentation on a word-level is ambigu- ous. The bracketed numbers can be seen as stand-alone words or be assigned to the leading or trailing word. The choice between these three possibilities may well influence the following transcription since transcription methods in the form of recurrent neural net- works contain implicit language models and some transcription methods even explicitly apply a language model. Word-level segmentation thus should prefer the segmentation which closely matches the language model at hand. This effect, at least in the example of Figure 4.2.6, does not apply to paragraph- or line-level transcription since all words are contained in one single text line anyway and thus presented to the transcription method. It also does not apply to character-level transcription since it does not contain a language model, implicit or explicit, anyway. Figure 4.2.6: Example paragraph from the IAM offline handwriting database where correct word- level segmentation is ambiguous without knowledge of the transcribed text. The example of Figure 4.2.7 contains one to three characters, marked in red, which are not clearly separable. Paragraph-, line- or word-level transcription should be applied since in these methods no character-level segmentation will be necessary and an (im- plicit) language model may be capable of distinguishing between the characters. Figure 4.2.8 shows an example were the writer made a mistake and corrected it by striking through the wrong word and writing down the correct text. A character- or word- level transcription applied to this example may be erroneous, especially if the corrected word is treated as an individual word segment. Paragraph- or line-level transcription seems to be more applicable here, especially if such cases occur within the training data, since the segmentation or transcription method may ignore the corrected part of the line. Presenting only the stricken word to the transcription method may on the other hand generated erroneous results since a transcription method is designed to produce natural language text, even if the input image does not contain text. An abstraction of offline handwriting recognition is so called identification of writer in- tention, which is transcription plus the assumption that a writer will occasionally make mistakes and write down a different text in comparison to what was the intended informa- tion. Figure 4.2.9 shows an example of this. The writer of this paragraph made a spelling mistake. The correct transcription of the marked word in terms of offline handwriting recognition is ‘effektive’, but in terms of identification of writer intention it is ‘effective’. 80 Figure 4.2.7: Example paragraph from the IAM offline handwriting database with an ambiguous or corrected character marked in red. Figure 4.2.8: Example paragraph from the IAM offline handwriting database with a word, marked in red, corrected by the writer. 81 Figure 4.2.9: Example paragraph from the IAM offline handwriting database where the writer made a spelling mistake. This is an example of ‘identification of writer intention’. Conclusion This section detailed some exception cases that can occur in handwritten paragraphs of natural language. Of interest to this thesis are mainly potential errors that occur when applying line-level segmentation to paragraphs that contain overlapping text lines. Multi- dimensional connectionist classification is designed to transcribe whole paragraphs with- out prior segmentation in order to mitigate these problems. We show that MDCC is capable of solving Sayre’s knot on a paragraph-level by treating segmentation and tran- scription as products of the same process. MDCC as proposed in this thesis emphasizes transcription. Section 6.7 does however briefly discuss ways to put emphasis on seg- mentation. Other examples shown in this section concern more general difficulties in offline hand- writing recognition. These are not directly addressed by MDCC. On the other hand is a paragraph- or line-level transcription suitable for these examples. 4.3 Computational Considerations Forward-Backward in Connectionist Temporal Classification So far this chapter has detailed potential problems in line-, word- or character-level seg- mentation when applied to handwritten paragraphs. This section will discuss why the methodology of connectionist temporal classification (CTC)[46], namely forward-back- ward alignment, cannot be directly transferred to two-dimensional, and in extension to 82 multi-dimensional, tasks. We will again use an example from the IAM offline handwriting database[88] to illustrate the considerations of the following paragraphs. Section 3.1 discussed the application of the forward-backward algorithm in CTC. CTC solves the sequence labeling task for one-dimensional sequences. This task is to tran- scribe a sequence of discrete labels from a one-dimensional input, or at least input that can be treated as one-dimensional. For example in offline handwriting recognition, an im- age of one text line is the one-dimensional input (processed left to right with the height col- lapsed) and the transcribed sequence is the sequence of characters contained in this line image. It is important to note that in sequence labeling, the transcribed label sequence is shorter than the input sequence. As such the assignment between transcribed labels and input positions is not known prior. CTC solves this by applying forward-backward to infer the alignment between the transcribed sequence and input sequence during training of the deep neural network. Connectionist Temporal Classification as a Conditional Random Field Connectionist temporal classification successfully applies the forward-backward algo- rithm to infer the character alignment. This is possible since the underlying graph struc- ture, when interpreting CTC as a graphical model, is an undirected chain. Figure 3.1.1 shows all paths while aligning the label sequence ‘HELLO’ over a observation of 12 time steps. In terms of a graphical model this translates to a chain of 12 nodes with 11 labels each. The transitions in Figure 3.1.1 indicate compatible node-label combinations within one node and in neighboring nodes. Each path thus represents one configuration of the graphical model which correctly decodes to the label sequence in question. Please see Section 2.2 for a discussion of graphical models in this context. 1 ε ε ε ε ε ε ε ε ε ε ε ε H H H H H H H H H H H H ε ε ε ε ε ε ε ε ε ε ε ε E E E E E E E E E E E E ε ε ε ε ε ε ε ε ε ε ε ε L L L L L L L L L L L L ε ε ε ε ε ε ε ε ε ε ε ε L L L L L L L L L L L L ε ε ε ε ε ε ε ε ε ε ε ε O O O O O O O O O O O O U ε ε ε ε ε ε ε ε ε ε ε ε t=1 t=2 t=3 t=4 t=5 t=6 t=7 t=8 t=9 t=10 t=11 t=12 1 Time steps of DNN prediction T Figure 4.3.1: Interpretation of CTC as a chain-structured graphical model. The example is identi- cal to that of Figure 3.1.1 but with the time steps of the DNN prediction as 12 nodes of the graphical model and the label sequence ‘HELLO’ as 11 discrete states of each node. 83 Label sequence Figure 4.3.1 shows the interpretation of CTC as a chain-structured graphical model in the same example as of Figure 3.1.1. We can show that this interpretation as a chain- structured graphical model is equivalent to the CTC formulation by recovering Equation 3.1.3, which defines the probability of observing a specific label sequence, from Equation 2.2.9, which defines the joint probability of a conditional random field (CRF). Equation 3.1.3, with substitution of Equation 3.1.2, is as follows, with l being the label sequence in question, y the observed DNN prediction and π being one configuration: ∑ ∏T P (l|y) = ytπ (4.3.1)t π:F (π)=l t Function F (π) collapses a configuration π to a label string by first converting repeti- tions of the same glyph to a single instance of the glyph, followed by removing all glyph separators. This is discussed in Section 3.1. Equation 2.2.9 defines the joint probability of a CRF as follows, using π as one con- figuration and y as the observed DNN∏:1 ∏ P (π|y) = ψ ss(πs|y ) ψs,t(πs, πt) (4.3.2) Z s s∼t Since the goal is to construct a chain-structured CRF, the neighborhood relation s ∼ t is defined as nodes s and t being two consecutive nodes within this chain. Each node of the chain is part of exactly two such neighborhood relations, one with its leading and one with its trailing neighbor. The exception are the very first and very last nodes of the chain, both only being part of one neighborhood. Z is a normalization factor, also called the Zustandssumme, defining the accumulated joint probability over all possible configurations π. This normalization factor is responsible for the joint probabilities of all c∑onfig∏uration actu∏ally summing up to 1: Z = [ ψs(πs|ys) ψs,t(πs, πt)] (4.3.3) π s s∼t Marginalization over all CRF configurations π that represent the label string l yields the probability of obs∑erving this label s∑tring: 1 ∏ ∏ P (l|y) = P (π|y) = [ ψs(π ss|y ) ψs,t(πs, πt)] (4.3.4) Z π:F (π)=l π:F (π)=l s s∼t We define the node potential function ψs(π ss|y ) of the CRF as the probability of ob- serving character πs in time step s according to the deep neural network prediction y: ψs(πs|ys) = ysπ (4.3.5)s The edge potential function ψs,t(πs, πt) is defined as a constant, giving equal compat- ibility to all node-label combinations: ψs,t(πs, πt) = 1 (4.3.6) At this point the question arises if this is a valid CRF representation of the CTC model since the edge potential of the CRF does not restrict to the specific label string l. Con- nectionist temporal classification is a loss function for optimizing a deep neural network towards predicting label string l matching the true transcription of the DNN input. As such there is the possibility that the DNN predicts many different label strings for different inputs. The goal of CTC is to maximize the probability of predicting the true label string 84 given its corresponding input from the training data set. The original CTC formulation accounts for this fact by recognizing in Equations 3.1.2 and 3.1.3 that there are potential paths π that do not correspond to the correct label string l. The CRF at hand is built on the same assumption. Substituting ψs(πs|ys) = ysπ and ψs,t(πs∑, πt) = 1∏yields the marginalizations P (l| 1y) = [ ysπ ] (4.3.7)Z s π:F (π)=l s for the probability of the DNN prediction y∑enc∏oding the truth label string l with Z = ysπ (4.3.8)s π s being the normalization factor. Section 3.1 briefly discusses deep neural network topologies as used for CTC. These topologies end with a collapse layer, followed by a softmax function that normalizes the glyph probabilities within each time step. This observation leads to the conclusion that the normalization factor Z over all configurations π always sums up to exactly one given that y is predicted by such a DNN: ∑∏ Z = ysπ = 1 (4.3.9)s π s Substituting Z = 1 recovers the CTC formulation for P (l|y) from the conditional ran- dom field joint probability of Equation 4.3.2. This formulation is identical to Equation 4.3.1 with the exception that the time dimension is∑index∏ed by symbol s instead of t: P (l|y) = ysπ (4.3.10)s π:F (π)=l s Computational Complexity of Inference in Graphical Models So far this section discussed how connectionist temporal classification applies the for- ward-backward algorithm for alignment of the truth label string and thus solves the se- quence labeling task for one-dimensional sequences. This section also detailed the in- terpretation of CTC as a chain-structured conditional random field. This explains why forward-backward can be applied for exact inference while keeping computational com- plexity polynomial. Belief propagation has been discussed in Section 2.2 and the forward-backward al- gorithm is a special case of BP. Forward-backward applies belief propagation in sum- product mode to chain-structured graphical models. Section 2.2 also refers to published literature[2, 15, 23, 80] on the computational complexity of inference in graphical mod- els. Kevin Murphy[93, ch. 20.5] gives an overview over inference algorithms for graphical models and their respective restrictions. He states that forward-backward is applicable for exact inference in chain-structured models and belief propagation in trees. Please note that belief propagation can in general be applied for inference in polytrees since the factor-graph[34, 75] of a polytree is again a tree[7, ch. 8.4.3]. A polytree is a graph of which its underlying topology is an acyclic graph. Colloquially speaking, a polytree is a graph in which directed edges are replaced by undirected edges, duplicate edges re- moved and the resulting graph then will not contain any cycles. As such, a polytree is the least restricting topology out of a chain, tree or polytree. Inference in general graphical models, that is directed or undirected and with cycles, is NP-hard[23] with exact inference being even #P-hard[111]. 85 Loopy belief propagation (LBP)[34, 94][93, ch. 22] allows for approximate inference in general graphical models within polynomial time, given that a suitable convergence criteria is applied. The remainder of this section discusses why the sequence labeling task with multi-dimensional inputs and label sequences falls within these general graph- ical models and thus renders application of the forward-backward or belief propagation algorithms unfeasible. Chapter 6 discusses how to apply a grid-structured CRF and LBP to this multi-dimensional sequence labeling task by proposing multi-dimensional connec- tionist classification (MDCC). On Chain- and Grid-Structured Models Figure 4.3.2: Chain-structured graphical model with 5 nodes. The previous paragraphs of this section detailed the interpretation of connectionist temporal classification as a chain-structured conditional random field, the application of the forward-backward algorithm to chain-structured graphical models and the general computational limitations of inference in graphical models. Figure 4.3.2 shows a chain- structured graphical model with 5 nodes. A discussion on the modeling of multi-line text in a graphical model follows in the next paragraphs. 1 2 3 4 5 6 7 8 9 Figure 4.3.3: Example of multi-line handwritten text. The matrix serves as a partial pixel grid. Cell 5 could be the beginning of e.g. an ‘a’, ‘g’ or ‘o’ glyph. Cell 6 e.g. an ‘o’, ‘c’, ‘g’. Cell 9 e.g. an ‘o’, ‘b’, ‘s’ with or without a new line on the bottom. Cell 8 e.g. an ‘o’, ‘c’, ‘u’ with or without a new line. Figure 4.3.3 contains an extract from an IAMDB example with two text lines of three words each. A partial pixel grid was added as an overlay to show how the contents of neighboring cells influence each other in a cyclical fashion. In this example, the cells 5, 6, 8 and 9 influence each other and inferring the content of each cell is not reliably possible without looking at the other cells as well. Please note the caption of Figure 4.3.3 for example possibilities of the cell contents. This observation leads to the following two reasons on why multi-line text necessitates a grid-structured model with 4- or 8-neighborhoods around each node instead of a chain- structured model as is the case for one-dimensional transcription using CTC: 1. Even without difficult to recognize examples, as in Figure 4.2.1, the modeling of multi-line text is inherently multi-dimensional. The horizontal spatial dimension roughly translates to the reading direction within one text line. The vertical dimen- sion to the ordering of multiple text lines. This is the case even for simple cases, but also holds true for more interesting cases such as slanted, curved or rotated text lines. 86 2. The example of Figure 4.3.3 shows that cyclical dependencies in the neighborhood around nodes of the graphical models do exist. We argue that correct inference of the alignment of multi-line text is not possible with a chain-structured graphical model, but instead necessitates a grid-structured model that allows for cyclical de- pendencies. This reasoning leads us to multi-dimensional connectionist classification as discussed in Chapter 6. MDCC applies a grid-structured conditional random field to infer the align- ment of multi-line text. An example of such a grid-structured graphical model is shown in Figure 4.3.4. Figure 4.3.4: Grid-structured graphical model with 25 nodes in a 5 by 5 grid. We can see from the example in Figure 4.3.4 that such a grid-structured graphical model is not a chain or tree. It also is not a polytree since the underlying graph structure is cyclic. As such neither the forward-backward algorithm nor belief propagation can be applied for inference in such a model. We propose to apply loopy belief propagation as an approximate inference method. It would still be possible to apply variable elimination[162][93, ch. 20.3] or the junc- tion tree algorithm[86, 131][93, ch. 20.4] as an inference method to a grid-structured model. Both methods are also discussed in other published literature[70]. Unfortunately the computational complexity of both algorithms grows exponentially with the tree-width of the graphical model. The tree-width of a grid with 4-neighborhoods and N ×N nodes is N . The worst case for the tree-width in graphical models is that it is identical to the number of nodes in the model. Thus the choice of the variable elimination or junction tree algorithm for inference in grid-structured graphical models of multi-line text would quickly become unfeasible since the size of the grid-structure depends on the number of pixels in the input image of the text. Loopy belief propagation on the other hand is an iterative algorithm that can be stopped as soon as sufficient convergence criteria are met. 87 88 Chapter 5 Decoding Algorithms for Multi-Line Text Recognition Figure 5.0.1: Part of the pipeline discussed in this chapter. Left is the input, middle the estimated probabilities and right the decoded text. The content of this chapter is based on the two publications on multi-dimensional connectionist classification: Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-Line Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019. DOI: 10.1109/ICDAR.2019.00015 Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–410. DOI: 10.1109/DAS.2018.36 Please see Section 1.3 for detailed information on the authors contribution. 5.1 Overview The overall method and system for offline handwriting recognition proposed in this work can be split into two basic parts: A first stage, consisting of a deep neural network, that takes the image of the handwritten text as input and estimates a probability distribution that soft-assigns each pixel1 to one of the glyphs from the given alphabet. The second stage is a decoding algorithm that produces the most likely or at least a likely sequence of glyphs given the probability distribution from the first stage. See Figure 5.0.1 for an overview. Algorithmically these two stages are executed in this order. However, dis- cussing the decoder stage first is more easily approachable since its output (a sequence 1We will use the term ‘pixel’ in this context colloquially for a specific spatial position in the probability distribution and not strictly for a ‘picture element’ of an image. 89 of glyphs) is not abstract and intuitively understood. This is why we will discuss the de- coding algorithm of this work in the current chapter with the discussion of the deep neural network and training algorithm in the following Chapter 6. Decoding is a problem well known in information and signal theory with applications in e.g. telecommunication, speech recognition and of course handwriting recognition. We have previously, see Section 3.1, discussed decoding algorithms for connectionist temporal classification. Decoding describes the problem of observing a time series of continuous signals, e.g. voltages on copper telecommunication lines or probability esti- mates from a deep neural network, and uncovering the sequence of discrete events the most likely led to this observation. A well known decoding algorithm for one-dimensional sequences is the Viterbi algorithm[33, 149]. In the case of offline handwriting recognition, the sequence of events is the sequence of glyphs actually written on the sheet of paper and is captured by the camera. Offline handwriting recognition systems that employ line-wise transcription of multi- line text require a line segmentation algorithm run beforehand in order to extract individual text lines and facilitate correct text transcription. These text line segmentation algorithms are based on features extracted from the image of handwritten text before applying the transcription method to the extracted lines. This chapter proposes a multi-line decoding algorithm for multi-dimensional connectionist classification. It employs a similar overall approach by first identifying and extracting text lines from the deep neural network pre- diction, converting these extracted lines to one-dimensional sequences of probability es- timates and then decoding these using established decoding algorithms. The difference is that the proposed system does not extract lines from the original image of handwritten text but from the two-dimensional probability distribution estimated by the deep neural network. This probability distribution gives probabilities for both visible glyphs and the artificial line separator glyph. This allows the proposed decoding algorithm to use infor- mation about the transcribed text and extract lines in such a way as to facilitate text line transcription with fewer errors. At this point it is prudent to specify our terminology as there may be conflicting defini- tions in literature. In this chapter and Chapter 6 we will use the term glyph for one element from the alphabet in use, e.g. ‘l’ is a glyph. A character denotes a specific instance of a glyph within a text, e.g. ‘hello’ contains the glyph ‘l’ twice in two characters. The definition of a label in this context is identical to that of a glyph but also is a general term used in the sequence labeling task in machine learning. In the following sections of this chapter we will first discuss the structure and proper- ties of the two-dimensional probability distribution estimated by the deep neural network and how it encodes information about the included text and its spatial structure. We will then continue by outlining and discussing the proposed decoding algorithm, starting with the overall algorithm and then detailing the algorithmic parts for finding and extracting text lines, as well as decoding text lines to label sequences. We will use images from the IAM offline handwriting database[88] as examples in this chapter. Please note that we will use handwritten text as examples in this Section, but this work is applicable to multi-line text in general. 5.2 Structure of the Model Output In the following section we will discuss the structure, properties and meaning of the two-dimensional probability distribution estimated by the deep neural network in multi- dimensional connectionist classification. As illustrated in Figure 5.0.1, this probability distribution is generated by the DNN using an image of handwritten text as input. Opti- mization of this DNN will be discussed in Chapter 6. 90 Mathematical Properties Let A be the alphabet in use. It is a set consisting of the glyphs of the writing system, as well as an artificial glyph separator ϵg and an artificial line separator ϵl. The glyph separator ϵg is used to differentiate between multiple adjacent occurrences of the same glyph in one text line and a single occurrence that stretches over multiple pixels. For example a sequence a ϵg a encodes ‘aa’ whereas aaa encodes ‘a’ stretched over three pixels. The line separator ϵl encodes information about line breaks, specifically it indicates that the two pixels directly above and below belong to two different text lines. Let us use x ∈ [0, 255]dx as the gray scale input image of dx = Width(x)× Height(x) number of pixels. Prediction y ∈ [0, 1]dy×|A| with dy number of pixels is then the soft assignment of each pixel to one of the glyphs or separators from A. This soft-assignment is estimated by the deep neural network as y = DNN(x,W) in the case of transcribing text from an image, with W being the parameters of the DNN. The number of pixels dy is likely smaller than dx because of subsampling, pooling or padding-effects in the DNN or model in general. However from a theoretical viewpoint it is sufficient to assume that each pixel in x corresponds to exactly one pixel in y. When training the DNN, see Chapter 6, this soft-assignment will be estimated by a conditional random field (CRF) and will be assumed to be the true alignment and as such be used for supervised optimization of the DNN parameters. The prediction y is a probability distribution with two spatial dimensions (width and height) as well as one dimension for the glyph space of the alphabet. It gives the proba- bility ysg of a specific pixel s being part of a specific glyph g for every pixel and glyph. Since the glyphs are mutually exclusive (assuming the writer intended to only write exactly one glyph per spatial position), the pro∑babilities per pixel s sum up to one. This constraint ysg = 1, ∀s ∈ [1, dy] (5.2.1) g∈A is enforced by applying a pixel-wise softmax function to the last layer of the DNN. We will later also use the annotation (i,j)yg in addition to ysg in order to indicate a pixel at a specific position (i, j). Here i ∈ [1, I] indicates the horizontal position with I = Width(y) and i ∈ [1, J ] the vertical position with J = Height(y). ysg is the short-hand notation for one spatial position s. Semantic Meaning In the section beforehand we have discussed the mathematical properties of the soft- assignment y. We will now discuss the meaning of the probabilities contained in it and how to interpret them in a way that facilitates transcription of text by decoding these probabilities. In the following discussion, and in fact for the remainder of this thesis, we will assume that the text is written in a typing system that places characters from left to right within a line and lines from top to bottom on the page or paragraph. Adjusting the decoding algorithm described in this chapter and the training algorithm in Chapter 6 to other writing systems is matter of adjusting these neighborhood relations. This top-to-bottom and left-to-right writing system leads us to the first interpretation of the probabilities in y since this defines also the ‘reading direction’ of the probability distribution. The first text line starts in the pixel in the top-left corner, text lines generally start on the left pixel and the last text line ends in the bottom-right corner. Text lines are presented in the pixel space from top to bottom and characters within a text line from left to right. This directly reflects the writing system. We will now discuss the meaning of the different glyphs from alphabet A, starting with the artificial glyphs ϵl and ϵg followed by the glyphs purposefully intended by the writer. 91 The line separator ϵl encodes transitions between text lines since this work aims at transcribing multi-line text without prior segmentation. The line separator ϵl is an artificial glyph of the alphabet, in the sense that the writer did not intend to write it on paper as part of the text, but it is necessary in order to correctly encode multi-line text in the probability distribution. Let (i,j)yϵl be the probability of pixel (i, j) being part of a line separator ϵl. A line separator ϵl encodes the semantic meaning that the pixel column (i, j′), j′ ∈ [1, j − 1] to the top belongs to a different text line or multiple text lines than the pixel column (i, j′), j′ ∈ [j + 1, J ] to the bottom. Thus (i,j)yϵl contains the probability that this line separation is in effect for pixel (i, j) and the pixels above and below belong to different text lines. Two horizontal adjacent text lines are separated by a continuous chain of line separators ϵl ranging from the left border at i = 1 to the right border at i = I. Each line separator is also ranging from the left to the right border and does not ‘merge’ with another line separator above and below, even if the text lines are not using up the full width. Instead the line separators range the full width, separated by at least one pixel in vertical direction and too short text lines are filled up with spaces on the left or right side, according to their alignment. Figure 5.2.1: DNN prediction of the line separator for one IAMDB example. Yellow color encodes high probabilities. This example shows perfectly horizontal lines, which may not always be the case as MDCC allows for line curvature or rotation up to 45 degree. Figure 5.2.1 illustrates this concept with an example DNN prediction for the line sep- arator. High probability is encoded in yellow color and thus the image parts above and below each yellow line have a high probability of belonging to two different text lines. Similar to the artificial line separator ϵl there exists an artificial glyph separator ϵg. This glyph separator becomes necessary whenever two adjacent characters in the same text line are the same glyph from the alphabet. In this case the decoding algorithm needs a semantic pointer to differentiate between the two characters since it is allowed that one character spans over several adjacent pixels. Differentiating between two adjacent characters with different glyphs is intrinsically modeled since there is an actual transition between different glyphs, but this is not the case for the same glyph in adjacent charac- ters. Hence the need for a separator ϵg between these characters. The glyph separator ϵg is placed whenever two adjacent characters are the same glyph. This allows differentiation between e.g. decoding a sequence aaa to the string ‘a’ or a sequence aϵgaa to the string ‘aa’. The probability (i,j) yϵg in pixel (i, j) gives the probability that the pixels (i′, j), i′ ∈ [1, i − 1] left of the glyph separator belong to one or more characters different from the pixels (i′, j), i′ ∈ [i+1, I] to the right. Two characters of the same glyph are thus separated by a continuous chain of glyph separators ϵg ranging from the upper text line boundary, either the top of the pixel space or the above line 92 Figure 5.2.2: DNN prediction of the glyph separator for one IAMDB example. Yellow color en- codes high probabilities. separator ϵl, to the lower text line boundary. Figure 5.2.2 illustrates this by encoding high probabilities for the glyph separator in yellow. Glyph separators are only placed when necessary, in contrast to how they are han- dled in connectionist temporal classification (CTC). This is computationally useful when using a conditional random field (CRF) for estimating the alignment during training since it directly reduces the runtime required for inference in a CRF. We will now continue the discussion with the glyphs intentionally placed by the writer. Most of them are visible, but the space is not. Similarly to the glyph separator ϵg these are ‘writer-intended glyphs’ placed from left to right within a text line, specifically between the upper and lower text line borders. These borders may be the start/end of the vertical dimension in pixel space or the next occurrence of the line separator ϵl. The probability for pixel (i, j) being part of glyph g is given by (i,j)yg with glyph g ∈ A \ {ϵl, ϵg}. Figure 5.2.3: DNN prediction of the space for one IAMDB example. Yellow color encodes high probabilities. The first glyph in this part of the discussion is the space glyph that in Latin writing systems is used to separate two adjacent words. In this work it is also required to fill up the alignment whenever a text line in pixel space is not using up the full horizontal extent of the pixel space. This can occur since the pixel space is a Euclidean space, which is rectangular when visualized, but not every text line necessarily has exactly this horizontal size in pixel space. This can also be used to restrict the alignment during the training 93 process since we can define the text to be left- or right-aligned and thus only add the space to one of the two sides of the text lines. In both cases is the space an optional character in the text line and only used to fill remaining pixel space if necessary. During decoding it is suitable to remove leading or trailing spaces in each text line. Figure 5.2.3 shows the probabilities for space glyphs in an example were the text is left-aligned. Figure 5.2.4: DNN prediction of the glyph ‘a’ for one IAMDB example. Yellow color encodes high probabilities. Figure 5.2.5: DNN prediction of the glyph ‘e’ for one IAMDB example. Yellow color encodes high probabilities. Visible glyphs are encoded in a similar fashion as the space glyph. The difference is that these are strictly only present in the alignment or prediction when intentionally written and not used to fill up remaining space. Again these characters are boxed in by the upper and lower boundary of their respective text line in pixel space and to the left and right by their neighboring characters. Figure 5.2.4 shows an example for the glyph ‘a’ and Figure 5.2.5 for the glyph ‘e’. 5.3 Multi-Line Decoding In this section we will begin discussing the actual multi-line decoding algorithm pro- posed and used in this work. The function of this decoding algorithm is to take the soft- assignment y as estimated by the deep neural network (DNN), as well as the alphabet in 94 use A as input and produce the most likely, or at least a high probability, readable string from it. Likelihood of a Specific String Let us start out by defining the likelihood of a specific string l given the soft-assignment y. This in turn requires us to define the likelihood for one specific configuration C first. We use the word configuration in the same way as in graphical models of probability distributions, e.g. a Markov random field or conditional random field. It describes a hard-assignment of pixels to glyphs from the alphabet, meaning each pixel is assigned to exactly one glyph in a one-hot coding. Given a soft-assignment y of spatial size dy = Width(y)×Height(y), configurations C ∈ Ady are accordingly hard-assignments. Assuming that spatial positions p ∈ [1, dy] in predicting the soft-assignment y are conditionally independent events, we can define the likelihood ∏dy P (C|y) = ysCs (5.3.1) s of a specific configuration C being the hard-assignment generated out of the observed soft-assignment y. The assumption that the spatial positions s in the soft-assignment y are conditionally independent holds true if y is predicted by a deep neural Network were the last layer is not fed back to itself or the layers before[46, p.2]. If using a different machine learning model from a deep neural network for predicting y, we need to make sure that this assumption still holds true. From here we can define the likelihood of observing a specific label sequence l given the soft-assignment y. There are many different ways of writing the same label sequence in a pixel space, e.g. coloring the paper using a pen, and these different ways of writ- ing are different events that lead to the same outcome. Thus we can marginalize over different configurations C in order to find the likelihood for a specific label string. The likelihood ∑∏dy ∏ P (l|y) = ys α (Cs, CtCs s,t , l) (5.3.2) C s t∈nbr(s) is dependent on the indicator function α, which defines valid glyph neighborhood relations in pixel space. Function nbr(s) defines the 8 neighbors in pixel space around spatial position s. Function α{is defined as: s t s t 1 iff C ,C are valid neighbors in s, t according to lαst(C ,C , l) = (5.3.3) 0 else Configuration positions Cs and Ct are valid neighbors if their indicated glyphs match the relations in the label sequence l, e.g. left-right, top-down relations are preserved. We will discuss these neighborhood relations in detail in Section 6.2, but for now it is sufficient to keep the discussed properties of Section 5.2 in mind. Approach to Decoding The likelihood P (l|y) gives us the opportunity to define the prototypical decoding method for finding the most likely label sequence l⋆ given a soft-assignment y. This decoder l⋆ = Decoder(y) = argmaxP (l|y) (5.3.4) l 95 simply selects the most likely label sequence. Sadly, Equation 5.3.2 prohibits such an approach since enumerating all configurations C is computationally far too expensive. Assuming alphabet A being a one-byte character set of 256 characters and a spatial resolution of 320 by 240, we arrive at |A|320×240 = 28×320×240 = 2614400 ≈ 6.7 × 10184952 different configurations C. This number of configurations C tells us that a simple maximum likelihood approach to decoding will not be computationally feasible, a pattern that we will see repeat in Section 6.4. Thus the need arises for a decoding algorithm that uses the semantic structure of multi-line text in order to quickly decode this soft-assignment y to a likely string. As discussed above, multi-line text as approached in this work is organized in multiple distinct text lines that span roughly horizontally from left to right and each text line can be a variable number of pixels in heights. These text lines are separated by line separators ϵl in pixel space. Glyphs within a text line are organized from left to right, typically span the height of their text line in pixel space and likely span over several pixels horizontally. Armed with this knowledge we will derive a two-stage decoding algorithm that first identifies distinct text lines in pixel space and then proceeds to decode each text line. First identifying and extracting text lines makes this computationally feasible since this allows to dynamically collapse each text line to a one-dimensional sequence and then decode them with tried and tested algorithms such as e.g. Viterbi decoding. This two-stage approach of finding and extracting text lines and then decoding each individually does resemble the ‘classic pipeline’ for offline handwriting recognition. The benefit of the approach of this work is that this two-stage process is done on the soft-assignment y which is semantically closely related and relevant to the problem at hand, offline handwriting recognition, in contrast to the classical approach which starts from semantically unrelated grayscale, often black-and-white, or color images. Main Algorithm The following paragraphs will discuss the algorithmic entry point of the multi-line decoding algorithm, as stated in Algorithm 5.3.1, proposed in this work. The basic idea of this decoding algorithm is that text lines are organized from top to bottom, each spanning from left to right and each one should be dynamically collapsed to a one-dimensional sequence in order to decode each text line. To this end, the algorithm employs a scan-line approach with the scan-line starting at the very top of the pixel space and spanning from the left to the right border. The scan-line will then be moved from top to bottom, alternating between processing line separators ϵl and readable text lines. Text lines may span multiple pixels in height and thus each text line will be dynamically collapsed by summing its glyph probabilities per pixel column. This works since, as we will see later, for decoding each text line only the relative difference between the probabilities of any two glyphs is of importance, not their absolute probability. The scan-line is allowed to move downwards with varying speeds per pixel column in order to account for slanted or curved text lines. Figure 5.3.1 shows an example of this scan-line approach. The limitation of this proposed multi-line decoding algorithm is that it is not able to cor- rectly decode text lines that are curved by 45 degrees or more at one or more positions. Otherwise said, each text line has to be exactly one interval per pixel column. A single text line may not occur in two or more distinct intervals in any given pixel column. Algorithm 5.3.1 outlines the main function of the proposed multi-line decoding algo- rithm. Input are both the soft-assignment y and the used alphabet A. Desired output is a (highly) likely sequence of glyphs that lead to the observed soft-assignment, or in extension to the observed image of multi-line text. 96 Start Line 1 1 2 4 6 7 εl εl εl End Line 1 8 3 5 11 12 εl εl Start Line 2 13 9 10 18 21 14 15 16 19 22 εl εl 24 25 17 20 23 εl εl εl End Line 2 29 31 26 27 28 Start Line 3 30 32 33 34 35 End Line 3 Figure 5.3.1: Scan-line moving through a pixel space of width 5 and height 7 which contains 3 text lines. The ‘horizontal’ lines indicate the scan-line in different states. Dashed lines signify states at the beginning of a text line and solid lines after a text line, either hitting a line separator or reaching the end of the pixel space. The downward arrows indicate the advancement of the scan-line with the numbers counting the individual movement operations in order. 97 Algorithm 5.3.1 Proposed Multi-Line Decoding Algorithm Input soft-assignment y and alphabet A. I = Width(y) J = Height(y) Find line separators l = FindLineSeps(y). Initialize the scan line s: s ∈ NI and set elements to 1. Initialize the resulting glyph sequence r to empty: r = {} while min(s) ≤ J do Skip over the preceding line separator: for i ∈ [1, I] do while si ≤ J ∧ l(i,si) do si = si + 1 end while end for Initialize accumulated glyph probabilities a: a ∈ RI×|A| and set elements to 0. Accumulate the glyph probabilities of the current text line: for i ∈ [1, I] do while si ≤ J ∧ ¬l(i,si) do (i,si ai = ai ) g g + yg , ∀g ∈ A si = si + 1 end while end for Set line separator probabilities to zero: aiϵ = 0,∀i ∈ [1, I].l Decode and add the current text line to the sequence: r = r+ {ϵl}+DecodeLine(a) end while Transform the glyph sequence to a readable string: Return GlyphsToString(r,A). 98 The decoding algorithm starts by applying function FindLineSeps to the soft-assign- ment y in order to identify line separators in pixel space. The result is a two-dimensional matrix of the same spatial size as the soft-assignment, where each pixel that is part of a line separator is marked as a logical true. Whenever a position in l is logical true, it means that the pixels in the above pixel column and the below pixel column belong to two distinct text lines. We will discuss two variants of the FindLineSeps function in Section 5.4. The scan-line s is a vector of integer numbers, giving the vertical position in each pixel column, with its number of components equal to the width of the soft-assignment y. The scan-line is initialized to start at the very top of the pixel space. During decoding it will move incrementally downwards until reaches the lower end of the pixel space, in which case the decoding algorithm stops. Accumulator a is reset for each new text line and is used to dynamically collapse the text line. It is a matrix of the size of the width of the soft-assignment by the number of glyphs in alphabet A and is used to sum up the glyph probabilities while moving the scan-line through the text line in pixel space. Summing up the glyph probabilities while moving the scan-line through a text line may seem to give different weights to different pixel columns as the text line may be of varying height. Strictly speaking is this the case, but does not influence the result since the line decoding algorithms of Section 5.5 compare the glyph probabilities relative to each other and all glyphs within one pixel column have the same weight as they are all the sum of the same pixels. Individual text lines are decoded using the function DecodeLine, which we will discuss in two variants in Section 5.5. Their function is to take accumulated probabilities a and produce the likely glyph sequence that matches this observation. Both variants of this function are indeed identical to the ones proposed by connectionist temporal classifica- tion[43, 46] since they too decode a one-dimensional sequence of glyphs. Parallelization using multi-threading can be implemented in this multi-line decoding algorithm in two different ways: first, by decoding multiple paragraphs in parallel. This decoding algorithm does not depend on shared resources and thus can easily be applied to a batch of examples at the same time. Second, even within one example the decoding of individual lines can be done in parallel. This approach entails first applying the Find- LineSeps function, followed by running the horizontal scan-line to accumulate the glyph probabilities per text line. In contrast to a non-parallel implementation the individual text lines are not directly decoded, but their accumulated glyph probabilities stored for later use. All text lines can then be decoded in parallel by applying the DecodeLine function in multiple threads. The decoded individual text lines are then combined to the final result of the multi-line decoding algorithm. We will discuss the two variants of the function FindLineSeps in the following Section 5.4 and the two variants of the function DecodeLine in Section 5.5. The variants of both functions are usable in all combinations, yielding four different variants of this proposed decoding algorithm. Converting Glyph Sequences to Strings So far we have discusses the overall algorithm for decoding multi-line texts. Next we will discuss the function GlyphsToString which fulfills three utilities: • Allow individual glyphs to stretch over multiple adjacent pixels. • Allow the same glyph to occur in two or more adjacent characters within the same text line. 99 • Map the decoded glyph sequence to a computer-processable string, e.g. encoded in UTF-8. Each position in the decoded glyph sequence r roughly correlates to one pixel col- umn of one text line - a one-pixel wide column of a small vertical interval - which has been dynamically collapsed. This means that if a glyph stretches multiple pixels in width, it may result in repetitions of its glyph in the resulting sequence. It may because one of the two variants of the line decoding function, see Section 5.5, already deduplicates these repetitions. To make sure that this case is correctly handled, function Glyph- sToString will first deduplicate adjacent repetitions of the same glyph, e.g. the sequence ‘aaaaϵg ϵgaabbbbbbbb’ is mapped to the new sequence ‘aϵgab’. Next step is to allow for the same glyph in adjacent characters within the same text line. As stated before, see Section 5.2, this is encoded by including the glyph separator ϵg in the sequence to distinguish between the same character in multiple adjacent pix- els and the same glyph in multiple adjacent characters. The above example sequence ‘aϵgab’ encodes the same glyph ‘a’ in two adjacent characters. This is signaled by the intermediate ϵg glyph. Keeping in mind that repetitions of the same character within ad- jacent pixels are already handled, we can now safely omit the glyph separator ϵg. The example ‘aϵgab’ reduces thus to ‘aab’, which is the final glyph sequence. The last step is to map the glyph sequence to a computer-processable string encoded in the character encoding chosen by the user. This is a mapping from glyphs in alphabet A to the computer’s character encoding, which is applied to the decoded glyph sequence. On the Runtime At this point it is also prudent to discuss the computational runtime of the proposed Al- gorithm 5.3.1. We will use Bachmann-Landau notation[69] to this end. Specifically will we employ the O notation (‘Big-O notation’) that defines the upper limit in growth of a function. Algorithm 5.3.1 employs a scan-line that spans horizontally and moves downwards, visiting each pixel exactly once and summing up the glyph probabilities of pixel columns within the same text line. This leads to a upper limit for the runtime of O(I × J × |A|). As before, I is the width and J the height of the pixel space. |A| is the number of glyphs in the alphabet. The referenced algorithm also calls function FindLineSeps once and function Decode- Line once per text line, which is at most half the number of pixel rows. Function Glyph- sToString is called once, reducing and mapping the decoding glyph sequence. Since this sequence has an upper limit for its length of once glyph per pixel, it also has a computa- tional limit of O(I × J × |A|). In total this leads us to an upper limit for the runtime of Algorithm 5.3.1 in O(I × J × |A|) +O(FindLineSeps) + J ×O(DecodeLine). 5.4 Finding Lines Maximum Individual Probability Variant In Section 5.3 we have discussed the overall multi-line text decoding algorithm proposed in this work. It employs a two-stage approach in which first line separators are identified and text lines extracted, followed by decoding each text line. We will now discuss the first of two variants of the FindLineSeps function for finding line separators, and thus extracting lines. 100 This algorithm for finding line separators is based on the assumption that the predictor that generated the soft-assignment y does not make mistakes or at least it will still predict the line separator glyph ϵl with the highest probability if it is required in the pixel in question in order to satisfy the neighborhood relations for the correct glyph sequence. Whenever the pixel column above the current pixel and the pixel column below belong to two different text lines, the predictor is expected to predict the line separator in this position. Algorithm 5.4.1 FindLineSeps: Max. Individual Probability Variant Input soft-assignment y. I = Width(y) J = Height(y) Initialize the resulting matrix with pixels marked as line separators: l ∈ [true, false]I×J and set elements to false. Mark pixels where the line separator has the highest probability: for i ∈ [1, I] do for j ∈ [1, J ] do if (i,j)argmaxg yg ≡ ϵl then l(i,j) = true end if end for end for Finished, return the marked line separators: Return l. Algorithm 5.4.1 outlines the algorithm for finding text lines based on this assumption. It initializes a matrix with one position per pixel in soft-assignment y, which is the marker if this position is a line separator, and initializes all to a logical false value. After that the algorithm simply iterates all pixels, identifies the highest-probability glyph for this pixel and marks it as a line separator if the glyph is the line separator ϵl. In terms of computational runtime, this leads to an upper limit of O(I × J × |A|) for this algorithm. The benefit of this algorithm is that it is very fast. On the other hand is it not robust in case of noise in the soft-assignment y or whenever the correct line separators are not predicted with a high probability, at least not as a continuous line in the full width from the left to right borders. The main decoding algorithm as outlined in Section 5.3 is based on a scan-line that is spanning from the left to the right borders and moving from top to bottom. It alternates between skipping over the line separator preceding a text line and then decoding the text line while moving the scan-line through it. Gaps in the predicted line separators, as can happen with this line-finding algorithm, will then result in the merge of two or more text lines. Figure 5.4.1 illustrates this problem. The example contains a paragraph of three text lines, but the line separator between the first two lines has a gap in it. Step A shows how the scan-line moves through the first decoded text line and merges parts of the first two true text lines. Step B continues this merging of adjacent text lines since now the scan-line is offset by one in comparison to the true text lines. Step C then misses part of the last text line in the area where the original gap exists. Noise in the soft-assignment will result in similar effects. A single pixel where the line separator is randomly predicted with the highest probability will lead to the scan-line switching between lines and thus introducing an offset in comparison to the true text lines. We will discuss a second line-finding algorithm that solves these problems in the next few pages. 101 A B C Figure 5.4.1: Merging of text lines while moving the scan-line downwards in case of gaps in the predicted line separators. Continuous Separators Variant As we have seen before, Algorithm 5.4.1 poses problems when identifying lines in the face of random noise or gaps in the prediction of the true line separator. This phenomenon occurs because only the individual pixel and no context information is used for deciding if the pixel is a line separator or not. We have discussed the semantic structure of the soft-assignment in Section 5.2. Line separators are expected to form a continuous line from the left to right border. There also should be at least a one pixel vertical space between two line separators in order to actually fit a text line in between. The first property disallows gaps in the line separator and prevents larger influence due to noise. The second property again is important in face of the scan-line behavior. A B C Figure 5.4.2: Actually merged line separators introduce an offset into the scan-line. Figure 5.4.2 illustrates the particular problem of merged line separators. The middle text line is shorter than the full width and thus the line separator between the first and second text lines merges with the one between the second and third text lines. We can see that this correctly decodes the first text line in step A, but then partially merges the second and third text lines in step B because there is an offset in the scan-line. Step C then again misses parts of the last text line. We can identify three properties of a line-finding algorithm that works well with the overall decoding algorithm as outlined in Algorithm 5.3.1: 1. Random noise should be ignored as best as possible. 2. Line separators should be continuous line from the left to right borders. 3. Two line separators should be at least one vertical pixel apart to allow for a text line in between. The second and third properties can be expressed in a consistent manner: Each pixel column of the soft-alignment should contain exactly the same number of line separators as the other pixel columns in the same soft-alignment. This prevents the introduction of offsets into the scan-line. 102 With this in mind we can start to derive an algorithm for finding continuous line sepa- rators from the left to right border that do not have gaps, have at least one vertical pixel in space between each other and are robust against noise. The basic idea is to find the probability for drawing a continuous line separator from a specific vertical position on the left border to a specific vertical position on the right border, as well as marking the path in between that leads to this probability. For this we define a continuous line as one in which each included pixel is offset from its neighbors exactly by one pixel in horizontal and by at most one pixel in vertical direction. This allows for curved line separator strokes with a curvature between minus 45 and plus 45 degrees. The algorithm should identify the best line separator stroke for all left-bound starting points to right-bound ends and their paths in between. It then picks the highest-probability ones and adds them to the result as long as the conditions for line separators are not violated. We observe that the line separator candidates starting in the same vertical position on the left can be calculated using a dynamic programming approach in a tableau. The line candidate probabilities following the recursive formulation P (c(i,j)|y, c(i−1,j±1)) = y(i,j)ϵl (5.4.1) ×max(P (c(i−1,j−1)|...), P (c(i−1,j)|...), P (c(i−1,j+1)|...)) where the candidate probability P (c(i,j)|...) is dependent on its predecessors to the left, namely c(i−1,j−1), c(i−1,j) and c(i−1,j+1). Index i − 1 refers to the pixel column to the left, j ± 1 to the pixel rows above and below. 1 2 3 4 5 6 7 Figure 5.4.3: Tableau of line separator probabilities starting from the third position on the left. Higher saturation encodes higher probability. Blank areas are not reachable. Figure 5.4.3 shows one such a tableau that gives the line separator candidates that start at the third vertical position on the left and end on the right border. The line sep- arators with a higher saturation have a higher probability. We can later follow the line separator candidate in reverse from right to left in order to enter it into the result. The overall algorithm for finding continuous line separators is shown in Algorithm 5.4.2. It starts again by initializing the result structure, followed by computing the tableaus of line separator probabilities and their line separator candidates. These line separator candidates are then processed in descending probability and entered into the result struc- ture if viable. Algorithm 5.4.2 heavily relies on two functions, one for computing the tableaus of line separator probabilities in the first place and the second for tracing each line separator candidate in a backwards fashion and inserting them into the result structure. Algorithm 5.4.3 outlines the first one. It generates one tableau for every starting position on the left border by using the line separator probability (1,s)yϵl from the soft-assignment as the line separator probability in that single pixel. It then increments the tableau to the right by applying a dynamic programming approach based on the recursive formulation from Equation 5.4.1. 103 Algorithm 5.4.2 FindLineSeps: Continuous Separators Variant Input soft-assignment y. I = Width(y) J = Height(y) Initialize the resulting matrix with pixels marked as line separators: l ∈ [true, false]I×J and set elements to false. Produce tableaus of line separator candidates: C = CandTableaus(y) Sort by descending probability: C = SortByProb(C) Add as many candidates as viable: while C ̸= ∅ do Retrieve the highest-probability separator candidate: (p, t, s, e) = Pop(C) Trace it backwards and test if its viable: c = TraceSeparator(l, p, t, s, e) if c ̸= ∅ then Accept the candidate into the result: for i ∈ [1, I] do j = ci l(i,j) = true end for end if end while The second of the two missing functions from Algorithm 5.4.2 is shown in Algorithm 5.4.4, which takes a separator candidate and traces it backwards from right to left, enter- ing it into the result if viable. For this, each line separator candidate consists of the tableau with the separator probabilities, as exemplified in Figure 5.4.3, as well as its starting and ending positions on the left and right border, respectively. It will then follow the maximum probability within the tableau from the right border to the left one while testing each pixel if it is still valid as a line separator according to Algorithm 5.4.5. If the line separator can- didate is not viable anymore, it is skipped and the next highest probability line separator candidate is processed by Algorithm 5.4.2. This process as outlined by Algorithms 5.4.2, 5.4.4 and 5.4.5 is greedy, entering as many line separators into the result as viable. In some cases these may be more or less than the actual true number of line separators in the example to be decoded. It is to be expected that a well-trained deep neural network, or model in general, generating the soft-assignment y will produce the correct line separators, although possibly with low-probability gaps in them. This is the case for which this algorithm was designed. There are limitations to the line separators that the algorithm discussed in this section can detect, most important that they cannot touch and must be separated by at least one pixel in vertical direction. These are strong limitations to false line separators generated by noise, which will in turn likely not lead them to be accepted since the high-probability true line separators are already entered in the result. Optionally a lower limit on the line separator probability could be enforced in order to discard line separator candidates with large gaps or high noise. Still, there is a chance that this algorithm will find line separators that are false as results of flukes in the model or random noise. The same is true for the line detection outlined in Algorithm 5.4.1. 104 Algorithm 5.4.3 CandTableaus: Produce tableaus of line separator candidates Input soft-assignment y. I = Width(y) J = Height(y) Initialize the set of line separator candidates: C = {} Iterate starts on the left border: for s ∈ [1, J ] do Initialize a new empty tableau: t ∈ RI×J and set elements to 0. Set beginning line separator probability: t(1,s) (1,s) = yϵl Increment to form continuous separators to the opposing border: for i ∈ [2, I] do for j ∈ [1, J ] do Preceding line separator probability: p = max(t(i−1,j−1), t(i−1,j), t(i−1,j+1)) Probability for a line separator at the current position: t(i,j) (i,j) = p× yϵl end for end for Iterate ends on the right border and store candidates: for e ∈ [1, J ] do p = t(I,e) if p > 0 then C = C ∪ (p, t, s, e) end if end for end for Finished, return the line separator candidates: Return C. 105 Algorithm 5.4.4 TraceSeparator : Trace a separator candidate backwards Input separator matrix l. Input separator candidate (p, t, s, e). I = Width(l) J = Height(l) Initialize list of resulting vertical coordinates: c = {} Is the start and end still viable? if IsSeparatorOkay(l, 1, s) ∧ IsSeparatorOkay(l, I, e) then Add the right end to the separator trace: c = c+ e Last seen vertical position in the separator: j− = e Iterate backwards from right to left border: for a ∈ [1, I − 1] do i = I − a Find next vertical position j+ with the highest probability: + (i,j−−1) (i,j− −j = argmax )j(t , t , t (i,j +1)) Test if this pixel is viable as a separator: if IsSeparatorOkay(l, i, j+) then Use this position as the next step in the line separator: c = c+ j+ j− = j+ else Separator candidate is not viable anymore: Return ∅. end if end for Reverse the order of the trace and finish: Return ReverseOrder(c). end if This candidate is not viable anymore: Return ∅. Algorithm 5.4.5 IsSeparatorOkay : Can this pixel be a separator? Input separator matrix l. Input coordinates i, j. if l(i,j) ≡ true∨l(i,j−1) ≡ true∨l(i,j+1) ≡ true then Return logical false. end if Return logical true. 106 On the Runtime In case of Algorithm 5.4.2 it is prudent to start the runtime analysis backwards. Function IsSeparatorOkay simply has a runtime of O(1). Function TraceSeparator in Algorithm 5.4.4 follows a line separator candidate backwards, visiting each pixel column once and identifying the highest-probability predecessor in each column. Since at most three pre- decessors are tested in each column, this reduces to a total runtime of O(I) for width I. Function CandTableaus generates one tableau of line separator candidates for each of the J pixels on the left border and each tableau requires to visit every pixel one. This results in a total runtime of O(I × J2) for this function with I being the width and J the height of the pixel space. The overall function FindLineSeps as described in Algorithm 5.4.2 calls function Can- dTableaus once an then sorts the resulting line separator candidates, which are at most J2 many. In-place sorting has a computational complexity of O(n × log(n)), which in this case is O(J2 × log(J2)) and reduces to O(J2 × 2 × log(J)) for J > 0 and thus to O(J2 × log(J)). The upper limit for the number of line separators actually entered in the result is half the height J , respecting the condition that two line separators must be separated by at least one pixel in vertical direction. This means that the function TraceSeparator is called at most J2 times. In total the runtime of Algorithm 5.4.2 is thus O(I × J2 + J2 × log(J) + J × I), which reduces to O(I × J2 + J2 × log(J)). 5.5 Decoding Lines Preface The two Sections 5.3 and 5.4 discuss and outline the algorithm for decoding multi-line texts and for identifying and extracting individual lines within a paragraph as one step of the decoding. The missing piece to complete this multi-line decoding algorithm is to de- code the individual lines by means of reading a high likelihood glyph sequence from the probabilistic model output. Algorithm 5.3.1 employs a scan-line spanning from the left to right borders in order to sequentially process each line and dynamically collapse each line to a one-dimensional sequence of glyph probabilities. This in turn means that decoding each text line is the same decoding problem as in connectionist temporal classification (CTC)[43, 46]. The work discussed in this thesis thus employs the one-dimensional de- coding algorithms from CTC in order to decode individual text lines. The Algorithms 5.5.1, 5.5.2, 5.5.3 and 5.5.4 as discussed in this section are the au- thor’s specific implementation of the algorithms proposed in the CTC[43, 46] publications. Further ideas on how to improve decoding algorithms for one-dimensional sequences are discussed in Sections 10.2 and 10.3. Best Path Variant The first of the two algorithms for decoding one-dimensional text lines is named best path decoding and outlined in Algorithm 5.5.1. Recalling Figure 3.1.1, we observe that there are multiple different ways to align a glyph sequence over a time series in case the time series is longer than the glyph sequence. This allows for variability for e.g. translation of glyphs in pixel space, or time steps in one-dimensional problems2, or to span glyphs over multiple pixels. A ‘path’ in this context, or a ‘configuration’ in terms of this work and 2In this context, the terms ‘time step’ in one-dimensional decoding and a ‘pixel’ in the dynamically col- lapsed, accumulated glyph probabilities are synonyms. 107 graphical models in general, refers to one single chain of glyphs with one glyph per time step or visually one path from left to right in Figure 3.1.1. Best path decoding as shown in Algorithm 5.5.1 is based on the assumption that the highest-probability configuration C also represents the true glyph sequence. This holds true for perfect estimators of the soft-assignment y, which would produce a one- hot encoding at each time step with the true glyph having a probability of one and all others zero. The overall probability for one con∏figuration is given by P (C|a) = aiC (5.5.1)i i with Ci being the glyph in configuration C at time step i and a being the one-dimensional sequence of glyph probabilities as accumulated by Algorithm 5.3.1. This assumption does not hold true anymore in case the accumulated glyph proba- bilities a contain time steps where the true glyphs are predicted with a low probability or where there is a general ambiguity of glyphs. Algorithm 5.5.1 DecodeLine: Best Path Variant Input accumulated glyph probabilities a. I = Width(a) Initialize the result sequence: s = {} Collect the maximum probability glyph per pixel: for i ∈ [1, I] do Find the glyph in this pixel: g i⋆ = argmaxg ag Append the glyph to the sequence: s = s+ g⋆ end for Finished, return the glyph sequence: Return s. Algorithm 5.5.1 then outlines the first variant of the DecodeLine function in Algorithm 5.3.1. It decodes a single text line by identifying the highest probability glyph per time step and appending it to the decoded glyph sequence. Beam Search Variant Again recalling Figure 3.1.1 and the discussion in Section 5.3, we observe that there are multiple configurations that represent the same glyph sequence when accounting for e.g. repetitions of the same glyph. We can use Beam Search[106] as a heuristic to uncover the most likely glyph sequence, accounting for its different configurations. Since different configurations that fold to the the same glyph sequence are indepen- dent events, we can write the likelihood for a∑specific gly∏ph sequence t P (t|a) = ais (5.5.2)i C:GlyphsToString(C)≡t i as the sum of the likelihood for observing configurations C that map to it using function GlyphsToString. We can use this fact to heuristically decode the high likelihood glyph sequence by building a trie of already known glyph sequences, the prefixes of the full decoded sequence, and incrementally appending further glyphs to those prefixes with the 108 highest likelihood. Low likelihood prefixes will be discarded underway in order to reduce the computational effort, introducing a heuristic property to this decoding algorithm. The decoding algorithm outlined by Algorithm 5.5.2 is based on the idea to build a prefix trie, initialized with the empty sequence, containing the top-n most likely sequences and then incrementally appending to those until all time steps of the input soft-assignment are processed. Prefixes that do not fall in the top-n most likely ones per time step are discarded. Finding the best prefixes within the current trie is implemented using a heap, organized by the likelihood of each prefix. In the context of the beam search, the top-n most likely prefixes are that processed in each time step are also called the ‘beam width’. Processing the top-n prefixes at each time step requires that the likelihood for each is kept up to date and consistent in order to compare the individual likelihood of multiple prefixes. This requires that we respect the rules for folding glyph sequences, implemented by function GlyphsToString, at every time step. This in turn requires the application of Equation 5.5.2 and the rules for folding given by function GlyphsToString to the prefixes in the trie. This leads to two rules for incrementally adding glyphs to the prefixes in the trie: 1. Increment a prefix with a glyph by multiplication of the glyph probability with the prefix likelihood. Append the glyph to the sequence if it is different from the last glyph in the sequence. 2. Fold two identical prefixes by summation of their likelihood and unification of their trie nodes. Algorithm 5.5.2 implements this beam search decoding algorithm for one-dimensional sequences. It builds a prefix trie of known glyph sequences, incrementally appending glyphs to the top-n most likely ones according to a heap structure that organizes trie nodes. It keeps track of two likelihood per prefix sequence, that is per trie node: the likelihood in the last time step and the interim likelihood in the current time step. The current likelihood is used for accumulation of the likelihoods if multiple prefixes fold to the same sequence and is initialized to zero at each time step. The last likelihood is used for calculation of the new likelihood when appending a new glyph to this prefix. The algorithm further depends on two helper functions, Increment for appending to a prefix and FlipProbs for saving the likelihoods from the last time step and zeroing for the next time step. Algorithm 5.5.3 describes the function for incrementing an existing prefix sequence by one additional glyph and at the same time folding identical glyph sequences. This is done by following the prefix trie structure, modifying the likelihood if the current sequence is already known or creating a new leaf node if it is a previously unknown sequence. Folding is implemented by summation of the likelihoods of two identical glyph sequences. Algorithm 5.5.4 is a helper function that enumerates the prefix sequences within the trie and flips the current for the last likelihoods, re-initializing the current likelihoods to zero again. This Beam Search decoding algorithm is capable of decoding the correct glyph se- quence even if it is partially weakly predicted by the model that generated the soft- assignment y. This is because it takes into account the different configurations, that is ways to align the glyph sequence over the pixel space, of each individual glyph se- quence. If there are parts were the correct glyphs are weakly predicted, there are still multiple configurations that use this weak prediction and sum up to a high likelihood for this sequence. This is likely not the case for random noise in the prediction. Still, as both decoding algorithms discussed here, it is not robust in the face of incorrectly predicted glyphs. 109 Algorithm 5.5.2 DecodeLine: Beam Search Variant Input accumulated glyph probabilities a, beam-width w and alphabet A. I = Width(a) Initialize the trie to only the empty sequence. Elements are 4-tuples of the own glyph, set of suffixes, last probability and current probability: T = {(ϵg, ∅, 1, 0)} Initialize the heap of trie nodes, sorted by descending last probability: H = Heapify(T) Iterate pixels and append to the trie: for i ∈ [1, I] do Follow the top w current prefixes: for n ∈ [1, w] do Retrieve the best sequence/trie node from the heap: s = Pop(H) Trie node s is a 4-tuple as described above: s ≡ (g′,C, plast, pcur) Are there anymore prefixes? if s ̸= ϵ then Increment the sequence by one glyph: for g ∈ A do Probability for the sequence with the glyph added: p = p ilast × ag if p > 0 then Increment with this glyph: Increment(s, g, p) end if end for end if end for Flip the current and last probabilities: T = FlipProbs(T) Ensure the heap is sorted and all nodes are in it: H = Heapify(T) end for Finished, return the best sequence: s = Pop(H) Return ToSequence(s). 110 Algorithm 5.5.3 Increment : Append glyph to a trie node Input reference to trie node s, glyph g and sequence probability p. Trie node s is a 4-tuple of its own glyph g′, child nodes C, last probability plast and current probability pcur: s ≡ (g′,C, plast, pcur) Is the new glyph the own glyph? if g′ ≡ g then Deduplicate identical adjacent glyphs in this case: pcur = pcur + p else Actually increment by one glyph: c = InsertChild(C, g) Increment(c, g, p) end if Algorithm 5.5.4 FlipProbs: Flip last and current probabilities in the trie Input trie T. Iterate all nodes per reference s: for s ∈ T do Trie node s is a 4-tuple of its own glyph g′, child nodes C, last probability plast and current probability pcur: s ≡ (g′,C, plast, pcur) plast = pcur pcur = 0 end for 111 112 Chapter 6 Multi-Dimensional Connectionist Classification (MDCC) Figure 6.0.1: Part of the pipeline discussed in this chapter. Left is the input, middle the estimated probabilities and right the decoded text. As with Chapter 5 is this chapter based on the following two publications on multi- dimensional connectionist classification: Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi-Line Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019. DOI: 10.1109/ICDAR.2019.00015 Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405–410. DOI: 10.1109/DAS.2018.36 Section 1.3 discusses the individual contributions to these publications. 6.1 Overview As we have discussed in Section 5.1, can the full offline handwriting system proposed in this work be split into two parts. One is a multi-line decoding algorithm as detailed in Section 5.1 that uncovers a high-likelihood string given a probabilistic soft-assignment of pixels to glyphs from an alphabet. This covers the latter part of the pipeline shown in Fig- ure 6.0.1. The other part of this pipeline is predicting the probabilistic soft-assignment, given an image of handwritten multi-line text, in the first place. This prediction is gen- erated by a deep neural network in this work and training this DNN is the topic of this chapter. The left-side part of Figure 6.0.1 visualizes this part of the pipeline. The training system that we will discuss in this chapter is, as is the decoding algorithm presented before, suitable for multi-line text in general and not just handwritten text. 113 A large part of this chapter will be a discussion of the ideas and function of the training algorithm proposed in this work. Still, a few words about the deep neural networks used for multi-line offline handwriting recognition are in order. Figure 6.0.1 shows the overall pipeline for transcribing multi-line text as proposed in this work. Its function is to transcribe multi-line text by predicting a computer-processable string from an image with the string containing the text as contained in the image. Processing image data with deep neural networks is a well-studied problem and can be tackled by both convolutional neural networks[78], e.g. for ImageNet classification[73], and recurrent neural networks[55], e.g. for offline handwriting recognition[45]. Both these topologies of DNNs are detailed and discussed in Section 2.3. We can thus build on a large corpus of knowledge regarding the processing of image data in deep neural net- works. Deep neural networks are typically optimized using the combination of the backprop- agation algorithm[112, 113] and gradient descent [11, 65, 107], see Sections 2.3 and 2.3. Gradient descent requires that the model, in our case a deep neural network, is differ- entiable regarding its parameters at (theoretically) all points. In practice, a model that is differentiable at most of the points and provides heuristics for non-differentiable points is still suitable when applying gradient descent. However, the output of the overall pipeline in Figure 6.0.1 is a string, which is a sequence of discrete symbols. Target functions for optimizing models that directly predict discrete symbols tend to be piecewise constant, providing no gradient at all, in large parts and non-continuous in other parts. One exam- ple for such a optimization target would be to count the number of wrong discrete symbols in the sequence, minimizing this number to zero. These properties, being piecewise con- stant at most points and non-continuous at the remaining points, disqualify such problem statements for optimization using gradient descent. The way this work, and many other works in fact, address this problem is by for- mulating it in a probabilistic framework by not directly predicting discrete symbols, but instead predicting probabilities for the occurrences of those. Basically the problem is re- formulated as a probabilistic multi-class classification problem. Assuming prediction y for observation x is generated by a deep neural network with parameters W according to y = argmax(DNN(x,W)) (6.1.1) is this a change to y = Softmax(DNN(x,W)) (6.1.2) with y now being a vector of class probabilities instead of a specific class. The softmax function[13] is discussed in Section 2.3. Application of a multi-line decoder function as proposed in Chapter 5 to the soft- assignment estimated by the deep neural network allows the formulation of the multi-line offline handwriting recognition problem in exactly such a probabilistic framework. The soft-assignment as predicted by the deep neural network, see the middle part of Figure 6.0.1, now gives probabilities for pixels being part of discrete glyphs. This is in contrast to a hard assignment of a pixels to one specific glyph. In the turn the proposed multi-line decoding then produces a high-likelihood glyph sequence from this soft-assignment. This formulation with the deep neural network estimating a soft-assignment from pix- els to glyphs from the alphabet effectively transforms this into a semantic segmentation or image segmentation task. Image segmentation is a task that can be solved using deep neural networks, see e.g. U-Nets[108]. In case of this thesis, the image segmentation task is a supervised learning task. This means we have a data set for training the deep neural network and this data set contains input images of multi-line text together with the matching transcribed text. The problem is that the data set contains the correctly tran- scribed text only, but no spatial information about the text. It does not contain information 114 about the position, size, orientation or shape of the characters in the text. Since the deep neural network should estimate this information, we need to infer this missing spatial in- formation during training. This procedure of inferring the missing information is called, similar to connectionist temporal classification, an alignment of the truth text. This topic will be the main task during training of the deep neural network and of this chapter. In this chapter we will at first discuss the structure of multi-line text in general, which is a necessary precondition for further discussions on inferring the missing spatial informa- tion within the training data. We will again use the IAM offline handwriting database[88] for examples in this chapter. This is followed by discussions of the alignment problem of multi-line text in general. The latter part of this chapter is detailing the solution proposed in this work: using conditional random fields[77][93, ch. 19.6], see Section 2.2, for infer- ence of the alignment. This will allow us to implement and train a deep neural network that fulfills the role as needed for the proposed pipeline, see Figure 6.0.1 and thus allows to set up the overall multi-line text recognition system as proposed in this thesis. A practical implementation, application and experimental evaluation of both this chap- ter and Chapter 5 is detailed in Chapter 7. 6.2 Structure of Paragraphs Patterns of Multi-Line Text A problem that is stated in the opening of this chapter is that for supervised training of the deep neural network in this work, the training data only contains the input image and corresponding true text, but no spatial information about the text. This missing spatial information needs to be inferred in order to treat this as an image segmentation task and to optimize the deep neural network accordingly. In addition, Sections 5.2 and 5.3 discussed the structure of multi-line text in the context of multi-line decoding, applying an indicator function α but not detailing it. Both of these problems will be addressed in this section. We need to keep both applications, alignment during training and decoding during transcription, in mind when identifying and deriving the rules for the structure of paragraphs since both alignment and decoding need to be symmetrical in this sense. This requires us to keep the discussions of Section 5.2 in mind. Similar to Chapter 5 will we use the term glyph to refer to one element of the alphabet A at hand. A character denotes a specific instance of a glyph within a text or sequence and multiple characters within a text can be of the same glyph. The line separator glyph ϵl indicates that the pixels columns above and below belong to two different text lines. The glyph separator ϵg indicates that the pixels rows left and right belong to two different characters, even if they are of the same glyph. We will start by discussing the shapes that multi-line text contains when writing it to a piece of paper. In computer vision terms, we will discuss the patterns of multi-line text in pixel space. This is a necessary step to inferring the missing spatial information since we know from the annotated data and the writing system at hand what the general geometric relations between characters are. In Latin writing systems, characters in the same line are ordered from left to right in pixel space and text lines are ordered from top to bottom. Missing is the translation of this writing system to the pixel space. Identifying the patterns that occur in pixel space will allow us to deduct rules for describing and inferring these patterns. Let us have a look at the patterns that the transitions between two text lines produce. Text lines in this work are assumed to be roughly horizontal or slanted and curved up to 45 degrees. Further curvature will lead to decoding errors since the decoding algorithm proposed in Chapter 5 requires each text line to be exactly one interval per column in pixel 115 A B C L1 L1 L1 L1 L1 L1 L1 L1 L1 εl εl εl L1 L1 εl εl L1 L1 εl εl εl εl εl εl L1 L1 εl L2 L2 L2 L1 εl L2 L2 εl L1 L2 L2 L2 L2 L2 L2 εl εl L2 L2 L2 L2 εl L2 L2 L2 L2 εl εl εl L1 L1 L1 L1 εl L1 L1 L1 L1 εl L2 L2 εl L1 L1 L1 L2 εl L1 L1 εl L2 L2 L2 L2 εl εl εl L2 L2 εl εl L2 L2 Figure 6.2.1: Patterns in pixel space for transitioning between two text lines. ϵl denotes the line separator. space. The valid patterns for line transitions resulting from these restrictions are shown in Figure 6.2.1 with each block being one possible pattern. These example patterns are within a limited pixel space and can be increased in size, e.g. for creating longer diagonals. Extending the size of and combining these patterns will then produce the shapes of text lines that this work covers. Pattern A describes the default line transition, which is perfectly horizontal. Pattern B is a line transition over the diagonal. This may occur either because the text line is actually slanted or because a glyph above or below the line separator is extending down- or upwards. Pattern C is the shape of a curved line transition. Again this may be because the line itself is actually curved or because of the shape of the glyphs directly above and below. A B C C1 C1 C1 C2 C2 C2 C1 C1 C2 C2 C2 C2 C1 C1 C2 C2 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C2 C2 C2 C2 C1 C1 C1 C1 C2 C2 C1 C1 C1 C1 C2 C2 C1 C1 C1 C1 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C1 C2 C2 C2 C1 C1 C1 C1 C2 C2 C1 C1 C2 C2 C2 C2 Figure 6.2.2: Patterns in pixel space for transitioning between two characters. One of the two characters may be of the glyph separator ϵg. Similar to Figure 6.2.1 for lines does Figure 6.2.2 show the patterns for transitions between characters within the same text line. Pattern A shows the default transition of a straight vertical border between the two adjacent characters. This is also the optimal case for the decoding algorithm of Chapter 5, since this algorithm dynamically collapses each text line to a one-dimensional sequence by summation of the probabilities in each pixel column. A perfectly vertical transition between two characters thus introduces the least ambiguity after collapsing. Pattern B shows transitions between two characters with ragged borders. Pattern C are transitions over the diagonal. These patterns of transitions between characters occur because their glyphs have specific shapes and thus 116 intersections between each other. As with line separators the patterns can be combined and repeated to produce the overall patterns of placing characters within a text line. Indicator Function α Armed with this knowledge can we now derive abstract rules that describe these patterns. These rules in turn then will allow us to derive the indicator function α that we have used before or even generate these patterns, which will be used for computing the alignment during training of the deep neural network. This is necessary since ‘computing the alignment’ only means, as we will discuss in Section 6.3, to marginalize over all possible configurations for placing a text in an pixel space. The term configuration here carries the same meaning as it does in the context of graphical models, see Section 2.2, namely a configuration is exactly one hard assignment of labels to nodes of the graphical model. In our case, a configuration is one way of assigning a character to every pixel. The character may be a different one per pixel, but each pixel must be assigned exactly one character. We say a configuration is valid if it always follows all the rules for the patterns of text in pixel space discussed in this section. Marginalization over all valid configurations for placing the text at hand in pixel space then yields the alignment of this text. We will discuss this in-depth in Section 6.3. These rules for describing the valid configurations establish the connection between the text and pixels and thus we need to be clear on how to operate in both. The term label space refers to the topological space that contains the text at hand, e.g. the true label text during supervised training. Movements to the ‘left’ or ‘right’ in this space refer to the character before or after respectively, within the same text line. Direction ‘up’ and ‘down’ indicate the text line before or after the current one. Please note that, similar to the structure in Chapter 5, is each visible text line separated from its neighbors above and below by a line separator ϵl. Figure 6.2.3 shows the label space for the two-line sequence ‘CAT DOG’. C A T εl D O G Figure 6.2.3: Label space for the two-line sequence ‘CAT DOG’. We will use the term pixel space as done before in this thesis, which is to refer to the grid structure of pixels in an image. In this work we will use a pixel space of 8- neighborhoods, that is each pixel is seen as connected to 8 neighbors, including 4 over the geometric diagonal. Figure 6.2.4 shows part of the pixel space around a specific pixel. We have now the tools necessary to derive the rules that we will use in this thesis to describe the patterns of multi-line text in images. That is these rules create the relation 117 · ···· · i-1, i-1, i-1, j-1 j j+1 ··· i, j-1 i, j i, j+1 ··· · i+1, i+1, i+1, ·· j-1 j j+1 ··· Figure 6.2.4: Pixel space around a pixel (i, j). Solid lines indicate direct neighbors, dashed lines are neighborhoods of the other pixels. Outer dots indicate the extension of the pixel space in all directions. between the label space as given by a text and the pixel space as given by e.g. the output of a deep neural network. In the following paragraphs we will first discuss the rules governing the line separator ϵl, followed by a discussion of the rules for the remaining glyphs. Constructing the label space and pixel space for alignment in this work assumes that the text on the input image, and thus encoded in the pixel space by the deep neural network, the the truth label string do match. That is, e.g. spelling mistakes or missing characters in a handwritten paragraph are reflected in the truth label string and thus the constructed label space. This also assumes that the full paragraph is visible in the input image and if it is not that the truth label string also only contains the visible part. The following figures and paragraphs will focus on a specific character within the label space and then show its direct neighbors in both label and pixel space. In order to derive the overall rule for mapping a label space to pixel space and testing if a configuration is valid or not, the following rules will have to be repeated for all characters in the label space and all pixels in the pixel space. There is one exception to this: combinations of labels and pixels are valid only whenever there is enough space left before and after to map the remaining characters from the label space. For example, the character ‘B’ of the label ‘ABC’ cannot be mapped to any the left- or right-most pixels since there would no pixel be left in which to place the character ‘C’. Violating this automatically leads to invalid patterns and thus configurations. Figure 6.2.5 visualizes the rules for connecting the label space around a line sepa- rator ϵl to the pixel space. These are directly derived from the patterns on lines that we have observed in Figure 6.2.1. Each node in Figure 6.2.5 denotes one character in the label space and the edges give the relationship in label space, e.g. rightwards for the next character within the same line. The markers ‘R’ (Right), ‘DR’ (Down-Right), ‘D’ (Down) and ‘DL’ (Down-Left) on the edges refer to the according relationship in pixel space. We only show four directions of movement in the pixel space in Figures 6.2.5 and 6.2.6 since the other four are defined from the viewpoint of the neighbors to the left to top-right direc- 118 ··· ··· R, DR, DL Cur. εl R, DR, D, DL ... Char. of Char. of Char. ofnext Line next Line next Line ... Figure 6.2.5: Rules for transitioning from one line separator ϵl to its neighbors. Arrows indicate relations in label space. ‘R’, ‘DR’, ‘D’ and ‘DL’ indicate ‘Right’, ‘Down-Right’, ‘Down’ and ‘Down-Left’ directions in pixel space. The dotted transitions follow the same rules as the middle transition. The same rules need to be repeated for all line separators ϵl in the text. tions in pixel space. The remaining four directions (left, up-left, up and up-right) will later be constructed within the indicator function by transforming the edges from directed to undirected ones. For now the directed transitions should be seen as a simplification that will be dropped by the indicator function in favor of symmetric relations. Self-cycles are allowed in these rules to accommodate for the fact that there are at most as many char- acters in the label space than pixels in pixel space and each pixel needs to be assigned one character. Undefined pixel assignments are not allowed. It is notable here that Figure 6.2.1 contains a relationship between neighbors that is not modeled in the rule set of Figure 6.2.5: in actual examples, two adjacent text lines maybe neighbors over the diagonal, effectively skipping the line separator ϵl in between. However, the line separator is not really skipped since the decoding algorithm presented in Chapter 5 identifies individual text lines by testing for the line separator ϵl over the vertical in pixel space, not the horizontal or diagonal. The correct order neighborhood relations of Figure 6.2.1 are still preserved in Figure 6.2.5. Omitting this neighborhood between two text lines over the diagonal does serve a computational purpose in the way that all other described rules are either 1:1 (neighbors within the same text line) or 1:n (transition from ϵl to characters of adjacent text lines) relations, but the missing rule would be a n:m relation. This would increase the computational runtime of the algorithms built on this rule set and thus omitting it gains practical benefits without drawback. Figure 6.2.6 illustrates the rules for the label space around non-line-separator glyphs to the pixel space. The notation is identical to the one used in Figure 6.2.5. It is directly derived from the character patterns of Figure 6.2.2 for transitions within a text line and Figure 6.2.1 for those between text lines. The rules shown in both Figures 6.2.5 and 6.2.6 illustrate only four of the eight direc- tions in pixel space as seen from a character. This is because in order to apply these rules to a specific label and pixel space, they need to be repeated for each and every character anyway. This leads to a full rule set for the specific instances of label and pixel space, thus including all neighbors in all eight directions. The missing relations in pixel space are then observed by reversing the direction, e.g. ‘Down-Right’ becomes ‘Up-Left’. The missing indicator function α of Chapter 5 can now be implemented according to the following overall rule: a configuration is valid if and only if it places every character 119 Prev. εl R R, DR, D, DL Prev. Char. Cur. Char. ≠ εD, DL l R, DR, D Next Char. R, DR, D, DL Next εl Figure 6.2.6: Rules for transitioning from one character to its neighbors. Arrows indicate relations in label space. ‘R’, ‘DR’, ‘D’ and ‘DL’ indicate ‘Right’, ‘Down-Right’, ‘Down’ and ‘Down-Left’ directions in pixel space. The same rules need to be repeated for all characters, excluding the line separator ϵl, in the text. from label space at least once, if every pixel is assigned exactly one character and if the previous rules from Figures 6.2.5 and 6.2.6 for neighborhoods are respected without contradiction. We will apply the same indicator function to the remainder of this chapter. Formalizing the Indicator Function α We will now formalize the definition of the indicator function α as given above. We refer to this indicator function as αs,t(u, v, l) with s and t being pixels in the pixel space as defined in Figure 6.2.4. In the same way, u and v refer to positions within the label space of Figure 6.2.3. Variable l is the truth label string itself, necessary to construct the label space correctly. The indicator function αs,t(u, v, l) assumes a value of 1 if the neighborhood of u and v in pixels s and t is valid according to label l, else it is 0. The following paragraphs give the formal definition of α. The graph tensor product [154] provides the mechanism for formalizing allowed neigh- borhoods. The nodes of the product graph are the Cartesian product of the nodes of the pixel space, see Figure 6.2.4, and the nodes of the label space of Figure 6.2.3. A node (s, u) of the product graph exists iff all the following statements are true: • s is a node of the pixel space. • u is a node of the label space. • There are equal or more pixels to the left of s than characters before u in the same text line. • The above statement is true for pixels to the right of s and characters after u. • There are equal or more pixels to the top of s than the sum of the number of text lines and line separators ϵl before the text line of which u is a part. • This is also true for the pixels below s and the text lines and line separators after u. 120 The edges of the product graph are defined according to the graph tensor product. An undirected edge (s, u) ∼ (t, v) exists in the product graph if both s and t are neighbors in the pixel space in any direction D - one of ‘Right’, ‘Down-Right’, ‘Down’ and ‘Down-Left’ - and u and v are neighbors in the label space according to either the rules of Figure 6.2.5 or Figure 6.2.6 in the same direction D. This formulation necessitates flipping the pixel- label combinations (s, u) and (t, v) such that pixel s is always to the top, left or top-left of pixel t. The indicator function αs,t(u, v, l) has a value of 1 iff both (s, u) and (t, v) are nodes of the product graph and there exists an edge (s, u) ∼ (t, v) or (t, v) ∼ (s, u) in the product graph. αs,t(u, v, l) has a value of 0 in all other cases. Please notice that the product graph is an undirected graph in which an edge (s, u) ∼ (t, v) defines a neighborhood relation, but no direction. This is in contrast to the directed edges of the neighborhood rules specified above and exampled by Figures 6.2.5 and 6.2.6. This approach simplifies the construction of the pixel and label spaces since only four instead of eight neighborhood relations need to be encoded. The final product graph still encodes the required spatial information since the indices s and t still address specific pixels, while u and v address label positions accordingly. This means it is not possible to e.g. flip a text line in pixel space and still have a configuration that is valid according to this indicator function. An Example Configuration C C C C A A T T T T T T T C C C C A A T T T T T T T C C C A A A A T T T T T T C C C A A A A A T T T εl εl C C C C εl εl A A T T εl G G εl εl εl εl O O εl εl εl εl G G G D D O O O O O O O G G G G D D D O O O O O G G G G G D D D D O O O O G G G G G D D D D O O O O G G G G G D D D D O O O O O G G G G D D D O O O O O G G G G G D D O O O O O G G G G G G Figure 6.2.7: One example configuration for placing the two-line sequence ‘CAT DOG’ in a pixel grid. There are many more possible configurations for the same sequence and pixel grid. Figure 6.2.7 shows one possible configuration of placing the text from the label space of Figure 6.2.3 in a pixel space like the one in Figure 6.2.4, but of 13 by 13 pixels in 121 size. We can see that this is only one of many valid configurations. Other valid configura- tions can be generated by either iteratively modifying one root configuration pixel-by-pixel while respecting the above rule set at each change, or enumerating all configurations and testing for their validity afterwards. The question of how to work with the many different configurations for one text and pixel space will be part of the topic of the remainder of this chapter. On 8- versus 4-Neighborhoods The paragraphs before discussed the patterns that occur when placing multi-line text in a pixel space, e.g. by writing it on paper, and the abstract rules for these patterns. The rules represent the connection between label and pixel space. The pixel space in use is designed with 8- instead of 4-neighborhoods between pixels. That is it includes the four diagonal neighbors and not only the four over the vertical and horizontal lines. A εl εl B εl εl A εl εl B Figure 6.2.8: Invalid configuration of the sequence ‘AB’ over the diagonal when using 4- instead of 8-neighborhood relations. Rule sets based on 4-neighborhoods (without diag- onal) cannot model the correct order of characters in a ‘staircase’ pattern without increase in model complexity. This was a deliberate choice in order to address a specific problem during alignment. If a text line is place at a 45 degree angle, then its borders to the neighboring line sep- arators ϵl form a ‘staircase pattern’. If in addition the text line is placed with exactly one pixel in height, then the pixels assigned to this text line never touch over the vertical or horizontal neighborhoods. In a pixel space with 4-neighborhoods, the characters of the text line thus never touch in pixel space. This is illustrated in Figure 6.2.8. Line separators ϵl only carry the meaning that the pixels above and below belong to different text lines, but are not indicating to which characters specifically they are neighboring. This means that if the correct order of characters is not enforced within the text line, which it cannot in this case, then these invalid configurations can incorrectly be recognized as valid. This is shown in Figure 6.2.8 where the sequence ‘AB’ is placed in a diagonal as ‘ABAB’. One solution to this problem would be to encode the closest characters to each line separator ϵl. This approach modifies the label space, e.g. of Figure 6.2.3, in such a way that there are multiple line separators in place of one, each one encoding one possible placement in pixel space. For example there would be a line separator ‘ϵl nearest to C of CAT’, ‘ϵl nearest to A of CAT’, ‘ϵl nearest to D of DOG’ and so on. While this is a remedy to the problem of incorrect alignment overs the diagonal, it also increases the model complexity. In the model with 8-neighborhoods each character of a text line has a constant number of possible neighbors and as such is a 1:c relation with c being constant. There is also exactly one line separator ϵl between any two text lines. These line separators have 1:n relations in the described rule set with n being the summed number of characters in the text lines directly above and below. In the model with 4- neighborhoods with extended info added to the line separators, there is now one line separator character per character in the text line above and one per character in the text line below, all standing in relations to each other. This constitutes a n:m relation in the 122 rule set with n being the number of characters in the text line above and m in the text line below. In total does this lead to an increase in the model complexity by a factor of the number of characters in the label string, which should be avoided if possible. The method proposed in this thesis addresses this problem by using 8- instead of 4-neighborhoods in pixel space. In this case, characters within the same text line can be neighbors in pixel space even if they are placed in a diagonal. This in turn allows to correctly identify such configurations as shown in Figure 6.2.8 as invalid since they violate the rules for the indicator function as described above. The rule set is still increased in complexity since the characters now have more neighbors, but this is only an increase by a constant factor of two and not an increase in complexity linear to the number of characters in the label string. 6.3 Basic of Multi-Line Training Idea and Definitions The core goal of this chapter about multi-dimensional connectionist classification is to derive a training algorithm that can be applied to a given deep neural network and training data set in order to maximize the likelihood that the DNN predictions decode to the correct label strings if the decoding algorithm of Chapter 5 is applied. We will discuss multiple approaches to such a training function in the following sections, but first we need to define the common basics for all these approaches. Symbol W is the parameter set of a deep neural network that is suitable for the functionality as discussed in Section 6.1 and specifically for the pipeline of Figure 6.0.1. That is the DNN should take an image of multi-line text as input and estimate a soft- assignment, that is a probabilistic assignment, between pixels and characters from the alphabet in such a way that the decoding algorithm of Chapter 5 will produce the correct label string. The correct label string is defined by the training data set S, which consists of 2-tuples (x, l) ∈ S with x being the input image of multi-line text and l being the true label sequence for the text as seen in the input image. The symbol y = DNN(x,W) defines the soft-assignment between pixels in x and glyphs from alphabet A in the same way as used beforehand in this thesis. The goal of the training of the DNN and optimization of its parameter set W is to maximize the likelihood of Decoder(y) ≡ l. At this point we need to discuss the difference in the configuration C between the decoding and training algorithms. In decoding, each configuration C is a hard-assignment between pixels and glyphs from the alphabet A in the form of C ∈ Ady with dy being the spatial resolution of the DNN prediction y. A big part of the training method proposed in this thesis is to infer the missing information about the alignment of the truth label string l over the DNN prediction y, which is necessary since the DNN predicts a soft-assignment between pixels and glyphs and not directly a label string. As such the network prediction y ∈ [0, 1]dy×|A| is a probabilistic assignment where glyphs g ∈ A are exclusive per pixel s: ∑ ysg = 1, ∀s ∈ [1, dy] (6.3.1) g∈A In contrast to this, the alignment z is a soft-assignment between characters - specific instances of a glyph from alphabet A - of the truth label string l and pixels of the DNN prediction y. Thus z ∈ [0, 1]dy×|l| and it follows that configurations C in training are hard- assignments between characters of the label string and pixels. This is necessary since multiple characters of the label string may refer to the same glyph in the alphabet. In the same way as with the soft-assignment y, character assignments within the same pixel 123 are mutually exclusive to each other while pixels are independent of each other. This differentiation between the soft-assignment estimated by the DNN and the alignment of the label string leads to formulations such as ysl where the lower index refers to a glyphCs from the alphabet, which itself is defined by a position Cs within the truth label string l. The equivalent of this statement for the alignment would be zsCs . This described mechanism is a fundamental difference between the DNN prediction y and the alignment z of the truth label string. To resolve this we define a marginalization zΣ over characters that refer to the same glyph: ∑|l| z s = β(l sΣg i, g)× zi ,∀s ∈ [1, dy] (6.3.2) i with { 1 iff li = g β(li, g) = (6.3.3) 0 else as an indicator function that ensures marginalization over characters of the same glyph. As stated before, the spatial resolution of the input image x and soft-assignments y, z and zΣ may differ because of resizing operations beforehand or subsampling and padding effects in the deep neural network. However, from a theoretical viewpoint it is sufficient to assume that the spatial resolution of the DNN input and output are identical. Naive Algorithm We can now derive the first, although impractical, training algorithm for optimizing the parameter set W in such a way that the DNN, after decoding, likely predicts the correct string as seen in the input image x. For naive approach we directly employ the decoding algorithm from Chapter 5 in the tra∏ining. The likelihood P (S|W) = P (Decoder(DNN(x,W)) ≡ l) (6.3.4) (x,l)∈S defines the likelihood of observing the training data S given the DNN model with param- eter set W when directly applying the decoding algorithm. The optimal parameter set W⋆ = argmaxP (S|W) (6.3.5) W is in this case the one that maximizes this likelihood of observing the training data. Since we are dealing with one-dimensional sequences in l, we can use the Edit-distance[81, 151] as a surrogate for the probabil∑ity of decoding to the true label sequence l: W⋆ = argmin Edit(Decoder(DNN(x,W)), l) (6.3.6) W (x,l)∈S Unfortunately this direct approach is not practical. Gradient descent[11, 65, 107] and backpropagation[112, 113], see Section 2.3, cannot be applied since the decoding func- tion is not differentiable. The weight space from which the parameter set W ∈ R⋆ is drawn is high-dimensional and each dimension has infinite elements. This makes ex- haustive search, grid search or random search unfeasible. Heuristic optimization meth- ods would be a possibility, but we will discuss a more direct approach in the remainder of this chapter. 124 6.4 Maximum Likelihood Training Loss Function We will now look at the maximum likelihood approach to training the deep neural network. This leaves out the ‘middle man’ of the decoding function and directly maximizes the likelihood P (l|y) as defined by Equation 5.3.2 of observing the true label string l. Again we will use (x, l) ∈ S as our training data set consisting of tuples of the image input and true label string. The parameters of the deep neural network to be training is the set W. We can then define the likelihood of ob∏serving the training data set S given our modelparameters W: P (S|W) = P (l|DNN(x,W)) (6.4.1) (x,l)∈S This assumes that the examples in the training data set S are identical and independent distributed (i.i.d.), which allows sampling of data for the training set without considering dependencies between examples. This reduces the likelihood for the whole data set to the product of the likelihoods of its individual examples. The optimal parameter set W⋆ = argmaxP (S|W) (6.4.2) W is the one that maximizes the likelihood of observing the training data. Deep neural networks are typically optimized using gradient descent for finding a minimum of the loss function and thus we rewrite this in a log-likelihood formulation: W⋆ = argmin[− logP (S|W)] (6.4.3) W Using a log-likelihood formulation is a practical choice for optimizing deep neural networks using gradient descent towards a maximum likelihood solution since it both improves numerical stability by replacing products by summations and this in turn facilitates efficient batch training by simply accumulating the gradient of the examples within each batch. This directly leads us to the loss function L = − logP∏(S|W) = − log P (l|DNN(x,W)) ∑(x,l)∈S (6.4.4) = − logP (l|DNN(x,W)) (x,l)∈S which is suitable for gradient-based batch optimization of the parameter set W of the DNN. Before substituting P (l|DNN(x,W)) we need to define P (l|y) in a way that retrieves the likelihood of observing the truth label string l given a soft-assignment y. We can achieve this by treating valid configurations of the truth label string in the pixel space as in- dependent events and thus define the likelihood for observing the truth label string as the marginalization over all its configurations. Defining configuration C as a hard-assignment between pixels and characters of the truth label string l then gives the likelihood of ob- serving the correct string as the marginalization ∑∏dy ∏ P (l|y) = ysl s α (Css,t , Ct, l) (6.4.5)C C s t∈nbr(s) 125 over all possible config{urations C. Applying the indicator function s t 1 iff C s, Ct are valid neighbors in s, t according to l αst(C ,C , l) = (6.4.6) 0 else ensures that the marginalized probabilities are those of configurations C leading to the observation of the correct label string l. See Section 6.3 for a discussion of the hard- assigned configurations C and Section 6.2 for the formal definition of this indicator func- tion. Substituting Equation 6.4.5 in Equation 6.4.4 yields the following loss: ∑ ∑∏dy ∏ L = − log DNN(x,W)sl s αs,t(Cs, Ct, l) (6.4.7)C (x,l)∈S C s t∈nbr(s) The calculation of the likelihood P (l|y) as given by Equation 6.4.5 is fully differen- tiable and thus allows us to employ backpropagation and gradient descent for parameter optimization. The derivative ∂L∂W of Equation 6.4.7 gives the gradient necessary for this. The drawback in this approach lies in the computational complexity in doing so. Equation 6.4.5 and thus the loss of Equation 6.4.7 requires the enumeration of all possible config- urations C for placing a label string l in the pixel space defined by x. As we will see next, is this an intractable large amount. Number of Enumerated Configurations Estimating the number of these configurations as valid ways of writing text is the topic of the next few paragraphs. Let us assume a very simple indicator function α which only accepts neighborhood relations as valid if text lines are always aligned horizontal with a constant height per text line. In this case each text line is a perfect rectangle in pixel space. Similarity it accepts characters in pixel space only if they themselves are perfect rectangles. Let W be the width and H the height of the pixel space. Let L be the number of text lines. Each pair of text lines is separated by a horizontal line of line separators ϵl of one pixel in height, leaving H − L + 1 pixel rows for alignment. There are then Tl = L − 1 transitions between text lines and Tv = H − L vertical transitions between pixel rows for free use in configurations C. Interpreting this as a combinatorial problem[138] of choosing Tl elements out of a base set of Tv elements without replacing elements and without counting permutations twice yields the following factorial function for the number of possible configurations Nl for placing text lines: Tl Tv Tv × (Tv − 1)× · · · × (Tv − Tl + 1) Nl = = (6.4.8) Tl! Tl × (Tl − 1)× (Tl − 2)× · · · × 1 The same approach can be applied for placing G glyphs in a horizontal pixel row. Counting occurrences of the glyph separator ϵg as part of the sequence of G glyphs, there are then Tg = G− 1 transitions between glyphs and Th = W − 1 vertical transitions between pixel columns. This gives the number of possible configurations for placing the text line in a pixel row: Tg Th Th × (Th − 1)× · · · × (Th − Tg + 1)Ng = = (6.4.9) Tg! Tg × (Tg − 1)× (Tg − 2)× · · · × 1 126 Number Nl gives the number of ways to place the text lines in the vertical dimension and Ng the ways for placing the glyphs of a text line in horizontal direction. In total the number of configurations C is then Nc = Nl ×Ng (6.4.10) assuming that each text line has an equal number of glyphs and the alignment of their characters is uniform for all lines or Nc = N L l ×Ng (6.4.11) assuming that the alignment of characters is independent for each text line. In an example with a pixel space of W = 320 by H = 240, aligning a paragraph of L = 5 text lines of G = 50 characters each, the number of possible configurations Nc is approximately 1066 with uniform text lines as in Equation 6.4.10 and approximately 10299 with independent character alignments in each text line as assumed in Equation 6.4.11. These numbers of configurations are vastly smaller than just enumerating all configurations independent of their encoded text, but it still is computationally prohibitively large even for this simple indicator function α, which only allows perfectly rectangular and axis-parallel lines and characters. Using the indicator function α as discussed in Chapter 6.2 will only introduce more degrees of freedom, e.g. that text lines and glyphs do not have to be rectangular anymore, and thus increase the number of different configurations Nc. This further increases the computational complexity of applying the maximum likelihood approach to this problem. Please also note the discussion in Section 4.3 on the computational limitations of inference in graphical models. 6.5 Expectation-Maximization Training Idea and Training Algorithm In the previous sections we have so far discussed methods for training deep neural net- works for multi-line text transcription based on directly minimizing the Edit-distance[81, 151] between the truth label string and the decoded predicted string and another method for maximizing the likelihood of observing the truth label string in the prediction by enu- merating all configurations on how to place this truth label in pixel space. Both seem computational inefficient and intractable in their own way. We will now discuss an ap- proach based on expectation-maximization[26][7, ch. 9] which we have in discussed in Section 2.4. This approach will employ a conditional random field and loopy belief propa- gation in order to infer an approximation of the soft-assignment between pixels and glyphs that will maximize the likelihood of decoding to the correct label string. After that all that is missing is to optimize the deep neural network for reproducing this soft-assignment during prediction. In this section we will discuss this approach in detail, which is also the training algorithm used in MDCC. Expectation-maximization is an iterative optimization algorithm used in machine learn- ing for selecting model parameters in the face of latent variables. Each iteration is a two-step process as follows: 1. Expectation step: Keep the model parameters constant while inferring the latent variables. 2. Maximization step: Keep the latent variables constant while updating the model parameters. 127 In our case, the latent variable is the soft-assignment zΣ between pixels and glyphs in such a way that it decodes to the truth label string. This is in contrast to the soft- assignment y, which is estimated by the deep neural network without knowledge of the correct label string. We need to keep this distinction between zΣ for a soft-assignment matching a specific truth label string and y for the network prediction in mind. The E-step in our case will be to use a conditional random field for inferring z, a soft-assignment between characters and pixels based on the truth label string, and thus zΣ by accumu- lating characters to glyphs, see Equation 6.3.2 for this distinction, while keeping the DNN parameters constant. The M-step then is to keep the soft-assignment zΣ constant while updating the DNN parameters towards reproducing it without knowledge of the truth la- bel string. Figure 6.5.1 gives an overview over this loop, which will be applied iteratively during the training of the deep neural network. Start here Training Only Image of Truth Multi-Line Label Text Update String x Weights Corrected Alignment l W Soft-Assignment z and zΣ DNN CRF Topology Prediction Estimated Soft-Assignment Prior y Decoding Transcribed String Transcription Only Figure 6.5.1: Loop between the deep neural network and conditional random field to optimize the network parameters using expectation-maximization. The EM loop only exists dur- ing training, as does the CRF. Only the DNN and decoding algorithm are required for transcription. The next few paragraphs will discuss the optimization target for the expectation-max- imization training in this thesis. As before, (x, l) ∈ S will denote the training data set S consisting of examples of an image x of multi-line text and the matching truth label string l. Expectation-maximization requires the definition of a optimization function, in the con- text of EM also called the distortion function, that will be minimized during training. In the 128 case of MDCC, this function consists of two distinct parts. The first regards the E-step, in which we will minimize the Edit-distance between the decoded soft-assignment zΣ and the truth label string in order to infer this latent variable zΣ. The second part accordingly aims at the M-step, in which the cross-entropy loss will be employed to optimize the DNN parameters towards reproducing the soft-assignment zΣ in its own prediction y. The distortion function J for EM trainin∑g in MDCC is thus J(W,Wold) = [Edit(Decoder(zΣ), l) + CE(zΣ,y)] (6.5.1) (x,l)∈S with W being the parameters of the deep neural network. The soft-assignment y = DNN(x,W) (6.5.2) is estimated by the deep neural network based on the current network parameters. Esti- mation of the latent variable in form of the soft-assignment z = Alignment(DNN(x,Wold), l) (6.5.3) and its marginalization zΣ according to Equation 6.3.2 is a function of aligning the truth label string l on the DNN prediction based on the network parameters Wold of the last EM iteration, which are now being kept constant. The function Alignment() for finding the soft-assignment z will be the topic of further discussion in this chapter and especially Section 6.6. Equation 6.5.1 details the optimization function for training in MDCC. One part of it is to find a soft-assignment zΣ that minimizes the Edit-distance with the truth label string l when decoded with the decoding algorithm of Chapter 5. Since it was possible to place the truth label string l in the input image x in the first place, assuming the truth label and image match and are correctly annotated, then it is also possible to find a soft-assignment zΣ that correctly encodes this truth label string. This means that the lower bound for the term Edit(Decoder(zΣ), l) of the proposed EM distortion function is actually zero, which is the Edit-distance if both strings match without difference. However, the second term CE(zΣ,y) minimizes the cross-entropy between the aligned soft-assignment zΣ, which serves at the truth in this case, and the DNN predicted soft-assignment y. The goal is that y decodes to the truth label string l without prior knowledge of l. This cross-entropy term means that the EM distortion function will increase in value if the aligned soft-assignment zΣ varies between multiple iterations of the EM algorithm or deviates too much from the DNN prediction y. This effect is negated by including the estimated soft-assignment y, based on the last network parameters Wold as prior into the aligned soft-assignment z. In conclusion, the overall target of the alignment function Alignment() should be to find the alignment zΣ that minimizes the Edit-distance, that is a distance of zero, of the decoded alignment Decoder(zΣ) to the truth label string l, but as a side condition also is as similar as possible to the estimated soft-alignment y known beforehand for this example. This also makes practical sense since the deep neural network with parameters Wold is also the best estimator for the true soft-assignment without prior knowledge of the label string. We can now also see why different formulations of P (l|y) of Equation 5.3.2 in de- coding and of Equation 6.4.5 in training are necessary. On a basic level, the decoding variant is identical to the training variant but with the marginalization of Equation 6.3.2 from label positions to glyphs already built-in. This simpler formulation during decoding is sufficient since in decoding the only goal is to see how well the assumed label string matches the observation of the glyphs. During training we need to infer the spatial posi- tion for each character of the label string and thus need to avoid confusions between the identical glyph in different characters and thus different spatial positions. 129 The actual loss function for optimizing the DNN parameters W is only dependent on the second term in the distortion function of Equation 6.5.1. Only the term CE(zΣ,y) is dependent on W, while the remaining distortion function is dependent on Wold. This reduction leads us to deriving the f∑ollowing loss function L = CE(zΣ,DNN(x,W)) (6.5.4) (x,l)∈S for training the DNN. This loss function is differentiable regarding W and thus can be used for gradient-based optimization of this parameter set. Armed with this knowledge we can derive the algorithm for training the deep neural network using an expectation-maximization loop in combination with backpropagation and gradient descent as detailed in Algorithm 6.5.1. Algorithm 6.5.1 MDCC training based on Expectation-Maximization Input training data set S. Choose initial parameter set W. Outer loop for epoch-based training: while convergence criteria are not met do Inner loop for processing training examples: for (x, l) ∈ S do Store the current parameter set: Wold = W Calculate the aligned soft-assignment: z = Alignment(DNN(x,Wold), l) Margin∑alization zΣ according to Equation 6.3.2: s |l|z sΣg = i β(li, g)× zi , ∀s ∈ [1, dy] Predict the soft-assignment using the DNN: y = DNN(x,W) Calculate the DNN loss: L = CE(zΣ,y) Update the parameter set: W = W − µ× ∂L∂W end for end while Return final parameter set W. In this formulation of Algorithm 6.5.1, the parameter set Wold is explicitly copied and stored from W. The reasoning is to make clear that the parameter set is kept constant during inference of the alignment zΣ and only modified when updating the parameter set using backpropagation and gradient descent. In a practical implementation, the parame- ter set would not be copied in each loop and the DNN estimation would only by computed once per loop. Updating the parameter set in this is case performed by simply applying stochastic gradient descent in the form of W = W − µ × ∂L∂W with µ being the learning rate. Other optimization algorithms such as Adam[67, 84] can be applied as well. Mini-batch or batch training can be applied instead of stochastic optimization to the DNN by modification of the inner loop of Algorithm 6.5.1 in order to process multiple training examples at once. The parameter set is in this case updated by the accumulation of the gradients of the individual examples ∑ ∂L W = W − µ× (6.5.5) ∂W (x,l)∈B 130 with B ⊆ S being a mini-batch or batch of examples. Finding the Soft-Assignment z So far in this section we have discussed the MDCC training algorithm for deep neural net- work parameter optimization based on expectation-maximization and gradient descent. The paragraphs above give the methodology, rationale and equations for this EM training. What is missing is a formulation on how to find the soft-assignment z based on the DNN estimate y = DNN(x,Wold) and the true label string l. That is to define the function z = Alignment(y, l). We will discuss this in the following paragraphs. Let C again be a configuration, that is a hard-assignment of label space positions i ∈ [1, |l|] of the truth label string l to pixel space positions s. That is each pixel hard- assignment Cs ∈ [1, |l|] gives the label position to pixel s. Each pixel is assigned exactly one label position, but the same label position may be assigned to different pixels. This is consistent with the use of the term configuration in the context of graphical models. We need to refer back to the definition of the likelihood P (l|y) of Equation 6.4.5, which gives the likelihood of observing label l given the soft-assignment y in order to define the aligned soft-assignment z. The alignment z as required for the expectation-maximization training described in the paragraphs beforehand is a soft-assignment between label space positions and pixels. That is each pixel is assigned a vector of probabilities, each given the likelihood that one specific label position, that is character of the truth label string, occurs in this pixel. Since the characters are mutually exclusive but one character has to be assigned to each pixel, the probability vector per pixel does sum up to exactly one hundred percent. To achieve this we can choose the same approach as for P (l|y) of Equation 6.4.5, but to define zsi only, we marginalize pixel-wise over configurations that assign the label position Cs = i per pixel s. We thus derive a basic formulation of the alignment z by first computing the unnormalized alignment z′ with ∑ ∏dy ∏ z′s ′ ′ i = γ(C s, i) ys α s tl ′ s′,t(C ,C , l) (6.5.6) Cs C s′ t∈nbr(s′) for all spatial positions s ∈ [1, dy] and all label positions i ∈ [1, |l|]. In this formulation, γ is a indicator function { s γ(Cs 1 iff C = i , i) = (6.5.7) 0 else that ensures we only marginalize over configurations that assign the correct label space position i to pixel space position s. The alignment z is the normalization ′s zs z i = ∑ i ′s (6.5.8) j∈[1,|l|] zj of the unnormalized soft-assignment z′. As in Equation 6.4.5, the indicator function α enforces correct neighborhood relations according to label string l, dy gives the spatial di- mensionality of the soft-assignment y and function nbr gives the 8-neighborhood around position s. In this equation, the variable s gives the spatial position in question and s′ is an iterator variable over all pixel positions in order to derive the likelihood for the current configuration C in question. In this case we treat each configuration C as an independent event that favors the assignment of label position i to pixel s or not. Accumulating the likelihoods of configura- tions that favor a specific assignment derives the likelihood of observing this label position 131 in this pixel. Normalization of each pixel in such a way that its likelihoods of assigned la- bels sums up to 100 percent is required since each pixel must be assigned a label. There cannot be any configuration C that correctly encodes l but has one or more pixels with as- signed labels of probability zero. This normalization also ensures that P (l|z) = 1, which is necessary since the alignment z purposefully encodes l. Computing the alignment z according to Equation 6.5.8 again requires the enumera- tion of all configurations C that encode the correct label string l. This means the same reasoning on the runtime as in Section 6.4 with maximum likelihood training holds true here. Applying this equation for the alignment is prohibitively large in computational run- time. In Chapter 6.6 we will discuss how to interpret this two-dimensional alignment problem as an inference problem in general graphical models, namely a conditional ran- dom field. This will allow us to approximate the alignment of Equation 6.5.8 in reasonable runtime using loopy belief propagation. Considerations on EM-training As have have discussed before in Section 2.4 do we need to distinguish between expecta- tion-maximization and generalized expectation-maximization. Expectation-maximization refers to the specific use case where a maximum likelihood solution is heuristically ap- proximated. In this case the E-step finds the expectation value for the latent variables given the current model parameters and according to the distortion function. The M-step maximizes the likelihood of observing the training data given these latent variables. Iter- atively repeating this process will lead to convergence towards the maximum likelihood solution. The M-step can also be changed towards a maximum a-posterior (MAP) solu- tion, which is one generalization of the EM algorithm. Another generalization of EM is to approximate either or both the E-step and the M- step. In the original formulation of EM, the E-step finds the expectation value for the latent variables given the joint distribution of the observed data set and the latent variables. This is reflected in the distortion function. In a symmetrical fashion, the M-step minimizes the distortion function by choosing the optimal model parameters. Finding the expectation value of the latent variables and optimizing the model parameters is done in every iter- ation of the EM algorithm. One example for this ‘exact’ EM is k-means clustering which we have discussed in Section 2.4. The EM training proposed in this thesis falls under the term generalized expectation-maximization since both the E-step and M-step are only approximations or iterations towards the true expectation value of the soft-assignment zΣ or the optimal model parameters W. The DNN training discussed beforehand utilizes gradient descent for optimizing the model parameters. Gradient descent is an iterative al- gorithm, which only modifies the model parameters in small steps each iteration. It does not in one step find the optimal model parameters each step. This alone leads us to the conclusion that we deal with generalized expectation-maximization in this thesis. The E- step in MDCC approximates the soft-assignment z using a conditional random field and loopy belief propagation. This is the topic of Section 6.6. Again, this E-step is only an approximation of the true expectation value for the soft-assignment zΣ and thus supports the idea that this is generalized expectation-maximization. We need to keep in mind that this is generalized expectation-maximization with step- wise updates of the model parameters instead of finding one-step optimal solutions for the model. However, there are theoretical discussions[93, ch. 11.4.8] and examples that show that an incremental EM algorithm will still converge to a (local) optimum of the maximum likelihood solution. 132 6.6 Construction of and Inference in the CRF Overview and Structure Section 6.5 discussed the expectation-maximization approach to training the deep neural network in the method proposed in this thesis. We have so far discussed the optimiza- tion function, or distortion function, of the EM training in multi-dimensional connectionist classification and how to derive the loss for gradient-based optimization of the DNN pa- rameters. We also have shown the prototypical alignment z = Alignment(y, l) between the truth label string l and the DNN prediction y. This aligned soft-assignment z be- tween label positions and pixels represents the latent variable in the EM training at hand. This prototypical alignment made clear that exact inference of the alignment z opens up the same computational considerations as with maximum likelihood optimization of the DNN parameters and is computational intractable. Formulating the deep neural network training as an expectation-maximization approach did however change the nature of the alignment problem from a optimization task on the DNN parameters to a inference task of the latent variable. This allows the application of well known approximate inference algorithms. In this section we will discuss the formulation of the alignment in MDCC as approximate inference using loopy belief propagation[34, 94][93, ch. 22] on a conditional random field [77][93, ch. 19.6]. Both concepts have been discussed in Section 2.2. It is necessary to mention that there are two ways of using conditional random fields, or most graphical models in general: One being as a machine learning model in which the model parameters are automatically learned in a supervised schema using a training data set. The other being as a model of a multi-variate probability distribution in which the model parameters are predefined according to expert knowledge of the problem at hand. In this work we will apply the CRF in the latter way, by choosing the graph topology and parameters according to the alignment problem that we have discussed beforehand. We then use the graphical model for inference of the aligned soft-assignment. We define the conditional random field in MDCC as a pairwise undirected model with discrete states in which each pixel of the alignment problem relates to a node in the CRF and each label position to a state of the CRF. Otherwise said, each pixel of the pixel space becomes one random variable of the graphical model and each label position of the truth label becomes one possible discrete state of these random variables. The joint distribution of such a discrete pairwise CRF for the alignment problem at hand is defined as follows: ∏dy1 ∏ P (C|y, l) = ψs(Cs, l,y) ψ s ts,t(C ,C , l) (6.6.1) Z s t∈nbr(s) Z being the partition function or Zustandssumme, that is the normalizer ∑ ∏dy ∏ Z = [ ψs(C s, l,y) ψ (Css,t , C t, l)] (6.6.2) C s t∈nbr(s) that ensures that the accumulated likelihood over all possible configurations is 100 per- cent. Functions ψs(Cs, l,y) and ψs,t(Cs, Ct, l) are the potential functions of the CRF. As discussed before, the Hammersley-Clifford Theorem[50, 70, 77] defines the properties that these potential functions need to fulfill. They need to be non-negative functions in dependency of the nodes in their clique, which in our case as a pairwise model are cliques of two neighbors. The potential functions define the ‘compatibility’ between the observations, in our case the DNN prediction y, and the states of the random variables as well as between the states of neighboring random variables. Higher values of the potential 133 functions mean higher compatibility. In line with these properties and the discussions of Section 6.5 we define the node potential function s ψ (Cs y , l,y) = e lCss (6.6.3) as proportional to the estimated soft-assignment y, which serves as a prior. Please note that the soft-assignment y is indexed by a spatial position and a glyph from alphabet A, not a specific instance of it. Thus the necessity to index it by lCs in the case as applied here. The edge potential function ψ s ts,t(C ,C , l) = αs,t(C s, Ct, l) = α ss,t(C ,C t, l)× e0 (6.6.4) models the topology of the conditional random field according to the indicator function α which we have discussed in Section 6.2. Indicator function α has a value of 1 whenever the hard assignments Cs and Ct in pixels s and t are valid according to the truth label string l and 0 otherwise. The node potential function also assumes a value of one or zero depending on if the neighborhood relation is valid or not. Rewriting this as α × e0 allows to better fit this into the framework of graphical models and practical use, as we will see now. The edge potential function ψs,t of a graphical model is essentially a n × n matrix with n being the number of states in each random variable, in our case the size of the label space, and the coefficients of this matrix define the edge between the states in these two nodes of the graphical model. This matrix may contain structural zeros, that is zero coefficients which denote states that are not valid neighbors. As we can see from Equation 6.6.1, configurations C which contain such neighbors of such structural zeros have, correctly, a likelihood of zero according to the joint distribution. In our case the structural zeros are dependent on the structure of the label space and pixel space as defines by the indicator function α. Undirected graphical models with exponential potential functions, specifically where low-energy configuration have a high probability, are called energy based models, see e.g. Kevin Murphy[ch. 19.3.1][93] for further information, which are common in model- ing physical systems. In practice, choosing exponential potential functions offer bene- fits for the implementation of loopy belief propagation on computers. Applying LBP in sum-product mode, as we will do in this section, will require the repeated application of summation and multiplication operations to the values of the potential functions. Numer- ical stability is increased in such cases if these operations are done in logarithmic scale. The work on connectionist temporal classification contains[43, ch. 7.3.1] the required equation for addition in logarithmic scale ln(a+ b) = ln a+ ln(1 + eln b−ln a) (6.6.5) with multiplication ln(a× b) = ln a+ ln b (6.6.6) and the identity ln(ea) = a (6.6.7) being in the general corpus of knowledge. These equations allow a numerical stable implementation of the sum-product algorithm in loopy belief propagation and thus favor exponential potential functions. Restating the edge potential function ψs,t as a product of two terms, the indicator function α defining the structural zeros and the constant e0, allows for an efficient imple- mentation of loopy belief propagation. The above discussions and equations define the conditional random field as used for the alignment z to complete the EM training of Section 6.5. Approximate inference of the node marginals to retrieve the aligned soft-assignment z will be the next topic. 134 Inferring the Aligned Soft-Assignment We will now discuss how to apply loopy belief propagation in sum-product mode to the conditional random field described above. As discussed before in Section 2.2, belief propagation[100] is a message passing algorithm that computes the marginalized beliefs, in sum-product mode, or maximum posterior states, in max-product mode, of graphical models. If the graphical model at hand is a polytree, belief propagation will yield the exact marginals. Cyclic graphs, as in our case, are not polytrees, but still belief propaga- tion can be applied iteratively in order to retrieve approximated marginals[100, p. 195]. This iterative variant of belief propagation applied to cyclic graphs is called loopy belief propagation. The term beliefs in this context refer to the inferred likelihoods of the unobserved variables, in our case the alignment z, on the basis of the observed variables, here the DNN prediction y, and the given graphical model topology. In case of the alignment prob- lem in this chapter, the beliefs belsi are proportional to the marginals of Equation 6.5.8: belsi ∝ zsi . Referring back to the discussions of Section 2.2 and especially Algorithm 2.2.1, we will now define the message passing for loopy belief propagation for approximating the aligned soft-assignment z. In the following equations we will use the variables xs, xt as random variables ex- pressing the state of the pixels, that is nodes of the CRF, at spatial position s and t. This is in contrast to the usage of x as input into the deep neural network. The message update ∑ ∏ ms→t(xt) = [ψs(xs, l,y)ψs,t(xs, xt, l) mu→s(xs)] (6.6.8) xs u∈nbr(s)\t contains the local belief, based only on pixel s, about the likelihoods of discrete states in pixel t. Each message is built in a multi step process. First the evidence for node s is collected from its neighbors, except node t. This evidence is adjusted by its prior, namely the node potential ψs and finally transformed to a belief about node t via the edge potential ψs,t. Thus each message is the belief about the probability distribution in a specific node, given one of its neighbors and the prior information available. In loopy belief propagation, the message values ms→t(xt) are updated and stored at each iteration in order to use them for updating of the other messages in the message passing process. Updating the messages is iteratively repeated until the predefined con- vergence criteria are met. See Algorithm 2.2.1 or literature[93, ch. 22] for for the full algorithm. Given all the messages within the CRF, we ∏define the beliefs belsi ∝ ψs(i, l,y) mt→s(i) (6.6.9) t∈nbr(s) where s is a spatial position in pixel space and i and position within the label space. As stated before, in sum-product mode these beliefs are proportional to the marginals of Equation 6.5.8 and as such we can use the same normalization to approximate ∑ belszs ≈ ii s (6.6.10) j∈[1,|l|] belj with i and j both being positions in the label space defined by the truth label string l. Ap- proximating the beliefs for all spatial positions s and all label positions i will approximate the aligned soft-assignment z and thus complete the expectation-maximization training described in this chapter. 135 Convergence Criteria So far we have discussed how the aligned soft-assignment z can be inferred using loopy belief propagation. As stated, LBP is an iterative algorithm and requires convergence criteria in order to stop the iteration and use the best available beliefs. There are some theoretical considerations: At the time of writing it is not clear[94] under which conditions convergence of the beliefs in loopy belief propagation does occur or if the beliefs are near the exact solution if convergence does occur. However, practical application of loopy belief propagation on Markov random fields and conditional random fields does show that LBP can be successfully employed. It seems that if the beliefs do converge to a stable point, they are a reasonable approximation of the true posteriors. In some cases, LBP does oscillate without convergence to a stable point. We will now discuss that we can defuse these problematic behaviors of LBP for the proposed method of this thesis. For this we will choose an appropriate convergence crite- ria and we will test the approximated marginals if they are near the exact marginals. The first convergence criteria is derived from the expectation-maximization distortion function as presented in Equation 6.5.1. One term of the distortion function is to minimize the Edit-distance between the truth label string l and the aligned soft-assignment zΣ. When approximating the soft-assignment z in LBP, we can at the same time approximate zΣ by applying Equation 6.3.2. This in turn means that at every iteration of loopy belief propa- gation, we can approximate the soft-assignment zΣ, decode it with the decoder algorithm presented in Chapter 5 and compute the Edit-distance between the decoded alignment and the truth label string l. This is the term Edit(Decoder(zΣ), l) of the EM distortion func- tion of Section 6.5. If this Edit-distance is at its lower limit of zero, meaning no difference at all between the decoded and truth string, then we can stop the iterations of LBP and use the current marginals as the approximated alignment. While this does not give any indication of how large the difference between the approximated soft-assignment z and its true counterpart is, it at least means that there is no other string o ̸= l different from the truth label string l for which P (o|zΣ) > P (l|zΣ) based on Equation 5.3.2 holds true. This means that z is sufficiently close to the exact solution in order to apply the discussed expectation-maximization training for optimizing the deep neural network parameters to- wards predicting the soft-assignment y which decodes to the correct label string during transcription. This convergence criteria thus defuses the problem that LBP sometimes converges to a stable point that is not close to the exact solution. The second consideration is to prevent loopy belief propagation from running infinitely when not converging towards a stable point. Armed with the knowledge that we can test the current beliefs for sufficient closeness to the true marginals, we can simply stop LBP after a fixed amount of iterations. If the beliefs converge towards a point where they decode to a string with an Edit-distance of zero to the true label string, LBP will be stopped before and these sufficient beliefs be used as the aligned soft-assignment z. If no such beliefs were discovered after a fixed amount of iterations, LBP can be stopped and the example be ignored for the current expectation-maximization iteration. This effectively removes the example from the training data set for the current epoch of deep neural network training. It may be that a sufficient solution will be found for the same training example in later iterations of EM after the DNN parameters have been optimized towards correct transcription of multi-line text. These two ideas in combination result in the following formulation of loopy belief prop- agation, based on Algorithm 2.2.1, in the context of expectation-maximization as dis- cussed in Section 6.5: Loopy belief propagation as outlined in Algorithm 6.6.1 prevents infinitely running LBP iterations or training the deep neural network towards invalid predictions. It does so by introducing a limit on the number of iterations and by using only training examples for 136 Algorithm 6.6.1 Loopy Belief Propagation in MDCC Input predicted soft-assignment y. Input truth label string l. Initialize messages ms→t(xt) = mt→s(xs) = 1 for all edges s ∼ t. Initialize beliefs belsi = 1 for all nodes s. Choose a random but fixed order for message updates. Propagate beliefs until convergence criteria are met: repeat Send messa∑ges along each edge: ∏ ms→t(xt) = x∏[ψs(xs, l,y)ψs,t(xs, xt, l)s u∈nbr(s)\tmu→s(xs)]Update beliefs for each node: belsi ∝ ψs(i,y) t∈nbr(s)mt→s(i) Compute approximate z and zΣ based on these beliefs. Test if these beliefs do decode correctly: if Edit(Decoder(zΣ), l) = 0 then Return soft-assignment z and use it in EM training. end if until limit on the number of iterations is reached. Return and discard the training example for the current EM iteration. which the CRF approximation decodes to the truth label string. On the other hand does it potentially remove some training examples from the training data set and adds them back later again. This is not a problem from an algorithmic viewpoint since the expec- tation-maximization training presented in Section 6.5 is already a batch version and the optimization of the DNN parameters already has a non-stationary loss in the expectation- maximization loop of MDCC. Non-stationary objectives in the optimization of deep neural networks can be addressed by the Adam[67, 84] method. This concludes the training of the deep neural network towards transcription of multi- line text, which was the goal of this chapter. The following chapters will focus on exper- imentation and application of both the discussed decoding algorithm and training algo- rithm. 6.7 Emphasizing Segmentation Section 4.2 stated that one possible solution to Sayre’s knot is to treat segmentation and transcription of handwritten text as two products of the same process, not two different processes. Treating these as two different processes will introduce a circular dependency between both, which constitutes Sayre’s knot. So far this chapter and Chapter 5 have dis- cussed transcription of multi-line paragraphs using multi-dimensional connectionist clas- sification. The transcription task is encoded in the conditional random field, defined by its structure as discussed in Section 6.2, followed by a suitable decoding algorithm as proposed in this thesis. Segmentation, that is a correct assignment between pixels in the soft-assignment predicted by the DNN and the presented input image, was so far not discussed in the context of MDCC. Emphasizing segmentation instead or in addition to transcription is the topic of this section. It is worth noting that this is an idea on how to emphasize segmentation in multi-dimensional connectionist classification, but implementation and evaluation of this approach is not in the scope of this thesis. Part of the training algorithm of MDCC is to estimate the true soft-assignment of the ground truth label sequence over the two-dimensional soft-assignment estimated by the 137 deep neural network. This alignment process is implemented by constructing a suitable conditional random field, followed by approximate inference using loopy belief propaga- tion. A CRF is defined by its node and edge potentials. Edge potentials encode the ‘compatibility’ of labels in neighboring pixels of the CRF. In the case of MDCC encode these edge potentials neighborhood relations between characters in the ground truth la- bel sequence. As such, edge potentials can be interpreted as facilitating the correct transcription of the text. Node potentials of a CRF on the other hand encode the ‘compatibility’ between pixels of the CRF and their labels without taking neighboring pixels into account. Node poten- tials are thus only dependent on the spatial position of the pixel and the assigned label. In MDCC the node potential function is given by Equation 6.6.3. It defines the node po- tential in dependency of the glyph probabilities estimated by the deep neural network. This is because the idea behind MDCC is to modify the soft-assignment as estimated by the DNN as little as possible, but still correct it to facilitate correct glyph neighbors and thus correct transcription. Equation 6.5.1 necessitates this approach since it de- fines the distortion function for expectation-maximization in MDCC in such a way that the soft-assignment estimated by LBP needs to show a low cross-entropy towards the soft-assignment estimated by the DNN. In this section we will discuss the according modification to the node potential function. The original node potential function ys ψ ss(C , l,y) = e lCs (6.7.1) is augmented by introducing an additional dependency on a static soft-assignment k as ys +θ×ks ψ (Cs, l,y,k) = e lCs lCss (6.7.2) where 0 < θ < 1 is a constant coefficient to weigh the DNN-estimated soft-assignment y with the static soft-assignment k. The static soft-assignment k introduces a prior to the node potential function. A similar approach of augmenting the node potential function is detailed in Section 7.3 for implementation of a two-dimensional forced alignment. Emphasis is put on segmentation by choosing the static soft-assignment k in such a way that it preserves and represents the assignment between spatial positions in the presented input image and glyphs of the alphabet in use. One way to produce this static soft-assignment would be to move a sliding window over the input image and to do sin- gle character recognition using e.g. a convolutional neural network or support-vector machine. Since this soft-assignment is dependent only on the input image it can be com- puted once per data set and then reused, reducing the impact of this approach on the overall training time in MDCC. The question is how good the error rate of this single character recognition must be since its task stands in competition with MDCC and it seems that this approach of aug- menting the node potential function just moves the paragraph-level transcription problem to another algorithmic abstraction level. However, the CRF in MDCC encodes the truth label sequence in its edge potentials and as such the soft-assignment as approximated by the CRF and LBP always decodes to the correct string. The single character recogni- tion produces the soft-assignment k thus does not need to be 100 percent correct. It just needs to preserve the spatial relationship between pixels in the input image and glyphs of the alphabet while being correct in enough cases. ‘Enough cases’ means that each true prediction of the single character recognition will fixate the related node in the CRF to one single labeling or at last a low amount of labels. This in turn reduces the overall pos- sible configurations for placing the ground truth label string to the subset which respect this constraint. This means that each true prediction by the single character recognition, incorporated in soft-assignment k will improve the quality of the segmentation provided 138 by the soft-assignment in MDCC, both estimated by the DNN and approximated by the CRF. Emphasis should be placed on only incorporating reliable, or highly likely, predictions by the single character recognizer into the static soft-assignment k. Applying a threshold, if possible, to the predictions of the single character recognizer facilitates this precaution. This threshold could e.g. be a lower limit on the probabilities given by a softmax function in a convolutional neural network or a lower limit on the separation margin in a support- vector machine. While following this approach, adding a few high probability predictions to the soft-assignment k should yield better results than many low probability predictions. As stated was this approach of emphasizing segmentation not implemented or eval- uated in the scope of this thesis. This section still serves as a reminder that MDCC can easily be modified to support variants of the paragraph-level transcription task. 139 140 Chapter 7 Text Recognition for Paragraphs 7.1 Overview In Chapters 5 and 6 we have discussed the methodology and theory for using a deep neu- ral network (DNN) for segmentation-free multi-line offline text transcription. The pipeline first employs a deep neural network and the training algorithm of Chapter 6 to estimate a probabilistic soft-assignment between pixels of the input image and glyphs from the al- phabet. Chapter 5 finalizes the transcription pipeline by providing the multi-line decoding algorithm to produce a highly likely string from the estimated soft-assignment. In total the pipeline discussed in these two chapters of the thesis is capable of transcribing multi-line text from an image. So far the discussion has been on theory and the resulting meth- ods, but this chapter will detail the experiments and results that have been done with this methodology. This chapter will discuss the practical application of the proposed method and modi- fication necessary for this. These practical changes include data augmentation on the training data set, a well-known approach in deep learning where the training data is slightly altered in a random fashion. These variations lead to a better generalization of the trained model. We will also discuss a two-dimensional forced alignment [124] approach that serves as an initializer for the conditional random fields used in multi-dimensional connectionist classification (MDCC). Both data augmentation and forced alignment were applied to improve the method’s error rate on the used data set. Discussions on what the specific problems addressed by these approaches are, can be found in Sections 7.2 and 7.3. Section 7.4 will discuss the specific topology of the deep neural network used for the experiments. The network used for the experiments of this chapter is a combination of a convolutional neural network (CNN) and recurrent neural network (RNNS) in the form of long short-term memory (LSTM) cells. The corresponding section refers back to Chapter 2 for the neural network topologies. The IAM offline handwriting database[88] served as the basis for all experiments in this chapter. It is the quasi-standard data set for evaluating and comparing segmentation and transcription algorithms on handwritten English text. The experiments and results on this data set using the before discussed method and practical implementation is detailed in Section 7.5. This section also includes comparisons with the methods discussed in Chapter 3 that either address the same problem as MDCC or are current state-of-the-art methods on afore-mentioned IAM database. 141 7.2 Data Augmentation The IAM offline handwriting database[88] was used for all the experiments described in this chapter. The IAM database is a set of scanned pages of English handwritten text. Each page of the IAM database was generated by selecting a text from the London/Oslo- Bergen (LOB) corpus[62], printed on top of a blank physical sheet of paper and then letting a human writer copy the text to the free area below the machine written text. Finally, each physical page of handwritten text was scanned in 300 dpi resolution and stored as a grayscale digital image. The truth label strings provided with the IAM database match the handwritten text and reflect the line breaks as in the handwritten text, not the machine printed text, and also include potential spelling errors. The IAM database was designed primarily for training and evaluation of line- and word-wise handwritten text transcription and segmentation methods. The handwritten text lines typically have spacing between them. From the IAM offline handwriting database description[88, sect. 2]: As the main focus of the research that led to the acquisition of the database described in this paper is on high-level recognition using language models, we wanted to make the image processing part as easy as possible. Therefore, it was decided that the writers had to use rulers. These guiding lines, with 1.5 cm space between them, were printed on a separate sheet of paper which was put under the form. This poses the first reason for applying data augmentation to the IAM database in the context of this thesis. The method proposed in this thesis is designed for transcribing multi-line handwritten text without prior segmentation, even in the face of overlapping text lines. As such, a certain amount of overlaps between text lines is expected in the training data. Line overlaps will be artificially created by applying data augmentation. The second reason for data augmentation is the amount of examples available in the IAM database. All experiments in this chapter use the official split for the large writer independent text line recognition task, which splits the IAM database into four sets (train- ing, validation 1, validation 2 and test) without any overlaps of writers between the splits. In the experiments, the training set was used for training the deep neural network using multi-dimensional connectionist classification. The validation set 1 was used for hyper- parameter tuning and model selection. The test set was used for evaluation and compar- ison with existing works. Validation set 2 was not used in the experiments of this chapter, but in those of Chapter 9. The number of examples in these splits is listed in Table 7.1. The table shows that the number of examples, in the case of this work the number of paragraphs, in the training and validation sets are on the lower end for robust training of a deep neural network and the associated hyper-parameter optimization. Table 7.1: Characteristic sizes of the large writer independent text line recognition task on the IAM offline handwriting database. Training Validation 1 Validation 2 Test Num. Paragraphs 747 105 115 232 Num. Lines 6161 900 940 1861 Num. Writers 283 46 43 128 To counter these two properties of the IAM database, data augmentation was applied to the training and validation sets. Data augmentation is a technique in machine learning by which the number of examples in a data set is artificially increased by modification or perturbation of the original examples in a systematic way, although sometimes with 142 a random component. The test or evaluation data is typically not augmented since that would result in distorted and not directly comparable results and error rates. The main goal of data augmentation is to increase the number of examples used for automatic parameter optimization or manual hyper-parameter optimization of the machine learning model in order to reduce the likelihood of overfitting the data. One data augmentation method applied to the training and validation in the experi- ments in this chapter is to artificially reduce the line spacing between the text lines. This can be done on the IAM database since the annotated data contains the segmentation info on line levels. This segmentation data was manually corrected by the authors of the IAM database and thus considered to be the ‘perfect’ segmentation without error. Using this segmentation info, individual text lines were extracted and then combined again to a whole paragraph image, but with each line moved upwards vertically by a fixed amount of pixels. The number of pixels by which the text lines were moved were based on a chosen distance in millimeters and the known image resolution of 300 dpi. Since this reduced line spacing results in overlaps in the text lines, as required for data augmentation purposes, a pixel-wise logical OR-operation was applied to the text line images while merging. If the pixel in question was dark from ink in either of the overlapping text lines, the resulting merged pixel is dark in the augmented example. In typical grayscale image encoding schemes, this means the minimal numerical pixel value was used. This way the IAM data set was augmented by reducing the line spacing of the paragraphs in all examples of the training and validation set by 3 mm, 5 mm and 10 mm. The truth label string in the examples was not modified by this augmentation. Figure 7.2.1 shows one example from the IAM database. It is the example image as it is in the database, but cropped to the minimal axis-parallel rectangle around the handwrit- ten paragraph. This crop was necessary since the full paragraph images delivered with the IAM database are digital scans of the whole page, including meta data and the truth text in machine print. In order to retrieve the handwritten paragraphs, image cropping was applied to all data examples within the training, validation and test sets. In this way is Figure 7.2.1 an example for the images that were used in the experiments of this chapter while not applying data augmentation. These original paragraphs were only cropped to contain only the handwritten paragraph, but otherwise not modified. Figure 7.2.1: Example from the IAM offline handwriting database. Cropped to the minimal axis- parallel rectangle around the handwritten paragraph. The same crop was applied to all training, validation and test data. 143 Figures 7.2.2, 7.2.3 and 7.2.4 show the same example paragraph from Figure 7.2.1 but with line spacing reduced by 3 mm, 5 mm or 10 mm respectively. Figure 7.2.2: Example of Figure 7.2.1 with line spacing reduced by 3 mm. Figure 7.2.3: Example of Figure 7.2.1 with line spacing reduced by 5 mm. Figure 7.2.4: Example of Figure 7.2.1 with line spacing reduced by 10 mm. A second form of data augmentation used in this work is to artificially increase the number paragraphs by splitting them into smaller parts of at least two text lines. This again can be done since the annotated data of the IAM database contains the true line segmentation information. Data augmentation was applied by generating sub-paragraphs of at least two text lines by cropping the minimal axis-parallel rectangle around the se- lected text lines according to the annotated segmentation information. This extraction of sub-paragraphs was only done if the resulting axis-parallel rectangle around the text lines did not intersect with the neighboring lines. Meaning no sliver of text from adja- cent text lines was included. Only if this was the case for the cropped image region, the 144 sub-paragraph was included in the augmented data set. This process of cropping sub- paragraphs was done for all examples and for all possible combinations of starting and ending text lines, given that the resulting crop would not overlap with non-included text lines. At least two text lines were cropped per sub-paragraph in order to ensure that there is at least one line separator remaining and as such the example still is a multi-line para- graph. The annotated label string needed to be modified accordingly, by only using the annotated text related to the cropped text lines. Figure 7.2.5 shows the crop of the first two text lines for the example given in Figure 7.2.1. Figure 7.2.6 shows another sub-paragraph cropped from this example. In total there were 15 sub-paragraphs extracted by data augmentation for this example at hand. Figure 7.2.5: Example of a valid sub-paragraph from Figure 7.2.1 cropped to lines 1 and 2. Figure 7.2.6: Example of a valid sub-paragraph from Figure 7.2.1 cropped to lines 2 through 4. These two data augmentation methods, reducing the line spacing between text lines and cropping sub-paragraphs of adjacent text lines, were applied to all examples of the training set and validation set 1 of the IAM database. The test set was not augmented at all. This drastically increased the number of examples in these sets. Table 7.2 shows the number of examples after data augmentation. The number of training examples for the experiments went up from 747 to 20698 and the number of validation examples from 105 to 3163. Table 7.2: Number of examples in the large writer independent text line recognition task on the IAM database after data augmentation. Training Validation 1 Test Num. Original Paragraphs - - 232 Num. Reduced Line Spacing 2241 315 - Num. Sub-Paragraph Crops 18457 2848 - Sum total 20698 3163 232 The data augmentation as described in the above paragraphs and the data splits of Table 7.2 were used for all experiments in the remainder of this chapter. 145 7.3 Forced Alignment Idea and One-Dimensional Forced Alignment In Chapter 6 we have discussed multi-dimensional connectionist classification (MDCC), the training algorithm proposed in this thesis. MDCC employs a conditional random field to approximate the alignment of the truth label string over the two-dimensional pixel space of the DNN estimation. We will now discuss a modification of this alignment method that was used in the experiments of this chapter to improve the convergence speed of the DNN training in the initial phase. This modification is based on the idea of forced align- ment [124] for deep neural networks trained with connectionist temporal classification[46, 47]. The loss function of connectionist temporal classification implicitly computes the one- dimensional alignment of the truth label string over the DNN estimation. This implicit alignment can also be done explicitly. The loss function has to be changed to cross- entropy in this case. This is where forced alignment comes in. In the initial phase of training, the parameters of the deep neural network are random and then iteratively up- dated, which leads to more or less random estimations from this DNN in the beginning. In the case of random DNN estimations, connectionist temporal classification will still produce a valid alignment of the truth label string. However, the localization of the char- acters of this label string in the one-dimensional DNN estimate will be random, too. As shown in the forced alignment paper[124], this hinders the DNN optimization in the initial phase. Forced alignment combats this effect by explicitly generating an alignment based on assumptions on how handwritten text is structured, but without taking the DNN esti- mate into account, and uses this alignment to optimize the deep neural network via the cross-entropy loss. These assumptions for the forced alignment are that each character produces only a ‘spike’ (a narrow peak in probability for this character), that this spike is either in the beginning, middle or end of the character and that each character is of roughly the same width. After some epochs of training, forced alignment is replaced by connectionist temporal classification for further optimization. εg εg εg εg εg εg εg εg εg εg G a i t s k e l l Figure 7.3.1: One-dimensional forced alignment on an example word from the IAM offline hand- writing database. The characters in the forced alignment are uniformly spaced. Two-Dimensional Forced Alignment The same idea of using the truth label string and some assumptions about how multi-line text is structured can be used to compute a two-dimensional forced alignment. In the case of this work, the forced alignment is a two-step process: First, placing the text lines within the two-dimensional pixel space. Second, placing the characters within each text line in pixel space. 146 Placing the text lines is based on the assumption that prototypical text lines are ori- ented horizontally and roughly of the same height. Forced alignment in 2d thus places each text line as a perfectly horizontal, axis-parallel rectangle. All text lines are of the same size or with the smallest height difference that is possible. Two adjacent text lines are separated by a horizontal pixel row of line separators ϵl of exactly one pixel in height. These line separators have a probability of 100 percent in their respective pixels of the forced alignment. It is worth noting that these assumptions for forced alignment are de- signed for stabilizing the MDCC training on the IAM database with its roughly horizontal text lines. Figure 7.3.2 shows the line separators ϵl of the forced alignment of the example in Figure 7.2.1. Figure 7.3.2: Line separator ϵl from two-dimensional forced alignment on the example from Fig- ure 7.2.1. Red encodes a high probability, blue low a one. Placing characters such as the glyph separator ϵg or visible glyphs from the alphabet requires another set of assumptions on the structure of text. They are assumed to be placed left to right, occupying the full vertical range within their respective text line, being of roughly the same width and having an unsharp transition between two adjacent charac- ters. Forced alignment of the characters starts by calculating the width of each character by dividing the width of the pixel space by the number of characters in the longest text line of the truth label string. This character width in pixels is then applied to all text lines. We then place the mid-points of each character in their text lines, beginning from the left with a margin of half a character width to the left border and one character width between each two adjacent characters. Converting these mid-points of the characters to probabilities is done by placing a normal distribution over each mid-point, with the pixel coordinate of the mid-point being the mean of the normal distribution. Character probabilities are then drawn from these normal distributions. Normalizing these drawn probabilities to sum up to exactly 100 percent per pixel yields the final probabilities for characters in 2d forced alignment. Figure 7.3.3 shows this for the glyph separator ϵg. Figure 7.3.4 shows the forced alignment of the glyph ‘e’ in the example from Figure 7.2.1. Figure 7.3.3: Glyph separator ϵg from two-dimensional forced alignment on the example from Figure 7.2.1. Red encodes a high probability, blue low a one. Not all text lines have the same amount of characters in length and we have already made the assumption to place the characters aligned to the left border and from there left- to-right. This means there is potential unused space to the right of the individual text lines in pixel space. Normalizing the probability vector in each pixel to a sum of 100 percent results in the last character of each text line filling up the space up to the right border of the pixel space. This is not how handwritten text is typically structured, instead there 147 Figure 7.3.4: Glyph ‘e’ from two-dimensional forced alignment on the example from Figure 7.2.1. Red encodes a high probability, blue low a one. is a white space to the right of each line. This is modeled by placing a space character at the end of each text line, exactly at the rightmost pixel column. The probabilities for this trailing space are then also drawn from a normal distribution with its mean at this right border. Figure 7.3.5 illustrates the probabilities for the space glyph in the example of Figure 7.2.1. Figure 7.3.5: Space glyph from two-dimensional forced alignment on the example from Figure 7.2.1. Red encodes a high probability, blue low a one. Placing the white space on the right border as shown in Figure 7.3.5 needs to reflect the truth label string and if the text is expected to be left-aligned. Depending on the data, this filler space can also be placed on the left border or both. The experiments in this chapter were done by aligning the text lines of the 2d forced alignment to the left border since this is how the examples in the IAM offline handwriting database are written. Using this technique, 2d forced alignment produces a soft-assignment with the same properties, but different likelihoods, as the soft-assignment z approximated by loopy be- lief propagation on a conditional random field as described in Section 6.6. This allows replacement of the soft-assignment as estimated by the conditional random field by this 2d forced alignment. However, with the use of conditional random fields there is a better way of integrating 2d forced alignment, which we will discuss in the next paragraphs. Forced Alignment for Conditional Random Fields We recall our discussion of Section 6.6 for the definition of a conditional random field via its node and edge potential functions. In the case of multi-dimensional connectionist classification, the node potential is defined by Equation 6.6.3. The node potential function ψs gives the ‘compatibility’ between the character Cs of the label string l and a spatial position s, which in MDCC is proportional to the estimated likelihood as given by the soft-assignment y from the deep neural network. For including forced alignment into the CRF, we simply include it in such a way that the node potential function ψs is also proportional to it instead of only being dependent on the DNN prediction y. We can control the influence of FA by weighting it in this new node potential function k ×yss DNN l +kFA ×FA(s,C s,l)+k ψs(C , l,y) = e Cs b (7.3.1) 148 where constant coefficients kDNN and kFA define the relative weighting of the DNN pre- diction and the forced alignment. Constant kb is a bias that is identical for all spatial positions and all characters of the label string. The bias kb = 1 was kept constant for all pixels and glyphs. This bias was introduced because it should be possible, in principle, for any glyphs occurring in any pixel. This is not always reflected in the node potential function with forced alignment because the pix- els of the same character as estimated by the deep neural network and forced alignment may be non-overlapping, leaving a ‘hole’ between. The conditional random field is then in a self-contradicting state, sometimes called a ‘frustrated CRF’ in the literature, introduc- ing an unwanted random element to MDCC. Allowing any glyph in any pixel combats this phenomenon in MDCC by favoring a low-energy state over the whole CRF. Both the estimation ysl of the deep neural network and FA(s, C s, l) are probabilities Cs in the value range of [0, 1], which makes choosing the constants kDNN and kFA easier. When beginning to train the deep neural network, with model parameters initialized ran- domly, the estimate ysl will most likely also be a random low value for all pixels and allCs glyphs. This changes with progression in the deep neural network, the estimated likeli- hoods increasing for glyph-pixel combinations that the DNN deems correct and decreas- ing otherwise. This means that the maximum value in the soft-assignment y increases over training time. Weighting the DNN and forced alignment with constants kDNN and kFA thus decreases the influence of the forced alignment over time. In this work, values of kDNN = 3 and kFA = 1 where chosen by experimentation with different values. Using the node potential function of Equation 7.3.1 instead of 6.6.3 introduces forced alignment into multi-dimensional connectionist classification while decreasing the influ- ence of the forced alignment over time. All experiments in this chapter were done using a two-dimensional forced alignment in this style. 7.4 Neural Network Model Network Topology In this section we will discuss the topology of the deep neural network used for the experi- ments in this chapter. We will discuss the topology itself as well as ideas and observations that lead to this choice for the neural network model. The overall type of deep neural network is a mixture of a convolutional neural network and a recurrent neural network with the mixture being layer-wise, that is each layer is either convolutional or recurrent. This choice is based on two trains of thought: First, RNNs are well established and proven in the field of handwriting recognition. This trend can be seen, starting with the publications[43, 45, 47] of Alex Graves on using multi- dimensional long short-term memory (MDLSTM)[45] for handwriting recognition. Later work[71, 102, 150] follows up on this trend. The second idea underlying this topology comes from recent publications[12, 104] that discuss the possibility of using CNNs for handwriting recognition. A hybrid RNN-CNN model was thus chosen for this work to keep the benefits of the implicit language modeling capabilities of LSTM networks, while gaining speed benefits from using convolutional layers, which are well supported in GPGPU computing. Better use of GPGPU capabilities are also the reasoning behind using separable MDLSTM[156] layers instead of ‘classic’ multi-dimensional LSTM[45]. Separable MDL- STM only adds recurrent connections along a single dimension and not along all di- mensions. Separable MDLSTM has been discussed in Section 2.3 of this thesis and a visualization of a separable MDLSTM is shown in Figure 2.3.19. MDLSTM layers with recurrent connections along all dimensions introduce dependencies into the computa- 149 tion of the neural activation at each pixel in a way that only a small set of pixels can be computed in parallel. One way of applying GPGPU processing to MDLSTM layers is to compute the pixels on a common diagonal at the same time. This is implemented in the RETURNN[29] library. However, separable MDLSTM is easy to implement in com- mon deep learning frameworks and parallelizes computation by treating the columns and rows of an image as mini-batches of one-dimensional sequences. We will now continue with discussing the actual deep neural network topology used in the following experiments. Similar to the schema of Figure 2.3.19, Figure 7.4.1 shows the sequence of operations for a convolutional block in this work. This is because this exact sequence does repeat in every convolutional block of the overall DNN topology. Input Feature Map Convolution 2D Batch Norm 2D (1/α)+β Non-Linear Activation σ(x) Dropout Output Feature Map Figure 7.4.1: Convolutional block as used in this work. It consists of a two-dimensional convolu- tion, followed by batch normalization, a non-linear activation function and a layer- wise dropout. This convolutional block is shown in Figure 7.4.1, which consists of a two-dimensional convolutional layer followed by Batch Normalization[60] and a non-linear activation func- tion. Batch Normalization was added because of the practical observation that it improves the convergence rate of the model error during training. Batch Normalization normalizes 150 the same feature map for all examples within a batch, or over a larger history of exam- ples, to a mean value of approximate zero and standard deviation of approximate one. The non-linear activation function applied to all convolutional blocks was Leaky ReLU (Rectified Linear Unit)[85], which is the piecewise linear function σ(x) = max(x, αx) with 0 < α < 1. For a positive value of x, Leaky ReLU is simply the identity function. Negative values of x still result in a linear activation, but with a lower slope of α. Leaky ReLU in practice has a high speed because of a low computational complexity and partly mitigates the vanishing gradient effect by having a constant derivative of 1 for positive values and of α (with a typical value of α = 0.01) for negative values. Dropout[102, 137] was added to the last three of the convolutional blocks of the DNN. Dropout improves generalization and reduces overfitting of the deep neural network by randomly removing feature maps during the training. This results in a certain degree of redundancy in the neural network since no feature map is reliable by itself alone. During inference, all feature maps are used without removal. Dropout was added only to the last three convolutional blocks based on the idea that the convolutional blocks closer to the input image learn to recognize geometrical features of handwritten glyphs and convolutional blocks higher up in the network do learn abstracted linguistic features. This means that dropout in the later layers reduces overfitting to specific text, whereas dropout in the earlier layers affects generalization from specific writers. Reducing overfitting on the higher-level layers seems more prudent in this case, especially since (because of pooling operations in the after the first convolutional blocks) there are effectively more training data available for the first layers of the neural network. However, this choice of dropout must probably be adapted for different data sets. Figure 7.4.4 shows the overall deep neural network topology as used in the experi- ments in this chapter. The convolutional blocks are as described above and the recurrent blocks are separable MDLSTM as shown in Figure 2.3.19. The input into the DNN as shown in Figure 7.4.4 is a grayscale image of the IAM offline handwriting database with augmentations as discussed in Section 7.2. The input image is presented to the network in multiple variants, two of which apply binarization methods in order to obtain a bimodal image. Figure 7.4.2: Example paragraph image from the IAM database with the Otsu threshold applied. The Otsu threshold [98, 130] computes a single scalar value as a threshold for sep- arating the pixels of the digital image into two disjoint classes. The pixel assignment to these two classes, lower and higher intensity than the threshold, is the resulting bimodal image. The Otsu threshold is selected in such a way that the variance within each of 151 the two classes is minimized. The method computes this threshold by first generating the intensity histogram of the image, followed by iteratively testing each threshold for its resulting intra-class variance based on this histogram. The threshold that minimizes the intra-class variance is selected. Figure 7.4.2 shows an example paragraph with the Otsu threshold applied. Figure 7.4.3: Example paragraph image from the IAM database with Yen’s method applied. Similar to the Otsu threshold is Yen’s method [130, 159] a histogram-based approach to image thresholding. The first step in this method is to compute the histogram of in- tensity values of the digital image. The threshold itself is calculated by a weighted com- bination of multiple threshold candidates in order to maximize the entropy withing the two classes. Again, assigning pixels to the two classes according to this threshold value produces the binary image. Figure 7.4.3 shows an example for Yen’s method. In the experiments in this chapter, paragraph images are presented to the DNN in four different variants: 1. The grayscale image with the values normalized such that the mean value is 0 and the standard deviation 1 within the current image. 2. The grayscale image with normalization to mean 0 and standard deviation 1 over the whole training data set. The same normalization was also applied to samples outside the training data set. 3. A bimodal image produced by applying the Otsu threshold to the grayscale image. 4. The bimodal image from application of Yen’s method. All four variants of the example image were provided to the DNN in order to allow for the opportunity to automatically learn a good use of these variants during DNN training. The bimodal images are based on the idea that grayscale images of handwritten text basically consists of two different types of material: untouched paper and paper colored by ink. The difference between the two is the color of the pixel. Producing a bimodal image thus separates the actual writing from the background and reduces the low-level image operations that the DNN has to learn during training. This frees up filters in the first convolutional and recurrent blocks for other operations. Figure 7.4.4 shows the overall topology of the deep neural network used in this thesis. It consists of ten blocks (five of each) of alternating convolutional and recurrent blocks. The first three convolutional blocks are followed by average pooling operations. In the 152 Input Feature Map: Separable MDLSTM: 1) Image-wise - 96 filters normalization - Tanh activation 2) Dataset-wide normalization 3) Local binarization 4) Global binarization Convolutional Block: Convolutional Block: - 5x5 kernel - 5x5 kernel - 16 filters - 112 filters - Leaky ReLU activation - Leaky ReLU activation - No dropout - 25% dropout Average pooling by 3x3 Separable MDLSTM: Separable MDLSTM: - 32 filters - 128 filters - Tanh activation - Tanh activation Convolutional Block: - 5x5 kernel Convolutional Block: - 48 filters - 5x5 kernel - Leaky ReLU activation - 144 filters - No dropout - Leaky ReLU activation- 25% dropout Average pooling by 3x3 Separable MDLSTM: Separable MDLSTM: - 64 filters - 160 filters - Tanh activation - Tanh activation Convolutional Block: - 5x5 kernel Estimated Soft-Assignmed: - 80 filters - Convolution with 1x1 kernel - Leaky ReLU activation and one filter per glyph of the - 25% dropout alphabet - Pixel-wise Softmax function Average pooling by 2x2 Figure 7.4.4: Deep neural network as used in this work and following experiments. It is a combi- nation of a CNN and RNN with alternating layers of two-dimensional convolutions and separable MDLSTM. When used for one-dimensional transcription, a collapse layer is added between the final 1x1 convolution and the softmax function. 153 course of the experiments, both maximum pooling and average pooling operations where used. Average pooling showed a slightly lower error rate than maximum pooling. The final layer of the DNN is a pixel-wise feed forward layer, which is equal in function to a convolution with a kernel size of one by one pixels, with as many neurons as there are glyphs in the alphabet. This layer provides the estimate for assigning the specific glyph to the specific pixel in question. Normalization to probabilities is implemented by applying a pixel-wise softmax function to this estimate. This same DNN topology can also be used for one-dimensional or line-wise handwrit- ing recognition in combination with connectionist temporal classification. In this case a collapse layer is added in between the very last convolution, which predicts the glyph to pixel assignments, and the softmax function. The collapse layer is described in Section 3.1. Optimization Method Optimization of the deep neural network parameters was done automatically by applying multi-dimensional connectionist classification as discussed in Chapter 6 in combination with backpropagation[112, 113] and gradient descent[11, 65, 107]. See Section 2.3 for information on these two techniques. Gradient descent uses the first-order derivative of the loss function or optimization function in general in order to incrementally change the model parameters towards a better solution. There are other optimization methods, based on gradient descent, that include estimated second-order information to improve on shortcomings in gradient de- scent. All experiments in this thesis were done using Adam[67] with a base learning rate of 0.001. The reasoning for using Adam was twofold: First, it is a method that estimates a suitable learning rate per parameter of the model and adapts these parameter-wise learning rates over the course of the training. In theory, this should show a better behav- ior in cases with a gradient of low magnitude, e.g. near local optima or saddle points. In practice Adam showed a fast and stable convergence rate. The second reason for using Adam was that it can be used for non-stationary loss functions. This is the case in MDCC since the aligned soft-assignment changes at every iteration, even for training examples already processed before. This means that even while MDCC falls into the category of supervised learning, the targets in the training data set actually change at every epoch. In the course of the experiments, the deep neural network was first initialized with random parameters. Afterwards a pre-training was implemented by adding a collapse layer as described above and training the DNN with connectionist temporal classification. This training with CTC was done using the single line examples provided by the IAM offline handwriting database. Pre-training was done for a constant amount of 25 epochs using a mini-batch size of 16. After pre-training, the collapse layer was removed and the DNN trained using multi-dimensional connectionist classification with a mini-batch size of 8. Pre-training was done to allow the filters in the convolutional layers to adapt to recognizing the geometric shapes inherent in handwritten text. 7.5 Experiments and Results General Approach Chapters 5 and 6 detailed the decoding algorithm and training mechanism as proposed in this work for multi-dimensional connectionist classification. Section 7.2 explained the use of the IAM offline handwriting database and Sections 7.3 and 7.4 the two-dimensional forced alignment method and the deep neural network model as used for experimentation 154 and evaluation. Chapter 5 details the decoding algorithms proposed for MDCC. The experiments in this section used the continuous separators variant of Algorithm 5.4.2 for finding lines within the DNN prediction. The experiments further applied the beam search variant of Algorithm 5.5.2 for decoding each line to uncover its label sequence. No explicit language model was employed during the course of the described exper- iments. The only language model in use was the one implicitly learned by the recurrent neural networks during training via CTC or MDCC. The alphabet used for transcription consisted of all characters occurring in the ground truth texts of the IAM offline handwrit- ing database. No mapping between upper and lower case characters was done, which means wrong capitalization in the transcription negatively impacts the error rate. The same alphabet was used for all methods and experiments in this chapter. This first step towards evaluation of experiments is to decide on a measurement of the error that the transcription method in question produced. A commonly used error metric in offline handwriting recognition is the character error rate (CER) Edit(Decoder(y), l) CER(y, l) = 100× (7.5.1) |l| which measures the ratio of the Edit-distance[81, 151] between the transcribed string and truth label string l and the length |l| of the truth label string. Symbol y in Equation 7.5.1 is again the soft-assignment as estimated by the deep neural network. Roughly speaking, the CER measures the ratio between the number of wrongly transcribed characters and the total number of characters. Please note that the lower limit of the CER is zero if the transcribed string and the truth label string are identical, but the CER is not limited to an upper value and values over 100 are possible. We will use the CER as the measure for comparison of different transcription methods in the following paragraphs. Figure 7.5.1 shows the character error rate of Equation 7.5.1 as it was measured during training of the deep neural network for multi-line handwriting recognition. The DNN was first pre-trained using CTC on the provided ground truth line segmentation for 25 epochs and then switched to MDCC for the remainder of the training until convergence. This switch is indicated by the vertical line at epoch 25 of the diagram. The model parameters used for the multi-dimensional connectionist classification ex- periments in this chapter and Chapter 9 were reached after 59 epochs of training, 25 epochs of line-wise pre-training with CTC and 34 epochs of paragraph-wise training with MDCC. These model parameters produced a CER of 2.53 on the training data, 8.41 on the validation data and 10.22 on the evaluation data when applied to paragraph-wise transcription. The computing hardware use for these experiments were a Intel Core i5 6500 and a Nvidia GeForce GTX 1080 Ti. A single epoch of training on the training set of the IAM offline handwriting database, augmented as detailed in Section 7.2, with subse- quent calculation of the CER on all three data sets (training, validation and evaluation) took between 8 and 12 hours. The earlier epochs of training with MDCC took longer since loopy belief propagation takes longer to converge to a stable point while the DNN- predicted soft-assignment is noisy or contains many errors. Loopy belief propagation in multi-dimensional connectionist classification converges faster if the soft-assignments as predicted by the DNN and CRF are already similar. The training of Figure 7.5.1 ran for a total of 14 days. Comparison with Line-Level Transcription The first experiments done in this work were aimed at the question if multi-dimensional connectionist classification, that is paragraph-wise transcription, performs better than 155 Figure 7.5.1: Convergence of the character error rate while training the deep neural network with multi-dimensional connectionist classification. Red indicates the CER on the train- ing set, blue on the validation set. The vertical line on epoch 25 signals the transition from line-wise pre-training using CTC to paragraph-wise training using MDCC. connectionist temporal classification, that is line-wise transcription, for offline handwrit- ing recognition and if so in which cases. The deep neural network was trained on the IAM offline handwriting database as described above. The same DNN topology, but with different parameters, was trained on the ground truth line segmentation as provided in the IAM offline handwriting database. For this, the collapse layer was added in between the final convolutional layer and the softmax function. The paragraph-level network was pre- trained with CTC as discussed. The line level network was directly trained with CTC until convergence. The result were two deep neural networks with the same topology, except the collapse layer, and with training on the same data set but with parameters optimized for either paragraph level transcription or for line level transcription. Evaluation of these two deep neural networks included the transcription of the test data set from the Large Writer Independent Text Line Recognition Task. The offset be- tween each two adjacent text lines were reduced by a constant amount of millimeters in order to evaluate the sensitivity of the methods to overlapping text lines. This can easily be done on the IAM offline handwriting database since the ground truth line seg- mentation is provided, which is considered ‘perfect’, and the resulting line images can be artificially moved closer together. New segmentation into lines was then necessary since this artificial line offset invalidated the provided line segmentation. Three publicly available line segmentation algorithms were applied to obtain new line images from the modified database. Two were based on open-source optical charac- ter recognition (OCR) software in the form of Tesseract and GNU Ocrad. Tesseract1 is an end-to-end OCR system that has layout analysis and text segmentation included. GNU Ocrad2 is also an end-to-end OCR system in the style of Tesseract. The line- segmentation algorithms implemented and provided by these two OCR systems were used for obtaining segmented line images from the paragraph images of the IAM of- fline handwriting database. An A*-based line segmentation method[140] was applied in 1https://github.com/tesseract-ocr/tesseract/, version 4.1.1 2https://www.gnu.org/software/ocrad/, version 0.27 156 addition to these two open-source systems. A publicly available implementation3 of this method was used. This method of using A* path planning for line segmentation is referred to as A* Paths in Tables 7.3, 9.2, 9.4, 9.6 and 9.4.1. Table 7.3 contains the character error rates that were measured using the two deep neural networks as stated in combination with a variable amount of artificially reduced line offset and while applying Tesseract or GNU Ocrad as line segmentation algorithms. Connectionist temporal classification on the ground truth line segmentation outperforms every other described method by a margin. This is not really surprising since the ground truth line segmentation in the IAM offline handwriting database has been manually veri- fied and is considered ‘perfect’. However, even using automatic line segmentation on the unmodified data adds roughly 9 points of character error rate to the line-wise transcription using CTC. Artificially decreasing the line spacing increases the error rate for both CTC and MDCC, but with a slower increase for MDCC. Table 7.3: Average character error rates (CER) for connectionist temporal classification (CTC) and multi-dimensional connectionist classification (MDCC) on full paragraphs of the test set of the IAM offline handwriting database while using different line offsets and line segmentation methods. The last column gives the percentage of examples where MDCC produces a lower CER than CTC. CER Examples where Line Offs. Line Segm. CTC MDCC MDCC better than CTC 0 mm Ground Truth 7.94 15.09% 0 mm Tesseract 16.74 10.22 63.79%0 mm Ocrad 18.48 67.24% 0 mm A* Paths 16.31 71.55% 3 mm Tesseract 20.53 68.53% 3 mm Ocrad 16.87 10.80 68.10% 3 mm A* Paths 19.09 75.86% 5 mm Tesseract 27.58 75.86% 5 mm Ocrad 20.19 12.76 64.22% 5 mm A* Paths 24.83 81.47% 10 mm Tesseract 74.77 95.26% 10 mm Ocrad 56.87 31.20 90.52% 10 mm A* Paths 63.77 95.26% Tesseract 34.90 - Average Ocrad 28.10 16.24 - A* Paths 31.00 - Comparison with Attention Networks Section 3.2 discussed the application of attention networks to offline handwriting recog- nition, specifically the work of Bluche[8] on line-wise paragraph transcription. Since both the attention networks and multi-dimensional connectionist classification tackle the same problem of segmentation-free multi-line offline handwriting recognition, a comparison was in order. A reimplementation of the attention network method was done to facilitate a di- rect comparison to MDCC. As the original, the reimplemented attention network consisted of three deep neural networks: 1. The encoder network once per example extracts high-level features from the input image. The encoder network used in the experiments was identical to the DNN described in Figure 7.4.4 except that the final convolutional layer and softmax func- tion were replaced by a convolutional block with 256 filters. This way the model 3https://github.com/smeucci/LineSegm, version of August 6th, 2020 157 capacities of the attention network encoder and the MDCC network are similar. The difference between the two is that the encoder produces 256 feature maps for further processing while the MDCC network directly estimates glyph probabilities. 2. The attention mechanism is a neural network that performs line-by-line extraction of meaningful features from the encoder in order to transform the two-dimensional image into a one-dimensional sequence of feature vectors. The attention mecha- nism in the reimplemented network consisted of two layers of separable MDLSTM, see Section 2.3, with 128 filters each. This was followed by a convolution with a 1x1 kernel, 1 filter and a column-wise softmax over the whole encoded image. This yielded the probability that a specific pixel should be extracted from the encoder in order to be part of the representation of the current text line. The original attention mechanism consisted of two MDLSTM layers with 32 filters each. The increase in the number of filters is to compensate for the switch from MDLSTM to separable MDLSTM. 3. The decoder network transcribes the one-dimensional sequence of features vec- tors provided by the encoder network and attention mechanism. Its output is the transcription as trained with connectionist temporal classification. It consisted of one layer of bi-directional LSTM with 256 filters and a convolution with a 1x1 ker- nel and as many filters as glyphs in the alphabet. The activation function was the softmax function in order to obtain glyph probabilities as required by CTC. This is identical to the original work. The attention mechanism in this deep neural network was executed for a constant number of times, each time providing its last output as additional input. As discussed in Section 3.2 this can be seen as a sort of recursion over the whole DNN. Each time the output of the attention mechanism was used to extract features from the encoder and collapse them to a one-dimensional sequence. These one-dimensional sequences represent the text lines within the overall paragraph and were concatenated to one large sequence for the full paragraph. This attention loop was performed 12 times since the IAM offline handwriting database has a maximum of 12 text lines per paragraph. Using the deep neural network from MDCC as the encoder in the attention network results, in principle, in the same capacity for extracting meaningful high-level features in both methods. The attention network gained some additional model capacity in compar- ison to MDCC by adding multiple LSTM layers in the attention mechanism and decoder network, which are not present in the MDCC neural network. Both the MDCC neural net- work and the attention network were trained on the same augmented data from the IAM offline handwriting database with identical mini-batch sizes and parameters for the Adam optimizer. The training of the MDCC network was done as discussed above. Table 7.4 contains the error rates for these experiments while comparing the attention networks method for paragraph-wise transcription with MDCC. It shows that the attention networks produce a consistently lower error rate than MDCC. MDCC still resulted in a lower error in roughly a quarter of examples if evaluated individually. Table 7.5 shows the time measurements for both the transcription using attention networks and MDCC. The implementation of both methods was done using the PyTorch4 deep learning library in version 1.7.0 using Intel MKL5 and Nvidia CUDA6 for hardware acceleration. The hardware in use was a Intel Core i5 6500 with 4 cores at 3.2 GHz for execution on a CPU and a Nvidia GeForce GTX 1080 Ti on a GPU. Linux was used as 4https://pytorch.org/ 5https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl. html 6https://developer.nvidia.com/cuda-zone 158 Table 7.4: Comparison of the character error rates (CER) between MDCC and attention networks with CTC in the similar layout as of Table 7.3. Transcription was done on full paragraphs in 300 dpi resolution. CER Examples where Line Offs. Attn. MDCC MDCC better than Attn. 0 mm 9.01 10.22 28.88% 3 mm 9.27 10.80 26.72% 5 mm 9.86 12.76 21.12% 10 mm 22.44 31.20 7.76% Average 12.64 16.24 - the operating system. The decoding algorithm was multi-threaded and in parallel for the examples within the same mini-batch. The timings show that MDCC uses roughly half the execution time compared to the attention network approach, except in the case of a mini-batch size of one example while using GPU acceleration. This result is unsurprising since the attention networks use the same encoder DNN as the MDCC method, with additional neural networks for the attention mechanism and decoder. Still it highlights that MDCC can be implemented and executed with only a single pass through a deep neural network. The exception that MDCC is slower with a mini-batch size of one example and using the GPU for hardware acceleration is explainable with the complexity of the decoding al- gorithms for CTC and MDCC. In both cases the decoding algorithm was always executed on the CPU, even when using GPU acceleration. The decoding algorithm for CTC has a lower runtime since it decodes one-dimensional sequences and does not need to identify likely line separators. This means that even while the deep neural network in the attention network approach takes longer to execute, the longer runtime for the MDCC decoding al- gorithm compensates for this. This effect is mitigated for larger mini-batches where the decoding algorithm is executed in parallel for the examples within the batch. Table 7.5: Runtime comparison between MDCC and attention networks for full paragraphs and different batch sizes. Decoding for batch sizes greater than one used multiple threads. Time measurements include the total of 1539 paragraphs from the IAMDB. Wall Clock Average per Example Batch Size Hardware Attn. MDCC Attn. MDCC 1 CPU 51 m 0 s 22 m 32 s 1988 ms 878 ms 1 GPU 8 m 45 s 9 m 15 s 341 ms 360 ms 4 GPU 7 m 15 s 4 m 11 s 282 ms 163 ms 8 GPU 7 m 6 s 3 m 18 s 276 ms 128 ms Comparison with Published Work Table 7.6 contains a comparison of multi-dimensional connectionist classification with other methods for offline handwriting recognition as published in the literature. All char- acter error rates are as reported on the test set of the IAM large writer independent text line recognition task and use an image resolution of 300 dpi if not stated otherwise. Some methods employ data augmentation or normalization techniques and some use explicit language models. This is stated in the corresponding entry. Overall these results show a clear tendency to favor line-wise transcription methods using connectionist temporal classification if a reliable line segmentation is available, which is the case with the ground truth line segmentation from the offline handwriting database. 159 Table 7.6: Comparison of the character error rates (CER) between MDCC and methods proposed in other publications. The error rates are on the test set of the Large Writer Indepen- dent Text Line Recognition Task on the IAM offline handwriting database. Line-wise transcriptions use the ground truth segmentation as provided in the database. Method CER Multi-dimensional connectionist classification 10.22 Attention networks (as reproduced) 9.01 Bluche[8] no language model 7.9 Bluche[8] with language model 5.5 Paragraph-wise Bluche et al.[9] at 150 dpi 16.2 Singh et al.[135] at ∼145 dpi 6.7 Singh et al.[135] at ∼145 dpi with augmentation 6.3 Coquenet et al.[21]7 5.45 Coquenet et al.[20]7 4.32 Connectionist temporal classification (as reproduced) 7.94 Voigtlaender et al.[150] 3.5 Doetsch et al.[28] 4.7 Pham et al.[102] no language model 10.8 Line-wise Pham et al.[102] with language model 5.1 Kozielski et al.[71] with deslanting 10.9 Kozielski et al.[71] with deslanting and moments 5.5 Puigcerver[104] no distortions 8.3 Puigcerver[104] with distortions 6.2 Table 7.6 compares MDCC to other paragraph-level transcription methods. Work by Bluche et al.[8, 9] is discussed in Section 3.2. It results in a lower CER, but is also limited to a specific DNN topology in the form of attention networks. Table 7.5 outlines time measurements, which is summary indicate that MDCC performs faster than this type of attention network for paragraph transcription. Singh et al.[135] again apply an attention- based DNN, discussed in Section 3.2, with the same drawbacks. Coquenet et al.[20, 21] use an unofficial data split for evaluation. SPAN[21] is based on reshaping a CNN, as detailed in Section 3.3, which works well if the text lines are oriented in a roughly horizontal fashion with a similar and uniform height per text line. Coquenet et al.[20] also apply an attention network for paragraph transcription. It introduces further restrictions with its attention mechanism collapsing the horizontal dimension, resulting in attention windows that are perfectly horizontal. Comparison with Forced Alignment Only Section 7.3 discussed the idea of forced alignment in both one-dimensional and two- dimensional cases and how to integrate forced alignment with MDCC. Forced alignment on its own already produces a soft-assignment that encodes the truth text and this forced alignment soft-assignment is applied to the conditional random field (CRF) in MDCC as a prior. This begs the question of how much the soft-assignment as estimated by loopy belief propagation (LBP) adds to the transcription method. An experiment was performed to compare the transcription of a deep neural network trained with forced alignment only to the one trained with MDCC. The setup of data sets, deep neural network architecture and decoding algorithm was identical to the transcription with MDCC. The only difference between the DNN as trained with MDCC was that no inference using LBP was done, but instead the soft-assignment provided by forced alignment was directly used as the target soft-assignment in the cross-entropy loss while optimizing the DNN parameters. 7Using an unofficial split of the training, validation and evaluation data. 160 The deep neural network parameters after 25 epochs of line-wise pre-training were used as the starting parameters for this experiment. The following paragraph-wise train- ing was done using the soft-assignment of forced alignment as the target values. The best model parameters yielded character error rates of 3.55 on the training data, 11.24 on the validation data and 12.85 on the evaluation data after an additional 187 epochs of training with forced alignment only. Training the DNN for transcription using MDCC yielded an evaluation set CER of 10.22 after only 34 epochs of paragraph-wise training. Thus MDCC resulted in a decrease of the character error rate of 2.63 points while also decreasing the number of training epochs required by a total of 153. MDCC both reduced the error rate and increased the convergence rate during training when compared with forced alignment only. 7.6 Discussion The last Section 7.5 detailed the experiments with multi-dimensional connectionist clas- sifications, its evaluation and comparison to established method. Of special interest was the comparison to line-wise transcription mechanisms using connectionist temporal clas- sification and to the paragraph-wise transcription using attention networks. We will now go into a discussion of what the implications of these results are and how these methods compare besides their respective error rates. The first set of experiments aimed at the question of how multi-dimensional connec- tionist classification on a paragraph-level compares to connectionist temporal classifica- tion on a line-level transcription. Examples from the IAM offline handwriting database were modified by artificially reducing the spacing between text lines by a fixed amount of millimeters. This was followed up by a transcription with the method in question. In case of CTC, two publicly available OCR implementations were applied for line segmentation. The results are detailed in Table 7.3. This experiment shows that line-wise transcription becomes increasingly difficult with more overlaps between text lines. In this experiment, MDCC performed better than CTC with an identical deep neural network topology and training paradigm. It stands to reason that this is a general benefit of paragraph-wise transcription over line-wise transcription in the face of overlapping, and thus hard to seg- ment, text lines. MDCC provides an advantage over CTC when applied to difficult to segment paragraphs as can be deducted from the lower error rates in MDCC when arti- ficially reducing the line spacing. The second set of experiments evaluated the attention networks method by Bluche[8] and includes a comparison of this method against MDCC in both error rate and runtime. Again this was done using artificially decreased spacing between text lines. Table 7.4 compares the character error rate of both methods and shows that these results are favorable towards using the attention networks. The CER produced by applying the at- tention networks was roughly one point in error rate (9.01 versus 10.22) lower than with MDCC. However, MDCC still produced lower error rates in roughly one quarter of individ- ual examples, except for examples were the line spacing was artificially reduced by 10 millimeters. Contrary results are observed when measuring the runtime for transcription of exam- ples from the IAM offline handwriting database. Table 7.5 shows the measured total wall clock times for transcribing all 1539 examples, as well as the average runtime per exam- ple. It shows a runtime reduction of nearly half when using MDCC instead of the attention network method. This favors using multi-dimensional connectionist classification in use cases were a low runtime is required or of benefit. Newer attention-based methods seem to follow this trend[135, p. 12]. 161 It is the opinion of the author that there are general properties of multi-dimensional connectionist classification that make the case for further application and research in this direction. These general properties of MDCC are discussed in the next three paragraphs. Multi-dimensional connectionist classification is implemented outside of the deep neu- ral network in form of a special loss function based on an expectation-maximization ap- proach and a matching decoding algorithm. This puts relatively low constraints on the type of deep neural network in use for MDCC. The DNN must be capable of estimating the soft-assignment from image input as described in Chapter 6. MDCC further requires only one pass through the DNN per example. These constraints are in contrast to the attention network method, which implements paragraph-level transcription by applying a very specific DNN topology. This means that MDCC can be applied in combination with a range of deep neural network types that are specifically designed for the use case at hand. This could be topologies for execution on special hardware or topologies optimized for runtime or memory usage. This idea can be taken one step further by removing the assumption that the machine learning model has to be a deep neural network at all. As stated, MDCC requires the model to estimate the soft-assignment from in input input of handwritten text. It then ap- plies supervised training in the form of expectation-maximization for parameter optimiza- tion. This is possible with any model that supports supervised training using gradient descent. MDCC can further be applied to machine learning models which optimization is not gradient-based but still compatible with the M-step of Section 6.5. As such, multi- dimensional connectionist classification could be applied to paragraph-level handwriting recognition with models other than deep neural networks. The indicator function α of multi-dimensional connectionist classification as discussed in Section 6.2 and the decoding algorithm as detailed in Chapter 5 currently both assume that text lines are oriented roughly horizontal at an angle of at most 45 degree and do not include ‘U-turns’. That is each text line is assumed to be exactly one vertical interval per pixel column in the input image. This matches the idea of a collapse layer in connectionist temporal classification. It stands to reason that both parts of MDCC could be generalized in order to remove this constraint and e.g. allow for spiral-formed text. The line-wise paragraph transcription based on attention networks[8] is again limited to a maximum curvature or rotation of the text lines to up to 45 degree. This limitation emerges from the way the softmax function is applied as a column-wise normalization in the attention step. Explicitly modeling the line separators, as is the case in MDCC, instead of relying on the softmax function should make generalization of the text orientation more feasible. These observations lead to reason that MDCC is faster in execution time than the method[8] based on attention networks. MDCC shows also general properties in favor of the application of it. Attention network approaches do however seem superior in case of roughly horizontal text lines and without tight restrictions on the execution time. The last part of the experimentation had the goal of establishing a reliable compar- ison with published methods. This is shown in Table 7.6. Tables 7.6 and 7.3 in combi- nation make clear that connectionist temporal classification yields a lower error rate than paragraph-level transcription as long as a reliable line segmentation is available. This is certainly the case for the ground truth line segmentation provided in the IAM offline hand- writing database. Difficult cases of automatic line segmentation then favor a paragraph- level approach to handwriting recognition. As observed before, MDCC performs worse than the attention network method for paragraph-level transcription, at least in terms of plain error rates. Table 7.6 also shows that the application of an explicit language model or image distortions and deslanting generally improves the error rates in handwriting recognition tasks. 162 Overall the experiments show that multi-dimensional connectionist classification is a competitive method in the case of paragraph-level handwriting recognition and handwrit- ing recognition in general. It also seems to be the case that the deep neural network in use for MDCC should be tuned towards the specific requirements of error rates, hardware employed for execution, as well as runtime and memory limits for the use case at hand. It is however also clear that line-wise transcription should be preferred in cases were robust line segmentation is feasible. 163 164 Chapter 8 Hyper-Parameter Search using Visual Analytics Figure 8.0.1: The visualization technique proposed in this chapter targets the soft-assignment as estimated by the deep neural network or conditional random field, while utilizing the input image as contextual support. This chapter is based on the following publication with Section 1.3 discussing the individual contributions of its authors: Martin Schall, Dominik Sacha, Manuel Stein, Matthias O. Franz, and Daniel A. Keim. “Visualization-Assisted Development of Deep Learning Models in Offline Handwriting Recognition.” In: Symposium on Visualization in Data Science (VDS) at IEEE VIS 2018. Oct. 2018 8.1 Problem Description and Idea In Chapters 5 and 6 we have discussed the theoretical idea and method of multi-dimen- sional connectionist classification (MDCC) in terms of training a deep neural network (DNN) for multi-line handwriting recognition and decoding the DNN estimate. Chapter 7 details the practical application of MDCC and provides an experimental evaluation. It discusses the deep neural network topology, hyper-parameters in the DNN as well as the optimization method and data augmentation. It did however not discuss how to derive these hyper-parameters. This will be the topic of this chapter. The term ‘hyper-parameter’ in this context refers to all parameter of a machine learning model that are not covered by automatic optimization, e.g. gradient descent in our case. These hyper-parameters are typically selected by a model engineer. Examples for hyper-parameters are the number of neurons per layer, the learning rate or the resolution of the input image. The general, ‘black-box’ behavior of deep neural networks is one topic that needs to be respected when choosing and optimizing hyper-parameters in deep neural networks. Since supervised learning in DNNs is a form of curve-fitting of a non-linear function to 165 a finite set of (noisy) data points in a high-dimensional space, general methods to e.g. detect and reduce overfitting or adjusting the learning rate apply. We will touch on these in this chapter. There is however the problem of breaking open the black-box of a deep neural net- work and in order to inspect its specific functionality regarding the task at hand. This allows to derive specific actions during data preparation, model building and model train- ing. This problem of breaking open the black-box is part explainable AI (XAI), a research field that emerged in the last few years. Common techniques for XAI in DNNs include visualizing the sensitivity of the neural network output towards its input, e.g. in convo- lutional neural networks (CNNs)[134, 161]. explAIner[136] is an approach to track the evolution of a deep neural networks topology and hyper-parameters, combined with vi- sualizing both the DNN topology and metrics that provide insight into its performance. A visual analytics (VA) loop is then set up between the automatic training and the human expert in order to gain insight into the influence of each hyper-parameter and topology change and derive meaningful actions that improve the model performance. Combining visualizations and verbalization for explaining the inner workings of deep neural networks is another method[129] to provide insight to the expert user. Vis4ML[114] is an ontology and guidance that provides a map of visual analytics techniques for machine learning and explainable AI that both shows existing methods, but also serves to identify under-explored areas of XAI for ML. It is, among other purposes, an entry point for discovering existing XAI methods in ML. The workflow and visualizations of this chapter are included as one example on the Vis4ML website1 and also as an example in the talk2 given by Dominik Sacha. One method for finding hyper-parameters in machine learning models is to do an au- tomatic search by applying e.g. grid search or random search. Both these methods are based on the idea that training and evaluation of a single model is fast and thus a large amount of different combinations of hyper-parameters can be applied in a reasonable amount of time. This assumption is however not true in the case of multi-dimensional connectionist classification, which is expensive to train since both the deep neural net- work is relatively large and approximation of the correct alignment using conditional ran- dom fields (CRFs) and loopy belief propagation (LBP) takes time. In the experiments discussed in Chapter 7, a single epoch of training on the augmented training data set took between 8 and 12 hours of wall clock time. This amounts to several days or weeks per training run. Grid search or random search for optimizing the hyper-parameters is thus not feasible. This chapter proposes both a workflow and a heatmap-based visualization technique that together form a visual analytics loop that allows the model engineer to optimize the hyper-parameters for a model while training with multi-dimensional connectionist classi- fication. The workflow proposed in this chapter is designed to steer the model engineer through an optimization process for the hyper-parameters by posing questions that, when answered, provide insight into possible sources for errors in MDCC and guide the model engineer towards meaningful changes in the hyper-parameters. Answering the question is, if possible, supported by the proposed visualization technique. This process of guiding the model engineer through this workflow is repeatedly applied at different stages of the training, each time improving the hyper-parameters towards a higher model accuracy. The proposed heatmap-based visualization is designed with the properties of MDCC in mind. It allows the model engineer to visualize both the soft-assignment as estimated by the deep neural network and the aligned soft-assignment produced by the conditional random field in order to inspect both and spot relevant differences. See Figure 8.0.1. 1https://vis4ml.dbvis.de/ 2https://vimeo.com/303202734 166 The visualization is based on a heatmap, indicating the glyph probabilities, partially su- perimposed over the input image of multi-line text. This visualization technique and the proposed workflow in combination allows the model engineer to employ expert knowl- edge to derive meaningful changes to the hyper-parameters in MDCC and to arrive at a reasonable set of hyper-parameters without the excessive runtime requirements of grid search or random search. In terms of positioning of this work, we refer to two papers by other authors. Choo and Liu[17] identified understanding, debugging and refinement or steering as the tree tasks in XAI for deep learning. The proposed workflow addresses both debugging and refinement/steering. Debugging means to identify defective or faulty parts of the model or training system in order to improve on those points. Refinement and steering includes expert knowledge into the training process of a machine learning model to quickly derive meaningful changes and to improve the model accuracy or speed up the training process. A survey[58] by Hohman et al. categorizes VA techniques in deep learning by posing appropriate questions. In the case of the method of this chapter, the answers to these categorizations are as follows: Why: ‘Debugging & Improving Models’. Our work aims at improving the model accuracy. Comparisons of different models are also possible with this technique. What? would be ‘Individual Computational Units’ in the output layer of the deep neural network. This work specifically visualizes the soft-assignment in order to gain insight by comparing the output of the DNN and CRF. When? is after the current training run and before starting a new one. However, stopping the current training run and resuming it can also be done. This is e.g. the case when fixing errors in the ground truth data. Who? are clearly ‘Model Developers & Builders’. How? are ‘Line Charts for Temporal Metrics’, see Figure 7.5.1. We also present a heatmap-based visualization technique that falls in the ‘Instance-based Analysis & Exploration’ category. Where? is ‘Application Domains & Models’. 8.2 Error Sources in MDCC A first step in researching and applying the proposed workflow and heatmap-based visu- alization of this chapter is to understand the sources of error that can occur while training a deep neural network with multi-dimensional connectionist classification. For this we remember the discussions of Chapters 5, 6 and 7. The following list gives an overview over the possible reasons for an unsatisfying accuracy in MDCC. The numberings of the error sources will be later used in Figure 8.3.1 and Sections 8.3 and 8.4. The paragraphs following this overview discuss the details of these sources of error. 1. Data (a) Too few data in general. (b) Too few data for outlier examples. (c) Truth data has systematic fault. (d) Truth data has individual examples wrong. 2. Transformation (a) Resolution of input image too low. (b) Resolution of input image too high. 3. DNN Topology (a) DNN topology not suitable for task. 167 (b) DNN capacity too small. (c) Subsampling within the DNN too large. (d) Subsampling within the DNN too small. 4. CRF Alignment (a) LBP has not converged to a stable point. 5. Training Process (a) Training is not finished yet. (b) Overfitting to training data. (c) Optimizer hyper-parameters are sub-optimal. 6. General (a) General configuration error. (b) Implementation bug. Data: The first type of error sources identifies problems with the training data and data in general that is available for DNN optimization. A common problem in deep learning is that too few annotated data examples are available, which prevents effective optimization in high-dimensional parameter spaces. Increasing the amount of annotated data avail- able is often useful in this case (‘data is king’). We also need to keep in mind that multi- dimensional connectionist classification deals with the transcription of handwritten text, which is often natural language. The frequency of glyphs in natural text follows Zipf’s law[87], which states that for natural languages the number of occurrences of any glyph is roughly twice that of the next less frequent glyph. This means that training a DNN for transcription of natural texts inherently deals with imbalanced data sets where not every glyph is occurring roughly equal amount of times. Only a few annotation errors in exam- ples with infrequent glyphs can be harmful to the overall accuracy. In general there are other potential errors in annotation, e.g. systematic faults such as incorrect capitalization or missing punctuation marks. Of course there may be more general types of errors in the data, such as for example input images that are geometrically rotated or flipped. Transformation: This type of error sources concerns the image transformations. Those errors occur if the available data is correct, but the images are incorrectly trans- formed while loading them into the deep neural network. The two error types of most concern for MDCC is loading the input image of handwritten text in a too high or too low resolution. Choosing a wrong input resolution may lead to errors in recognizing charac- ters or text lines. An example here would be the glyph ‘i’, which looks like a glyph ‘l’ or ‘I’ in a too low resolution. On the other hand, a too high resolution separate the dot from the main body of the glyph by many time steps in a recurrent neural network and the glyph will likely then be be incorrectly transcribed. There are works[48, 54, 55] that test the ability of long short-term memory networks to recognize long-distance dependencies in sequential data. DNN Topology: There are two rather general types of error in this category. One is that the topology of the deep neural network is not suitable to the task of multi-line handwritten text recognition. This could be the case if e.g. the type of non-linear activation functions were of an unsuitable type. Choosing a subsampling, e.g. maximum or average pooling operations, that is too fine or too coarse will lead to similar phenomenon as if the input image has a too low or too high resolution. In case of a too low or high subsampling, the DNN may be incapable of correctly identifying geometric features of handwritten text and their relations with each other in natural language. Extreme cases occur if the spatial 168 size of the soft-assignment is smaller than the size of the truth label, which is the case if the number of pixel rows is smaller than the number of text lines or the number of pixel columns is smaller than the number of characters in the longest text line. In such a case, no correct soft-assignment can be estimated or computed at all, leading to the failure of the alignment using the conditional random field. CRF Alignment: The main topic when looking for error sources in the alignment by conditional random field is to check if the beliefs in loopy belief propagation have converged to a stable point and if this point is a valid solution. Please see Section 2.2 and Chapter 6 for information on aligning the soft-assignment in MDCC. The convergence criteria proposed in Algorithm 6.6.1 do test for both these cases. Still, the possibility remains that a large number of examples did not converge to a valid stable point and thus a large amount of training data was implicitly discarded, hindering the training process. Training Process: Problems found in the the wider field of machine learning also apply to multi-dimensional connectionist classification. Gradient descent, which is an iterative algorithm, is applied for parameter optimization in MDCC. As such it could simply be the case that if an unsatisfying error rate in the model is observed, the training process only needs to be run for a longer timer. Overfitting also applies here, which means that too few training data is used for the model capacity at hand. In this case the model starts fitting noise or outliers in the data instead of generally occurring concepts. The inverse of overfitting would be a too high learning rate in the optimizer, which leads to sporadic divergence of the model parameters, away from a low error solution in gradient descent. General: This category covers the most general of problems such as for example choosing the wrong alphabet for transcription or an implementation bug somewhere in the overall system. 8.3 Workflow for Identification of Error Sources So far in Section 8.2 we have discussed the potential error sources that lead to an in- crease in character error rate (CER), a lower accuracy, in multi-dimensional connection- ist classification. This section discusses a workflow that is designed to guide the model engineer to meaningful changes in hyper-parameters of the model, the training process or general data acquisition in order to effectively reduce the error rate in MDCC. Figure 8.3.1 outlines this workflow, which is a decision tree encoding expert knowledge regard- ing training in multi-dimensional connectionist classification. Part of this workflow involves answering questions about the current state of the model in MDCC and some of these questions are supported by a heatmap-based visualization. This visualization technique will be discussed in Section 8.4. We will discuss the workflow of Figure 8.3.1 and its suggested actions for error mitiga- tion in the following paragraphs. The nodes in Figure 8.3.1 are encoded in the following way: • Red signals questions to be answered or tests to be done on the current state of the model. • Blue suggests the application of the heatmap-based visualization technique dis- cussed in Section 8.4. • Green indicates suggestions for actions that to improve the accuracy of the model trained with MDCC. • The gray trapezoid suggests multiple options to proceed at this point. It is in the discretion of the model engineer to choose one action or to apply all of the sug- gested actions. 169 Figure 8.3.1: Proposed workflow to identify meaningful actions given the current state of the train- ing process. 170 The workflow begins by posing question A, which is to measure the character error rate on the validation data and with the current model parameters. As Figure 7.5.1 shows, does the error rate on the validation set converge towards the minimum during gradient descent. That is if no overfitting of the model happens, the model capacity and training set are large enough and no other error happened during training. The error rate does indeed converge towards the minimum in most cases, but starts to diverge or oscillate at some point. Question A of the workflow is targeted at the question if the current CER on the validation data set is satisfying. The model engineer may choose to stop the training run in this case and use the model parameters which produced the lowest error rate. Question B of the workflow assesses if the character error rate on the validation data set is still improving with each episode of the training. If this is the case, waiting for a better parameter set that produces a lower error rate on the validation set is still an op- tion. Another option in this case would be to check the hyper-parameters of the optimizer for potential improvements, e.g. increasing the learning rate if the error rates on both the training and validation set are only decreasing slowly. If the error rate on the validation set is not further improving, then question C poses the test if the error rate on the training set is still improving. If this is the case, we are likely dealing with overfitting, which is a common problem in machine learning. Actions to mitigate overfitting in multi-dimen- sional connectionist classifications are to collect more annotated data for this task (‘data is king’), thus improving the ratio between the number of training data and the model capacity. Instead of increasing the amount of data, reducing the model capacity or intro- ducing further constraints to the model and its parameters is also a suitable action when overfitting is observed. Actions in this direction are to reduce the number of layers in the deep neural network, reduce the number of neurons per layer, add dropout to the layers or add a regularization term to the loss function. Question D results in a significant split in the workflow since the outcome of this ex- periment determines if the problem at hand lies in the data, model or training system in general or with specific individual examples or glyphs. Question D can be answered by calculating the character error rate per example from the annotated data and then plot a histogram or compute the variance over these error rates per example. If the CER is not satisfying for all the examples, then question E follows up to determine probable causes. Question E requires that the annotated data is also available line-wise, that is each para- graph correctly segmented into its contained lines. The experiment of question E is to use the line-wise data set with the same model and hyper-parameters, as far as possible, and then to train the model using connectionist temporal classification for line-wise transcrip- tion. Training the model with CTC for line-wise transcription is a robust way of testing the deep neural network architecture in general since CTC is a robust and well understood transcription method for handwritten lines of text. If this CTC training run achieves a rea- sonable character error rate, further inspection of individual examples when transcribing using MDCC is in order. If training with connectionist temporal classification does not lead to a reasonable error rate, a general error is to be expected in the data or model. The model could be unsuitable for the task at hand or the model capacity could be too small to solve the task of handwriting recognition. Repeating the training with a different deep neural network architecture or with more neurons, thus more parameters, in the neural network is advised. Choosing a suitable DNN architecture is partly dependent on the experience of the model developer. The DNN architecture used in this work is detailed in Section 7.4. On the other hand could there be too few data or the data has a systematic fault. Systematic faults in the annotated truth data in handwriting recognition includes e.g. capitalization of words or missing punctuation marks. Questions D and E in combination establish if the employed data set and deep neural network architecture are suitable for solving the task of multi-line handwriting recognition. 171 Once it is established that the DNN model is capable of solving the task of offline hand- writing recognition, the workflow directs the model engineer towards error sources within individual data examples. Question F of the workflow is the first one that requires visual inspection of specific examples. We will discuss the heatmap-based visualization and ways to filter for inter- esting glyphs in Section 8.4. This proposed visualization technique is capable of answer- ing questions about the localization of characters predicted by the deep neural network, about the resolution of the soft-assignment as estimated by the DNN or conditional ran- dom field, about the correctness of the CRF alignment, about glyphs which are affected by a high error rate and about general inspections on the difference between the deep neural network and conditional random field soft-assignments. Filtering the data set for interesting data examples is again done by calculating the character error rate per example and selecting few examples with a high error rate. Those examples are then visualized with the technique proposed in this chapter. All glyphs being roughly of the same error rate and type points to a more general problem with the hyper- parameters in the model, data or software implementation. Only a few glyphs showing a high error rate on the other hand points to more localized and specific problems. Question I picks up this trail by testing if frequent or infrequent glyphs of the alphabet are impaired by a high error rate. For this the heatmap-based visualization in the missing mode, which highlights glyphs that the deep neural network did falsely miss, or ghosting mode, which targets falsely predicted glyphs, is employed. Typical problems leading to a higher than expected error rate in infrequent glyphs are a too low model capacity or a poor choice of the optimizer hyper-parameters. Both of these error sources can disproportional affect infrequent glyphs since in terms of the MDCC loss function, errors in infrequent glyphs have only a small impact on the overall transcription error of the model. Other types of errors that occur in combination with infrequent glyphs are a data collection bias that further reduced the number of examples with outliers or systematic errors in the ground truth data annotation that only affects infrequent glyphs. The remaining options for error sources in multi-dimensional connectionist classifica- tion are covered by a multiple choice path in the proposed workflow of Figure 8.3.1. The model engineer may choose to follow only one of the two options proposed in the work- flow, although it is recommended to perform both tests. The heatmap-based visualization is employed to inspect one example from the data set to boost the decision on how to proceed. Question G compares the resolution of the data input image with the resolution of the soft-assignments estimated by the deep neural network and the conditional ran- dom field. The source of error may be that the alignment resolution is too coarse or too fine, in which case the subsampling factor inherent in the DNN architecture or the reso- lution of the input image should be adapted. In the course of this work, soft-assignment resolution of 3-5 pixels in both height and width per character has worked well. Charac- ters in the DNN and CRF soft-assignments should however not consist of only a single pixel. Choosing a too low input resolution or too high subsampling factor may render certain glyphs indistinguishable from each other, e.g. the glyphs ‘i’, ‘l’ and ‘I’. A too high input resolution or too low subsampling factor on the other hand may create a too large distance between disconnected features of the same glyph and thus introduce additional long-range dependencies into the long short-term memory (LSTM) layers. Question H is an experiment to test for general errors in the data and software im- plementation. It may again be that there is a systematic fault in the ground truth data or that individual examples from the data set are wrong. There may also be a general con- figuration error, e.g. flipping the width and height of the input image at some point, or an implementation bug. Since loopy belief propagation is an iterative inference method that requires repeated multiplication and summation of probabilities, numerical instabilities in 172 the implementation may propagate to impact the overall error rate of the model. Also loopy belief propagation may simply not have converged to a stable point while estimat- ing the soft-assignment in multi-dimensional connectionist classification. Increasing the limit on the number of iterations in LBP is advised in this case. We will discuss the proposed heatmap-based visualization technique in the remainder of this chapter. 8.4 Heatmap-Based Visualization for MDCC Sections 8.2 and 8.3 have detailed the sources of error that may occur while applying multi-dimensional connectionist classification to the training of a deep neural network and proposed a workflow for deriving meaningful actions to modify the annotated data, hyper- parameters of the model or optimizer or the DNN architecture itself to mitigate these errors. What is missing is the heatmap-based visualization technique that is employed in the workflow of Figure 8.3.1 to inspect data examples for specific errors. Discussing this visualization technique is the topic of this section. This heatmap-based visualization targets the intelligible visualization of the soft-as- signments as estimated by the deep neural network and conditional random field. The structure of these soft-assignments has been discussed in Sections 5.2 and 6.2. Specif- ically the soft-assignment zΣ of Equation 6.3.2 in case of the conditional random field is targeted by this visualization. This allows direct comparison of the deep neural network and conditional random field soft-assignments since both are probabilities per glyph of the alphabet, not probabilities per character of the label string. These two soft-assign- ment, DNN and CRF estimations, are probability distributions with two spatial dimensions for the two dimensions in the input image of handwritten text, with a third dimensions over the glyphs of the alphabet in use. Understanding the spatial relationship between glyphs in a text is, so we assume, intuitive for speakers of that language as long as enough context is given. This required context is provided by the proposed heatmap-based vi- sualization by using the original image of handwritten text as a background and partially superimposing the glyph probabilities as a heatmap. Due to using the input image as background, only one example of the data set can be visualized at a time. Figure 8.4.1 shows this heatmap-based visualization for the example text of the IAM offline handwriting database often used in this thesis and specifically the glyph ‘a’ of the alphabet. The left part of the visualization shows the soft-assignment of the glyph ‘a’ as estimated by the deep neural network and the right part for the conditional random field. The background in both cases is the input image of handwritten text as a grayscale image. Superimposed on this background is a heatmap of the probabilities from the soft- assignments for this specific glyph. The heatmap is only partially superimposed to leave large parts of the handwritten text visible for reference to the user. This begs the question on how to decide which parts of the image should be plot over with the heatmap, that is in which spatial positions the soft-assignments should be visualized. The value range of the soft-assignments is in [0, 1] since the values are probabilities. Summing the probabilities over all glyphs for one pixel will always yield 100 percent prob- ability since each pixel has to be assigned to one glyph. Noisy predictions will thus yield probabilities near 1|A| with |A| being the number of glyphs in the alphabet in use. High confidence predictions will yield a few predictions with probabilities near one, that is for assignments of pixels to glyphs that the DNN or CRF expect, and many probabilities near zero which indicate assignments of low confidence. This effect is used to decide which pixels of the image of handwritten text to superimpose with the heatmap. Interesting ar- eas are where the DNN or CRF predict a high probability for the glyph in question and those areas should be superimposed. In the proposed visualization, these interesting 173 areas are defined as the pixels where the probability of assigning the pixel to the glyph is higher than the mean probability for this glyph over all pixels, plus one standard de- viation. Formalized, the heatmap of the soft-assignment y is superimposed for pixels s where ysg > µyg + σyg for glyph g at hand. This means that roughly 16 percent of pixels will be superimposed by the heatmap. The heatmap further encodes the absolute probability of the pixel being assigned to the specific glyph in color with high probabilities being encoded in yellow. This color coding of the probabilities further allows to distinguish between superimposed heatmap that shows noise (seemingly random pixels are superimposed) or a heatmap that shows confident predictions by the DNN or CRF (contextually correct pixels are superimposed). This partially superimposed heatmap allows the user to distinguish between soft-as- signments from the deep neural network or conditional random field that are seemingly noise and those that represent high confidence prediction. If the user decides that the soft-assignments at hand are not noise, further inspection of the soft-assignments is pos- sible by incorporating the context provided by the corresponding handwritten text in the background. Figure 8.4.1: Heatmap visualization technique for inspecting and comparing the deep neural net- work prediction and conditional random field alignment for a single glyph from the alphabet. The background shows the input image. The partially superimposed heatmap details the probabilities from the according soft-assignments. High proba- bilities are encoded in yellow, low ones in blue. Figure 8.4.1 shows a higher resolution of the pixels in the input image than in the superimposed heatmap. This is because the spatial resolution of the soft-assignments estimated by the deep neural network and conditional random field are actually of a lower resolution than the input image. This effect is because of subsampling within the DNN architecture, which reduces the spatial resolution while forward propagating through the neural network. This proposed visualization technique is capable of only displaying a single glyph of a single example from the data set at a time. Section 8.3 discussed the workflow proposed in this chapter and suggests that interesting examples from the data set can be identified by calculating the character error rate (CER) per example and selecting examples that have the highest or at least higher than expected error rate. This is a quick way to identify problematic examples which may indicate errors in the model, data or hyper-parameters but of course any example from the data set can be visualized using this technique. Identifying interesting glyphs from the alphabet at hand will be the topic of the discussion in the next few paragraphs. We propose three metrics for ordering the glyphs by ‘interestingness’ within one ex- ample of the data set: difference, which employs the cross-entropy between the CRF alignment and DNN estimate, ghosting as a measure of false-positives and missing for false-negatives. The user may choose the metric that is most useful for the task at hand and select one or more glyphs to visualize in order to gain insight into the DNN prediction 174 and CRF alignment. Similar to the discussions on expectation-maximization of Section 6.5 will the following equations use zΣ to denote the alignment as estimated by the con- ditional random field and y for the soft-assignment predicted by the deep neural network. The difference metric ∑ Difference(y, zΣ, g) = − z s sΣg × log(yg) (8.4.1) s measures the cross-entropy between the soft-assignments zΣ and y for the example at hand. That is it measures the information gained about zΣ while y being observed. A high value of cross-entropy indicates a low information again, which shows that the difference between the DNN prediction and CRF alignment is high. The ghosting metric ∑{ys iff z sΣ < ϵ Ghosting(y, zΣ, g) = g g (8.4.2) s 0 else has a high value for glyphs that are predicted by the deep neural network in pixels where the conditional random field does not indicate their assignment. In a standard machine learning classification task, this would be false-positives. The inverse to the ghosting metric is the mis{sing metric∑ z s sΣ Missing(y, z , g) = g iff yg < ϵ Σ (8.4.3) s 0 else which flips the soft-assignments of the DNN and CRF and thus has a high value for pixels where the CRF assigns the glyph, but not the DNN. These are false-negatives in classification tasks. ϵ in both the Ghosting and Missing metrics is a small positive probability as a threshold. Selecting interesting glyphs for visualization is done by choosing one of these three metrics, computing its value for all glyphs in the alphabet and for the specific data set example at hand. Plotting the values of the metric applied to the glyphs in a histogram then allows to quickly filter for interesting glyphs. Figure 8.4.2 shows this histogram for the data set example known beforehand from Figure 8.4.1 and with the difference metric applied. The histogram shows a high difference in the line separator glyph. Figure 8.4.2: Histogram for one specific example and over all glyphs of the alphabet, indicating the difference between the deep neural network prediction and the conditional ran- dom field alignment. The histogram indicates a high difference in the line separator ϵl. Figure 8.4.3 applies the ghosting metric to the same example as of Figure 8.4.2. There are false-positives in the space-glyph and ‘o’, ‘e’, ‘a’ glyphs. This can partially be 175 explained with the fact that these glyphs are frequent in the English language. Normal- izing the ghosting and missing metrics by the frequency of the glyph may however be contraindicated. Both metrics are the sum over all pixels in the soft-assignments and as such, glyphs that occupy a large space tend to have higher values in these metrics. Normalization would thus require knowledge about the spatial extent of each glyph, an information that is explicitly missing when applying multi-dimensional connectionist clas- sification. Figure 8.4.3: Histogram similar to Figure 8.4.2 but with the ghosting metric. The space-glyph as well as the glyphs ‘o’, ‘e’ and ‘a’ are the most ghosted glyphs in this example. Plotting and viewing these histograms as in Figures 8.4.2 and 8.4.3 allows to filter for interesting glyphs and to apply the heatmap-based visualization of Figure 8.4.1. Figure 8.4.4 shows the heatmap-based visualization applied to the example and hand for the top-3 glyphs according to the ghosting metric. We can see that the structure of the soft- assignments created by the deep neural network and conditional random fields is similar, but there are differences in the exact location of characters and the exact value of the probabilities. This is the case because a forward pass through a deep neural network and loopy belief propagation are still very different inference algorithms and are expected to produce non-identical estimates. However, due to expectation-maximization the two soft-assignments should converge to a stable point in which they are similar to each other. Figure 8.4.4 shows the top-3 glyphs according to the ghosting metric, which would be false-positives in typical classification tasks. The glyph ‘o’ indeed does show a false- positive in the bottom-left area where the word ‘and’ is seen in the background image, but the character ‘a’ is mistaken for an ‘o’. This section proposed and discussed a heatmap-based visualization for multi-dimen- sional connectionist classification and ways of filtering for interesting data set examples and glyphs from the alphabet. This completes the application of the workflow proposed in Section 8.3. 176 Figure 8.4.4: Heatmap visualizations for the top-3 glyphs according to the ghosting metric. Much seems to be minor perturbations in the two prediction methods (DNN and CRF), but the glyph ‘o’ is truly ghosted in the bottom-left area of the example. 177 8.5 Discussion This chapter proposed a heatmap-based visualization technique and workflow for iden- tification of error sources while training a deep neural network with multi-dimensional connectionist classification as well as for proposing meaningful changes to the model, its hyper-parameters and the annotated ground truth data in order to mitigate the identified errors. The workflow is designed to be executed by the model engineer after a training run with MDCC or during a current training run in order to decide on actions to improve the error rate when applying MDCC to the specific data set. The proposed workflow covers both typical machine learning problems such as overfitting or errors in the annotation of the ground truth data, but also extends to problems that are specific to offline handwriting recognition and multi-dimensional connectionist classification. The workflow proposes actions for improvements in both cases. The heatmap-based visualization is designed to inspect and compare the soft-assign- ments as estimated by the deep neural network and conditional random field. It provides contextual information to the user by incorporating the input image of handwritten text as a background to the heatmap. The heatmap itself is partially superimposed on this back- ground and visualizes probabilistic assignments between pixels and glyphs in the soft- assignments. The heatmap is superimposed in a way that allows to identify and compare spatial positions with high probabilities, but also to identify if the soft-assignment is largely a product of noise. Although no user study was performed to evaluate the visualization technique and workflow proposed in this chapter, they were of immense use for the author of this thesis while training deep neural networks using multi-dimensional connectionist classification. The experience collected while performing the experiments earlier showed that this tech- nique is of help while loading and transforming data and choosing a suitable DNN archi- tecture for the task at hand. It also proved useful during hyper-parameter selection for this model. It was also employed during training runs with MDCC to determine if they were worth of continuation or if it was better to stop the training, adjust the hyper-parameters and start a new training run. One example for this was a case where a mistake was done in the architecture of the deep neural network: the final neural layer that produced estimates for the glyphs was followed by a non-linear activation function and on top of this the softmax activation function followed to produce probability estimates. Applying the proposed workflow and heatmap-based visualization showed that infrequent glyphs were predicted poorly and the predicted soft-assignment for these infrequent glyphs showed random noise. This observation made it clear that the mistake must be of such a type that inhibits the recog- nition of infrequent glyphs. Applying two non-linear activation functions consecutively has this effect since the gradient for each glyph is then reduced by the product of the two function derivatives. In this practical example the workflow and visualization technique pointed towards an error source in the DNN architecture. Visualization-guided changes to the model, hyper-parameters and potentially anno- tated data is crucial in this context since training deep neural networks for offline hand- writing is time intense, even while using GPGPU acceleration. As stated before, a single epoch of training with MDCC on the IAM offline handwriting database has a duration of up to 12 hours. This amounts to multiple weeks for a full training run until the error rate has converged to a low value. Therefore it is in the interest of the model engineer to detect potential problems early and mitigate those problems in a directed and meaningful approach. One possible next step in further researching and developing this proposed visualiza- tion technique and workflow is to tightly couple it with the training process. This could 178 for example involve automatic identification and visualization of difficult examples and glyphs while training the DNN, potentially tracking changes over time and between dif- ferent models. To this end the visualization technique could be incorporated in already existing software for tracking DNN training runs, such as e.g. TensorBoard[1]. Integrat- ing these visualizations with semi-automatic annotated software[125] is another path for potential further research. Integrating these heatmap-based visualizations in software for annotating ground truth data could provide further insight into which data examples prove to be consistently erroneous and provide context to support improved annotation to mitigate some problems. Sections 8.2 and 8.3 discussed error sources that occur within the ground truth annotation and tight integration of visualizations of models with the data annotation process could limit the occurrence of such problems in the first place. 179 180 Chapter 9 Combined Models for Text Recognition 9.1 Idea and Overview This thesis mainly proposes and discusses multi-dimensional connectionist classification (MDCC) as a novel method for paragraph-wise handwriting recognition in Chapters 5 and 6 with empirical experimentation in Chapter 7. The experiments of Chapter 7 focused on a comparison of the character error rate (CER) of paragraphs transcribed with MDCC and with connectionist temporal classification (CTC), which was discussed in Section 3.1. While MDCC is a method for paragraph-wise transcription, CTC is one for line-wise transcription. This chapter proposes methods for combining both paragraph- and line- wise transcription. Each of the proposed methods constitutes a classifier that estimated which of the two transcription methods yields the lower error rate on the example at hand. The idea behind this approach is that not all handwritten texts are equally difficult to transcribe. Many paragraphs are constituted of text lines that are organized in a neatly horizontal fashion without overlaps between lines. These are typically well behaving in line-segmentation algorithms and produce good results when applying connectionist tem- poral classification. On the other hand are some adjacent text lines overlapping because of warped base lines of one or multiple text lines or because individual characters in- cluded in the text lines are overly extended in vertical direction. Noise in the image also is a factor during line segmentation and transcription. These factors may reduce the quality of line segmentation results and thus favor the application of multi-dimensional connectionist classification as a paragraph-wise transcription method. The IAM offline handwriting database[88] and its large writer independent text line recognition task again, as in Chapter 7, serves as the data basis for the experiments of this chapter. As discussed before does the large writer independent text line recognition task split the IAM database into four disjoint sets: training, validation 1, validation 2 and test. Chapter 7 did use the training set for automatic parameter optimization, validation 1 for validation and optimization of the hyper-parameters and test for final evaluation and comparison with other methods. This leaves the validation 2 split unused so far. For the purpose of the current chapter, both training and validation 1 will be used for automatic parameter optimization if a machine learning classifier is used. Thus validation 2 will be used for hyper-parameter optimization and test again for evaluation and comparison. This approach ensures that no data examples that were used for optimization of the transcription models will again be used for optimization of the classifiers for combining line- and paragraph-wise transcription. Similar to Table 7.1 is the data split used for the classifiers in this chapter outlined in Table 9.1. 181 Table 9.1: Data split of the IAM offline handwriting database as used in this chapter. Training Validation Evaluation Train. + Val. 1 Val. 2 Test Num. Paragraphs 747 + 105 = 852 115 232 Num. Lines 6161 + 900 = 7061 940 1861 Num. Writers 283 + 46 = 329 43 128 Similar to the experiments of Chapter 7 are the experiments of this chapter executed on both the original paragraph images of the IAM offline handwriting database, as well as paragraph images in which the lines have been artificially moved closer together by off- sets of 3, 5 and 10 millimeters. Tesseract1, GNU Ocrad2 and A* path planning[140] have been applied as methods to achieve line segmentation apart from the provided ground truth. Applying these line segmentation algorithms was necessary for experiments with artificial line offsets since no ground truth data was available for these paragraphs. Before proposing methods for application in real scientific or industrial settings, an overview over the potential gain by combining paragraph- and line-wise transcription is in order. The experiments of Chapter 7 and error rates detailed in Table 7.3 serve as the basis for the experiments in this chapter. Figure 9.1.1 outlines an ‘oracle’ classi- fier that decides on a example-by-example basis if the transcription generated by multi- dimensional connectionist classification or the one generated by connectionist temporal classification yields the lower character error rate. To this end, the ‘oracle’ classifier re- ceives the truth texts as an additional input and chooses the transcribed text that actually has the lower error rate. This means that this ‘oracle’ classifier yields the theoretical opti- mal decision, but is also impossible to implement in a scenario where the truth texts are not known beforehand. It will serve as a baseline for estimating the gain that a combined transcription method can achieve. Table 9.2 details the character error rates when applying the ‘oracle’ classifier of Fig- ure 9.1.1 to the line- and paragraph-wise transcription models of Chapter 7 and the eval- uation data of Table 9.1. It shows that there is indeed a small improvement in character error rate when choosing between a line- and paragraph-wise transcription on a example- by-example basis. The following pages of this chapter will discuss three approaches for building clas- sifiers that decide between line- and paragraph-wise transcription to reduce the overall CER of the system. These three classifiers do not, in contrast to the ‘oracle’ classifier, rely on knowledge of the truth data and thus will be applicable to real scientific and indus- trial systems. They continue to use the data split of Table 9.1 and an overview similar to Figure 9.1.1 will be provided for each classifier. One note on the approach that was chosen for the classifiers of this chapter: all three classifiers are based on the idea that each of the line- or paragraph-wise transcription produces the transcribed text of the full paragraph on its own. The classifiers only choose which of the two transcriptions is expected to yield the lower error rate. Another approach to this problem would be to merge the two transcribed texts, producing a new transcription that is different than each of the two individual ones but also yields an overall lower error rate. However, this is not the approach of this chapter since useful merging of natural texts exceeds the scope of this thesis. 1https://github.com/tesseract-ocr/tesseract/, version 4.1.1 2https://www.gnu.org/software/ocrad/, version 0.27 182 Line Segm. Segm. MDCC CTC Transcr. Transcr. String String Truth Text Lower CER XOR Transcr. Transcr. Figure 9.1.1: Diagram showing the data flow and application of methods in an ‘oracle’ classifier that knows the truth text and always makes the correct decision that leads to the lowest possible character error rate (CER). Green boxes are data, blue ones meth- ods. Yellow signifies the classifier and dashed lines are exclusive to each other, depending on the classifier decision. 183 Table 9.2: Average character error rates (CER) for connectionist temporal classification (CTC) and multi-dimensional connectionist classification (MDCC) on full paragraphs of the test set of the IAM offline handwriting database while using different line offsets and line segmentation methods. The last two columns combine both line- and paragraph- wise transcription by applying an ‘oracle’ classifier which always makes the correct decision. Thus the last two columns give the percentage of examples in which MDCC yields a lower transcription CER than CTC does and the resulting combined average CER if always making the optimal choice. This serves as the lower bound in the error rate when applying a non-all-knowing classifier. This table is based on Table 7.3. CER ‘Oracle’ Line Offs. Line Segm. CTC MDCC MDCC selected in CER 0 mm Ground Truth 7.94 15.09% 7.78 0 mm Tesseract 16.74 63.79% 9.48 0 mm Ocrad 18.48 10.22 67.24% 9.38 0 mm A* Paths 16.31 71.55% 9.37 3 mm Tesseract 20.53 68.53% 10.10 3 mm Ocrad 16.87 10.80 68.10% 9.90 3 mm A* Paths 19.09 75.86% 10.01 5 mm Tesseract 27.58 75.86% 11.72 5 mm Ocrad 20.19 12.76 64.22% 11.50 5 mm A* Paths 24.83 81.47% 11.91 10 mm Tesseract 74.77 95.26% 30.75 10 mm Ocrad 56.87 31.20 90.52% 30.33 10 mm A* Paths 63.77 95.26% 30.93 Tesseract 34.90 - 15.51 Average Ocrad 28.10 16.24 - 15.27 A* Paths 31.00 - 15.55 9.2 Classifier on Paragraph Images Classifier The first classifier type that we will discuss in this chapter is the application of a support- vector machine (SVM)[10, 30, 49, 147] on grayscale paragraph images. In its basic formulation, a SVM is a two-class classifier that finds the separation plane with the opti- mal margin. ‘Optimal margin’ in this case means that the minimal distance between any training data point and the separation plane is maximized within the feature space. As such a SVM is very robust to overfitting, given the general limitations regarding the num- ber of features and number of training samples available. The support-vector machine method can be applied to non-linear classification problems by implementing the kernel trick [57] in which the feature space is implicitly mapped to a high-dimensional space via a non-linear mapping. The SVM implementation of Scikit-Learn3 was applied in the course of the experiments of this section. In case of the work detailed in the following paragraphs, the two classes of the support-vector machine are an expected lower error rate during line- or paragraph-wise transcription, respectively. The input feature space consists of handcrafted features ex- tracted from the raw grayscale image of an paragraph of handwritten text. A radial basis function (RBF)[148] kernel was applied to this feature space in order to allow for a non- linear separation place in the SVM. Support-vector machines include a hyper-parameter for controlling the regularization during hyperplane fitting, often called the ‘slack variable’, which was chosen as C = 2 in the course of the work of this chapter. Increasing the slack variable decreases the regularization and fits the training data more closely, even at the cost of overfitting. This hyper-parameter and the kernel function were chosen by 3https://scikit-learn.org, version 0.23.2 184 an exhaustive search over a predefined set of hyper-parameters. The grid-search imple- mentation of Scikit-Learn was utilized to this end. It may be that a different kernel function and a different set of hyper-parameters yields better results when the proposed method is applied to a different data set of handwritten paragraphs. Figure 9.2.1 details this approach to selecting line- or paragraph-wise transcription per paragraph image using a support-vector machine. During transcription, the first step is to extract features from the grayscale paragraph image. This will be the topic of the next section. The extracted features are then fed into the SVM, which infers an estimate of which of the two methods for transcription will yield the lower character error rate. Only the selected method will then be applied, reducing the overall runtime to the minimum necessary in this approach. Feature Extraction Model SVM Parameters XOR Line Segm. Segm. MDCC CTC Transcr. Transcr. Figure 9.2.1: Diagram similar to Figure 9.1.1, but detailing the application of a support-vector machine on features extracted from the raw grayscale images. Dashed lines are only executed for one of the two alternatives. Feature Space The support-vector machine was provided with a handcrafted feature space in which the features were extracted from the grayscale image of a handwritten paragraph. The number of dimensions in this features space needs to be constant for all examples within the data set, which stands in contrast to the variable width and height of the paragraph images, which are of a constant resolution of 300 dpi. Features of constant dimension were extracted from the variable-sized paragraph images by first resizing the paragraphs to a standard height, leaving only a variable width in order to keep the original aspect ratio of the image. The standard height, to which the images were resized, was set to 185 the mean height plus one standard deviation over the images within the training data set. Assuming a normal distribution of the height of paragraph images, this leads to an increase of the height in 68 percent of images. The remaining variable width of each paragraph image was eliminated by applying a windowing approach. Each paragraph image was divided into five equally sized windows along the horizontal dimension. The overall image was treated as the sixth window. Separation of the overall paragraph image into five windows along the horizontal axis allows for slightly slanted or curved text lines while reducing the impact of these curved base lines on the stability of the extracted features. Engineering the feature extraction in such a way that each pixel in height and each window is treated as one spatial position, the total number of dimensions in the feature space was reduced to a constant of six times the standard height in pixels times the number of different features extracted. To this end all features were extracted by marginalization within the respective window to eliminate biases introduced by variable width windows. Figure 9.2.2 details this approach for an example of the IAM offline handwriting database. All grayscale images were normalized to a mean pixel value of 0 and a standard deviation of pixel values of 1 over the training data set. This same normalization was also applied to samples outside of the training set. Features were extracted from these normalized brightness values. The features were designed in such a way that they allow identification of overlapping text lines within the same pixel row. The types of features extracted from each the windows were as follows, with each feature being extracted per pixel row: • The mean pixel intensity and according standard deviation. • The value span in brightness between the darkest and brightest pixel. • The number of transitions from a dark to a bright pixel. That is the transition from < 128 to ≥ 128 in unnormalized grayscale images. This count was divided by the width of the window in pixels. • As above the normalized number of transitions from bright to dark pixels within the row. • The sum of the two above pixel transition quantities. That is the sum of transitions between dark and bright pixels, independent of their direction of transition. The basic idea behind these features is that pixel rows that strike through a text line end-to-end will be of average brightness and contain a lot of transitions between dark and bright pixels. This is because these pixel rows will contain many pen strokes that are dark, but also many bright spots between and within the glyphs. Pixel rows that were intended by the writer as separators between two adjacent text lines, but contain overlapping glyphs of these text lines will display different characteristics. These pixel rows are assumed to be mostly of bright intensity since they contain only few pen strokes. They also show only few transitions between dark and bright pixels. On the other hand do they still show a large value span between the brightest and darkest pixel. Pixel rows that are separators between text lines and do not contain overlaps of glyphs will on average be very bright with no transitions between dark and bright pixels and also a low value span within the pixel intensities. The proposed feature space contains a high number of dimensions which together with a non-linear kernel function increases the risk of overfitting to the training data. Anal- ysis of variance (ANOVA)[32] was applied as a univariate feature selection strategy to reduce the number of features to one quarter of the number of training examples. In this case, a total of 213 features. ANOVA as a feature selection strategy is based on the 186 Overall window Win. 1 Win. 2 Win. 3 Win. 4 Win. 5 Figure 9.2.2: Resizing of the paragraph image to a standard height and application of five verti- cal windows and one overall window to obtain a standard number of features per paragraph. The lower part shows the features as they are extracted as a marginal- ization of the horizontal axis of each window. Each feature can thus be interpreted as a distribution of values over the vertical axis of each window, here indicated as a histogram along the vertical axis. Please note that the exemplified features do not correlate to the paragraph image above. 187 Resize to standard height assumption that an observed variable, in this case the classification in two classes, is actually a mixture of a set of variables, in this case the features. ANOVA selects the top-n features that best explain the variance within the classification target. Empirical Evaluation This feature space in combination with the SVM hyper-parameters described above was applied to the task at hand. The evaluation is discussed in the next paragraphs. The split between training, validation and evaluation data set as used in these experiments in detailed in Table 9.1. Table 9.3 shows the error rates of the support-vector machine in percent of wrongly classified examples for both the training and validation data. It shows a tendency to- wards a lower error rate with a higher line offset, which is to be expected based on the observation that the character error rate in line-wise transcription increases faster than in paragraph-wise transcription. As Table 9.2 shows does the number of examples that truly should be transcribed using MDCC increases with an increasing line offset. Table 9.3 also shows that overfitting is to be expected in this approach. Table 9.3: Percentages of wrongly classified examples from the training and validation data while using the image-based SVM classifier of Figure 9.2.1. Error Rate Line Offs. Line Segm. Training Validation 0 mm Ground Truth 2.44% 10.38% 0 mm Tesseract 22.87% 41.96% 0 mm Ocrad 24.26% 43.64% 0 mm A* Paths 27.14% 40.35% 3 mm Tesseract 24.51% 28.57% 3 mm Ocrad 31.77% 37.27% 3 mm A* Paths 29.37% 36.61% 5 mm Tesseract 21.19% 29.46% 5 mm Ocrad 31.17% 44.64% 5 mm A* Paths 19.93% 30.36% 10 mm Tesseract 2.59% 7.83% 10 mm Ocrad 6.00% 9.73% 10 mm A* Paths 3.29% 1.74% Table 9.4 includes the character error rates when applying this image-based support- vector machine classifier to the IAM offline handwriting database. The reported CER is the average over all examples within the evaluation data set. These results show a slight improvement in the CER when combining multi-dimensional connectionist classification with CTC and the GNU Ocrad line segmentation on paragraphs with an artificial line offset that is reduced by 10 millimeters. However, in most cases MDCC alone results in a lower error rate. 188 Table 9.4: Average CER when combining MDCC and CTC by applying a support-vector machine (SVM) on features extracted from the grayscale image. Table layout is identical to Table 9.2 with the last two columns changed. The highlighted CER marks the instance where the combined error rate is lower than any of the line- or paragraph-wise transcription on its own. CER SVM on Images Line Offs. Line Segm. CTC MDCC MDCC selected in CER 0 mm Ground Truth 7.94 0.0% 7.94 0 mm Tesseract 16.74 10.22 99.57% 10.220 mm Ocrad 18.48 87.07% 10.71 0 mm A* Paths 16.31 99.14% 10.31 3 mm Tesseract 20.53 88.79% 11.17 3 mm Ocrad 16.87 10.80 87.50% 10.95 3 mm A* Paths 19.09 94.83% 11.36 5 mm Tesseract 27.58 94.83% 12.80 5 mm Ocrad 20.19 12.76 96.55% 12.85 5 mm A* Paths 24.83 98.71% 12.89 10 mm Tesseract 74.77 100.0% 31.20 10 mm Ocrad 56.87 31.20 97.41% 31.08 10 mm A* Paths 63.77 100.0% 31.20 Tesseract 34.90 - 16.34 Average Ocrad 28.10 16.24 - 16.39 A* Paths 31.00 - 16.44 9.3 Classifier on Transcribed Texts Classifier The approach detailed in this section is to move the classifier from the beginning of the pipeline (where only a grayscale image is available) to the end of the pipeline (where the transcribed text is available). The classifier of this section is hence based on the extraction of character n-grams from the transcribed text and comparing the n-gram frequencies of both line- and paragraph-wise transcription with a reference corpus to decide which of both transcriptions is closer to the expected n-gram frequency distribution. Figure 9.3.1 outlines this classifier method. Both a line- and paragraph-wise tran- scription of the paragraph image is performed to obtain the two transcription variants. All character n-grams of a constant size are then extracted from these transcribed texts. Character n-grams were beforehand extracted from a reference corpus of natural texts of the same language. Comparing the n-gram frequencies of the two transcribed texts and the reference corpus according to the Jensen-Shannon divergence[87, 97] yields the classification on which of the two texts is closer to the reference corpus. The one with the lower Jensen-Shannon divergence between the n-gram frequencies of the transcribed text and the reference corpus is assumed to contain fewer transcription errors. N-Grams and Reference Corpus A character n-gram as applied in the context of this classifier is a continuous sub-sequence of a constant number of characters from a natural text or transcription thereof. The clas- sifier proposed only extracts full n-grams from the texts, that is it uses no partial n-grams of shorter length that occur in the beginning or end of the text. Figure 9.3.2 shows the extraction of character 3-grams from the text ‘Hello World’. The first step in the proposed n-gram based classifier is to extract character n-grams in this fashion from the reference corpus of natural texts of the same language. In the case of this work, the truth texts of the training data set were used as the reference 189 Line Segm. Segm. MDCC CTC Transcr. Transcr. N-Grams N-Grams N-Grams of Lower Training Set Jensen-Shannon Divergence XOR Transcr. Transcr. Figure 9.3.1: Diagram showing the application of n-gram frequencies and the Jensen-Shannon divergence to decide between line- and paragraph-wise transcription. Dashed lines are only executed for one of the alternatives. Hel ... o W ... rld Hello World ell ... orl Figure 9.3.2: Example of extracting character 3-grams from the text ‘Hello World’. Only a subset of contained 3-grams is shown to outline the process in the beginning, mid and end part of the text. The full list of 3-grams in this text is ‘Hel’, ‘ell’, ‘llo’, ‘lo ’, ‘o W’, ‘ Wo’, ‘Wor’, ‘orl’ and ‘rld’. 190 corpus. All truth texts of the IAM offline handwriting database from the London/Oslo- Bergen (LOB) corpus[62] and as such are of the same language. The sample split of the larger writer independent text line recognition task on the IAM database is further in such a way that the splits are mutually exclusive regarding their truth texts. That is no truth text and no writer occurs in more than one sample split. The splits being mutually exclusive makes the choice of the training data set as the reference corpus a safe one without risk of leaking information from the evaluation into the training data. Character n-grams were hence extracted from the training data and kept with their respective counts for further use. The classifier proposed in this section uses character 3-grams for measuring the di- vergence between two texts. Natural language processing methods that operate on the basis of n-grams typically allow the configuration of the n-gram size. In this work the size of three characters was chosen based on some experimentation. Shorter n-grams are less prone to transcription errors since there are more n-grams in total extracted from the same text and there is less a chance of an transcription error corrupting a specific individual n-gram. Longer n-grams do carry more information and are thus better suited for the detection of transcription errors. However since there are fewer n-grams in total, a single transcription error may overly impact the result. A trade-off exists in the choice of the n-gram size between the robustness of the classifier and the significance of each tran- scription error. The choice of character 3-grams in this work was based on preliminary experiments with lengths of two, three and four characters per n-gram. We will use the symbol ci,t throughout this section to denote the number of occur- rences of the character 3-gram i in text or corpus t. The three corpora and texts in use are R for the reference corpus and M, N for the two transcribed texts. A count of ci,R = 1 will be assumed whenever the n-gram i does not occur in corpus R, but does so in either transcription M or N. The same is true for n-grams in the transcribed texts in relation to the other text and reference corpus. Assuming a default n-gram count whenever a n-gram is non-existent within a specific text, a so called out-of-vocabulary word, is necessary to avoid numerical instabilities and omittance of information from the metrics in use. Frequencies and Jensen-Shannon Divergence We will now discuss how to apply the Jensen-Shannon divergence in a classifier that estimates if the line- or paragraph-wise transcribed text is closer to the natural language of the training corpus. The Jensen-Shannon divergence is a symmetric variant of the Kullback-Leibler divergence (KL)[76], which we need to define first. The KL divergence measures the uncertainty about a reference (unobserved) probability distribution given an observed probability distribution. That is it measures the mean number of bits re- quires to encode the symbols of the reference distribution given the encoding scheme of the observed distribution. The Jensen-Shannon divergence removes this asymmetry of the KL divergence by averaging the Kullback-Leibler divergence of both reference and observed probability distributions when measured in relation to the mean of both distri- butions. The next few paragraphs will detail the Jensen-Shannon divergence applied to character n-grams. The frequency fi,t = ∑ ci,t (9.3.1) j∈t cj,t of an n-gram i within a text or corpus t is its number of occurrences ci,t normalized by the total number of n-gram occurrences within the text or corpus. 191 Based on these n-gram frequencies, the∑Kullback-Leibler divergencefi,V KL(U,V) = − [fi,U log( )] (9.3.2) f ∈ ∪ i,Ui U V measures the average number of bits required to encode the n-grams of text or corpus V given a coding scheme based on the n-gram frequencies of text or corpus U. As discussed above is the Jensen-Shannon divergence based on the KL divergence when measured in relation to the mean of distributions U and U. This averaged distribu- tion is a finite discrete set that contains the union of n-grams Q = U ∪V (9.3.3) with their frequencies 1 fi,Q = [fi,U + fi,V] (9.3.4) 2 being the average of frequencies given by sets U and V. The Jensen-Shannon divergence of text t, in our case either transcription M or N, to the reference corpus is then the average of the two KL divergences towards the mean Q of t and R. Equation 9.3.5 details the Jensen-Shannon divergence. 1 JSD(t,R) = [KL(t,Q) + KL(R,Q)],Q = t ∪R (9.3.5) 2 The classifier of this section computes the Jensen-Shannon divergence JSD(M,R) of the transcribed text M and the reference corpus R. The same computation is performed for the transcribed text N and the reference corpus. Whichever transcription produces the lower Jensen-Shannon divergence is assumed to be the one with fewer character errors since its n-gram frequencies are closer to the expectation set by the reference corpus. Empirical Evaluation As with the support-vector machine of Section 9.2, classifies this character n-gram based model each paragraph of the IAM database into two categories. Either the line-wise or paragraph-wise transcription is assumed to yield the lower character error rate. Eval- uation of the raw classification errors is provided in Table 9.5 where the percentage of wrongly classified examples is listed for the training and validation sets. The training set was used as the reference corpus. Table 9.6 includes the resulting average CER when applying this 3-gram classifier to the IAM database. There are several cases where the character error rate is lowered in comparison to either line-wise transcription using connectionist temporal classification or paragraph-wise transcription using multi-dimensional connectionist classification. On average, an improved CER can be expected when applying this classifier in combination with MDCC, CTC and either the line-segmentation of Tesseract or based on A* path planning[140]. 192 Table 9.5: Percentages of wrongly classified examples from the training and validation data while using the 3-gram classifier of Figure 9.3.1. Error Rate Line Offs. Line Segm. Training Validation 0 mm Ground Truth 48.41% 52.83% 0 mm Tesseract 26.89% 37.50% 0 mm Ocrad 44.09% 49.09% 0 mm A* Paths 26.39% 28.95% 3 mm Tesseract 23.77% 27.68% 3 mm Ocrad 36.93% 37.27% 3 mm A* Paths 21.00% 24.11% 5 mm Tesseract 17.07% 21.43% 5 mm Ocrad 31.41% 32.14% 5 mm A* Paths 14.44% 14.29% 10 mm Tesseract 2.35% 2.61% 10 mm Ocrad 6.82% 3.54% 10 mm A* Paths 2.23% 0.87% Table 9.6: Average CER when combining MDCC and CTC by applying a classifier based on the 3-gram character frequencies of the transcribed strings compared with the 3-gram character frequencies of the training truth texts as the reference corpus. Table layout is identical to Table 9.2 with the last two columns changed. Highlighted error rates mark the instances where the combined error rate is lower than any of the line- or paragraph-wise transcription on its own. CER 3-Gram Freq. Line Offs. Line Segm. CTC MDCC MDCC selected in CER 0 mm Ground Truth 7.94 60.78% 9.28 0 mm Tesseract 16.74 10.22 75.86% 10.220 mm Ocrad 18.48 64.22% 12.73 0 mm A* Paths 16.31 74.57% 10.35 3 mm Tesseract 20.53 76.72% 10.89 3 mm Ocrad 16.87 10.80 64.66% 12.80 3 mm A* Paths 19.09 82.76% 10.60 5 mm Tesseract 27.58 80.17% 12.54 5 mm Ocrad 20.19 12.76 69.83% 13.63 5 mm A* Paths 24.83 85.34% 12.38 10 mm Tesseract 74.77 96.55% 30.99 10 mm Ocrad 56.87 31.20 91.81% 30.69 10 mm A* Paths 63.77 98.71% 31.07 Tesseract 34.90 - 16.16 Average Ocrad 28.10 16.24 - 17.46 A* Paths 31.00 - 16.10 193 9.4 Classifier on Segmentation Information Classifier The last classifier that this chapter proposes in order to decide between paragraph- and line-wise transcription is utilizing geometric information provided by the line-segmenta- tion algorithms. Both Tesseract and GNU Ocrad do allow to retrieve additional informa- tion about the extracted line while applying the line-segmentation on paragraph images. Tesseract distinguishes between different ‘segmentation levels’ and e.g. level 4 are whole lines and level 5 are words within lines. The same is true for GNU Ocrad, but with differ- ent wording for the levels. The segmentation output of both GNU Ocrad and Tesseract are axis-parallel rectangles around the respective segment. In this chapter the classifier uses the provided coordinates of the four corner points of each line and word to extract features from paragraph images while using the segment numbering provided by the al- gorithm for topological ordering of the features. These coordinates are then transformed to create a feature space suitable for a support-vector machine as the classifier. Figure 9.4.1 outlines this approach. Tesseract GNU Ocrad Segm. Segm. Geometric Geometric Info Info Model SVM Parameters XOR Line Segm. Segm. MDCC CTC Transcr. Transcr. Figure 9.4.1: Flow when classifying each example according to the geometric information, that is corner points of lines, provided by the segmentation algorithms of Tesseract and GNU Ocrad. Dashed lines are only executed for one of the alternatives. 194 As with the SVM of Section 9.2, are the two classes of the task designed to indicate if a line- or paragraph-wise transcription can be expected to yield a lower error rate. Scikit- Learn was again employed for an exhaustive search of hyper-parameters and kernel functions. In the SVM of this section, a polynomial kernel of degree 5 was selected. A slack variable of C = 0.5 was used, which increases the strength of the regularization and thus reduces overfitting of the SVM parameters to the training data set. The data split as indicated in Table 9.1 was used for training and validation. Application of the trained SVM to previously unseen paragraph images yields the classification with the information if a line- or paragraph-wise transcription is expected to yield a lower character error rate. Feature Space Features in this method are extracted by applying the line-segmentation algorithms of Tesseract and GNU Ocrad to the paragraph image, retrieving line- and word-level seg- mentation information. The coordinates of each segment are transformed to a minimal axis-parallel rectangle around this segment. Figure 9.4.2 shows an example of this with an paragraph image from the IAM database. Line-level segmentation is shown in green, word-level segmentation in red. The included segmentation is not exhaustive for this ex- ample and there are more lines and words contained than marked in the figure. It serves as an example to illustrate this approach. Figure 9.4.2: Example paragraph image from the IAM offline handwriting database showing the different segmentation levels employed for this classifier. Green rectangles mark the line-level segmentation, red ones the word-level segmentation. Please note that the included segmentation is not exhaustive and serves as an example only. The example shows that one true text line may decompose into multiple line segments. The number of segments extracted from the paragraph image is variable since there are actually more or less text lines in each paragraph and each text line may contain a different number of words. As discussed in Section 7.5 is there a maximum of 12 text lines per paragraph in the IAM offline handwriting database, according to the ground truth data included with the data set. In order to obtain a feature space with a fixed number of dimensions, the number of line segments included in the feature transformation was set to 20. Paragraphs with more line segments were truncated to the first 20 and the ones with fewer were filled up with features that cannot occur naturally. The ground truth data of the IAM database indicates an average of 7 words per text line. For feature extraction the number of word segments was set to 20 per text line. 195 The features for use in this support-vector machine are transformed per line- or word- segment and are designed to distinguish between valid and invalid segmentation results. The first step in the proposed feature transformation is to normalize the coordinates of the four corner points of each segment to an interval of [0, 1] with the coordinate (0, 0) being the top-left corner. The coordinate (1, 1) is always the bottom-right corner. This removes the variable size of the paragraph images from influencing the feature space. Line and word segments are processed during feature transformation in such an order that a top to bottom order for lines and left to right order for words can be assumed. This order of processing is based on the topological order of segments as provided by the line segmentation algorithm. It is not based on the pixel coordinates of the segments. This is an import difference in this feature transformation since a valid segmentation result iterated in topological order should also produce a top to bottom and left to right ordering in pixel coordinates. Invalid orderings will result in backward jumps in pixel coordinates if text lines or words were extracted in the wrong order during segmentation. Figure 9.4.3 shows such an backward jump in which the topological ordering given by the segmenta- tion algorithm is written in the rectangles and arrows indicate movements in pixel space when following the top left corner of each line segment. A valid movement can be ob- served from line one to two, but not from two to three. The features included in the feature space are designed to recognize such cases. Line 1 Line 3 Line 2 Figure 9.4.3: Example of invalid line segments. The line numberings are given as the topological order by the segmentation algorithm. However, an invalid movement in pixel space is observed going from line two to three. Another potential problem in line- and word-level segments are overly large jumps in pixel space. Large gaps in pixel space while moving through segments in topological order can be interpreted as words or text lines that are existing within the paragraph image, but where missed by the line segmentation algorithm. Such gaps are also visible in the proposed feature space. The features included in the proposed feature space are thus as follows, with all fea- ture types being transformed from both the Tesseract and GNU Ocrad segmentation: • The four corner points of the first 20 line segments with pixel coordinates normalized to a [0, 1] interval. If there are fewer than 20 line segments, the remaining ones are filled up with values of −1, which cannot occur in true line segments. • In the same way, the four corner points of the first 20 word segments per text line. • The contained area per word segment in normalized coordinates. • The vector between the top-left corner of the last word segment to the top-left corner of the current word segment. • In the same way, the vector between the last bottom-right corner to the current top-left corner. This feature space includes, as the one of Section 9.2, a large number of dimen- sions and thus slows down optimization of the SVM parameters and increases the risk of overfitting to the training data set. ANOVA was again applied to reduce the number of dimensions of the feature space to one quarter the number of training examples. 196 Empirical Evaluation This proposed SVM classifier and feature extraction was experimentally applied to the IAM offline handwriting database with a data split as detailed in Table 9.1. The resulting raw error rate in terms of percentage of wrongly classified paragraph is shown in Table 9.7. These observations are similar to the SVM of Section 9.2 in that there is some overfitting to the training data, but the overall error rate is slightly decreasing for more difficult paragraphs, that is ones with the offset between lines artificially reduced. Table 9.7: Percentages of wrongly classified examples from the training and validation data while using the segmentation-based SVM classifier of Figure 9.4.1. Error Rate Line Offs. Line Segm. Training Validation 0 mm Ground Truth 2.80% 10.38% 0 mm Tesseract 19.95% 40.18% 0 mm Ocrad 21.43% 43.64% 0 mm A* Paths 25.15% 36.84% 3 mm Tesseract 22.92% 28.57% 3 mm Ocrad 28.18% 37.27% 3 mm A* Paths 26.94% 33.04% 5 mm Tesseract 19.98% 33.04% 5 mm Ocrad 27.80% 35.71% 5 mm A* Paths 15.75% 26.79% 10 mm Tesseract 1.88% 6.96% 10 mm Ocrad 4.59% 6.19% 10 mm A* Paths 1.88% 1.74% Table 9.8 shows the character error rate when applying this classifier to the evaluation set of the IAM database, computing the average CER of the transcribed texts. The tran- scriptions were either retrieved by applying multi-dimensional connectionist classification on paragraph images or connectionist temporal classification on line images, depending on the outcome of the SVM prediction. Combined transcription methods that show a lower average CER than either MDCC or CTC alone are highlighted in bold. Marginal- ization over all artificial line offsets shows that in general a lower CER can be expected when applying this method in combination the GNU Ocrad line segmentation algorithm. 197 Table 9.8: Average CER when combining MDCC and CTC by applying a SVM on the coordinates provided by the Tesseract and GNU Ocrad line segmentation algorithms. Table layout is identical to Table 9.2 with the last two columns changed. Highlighted error rates mark the instances where the combined error rate is lower than any of the line- or paragraph-wise transcription on its own. CER SVM on Segm. Line Offs. Line Segm. CTC MDCC MDCC selected in CER 0 mm Ground Truth 7.94 0.0% 7.94 0 mm Tesseract 16.74 10.22 98.28% 10.240 mm Ocrad 18.48 88.36% 10.10 0 mm A* Paths 16.31 99.14% 10.25 3 mm Tesseract 20.53 96.12% 10.76 3 mm Ocrad 16.87 10.80 80.60% 10.79 3 mm A* Paths 19.09 95.69% 11.11 5 mm Tesseract 27.58 98.71% 12.80 5 mm Ocrad 20.19 12.76 81.03% 12.47 5 mm A* Paths 24.83 96.12% 13.21 10 mm Tesseract 74.77 98.28% 31.25 10 mm Ocrad 56.87 31.20 92.24% 31.54 10 mm A* Paths 63.77 99.57% 31.17 Tesseract 34.90 - 16.26 Average Ocrad 28.10 16.24 - 16.22 A* Paths 31.00 - 16.43 9.5 Discussion This chapter discussed and detailed three methods for deciding if the line- or paragraph- wise transcription should be used for a specific paragraph. The goal of this decision is to reduce the overall character error rate of the transcription process. Connectionist temporal classification was employed for line-wise transcription and multi-dimensional connectionist classification for paragraph-wise transcription. As before, line segmentation algorithms of Tesseract, GNU Ocrad and based on A* path planning[140] were used to obtain line segmentation images. Results were presented as empirical evaluations on the IAM offline handwriting database, evaluating both the raw classification error of the underlying two-class classifier and the average CER of the transcribed texts. Section 9.2 proposed the application of support-vector machines on features ex- tracted directly from the grayscale paragraph image. Section 9.4 extracted features from coordinates of line and word segments provided by Tesseract and GNU Ocrad and set up a SVM classifier on these features. Section 9.3 discussed an approach for compar- ing character n-gram of the transcribed texts to a reference corpus and choosing the transcription that is closer to the expected n-gram distribution. The evaluations of this chapter show that deciding if line- or paragraph-wise transcrip- tion should be applied while only observing the grayscale paragraph image is difficult. The results in this case show only a slight improvement in CER for one single evaluation case. Classifying the geometric information provided by the corner points of line- and word-segments from two line segmentation algorithms yielded better results and a lower error rate. However, the overall lowest transcription error was achieved by comparing the transcribed texts to a reference corpus using the n-gram distribution. In combination with an A* path planning segmentation did this approach reduce the CER from 31.00 in a line- wise transcription and 16.24 in a paragraph-wise transcription to 16.10 in the combined transcription. It is worth noting that all three proposed method do add additional steps to the tran- scription pipeline, producing transcribed texts from images of handwritten paragraphs. As 198 such, these methods should only be employed if the character error rate of the transcribed texts is the main concern of the task at hand. The methods of this chapter show that combining line- and paragraph-wise transcrip- tion yields an overall lower error rate than any of the two methods alone, given that the deep neural networks in use for both CTC and MDCC are similar in structure and ca- pacity. This shows that paragraph-wise transcription using MDCC can be a worthwhile addition to handwriting recognition systems. It also reinforces the conclusion of Chapter 7 that MDCC is preferable to CTC in hard to segment paragraphs as seen again in the experiments with an artificially reduced line offset. Of course can the symmetry in the DNN topologies between the MDCC and CTC methods be broken and the neural net- works be tuned towards either CTC and MDCC to achieve lower error rates. However, one of the goals of this chapter is to show that MDCC as a method in terms of its train- ing system, loss function and decoding algorithms is a good addition to the transcription method proposed in CTC. 199 200 Chapter 10 Dictionary-Based Decoding Algorithms 10.1 Overview and Relation to This Work This chapter at hand proposes and details two approaches to single-line decoding in offline handwriting recognition. These two ideas of Sections 10.2 and 10.3 can potentially be applied to both connectionist temporal classification (CTC), see Section 3.1, and the line-decoding in multi-dimensional connectionist classification (MDCC) which is detailed in Chapter 5. Improvement of single-line decoding methods is not the main focus of this thesis, which deals with paragraph-wise transcription of handwritten text. We chose to focus on paragraph-wise transcription in the context of this thesis instead of further following the ideas discussed in the two following Sections 10.2 and 10.3. Please see the original publications[119, 122] for more details on these two approaches. 10.2 Decoding using a Large Lexicon and Fuzzy Search The following section is based on the following paper: Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Increasing Robustness of Handwriting Recognition Using Character N-Gram Decoding on Large Lexica.” In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). Apr. 2016, pp. 156–161. DOI: 10.1109/DAS.2016.43 Section 1.3 outlines the authors individual contributions. Overview Section 3.1 of this work discussed connectionist temporal classification (CTC)[46], which is a method for transcribing one-dimensional sequences, e.g. single lines of handwritten text or audio recordings. CTC introduced both a loss function for training deep neural networks towards this task as well as decoding algorithms to retrieve a high-likelihood label sequence from the DNN prediction. This decoded label sequence, e.g. a sequence of glyphs from an alphabet, constitutes the transcription result based on the given input. The deep neural networks employed for CTC are ended with a softmax function as their last layer and as such produce the likelihood of each time step of the sequence be- ing part of a specific glyph of the alphabet, with the glyphs being exclusive to each other 201 per time step. Decoding the DNN output is to uncover a high-likelihood label sequence, a sequence of glyphs, that explains this prediction. As such, there are different methods on how to approach this task. We have discussed best path decoding and beam search de- coding before, both proposed by Alex Graves for CTC[43, ch. 7.5.2]. Best path decoding identifies the single label sequence, that is one path through e.g. Figure 3.1.1, with the highest overall probability by greedily collecting the glyph with the highest probability per time step. This decoding algorithm is fast and low on memory usage, but transcription quality suffers if parts of the DNN output are weakly predicted. Beam search decoding builds on best path decoding by observing that there may be multiple paths that decode to the same label sequence. Beam search decoding marginalizes over all paths that decod- ing to the same sequence and this way introduces robustness against weakly predicted time steps. However, beam search decoding is computationally expensive since it keeps and manages a trie with all so far known prefixes of label sequences. The publication[119] outlined in this section at hand proposes a decoding algorithm for CTC based on matching character n-grams between the deep neural network prediction and an index of possible strings. It proposes to beforehand build an index out of a dic- tionary of strings, then online extract character n-grams from the DNN predicted, weigh them with probabilities and identify dictionary strings with a high probability to explain the DNN output. This decoding algorithm can both be applied as a stand-alone method as well as a pre-filter followed by beam search decoding. The proposed decoding algorithm is both robust in weakly predicted parts and capable of improving decoding speeds in beam search decoding when used as a pre-filter. Index Generation The first step to apply n-gram decoding to CTC is to build an n-gram index for later use. This can be done offline, before transcription using CTC, and the resulting index can be stored on disk. The overall process is outlined in Figure 10.2.1. Input into the index generation process is a dictionary of strings to which the decoding algorithm should be capable to decode. It is thus assumed to be in the same language as the later transcribed texts and using the same alphabet. Index Structure Hit list ID 1 Pos. 1 ID 2 Pos. 2 ... ... Dictionary Map n-gram to hit list ID 1 "abc" Bi-gram "AB" Hit list Index Generation ID 2 "abd" Process Bi-gram "BC" ID 1 Pos. 2 ... ... Bi-gram "BD" ... ... ... Hit list ID 2 Pos. 2 ... ... Figure 10.2.1: Structure of the generated n-gram index. Hit lists are sorted by ascending ID first and ascending position second. 202 The index structure consists of a map that relates character n-grams, in this example bi-grams, to hit lists. Upper or lower case characters are mapped to all upper case variants to implement additional robustness of the decoding method. The data structure for this map must contain each n-gram (key) exactly once and allow fast exact lookup. Suitable are for example tries or hash maps. Emphasis while choosing a data structure for this map should be placed on a low runtime for lookup, that is mapping n-grams to hit lists, since this is critical for a low runtime during transcription of handwritten texts using this decoding method. The index further contains exactly one hit list per character n-gram of the dictionary. Each hit list marks the occurrences of this specific n-gram. The hit lists are simple sorted list of 2-tuples containing an identifier for the dictionary entry and the position, in character from the beginning of the string, within the string. Sorting these hit lists by ascending identifiers and positions is beneficial for applying fast intersection algorithms to these lists. Decoding using Fuzzy Search After building the above described index it can be applied to filter the dictionary for strings that have a high probability given the prediction of a deep neural network trained by CTC. As mentioned before is the last layer of a DNN for CTC a softmax function that produces likelihoods for assigning each time step to one of the glyphs from the given alphabet. Assignments of the glyphs are exclusive to each other per time step. Table 10.1 illustrates a DNN prediction for a simple case. Table 10.1: Example output of a deep neural network training with connectionist temporal classi- fication. The sequence consists of ten time step and predicts labels from an alphabet of three glyphs plus the glyph separator ϵ of CTC. The probabilities of each time step (column) are exclusive to each other and thus sum up to exactly one. Time step 1 2 3 4 5 6 7 8 9 10 A 0.5 0.9 0.1 0.0 0.3 0.0 0.1 0.0 0.3 0.1 B 0.0 0.0 0.1 0.7 0.3 0.9 0.7 0.0 0.2 0.0 C 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.4 0.7 ϵ 0.2 0.1 0.8 0.3 0.4 0.1 0.2 1.0 0.1 0.2 The first step in decoding the soft-assignment of Table 10.1 is to extract character n-grams from it and weigh them with probabilities. The order of n-grams (bi-gram, tri- gram, ...) must be identical to the n-grams used for index generation. Upper and lower case glyphs must also be mapped to all upper case if done so in index generation. The glyph separator ϵ is contained in the soft-assignment as predicted by the DNN, but not in the index structure of Figure 10.2.1. Glyph separators ϵ are thus used for computing the weight of each n-gram, but not included for lookup in the map structure to retrieve the matching hit list. N-grams are extracted from the soft-assignment by application of a backtracking[22] algorithm. Backtracking is started once at every time step and extracts all n-grams starting at it. Only n-grams starting and ending with a printable glyph, that is not a separator ϵ, are extracted. Since the CTC label sequence alternates between print- able glyphs and the separator, each n-gram consists of n time steps of printable glyphs and up to n−1 separator glyphs. Separator glyphs ϵ may be less than the maximum n−1 since they are not required for decoding if the two adjacent glyphs are different and thus can be distinguished from a single character that repeats over multiple time steps without placing a separator glyph in between. The backtracking algorithm thus extracts character n-grams spanning between n and (2× n)− 1 time steps, beginning at each time step. All n-grams start and end with printable glyphs. 203 The proposed backtracking algorithm extracts character n-grams from the DNN pre- diction while at the same time calculating the mean probability over all labels included in this n-gram. Adjacent repetitions of the same glyph are resolved by using only the maximum probability for each character. This behavior is in line with the observation that DNNs trained with connectionist temporal classification tend to predict ‘spikes’, each one being a very localized high-probability prediction for a character. The weight of each extracted n-gram G is then the mean probability over∑all ‘spikes’ used during extraction | 1P (G y) = yt (10.2.1) (2× n)− 1 g with y being the soft-assignment produced by CTC, e.g. Table 10.1, and t, g being the time step and glyph used for each position in the n-gram G. Some example probabilities of bi-grams contained in Table 10.1 starting at the first time step are: • P (AϵA|y) = 13(0.5 + 0.1 + 0.1) ≈ 0.233 • P (AϵB|y) = 13(0.5 + 0.1 + 0.1) ≈ 0.233 • P (AAB|y) = 13(0.5 + 0.9 + 0.1) = 0.5 Second step of the decoding algorithm is to identify the index hit lists, see Figure 10.2.1, matching the extracted n-grams. This is simply a series of exact searches within the map data structure. N-grams that were extracted from the CTC prediction but that are not contained in the index structure are skipped. Each hit list is weighed by the probability of the matching extracted n-gram as defined by Equation 10.2.1. The last step during decoding is to intersect the n-gram hit lists, weighed by their respective probability, and to identify the highest probability dictionary strings. Probabil- ities of the dictionary strings are again computed by averaging the probabilities of the extracted n-grams contained within the string. This allows to simply weigh unmatched n-grams with zero probability in order to perform index queries with incomplete informa- tion, but still obtain meaningful dictionary entries. Intersection the hit lists is related to the multiple search and t-intersection problems. In these, ordered lists are intersected with the additional constraint that each element contained in the intersection must also be contained in at least t of the individual hit lists. Several algorithms have been proposed[4, 72, 146] to address these problems. The dictionary entries contained in the intersection are then weighed and ordered by their mean n-gram probability. The dictionary entry with the highest probability according to the decoding algorithm outlined above is the one label sequence that explains the DNN prediction best, at least under the assumptions incorporated in this decoding method. This decoding algorithm can also be used as a pre-filter for beam search decoding. In this case not only the highest probability dictionary entry is used, but the top-n entries. These are then used to restrict beam search decoding to process only label sequences, and their prefixes, that are contained within this pre-filtered list of dictionary entries. Evaluation This proposed decoding algorithm using character n-grams was tested on a postal data set at Siemens Parcel Logistics. The data set consisted of address lines from the United States and Canada. 135000 line images were used for training and each 3000 for vali- dation and evaluation. These three data splits were disjoint to each other. The dictionary in use consisted of 423170 strings, containing the correct transcription texts and variants of them. The trained deep neural network showed a character error rate (CER) of 5.50 204 on the training, 7.01 on the validation and 6.86 on the evaluation set when decoded with unconstrained beam search decoding. Table 10.2 shows the average CER and runtime per example of the evaluation split when decoded with beam search decoding constrained to either the full dictionary or a pre-filter obtained by the proposed n-gram index. As expected does beam search with an unlimited beam width produce the lowest error rate, but at the highest cost. Pre-filtering the dictionary to the best 500 entries and constraining to those produces a slightly higher error rate (0.65 instead of 0.58) but for a fraction of the required runtime (18.9 instead of 7444.5 milliseconds). Choosing the single best matched dictionary entry results in a high, compared to the other decoding methods, error rate of 3.05 with only little benefit in decoding speeds. Table 10.2: Character error rates on the evaluation split when constraining beam search decod- ing to the full dictionary or to a pre-filtered dictionary using n-gram decoding. Time measurements are the average wall-clock time to decode a single DNN output. Decoder Configuration CER Runtime Constr. on full dict. beam-w. 10000 1.05 81.9 ms Constr. on full dict. beam-w. 100000 0.76 2297.7 ms Constr. on full dict. beam-w. unlimited 0.58 7444.5 ms Constr. by n-gram idx. beam-w. 10000, 3-grams, filter 100 0.85 15.0 ms Constr. by n-gram idx. beam-w. 10000, 3-grams, filter 500 0.65 18.9 ms Single best match tri-grams 3.05 13.3 ms 10.3 Decoding using LSTM Networks and Metric Learning This section is based on the following publication: Martin Schall, Haiyan P. Buehrig, Marc-Peter Schambach, and Matthias O. Franz. “LSTM Networks for Edit Distance Calculation with Exchangeable Dictionaries.” In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS). Apr. 2018 Section 1.3 outlines the authors individual contributions. Overview The method discussed in this section is based on the observation that recurrent neural networks (RNNs) are Turing complete[101] and able to evaluate computer code[160]. In this section we will explore if a long short-term memory (LSTM)[48, 55] network is capable of learning the algorithm for computing the Edit-distance[81, 151] between a query string and a dictionary of strings. Inference in the LSTM network yields the Edit-distances between the query string and every string of the dictionary. Both the query string and the dictionary of strings should be exchangeable for previously unseen variants. In other words, the LSTM network is required to learn to approximate the algorithm for computing the Edit-distance, independent of the data presented. Figure 10.3.1 shows an overview of this approach. This method is related to offline handwriting recognition in terms of its potential ap- plication as a decoding algorithm. Incorporating this method would be similar to the one discussed in the previous Section 10.2. The method discussed in this section takes a specific query string as input and estimates the Edit-distances towards a given dictionary 205 Query String Deep Neural List of Network Edit-Distances Dictionary of Strings Figure 10.3.1: General idea for the metric learning of this section. The Deep Neural Network estimates the Edit-distances between an unseen query string and and unseen dictionary. The DNN should only learn the metric, that is the algorithm for compu- tation of the Edit-distance and not memorize a specific dictionary. of string. Replacing the query string with the output of a connectionist temporal classifi- cation (CTC)[46], see Section 3.1, would yield a decoding algorithm for CTC. Decoding in this case is finding the dictionary string with the minimal Edit-distance towards the CTC output. It could be employed in the paragraph-wise decoding algorithm of Chapter 5 for decoding individual lines. Network Topology and Training Method The first step in designing a deep neural network for estimating the Edit-distance between strings is to encode these strings in a way that is suitable for the task. In this case, each string is a sequence of characters from the Latin alphabet and the glyphs are mutually exclusive to each other per position in the string. That is each string position is exactly one character from the alphabet, not zero or multiple ones. As such the strings were encoded in a one-hot scheme were each string position is encoded as a 26-dimensional, for 26 Latin glyphs, vector with exactly one position set to the value 1 and others to 0. The glyph ‘A’ is encoded as a vector with the first coefficient set to 1, others to 0. ‘B’ with the second coefficient set to 1 and so on. In the case of these experiments, the strings were limited to a maximum length of ten characters. If the encoded string was shorter, the remaining positions were encoded as vectors with all zeros. The tensor for encoding the query string was thus of shape B × 10 × 26 with B being the mini-batch size and allowing for a maximum string length of 10 over the Latin alphabet of 26 characters. Dictionaries of strings for comparison with the query string were encoded in a similar fashion. Each dictionary presented to the deep neural network contained 100 strings, stacked along the ‘encoding-dimension’. The tensor containing an encoded dictionary was thus of shape B × 10 × 2600 in the experiments of this section. Both the one-hot encoded query string and dictionary were presented to the LSTM network for estimation of the Edit-distance. Figure 10.3.2 illustrates this approach and topological details. Figure 10.3.2 details the deep neural network topology employed in this method. The core idea is to stack layers of bidirectional LSTM layers, each time again presenting the encoded dictionary to the layer. This allows the network to optimize towards computing the Edit-distance, not towards forwarding the 100 dictionary strings through its layers. In addition to this wanted behavior does this approach allow to encode the dictionary external of the DNN, making it exchangeable for previously unseen ones. Bidirectional LSTM was chosen since it has proven to work well for sequence labeling and sequence transformation tasks in the past, CTC being a prominent example. There is also work[39] showing that LSTM networks with forget gates are capable of recognizing basic syntax. The last layer of the deep neural network is a fully connected feed-forward layer with a ReLU[85] non-linear activation function. The purpose of this fully connected layer is 206 Encoded Dictionary Encoded Queries (|Batch| x 10 x 2600) (|Batch| x 10 x 26) Concat along last dimension (|Batch| x 10 x 2626) Bidirectional LSTM (|Batch| x 10 x #Neurons) Concat along last dimension Repeat according to topology. Bidirectional LSTM Fully Connected with ReLU (|Batch| x 100) Estimated Edit-Distances Figure 10.3.2: The Deep Neural Network used for this method consists of a stack of bidirectional long short-term memory, ended with a fully connected layer that estimates the Edit-distances. The string length is limited to a maximum of 10 characters in the experiments of this section. The alphabet consists of 26 Latin characters. The dictionary passed into the DNN contains exactly 100 strings. 207 to estimate the Edit-distances as one scalar per dictionary entry. This way the overall problem is formulated as a regression task. ReLU is suitable here since the Edit-distance has a minimum of zero, but is unbound in the positive range. This is true for the ReLU function. Data for training and evaluation was derived from the 20k most frequent English words[91]. The word list used for this work was retrieved from https://github.com/ first20hours/google-10000-english/blob/master/20k.txt on the 29th of September 2017. These 20k most frequent English words were filtered to ones between 3 and 10 characters in length, leaving 16968 strings. 1000 random ones were used to build 10 dic- tionaries of 100 strings each. 9 of these 10 dictionaries were used for training and the last one reserved for validation and evaluation. Of the remaining 15968 strings, 80 percent were used as query strings for training of the DNN and 10 percent each for validation and evaluation. Training was conducted by choosing a random dictionaries from the 9 ones reserved for training. Also a random set of query strings was chosen from those set aside for train- ing. The true Edit-distances were computed for this combination of dictionary and query strings, facilitating optimization of the DNN in a supervised fashion. The optimization criteria was to minimize the mean squared error (MSE) between the true and estimated Edit-distances. The DNN parameters were optimized using backpropagation, gradient descent and Adam[67] in mini-batch mode. Evaluation Table 10.3 shows the average root mean squared error (RMSE) between the true and estimated Edit-distances after optimizing the deep neural network in the detailed way. The RMSE is a measure of the absolute difference in the same unit as the estimated quantity, in this case the Edit-distance. The evaluation table is split into two parts: the upper one while using the dictionary withheld from training and the lower part using the 9 dictionaries applied for training. The columns detail the RMSE for different DNN topologies with 2 to 5 LSTM layers with 30 to 200 neurons each. Figure 10.3.2 illustrates this network topology. The dictionaries were not shuffled and kept in the same order for these experiments. Table 10.3: Average RMSE when estimating the Edit-distances for known and unknown dictio- naries. The order of strings within the dictionaries was kept constant for training and evaluation. Num. layers and neurons 2× 30 2× 60 3× 60 5× 200 Eval. set 1.78 1.78 1.56 2.13 Unkn. Dict. Val. set 1.78 1.80 1.57 2.14 Train. set 1.78 1.79 1.57 2.12 Eval. set 0.37 0.30 0.29 0.36 Known Dict. Val. set 0.37 0.29 0.29 0.36 Train. set 0.37 0.29 0.28 0.34 Table 10.3 clearly shows a lower prediction error of ∼0.3 to ∼0.4 on the dictionaries seen during training in comparison to an error of ∼1.6 to ∼2.2 on the remaining unseen dictionary. This indicates that the DNN learning to either recognize or to memorize the dictionaries presented during training. This is contrary to the goal laid out in the begin- ning of this section, namely that the DNN should learn to approximate the algorithm of computing the Edit-distance while allowing exchangeable dictionaries. To this end, a sec- ond set of experiments was conducted. This time the words contained in the 9 training 208 dictionaries were shuffled randomly before each optimization iteration. Each dictionary contained 100 strings, yielding 100! ≈ 10158 different permutations per dictionary. Train- ing and evaluated was repeated with these randomized dictionaries as the only change in comparison to the set of experiments detailed in Table 10.3. Table 10.4 shows the results of the empirical evaluation using the shuffled dictionar- ies. This time the RMSE of the predicted Edit-distance is between ∼0.8 and ∼0.85 on the dictionaries seen during training and ∼0.85 on the unseen dictionary. Memorizing or recognizing the dictionaries presented during training was not possible anymore. Similar to this observation are the prediction errors between the training, validation and evalu- ation data sets close together. In conclusion does this training paradigm allow to both exchange the dictionary and the query strings while the DNN itself only learns to approx- imate the Edit-distance, not recognize specific inputs presented to it. Table 10.4: Evaluation as in Table 10.3 but with the order of strings shuffled randomly for each training iteration and evaluation step. Overfitting is reduced by shuffling the dictionar- ies randomly. Num. layers and neurons 2× 30 2× 60 3× 60 5× 200 Eval. set 0.86 0.84 0.84 0.84 Unkn. Dict. Val. set 0.86 0.84 0.84 0.84 Train. set 0.86 0.84 0.84 0.84 Eval. set 0.85 0.82 0.82 0.81 Known Dict. Val. set 0.85 0.82 0.82 0.81 Train. set 0.85 0.82 0.82 0.81 The achieved RMSE of ∼0.8 for predicting the Edit-distance between a query string and a dictionary of words is not small enough to retrieve the correct Edit-distance by rounding. However, this method shows that LSTM networks are capable of approxima- tion of the algorithm for computing the Edit-distance while keeping both the dictionary and query string exchangeable. This opens the door for further scientific research into appli- cations of this type of DNN, for example for fuzzy search or decoding of CTC or MDCC predictions in offline handwriting recognition. 209 210 Chapter 11 Discussion and Conclusion 11.1 Achieved Goals In the beginning of this thesis, Chapter 1 discussed the motivation and scientific contri- butions for it. Chapter 3 then detailed the works which can be directly compared to multi- dimensional connectionist classification as proposed in this thesis, followed by Chapter 4 which contains the problem statement for segmentation-free multi-line offline handwriting recognition. The target of this section is now to reflect on this and to discuss the goals which were set and achieved. Chapters 5 and 6 proposed and detailed multi-dimensional connectionist classifica- tion (MDCC), which tackles the problem of segmentation-free multi-line offline handwrit- ing recognition. The goal for this research was to identify a method which is able to apply deep neural networks to the transcription of handwritten multi-line paragraphs with a low error rate, but also without prior segmentation of the input input and while providing only the truth texts as labeling during training of the neural network. This goal has been achieved by MDCC and demonstrated in Chapter 7 by applying it to the IAM offline hand- writing database[88] and empirical evaluation and comparison. The empirical evaluation of MDCC on the IAM offline handwriting recognition showed competitive error rates in comparison to published methods and results. The results of Chapter 7 require discussion in the context of the comparison to the related works of Chapter 3. The time frame in which the research into the methods of this thesis has been conducted also saw the publication of other methods that address the same problem. Most prominently the application of attention network to multi-line offline handwriting recognition[8, 9, 135]. These have been discussed in Section 3.2 of this thesis. Table 7.6 shows the comparison of MDCC to these methods in terms of the character error rate achieved on the IAM offline handwriting database. This table shows a higher character error rate for MDCC than the other methods. However, multi-dimensional connectionist classification does offer some benefits in comparison to both applying attention networks, see Section 3.2, to offline handwriting recognition or reshaping the prediction of a convolutional neural network, see Section 3.3: • MDCC is designed and implemented as a training algorithm, based on expectation- maximization and conditional random fields, for deep neural networks. MDCC thus sets only broad requirements on the DNN topology and could in theory even be used in combination with other machine learning models, aside from artificial neural networks. The conditions on the machine learning model are that it produces the probabilistic soft-assignment required for MDCC and that is can be optimized in an expectation-maximization loop. This allows adapting MDCC for different use-case requirements such as memory or general hardware and runtime limitations. This 211 property of MDCC also enables its potential application to future, yet unknown, machine learning models. • MDCC is fast in transcribing paragraphs. This is achieved by moving a large part of the computational effort into the training method, not into inference during tran- scription. MDCC proposes a multi-line decoding algorithm, which is fast to execute and easy to parallelize. • MDCC can handle a variety of cursive writing styles in handwritten text. It does not set assumptions on the shape or size of glyphs, these are learned by the deep neural network. It also does not assume text lines to be of equal height and allows text lines to have different heights, but also individual text lines to vary internally in height. The label space for encoding multi-line text, discussed in Section 6.2, allows slants and angles of text lines up to 45 degree. Also the text lines in MDCC do not have to be aligned to the left or right border of the image. Chapter 8 proposed a visual analytics method for identifying and inspecting interesting examples during the training of a deep neural network with MDCC. The express goal of this is to allow a human expert user to identify potential error sources during the training process and address them. Specifically does this method target the optimization of hyper- parameters in MDCC and improving the ground truth data available. As such is this method an important part of applying MDCC to specific tasks and data sets. Chapter 9 finally proposed a method for combining line- and paragraph-wise tran- scription on a case-by-case basis. This is designed to utilize both methods in order to achieve a lower error rate. This goal is interesting from a purely scientific viewpoint, but also a common approach in industry and thus does support the deployment of MDCC in such use-cases. 11.2 Ideas for Future Research The achieved goals and direct outcomes of the research discussed in this doctoral the- sis has been detailed in the sections before. This section will now offer a few ideas for research that builds upon the work of this thesis. Please keep in mind that these sugges- tions will entail some speculation. However, this speculation is informed by the authors experience while conducting the research into multi-line offline handwriting recognition. It is also worth pointing out that the author of this thesis does not claim these proposed methods as his own. Instead are the following paragraphs only ideas for continuing the research of MDCC and the application of expectation-maximization to deep neural net- works and conditional random fields where this thesis ends. Generalized Writing Direction This thesis discussed the structure of multi-line text in Section 6.2 and the conditional random field (CRF) used for inference of the alignment between the truth text given in the labeled data and the pixel space of the image is built based on these observations. One of the limitations that restricted the structure of paragraphs that can be transcribed with multi-dimensional connectionist classification (MDCC) is that text lines need to be aligned from top to bottom, characters from left to right and text lines may be rotated by a maximum of 45 degree. This limitation is also reflected in the proposed decoding algorithm of Chapter 5. Switching the reading order from left-to-right around to right-to-left is not a big change in MDCC. On the other hand would it likely be an interesting research topic to generalize 212 MDCC in order to transcribe other writing directions. Figure 11.2.1 shows one such an example where the text is ordered in a spiral, starting in the outer ring and progressing towards the center of the spiral. elitr, se cin g d d ore mag i l na do os et a ao e ca rebe uergreksi Figure 11.2.1: Text typeset as a spiral. The reading direction is starting at the outer end, reading towards the middle. Text generated with https://www.loremipsum.de/ on September 24th, 2021. Transcribing generalized writing directions such as the example of Figure 11.2.1 seems to be possible with an improved version of MDCC since MDCC models the separation of two lines with an explicit symbol. The orientation and extent of individual text lines is thus not implicitly defined by geometry and assumptions about handwritten text, but by the prediction of the line separator glyph. Transcribing text ordered in a spiral would thus entail modeling the line separator glyph as a spiral in between adjacent loops of the same text line. This concept could further be generalized by removing the knowledge about the writ- ing order, e.g. that the text is written in a spiral, from the labels of the annotated training data. It may be possible to structure the CRF and decoder of multi-dimensional connec- tionist classification in such a way that it allows general structures of multi-line text. For example one could introduce two additional tokens into MDCC that signal the start and end of a text line. The CRF could then model text lines in arbitrary shapes as long as no two text lines are crossing. Decoding would entail starting at a predicted line start and following the corridor between the predicted line separator glyphs, even if this corridor is shaped in e.g. spiral or wave forms. Document Layout Analysis The next step up from transcribing generalized writing orders would be to analyze and transcribe general document structures. Figure 11.2.2 is an example from the PubLayNet 213 ons etetur sad c labore ips nt ut et tua. Atp veu lores e r o t sd gub a taaa dolor sit a um m or temp in e vi t, od d diam u se d vol sto duo d lita kc ao set Lorem numy e ipno irm s am uyam eraliq tsamu e , t c j. Ste u t m n, n ma database[163], which shows a complex layout. Adapting multi-dimensional connectionist classification for the analysis of such document would entail generalizing the geometric neighborhoods between entities. These have been discussed in Section 6.2 of this thesis. In the case of this example, the neighborhoods would reflect that there are two columns, left and right of each other. The structure would also entail that there is a table to the top of both of these columns. Hierarchical information also needs to be encoded since the two text columns and the table contain text, which in itself is a label space with relationships between characters. In a similar fashion does the right column contain two figures, which also represents a hierarchical relationship. Encoding hierarchical structures in the label space of MDCC would not necessarily entail a hierarchical conditional random field. It could also be possible to encode this hierarchy by introducing even more symbols with special meaning into the alphabet of MDCC and thus the label space. In the case of Figure 11.2.2, these symbols could e.g. be ‘upper border of a table’, ‘right border of a table’ and so on for tables, figures and the left and multi-column layouts. This way the labels within e.g. a table could be modeled independent from the outside label space, connected only by the indicated table border. Changes to the decoding algorithm proposed in Chapter 5 would be necessary to reflect these hierarchical relationships in the document and thus label space. Scene Text Recognition Figure 11.2.3 shows an example from the Natural Environment OCR (NEOCR) data set[95, 96], which addresses the scene text recognition task. Scene text recognition is to locate and transcribe text within natural images, e.g. scenes of streets with their street signs and advertisement in shop windows. One future research direction could be to apply multi-dimensional connectionist clas- sification to this task by, as before, introducing some special symbols. In the case of the example in Figure 11.2.3, these symbols could be generic such as ‘sky’, ‘floor’, ‘pole’, ‘tree’. It would then be possible to model this scene with its text embedded into the natu- ral environment. The label space, see Section 6.2, of MDCC would in this case indicate a tree left of the text, sky to the top-right, a floor below and a pole to the right. The text itself could be modeled, as in this thesis, as a multi-line text. Object Segmentation with Incomplete Information The last suggestion for further research into multi-dimensional connectionist classification is to apply it to object segmentation, but with incomplete information. The task of object segmentation is to take an image of a natural scene as input and produce a pixel-wise mask that assigns each pixel to an object within this scene image. Object segmentation thus allows both geometric location as well as identification of objects. In other words is object segmentation the task of separating individual objects within an image. Figure 11.2.4 shows a scene image of a smaller airport with a general aviation airplane. The objects in this case would be e.g. the airplane, landing strip, grassland, trees, buildings in the background and the sky. In contrast to the scene text recognition task would the object segmentation task re- quire an ‘alphabet’ of very specific symbols, one for each object type within the data set. For example the COCO data set[83] consists of 80 different object classes. The COCO data set is labeled with the truth pixel-wise segmentation masks and object types and thus provides full information about the scene image. MDCC could be applied to object segmentation with incomplete information by reducing the labeled truth data to object types and their geometric relation to each other, but omitting the pixel-wise masks. 214 Figure 11.2.2: Example from the PubLayNet database[163]. This example shows a two-column page layout with a spanning table on top. The right column contains two separate figures. 215 Figure 11.2.3: Scene text image contained in the Natural Environment OCR (NEOCR) data set[95, 96], provided by https://www.cs6.tf.fau.de/neocr. Figure 11.2.4: Scene image captured on a smaller airport and showing a general aviation airplane. Image from https://commons.wikimedia.org/wiki/File:Glenrothes_ Airport.JPG and released into public domain by author Michael Westwa- ter. 216 The geometric relationships in Figure 11.2.4 would be the following: • A tree on the left border. • A grassland to the right of the tree. • A small airplane surrounded by the grassland on the top, left and bottom. • A landing strip surrounded by the grassland on the top, left and bottom. • Buildings to the right of the tree and on top of the grassland. • Sky on top of the tree, buildings and grassland. Spanning up to the image border. • And so on... The label space of MDCC would then reflect these geometric relationship between objects, enabling the conditional random field to infer the pixel-wise mask from the current DNN estimate and the given label space. The DNN trained in this fashion would thus learn a pixel-wise object segmentation mask for each example by optimization on a large data set with only incomplete segmentation information provided. 11.3 Discussion This section represents the conclusion of this thesis and as such is the place for some final thoughts. The following paragraphs are thus the personal opinion of the author. Chapters 7 and 9 did show that paragraph-wise transcription becomes increasingly preferable to line-wise transcription of handwritten text if the text lines become harder to segment. This has been shown with experiments in which the offset between adja- cent text lines has been artificially reduced. Multi-dimensional connectionist classification and attention-network based methods[8] can be applied for paragraph-wise transcription. Both are robust methods for transcription of overlapping text lines. MDCC is faster than the compared attention-network method, which can be important in time-sensitive ap- plications. MDCC also shows general properties that allow its application to machine learning models other than deep neural networks. Section 11.2 discusses a some ideas on how to generalize MDCC for more complex document layouts than single paragraphs. One observation to point out is that at the beginning of this research, the state of the art in offline handwriting recognition was to apply connectionist temporal classification (CTC)[46] in combination with deep neural networks. CTC still is, with good reason, in the state of the art in this research field. However, multiple novel methods for paragraph-wise offline handwriting recognition have emerged independently of each other while working on MDCC. This seems to indicate some sort of ‘convergent evolution’ in document analy- sis and that there is indeed interest and need in multi-line offline handwriting recognition. Multi-dimensional connectionist classification fits into this. Another point is an observation on the published papers at recent document analysis conferences, especially ICDAR 2021. The number of published papers on different topics seem to signal an increasing interest in extracting information from complex documents, e.g. invoices, with little or no prior explicit layout analysis. On the other hand are there still numerous publications on historical document analysis. I think this makes sense since the number of contemporary, modern handwritten documents is likely decreasing. This trend does indicate that further research into multi-dimensional connectionist clas- sification for document layout analysis, see the ideas of Section 11.2, may be useful and fruitful. MDCC could potentially be applied to both complex document layouts and historical documents. 217 Manually debugging and optimizing the MDCC training method and deep neural net- work hyper-parameters also made it obvious that some sort of visual analytics for identi- fying error sources in MDCC is necessary. This was discussed in Chapter 8 in the context of this thesis. I do hope that this trend extends to more artificial intelligence methods in what is called explainable AI. I think involving human experts into the general evaluation of and identification of error sources in artificial intelligence systems is a component for a widespread use of AI while maintaining the trust of users and others who are affected by these methods. These observations conclude this thesis, hopefully with a positive outlook on the re- search fields touched by this work and with the contributions of this thesis being added to them. 218 Bibliography [1] Martin Abadi et al. “TensorFlow: A system for large-scale machine learning.” In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). 2016, pp. 265–283. [2] Stefan Arnborg, Derek G. Corneil, and Andrzej Proskurowski. “Complexity of find- ing embeddings in a k-tree.” In: SIAM Journal on Algebraic Discrete Methods 8.2 (1987), pp. 277–284. [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Trans- lation by Jointly Learning to Align and Translate.” In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. [4] Jérémy Barbay and Claire Kenyon. “Adaptive Intersection and t-Threshold Prob- lems.” In: Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Dis- crete Algorithms. SODA ’02. San Francisco, California: Society for Industrial and Applied Mathematics, 2002, pp. 390–399. ISBN: 089871513X. [5] Yoshua Bengio. Learning deep architectures for AI. Now Publishers Inc, 2009. [6] Yoshua Bengio. “Practical recommendations for gradient-based training of deep architectures.” In: Neural networks: Tricks of the trade. Springer, 2012, pp. 437– 478. [7] Christopher M. Bishop. Pattern Recognition and Machine Learning. springer, 2006. ISBN: 978-0-387-31073-2. [8] Théodore Bluche. “Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition.” In: Advances in Neural Information Process- ing Systems. 2016, pp. 838–846. [9] Théodore Bluche, Jérôome Louradour, and Ronaldo Messina. “Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition with MDLSTM Attention.” In: 2017 14th IAPR International Conference on Document Analysis and Recog- nition (ICDAR). Vol. 1. IEEE. 2017, pp. 1050–1055. [10] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. “A training algo- rithm for optimal margin classifiers.” In: Proceedings of the fifth annual workshop on Computational learning theory. 1992, pp. 144–152. [11] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimization methods for large- scale machine learning.” In: Siam Review 60.2 (2018), pp. 223–311. [12] Thomas M. Breuel. “High performance text recognition using a hybrid convolutional-lstm implementation.” In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 11–16. [13] John S. Bridle. “Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition.” In: Neurocomputing. Springer, 1990, pp. 227–236. 219 [14] Rich Caruana, Steve Lawrence, and C. Lee Giles. “Overfitting in neural nets: Back- propagation, conjugate gradient, and early stopping.” In: Advances in neural infor- mation processing systems. 2001, pp. 402–408. [15] Venkat Chandrasekaran, Nathan Srebro, and Prahladh Harsha. “Complexity of Inference in Graphical Models.” In: CoRR abs/1206.3240 (2012). arXiv: 1206. 3240. [16] Darren M. Chitty. “A data parallel approach to genetic programming using pro- grammable graphics hardware.” In: Proceedings of the 9th annual conference on Genetic and evolutionary computation. 2007, pp. 1566–1573. [17] Jaegul Choo and Shixia Liu. “Visual analytics for explainable deep learning.” In: IEEE computer graphics and applications 38.4 (2018), pp. 84–92. [18] Anna Choromanska et al. “The loss surfaces of multilayer networks.” In: Artificial intelligence and statistics. 2015, pp. 192–204. [19] Barry A. Cipra. “An introduction to the Ising model.” In: The American Mathemati- cal Monthly 94.10 (1987), pp. 937–959. [20] Denis Coquenet, Clément Chatelain, and Thierry Paquet. “End-to-end Handwrit- ten Paragraph Text Recognition Using a Vertical Attention Network.” In: CoRR abs/2012.03868 (2020). arXiv: 2012.03868. [21] Denis Coquenet, Clément Chatelain, and Thierry Paquet. “SPAN: A Simple Pre- dict & Align Network for Handwritten Paragraph Recognition.” In: Document Anal- ysis and Recognition – ICDAR 2021. Ed. by Josep Lladós, Daniel Lopresti, and Seiichi Uchida. Cham: Springer International Publishing, 2021, pp. 70–84. ISBN: 978-3-030-86334-0. [22] Thomas H. Cormen et al. Introduction to algorithms. 2009. [23] Paul Dagum and Michael Luby. “Approximating probabilistic inference in Bayesian belief networks is NP-hard.” In: Artificial intelligence 60.1 (1993), pp. 141–153. [24] Sanjoy Dasgupta. “Learning Polytrees.” In: CoRR abs/1301.6688 (2013). arXiv: 1301.6688. [25] Yann N. Dauphin et al. “Identifying and attacking the saddle point problem in high- dimensional non-convex optimization.” In: Advances in neural information pro- cessing systems. 2014, pp. 2933–2941. [26] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. “Maximum likelihood from incomplete data via the EM algorithm.” In: Journal of the Royal Statistical Society: Series B (Methodological) 39.1 (1977), pp. 1–22. [27] Jia Deng et al. “Imagenet: A large-scale hierarchical image database.” In: 2009 IEEE conference on computer vision and pattern recognition. Ieee. 2009, pp. 248– 255. [28] Patrick Doetsch, Michal Kozielski, and Hermann Ney. “Fast and robust training of recurrent neural networks for offline handwriting recognition.” In: 2014 14th International Conference on Frontiers in Handwriting Recognition. IEEE. 2014, pp. 279–284. [29] Patrick Doetsch et al. “RETURNN: The RWTH extensible training framework for universal recurrent neural networks.” In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2017, pp. 5345–5349. 220 [30] Harris Drucker et al. “Support Vector Regression Machines.” In: Advances in Neu- ral Information Processing Systems 9, NIPS, Denver, CO, USA, December 2-5, 1996. Ed. by Michael Mozer, Michael I. Jordan, and Thomas Petsche. MIT Press, 1996, pp. 155–161. [31] B. W. A. C. Farley and W. Clark. “Simulation of self-organizing systems by digital computer.” In: Transactions of the IRE Professional Group on Information Theory 4.4 (1954), pp. 76–84. [32] Ronald Aylmer Fisher. “Statistical methods for research workers.” In: Break- throughs in statistics. Springer, 1992, pp. 66–70. [33] G. David Forney. “The viterbi algorithm.” In: Proceedings of the IEEE 61.3 (1973), pp. 268–278. [34] Brendan J. Frey and David J. C. MacKay. “A revolution: Belief propagation in graphs with cycles.” In: Advances in neural information processing systems. 1998, pp. 479–485. [35] Kunihiko Fukushima. “Neural network model for selective attention in visual pat- tern recognition and associative recall.” In: Applied Optics 26.23 (1987), pp. 4985– 4992. [36] Kunihiko Fukushima. “Neocognitron: A hierarchical neural network capable of vi- sual pattern recognition.” In: Neural networks 1.2 (1988), pp. 119–130. [37] James Fung and Steve Mann. “Computer vision signal processing on graphics processing units.” In: 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. Vol. 5. IEEE. 2004, pp. V–93. [38] Yarin Gal and Zoubin Ghahramani. “A theoretically grounded application of dropout in recurrent neural networks.” In: Advances in neural information process- ing systems. 2016, pp. 1019–1027. [39] Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. “Learning to Forget: Continual Prediction with LSTM.” In: Neural Comput. 12.10 (2000), pp. 2451– 2471. DOI: 10.1162/089976600300015015. [40] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber. “Learning precise timing with LSTM recurrent networks.” In: Journal of machine learning research 3.Aug (2002), pp. 115–143. [41] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT press, 2016. ISBN: 978-0-262-03561-3. [42] Ian J. Goodfellow and Oriol Vinyals. “Qualitatively characterizing neural network optimization problems.” In: 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Pro- ceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. URL: http://arxiv. org/abs/1412.6544. [43] Alex Graves. Supervised sequence labelling with recurrent neural networks. Springer, 2012. [44] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. “Multi-Dimensional Recurrent Neural Networks.” In: International Conference on Artificial Neural Net- works. Springer. 2007, pp. 549–558. [45] Alex Graves and Jürgen Schmidhuber. “Offline Handwriting Recognition with Mul- tidimensional Recurrent Neural Networks.” In: Advances in Neural Information Processing Systems. 2009, pp. 545–552. 221 [46] Alex Graves et al. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” In: Proceedings of the 23rd In- ternational Conference on Machine learning. ACM. 2006, pp. 369–376. [47] Alex Graves et al. “A Novel Connectionist System for Unconstrained Handwriting Recognition.” In: IEEE Transactions on Pattern Analysis and Machine Intelligence 31.5 (2008), pp. 855–868. [48] Klaus Greff et al. “LSTM: A Search Space Odyssey.” In: IEEE transactions on neural networks and learning systems 28.10 (2016), pp. 2222–2232. [49] Isabelle Guyon, B. Boser, and Vladimir Vapnik. “Automatic capacity tuning of very large VC-dimension classifiers.” In: Advances in neural information processing systems. 1993, pp. 147–155. [50] John M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices (unpublished). 1971. [51] Kaiming He et al. “Deep residual learning for image recognition.” In: Proceed- ings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [52] Donald Olding Hebb. The organization of behavior: a neuropsychological theory. J. Wiley; Chapman & Hall, 1949. [53] Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Netzen.” In: Diploma, Technische Universität München 91.1 (1991). [54] Sepp Hochreiter. “Gradient flow in recurrent nets: the difficulty of learning long- term dependencies.” In: A Field Guide to Dynamical Recurrent Neural Networks (2001), pp. 237–244. [55] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory.” In: Neural computation 9.8 (1997), pp. 1735–1780. [56] Arthur E. Hoerl and Robert W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” In: Technometrics 12.1 (1970), pp. 55–67. [57] Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. “Kernel methods in machine learning.” In: The annals of statistics 36.3 (2008), pp. 1171–1220. [58] Fred Hohman et al. “Visual analytics in deep learning: An interrogative survey for the next frontiers.” In: IEEE transactions on visualization and computer graphics 25.8 (2018), pp. 2674–2693. [59] Roger A. Horn and Charles R. Johnson. Matrix analysis. Cambridge university press, 2012. [60] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep net- work training by reducing internal covariate shift.” In: International conference on machine learning. PMLR. 2015, pp. 448–456. [61] Ernst Ising. “Beitrag zur Theorie des Ferromagnetismus.” In: Zeitschrift für Physik 31.1 (1925), pp. 253–258. [62] Stig Johansson, Geoffrey N. Leech, and Helen Goodluck. Lancaster/Oslo-Bergen Corpus Manual. 1978. URL: http://korpus.uib.no/icame/manuals/LOB/INDEX. HTM. [63] Melvin Johnson et al. “Google’s multilingual neural machine translation system: Enabling zero-shot translation.” In: Transactions of the Association for Computa- tional Linguistics 5 (2017), pp. 339–351. 222 [64] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. “Grid Long Short-Term Mem- ory.” In: 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2016. [65] Jack Kiefer, Jacob Wolfowitz, et al. “Stochastic estimation of the maximum of a re- gression function.” In: The Annals of Mathematical Statistics 23.3 (1952), pp. 462– 466. [66] Ross Kindermann and J. Laurie Snell. Markov Random Fields and Their Appli- cations. Vol. 1. American Mathematical Society, 1980. ISBN: 978-0-8218-5001-5. DOI: 10.1090/conm/001. [67] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015. [68] Stephen Cole Kleene. Representation of events in nerve nets and finite automata. Tech. rep. RAND PROJECT AIR FORCE SANTA MONICA CA, 1951. [69] Donald E. Knuth. The Art of Computer Programming, Vol. 1: Fundamental Algorithms. Third. Reading, Mass.: Addison-Wesley, 1997. ISBN: 0201896834 9780201896831. [70] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009. [71] Michał Kozielski, Patrick Doetsch, Hermann Ney, et al. “Improvements in RWTH’s system for offline handwriting recognition.” In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013, pp. 935–939. [72] Robert Krauthgamer et al. “Greedy list intersection.” In: 2008 IEEE 24th Interna- tional Conference on Data Engineering. IEEE. 2008, pp. 1033–1042. [73] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” In: Advances in neural information pro- cessing systems. 2012, pp. 1097–1105. [74] Anders Krogh and John A. Hertz. “A simple weight decay can improve general- ization.” In: Advances in neural information processing systems. 1992, pp. 950– 957. [75] Frank R. Kschischang, Brendan J. Frey, and H.-A. Loeliger. “Factor graphs and the sum-product algorithm.” In: IEEE Transactions on information theory 47.2 (2001), pp. 498–519. [76] Solomon Kullback and Richard A. Leibler. “On information and sufficiency.” In: The annals of mathematical statistics 22.1 (1951), pp. 79–86. [77] John Lafferty, Andrew McCallum, and Fernando C. N. Pereira. “Conditional ran- dom fields: Probabilistic models for segmenting and labeling sequence data.” In: (2001). [78] Yann LeCun, Yoshua Bengio, et al. “Convolutional networks for images, speech, and time series.” In: The handbook of brain theory and neural networks 3361.10 (1995), p. 1995. [79] Yann LeCun et al. “Backpropagation applied to handwritten zip code recognition.” In: Neural computation 1.4 (1989), pp. 541–551. 223 [80] Vasilica Lepar and Prakash P. Shenoy. “A Comparison of Lauritzen-Spiegelhalter, Hugin, and Shenoy-Shafer Architectures for Computing Marginals of Probability Distributions.” In: CoRR abs/1301.7394 (2013). arXiv: 1301.7394. [81] Vladimir I. Levenshtein. “Binary codes capable of correcting deletions, insertions, and reversals.” In: Soviet physics doklady. Vol. 10. 8. 1966, pp. 707–710. [82] Stan Z. Li. Markov Random Field Modeling in Image Analysis. Advances in Pattern Recognition. Springer, 2009. ISBN: 978-1-84800-278-4. DOI: 10.1007/978- 1- 84800-279-1. [83] Tsung-Yi Lin et al. “Microsoft COCO: Common objects in context.” In: European conference on computer vision. Springer. 2014, pp. 740–755. [84] Ilya Loshchilov and Frank Hutter. “Fixing Weight Decay Regularization in Adam.” In: CoRR abs/1711.05101 (2017). arXiv: 1711.05101. [85] Andrew L Maas, Awni Y. Hannun, and Andrew Y. Ng. “Rectifier nonlinearities im- prove neural network acoustic models.” In: Proc. icml. Vol. 30. 1. Citeseer. 2013, p. 3. [86] Anders L. Madsen et al. “The Hugin tool for probabilistic graphical models.” In: International Journal on Artificial Intelligence Tools 14.03 (2005), pp. 507–543. [87] Christopher Manning and Hinrich Schutze. Foundations of statistical natural lan- guage processing. MIT press, 1999. [88] U.-V. Marti and Horst Bunke. “The IAM-database: an English sentence database for offline handwriting recognition.” In: International Journal on Document Analysis and Recognition 5.1 (2002), pp. 39–46. [89] Warren S. McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous activity.” In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115– 133. [90] Geoffrey J. McLachlan and Kaye E. Basford. Mixture models: Inference and appli- cations to clustering. Vol. 38. M. Dekker New York, 1988. [91] Jean-Baptiste Michel et al. “Quantitative analysis of culture using millions of digi- tized books.” In: science 331.6014 (2011), pp. 176–182. [92] Marvin Minsky and Seymour Papert. Perceptrons - an introduction to computa- tional geometry. MIT Press, 1987. ISBN: 978-0-262-63111-2. [93] Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012. ISBN: 978-0-262-01802-9. [94] Kevin P. Murphy, Yair Weiss, and Michael I. Jordan. “Loopy Belief Propagation for Approximate Inference: An Empirical Study.” In: UAI ’99: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, July 30 - August 1, 1999. Ed. by Kathryn B. Laskey and Henri Prade. Morgan Kaufmann, 1999, pp. 467–475. [95] Robert Nagy, Anders Dicker, and Klaus Meyer-Wegener. “Definition and Evalua- tion of the NEOCR Dataset for Natural-Image Text Recognition.” In: (2011). [96] Robert Nagy, Anders Dicker, and Klaus Meyer-Wegener. “NEOCR: A configurable dataset for natural image text recognition.” In: International Workshop on Camera- Based Document Analysis and Recognition. Springer. 2011, pp. 150–163. [97] Frank Nielsen. “A family of statistical symmetric divergences based on Jensen’s inequality.” In: CoRR abs/1009.4004 (2010). arXiv: 1009.4004. 224 [98] Nobuyuki Otsu. “A threshold selection method from gray-level histograms.” In: IEEE transactions on systems, man, and cybernetics 9.1 (1979), pp. 62–66. [99] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of train- ing recurrent neural networks.” In: International conference on machine learning. 2013, pp. 1310–1318. [100] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. [101] Jorge Pérez, Javier Marinkovic, and Pablo Barceló. “On the Turing Complete- ness of Modern Neural Network Architectures.” In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [102] Vu Pham et al. “Dropout improves recurrent neural networks for handwriting recognition.” In: 2014 14th international conference on frontiers in handwriting recognition. IEEE. 2014, pp. 285–290. [103] Lutz Prechelt. “Early stopping-but when?” In: Neural Networks: Tricks of the trade. Springer, 1998, pp. 55–69. [104] Joan Puigcerver. “Are multidimensional recurrent layers really necessary for hand- written text recognition?” In: 2017 14th IAPR International Conference on Docu- ment Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 67–72. [105] Lawrence R. Rabiner. “A tutorial on hidden Markov models and selected applica- tions in speech recognition.” In: Proceedings of the IEEE 77.2 (1989), pp. 257– 286. [106] D. Raj Reddy et al. “Speech understanding systems: A summary of results of the five-year research effort.” In: Department of Computer Science. Camegie-Mell University, Pittsburgh, PA 17 (1977), p. 138. [107] Herbert Robbins and Sutton Monro. “A stochastic approximation method.” In: The annals of mathematical statistics (1951), pp. 400–407. [108] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. “U-net: Convolutional networks for biomedical image segmentation.” In: International Conference on Medical image computing and computer-assisted intervention. Springer. 2015, pp. 234–241. [109] Frank Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957. [110] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and organization in the brain.” In: Psychological review 65.6 (1958), p. 386. [111] Dan Roth. “On the hardness of approximate reasoning.” In: Artificial Intelligence 82.1-2 (1996), pp. 273–302. [112] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. “Learning rep- resentations by back-propagating errors.” In: nature 323.6088 (1986), pp. 533– 536. [113] David E. Rumelhart et al. “Backpropagation: The basic theory.” In: Backpropaga- tion: Theory, architectures and applications (1995), pp. 1–34. [114] Dominik Sacha et al. “Vis4ML: An ontology for visual analytics assisted ma- chine learning.” In: IEEE transactions on visualization and computer graphics 25.1 (2018), pp. 385–395. 225 [115] Fadil Santosa and William W. Symes. “Linear inversion of band-limited reflec- tion seismograms.” In: SIAM Journal on Scientific and Statistical Computing 7.4 (1986), pp. 1307–1330. [116] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.” In: 2nd Interna- tional Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2014. URL: http://arxiv.org/abs/1312.6120. [117] Kenneth M. Sayre. “Machine Recognition of Handwritten Words: A Project Re- port.” In: Pattern Recognition 5.3 (1973), pp. 213–228. [118] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Improving gradient-based LSTM training for offline handwriting recognition by careful se- lection of the optimization method.” In: BW-CAR| SINCOM (2016), p. 11. [119] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Increasing Ro- bustness of Handwriting Recognition Using Character N-Gram Decoding on Large Lexica.” In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS). Apr. 2016, pp. 156–161. DOI: 10.1109/DAS.2016.43. [120] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Multi-Dimensional Connectionist Classification: Reading Text in One Step.” In: 2018 13th IAPR In- ternational Workshop on Document Analysis Systems (DAS). Apr. 2018, pp. 405– 410. DOI: 10.1109/DAS.2018.36. [121] Martin Schall, Marc-Peter Schambach, and Matthias O. Franz. “Dissecting Multi- Line Handwriting for Multi-Dimensional Connectionist Classification.” In: 2019 15th IAPR International Conference on Document Analysis and Recognition (ICDAR). Sept. 2019. DOI: 10.1109/ICDAR.2019.00015. [122] Martin Schall et al. “LSTM Networks for Edit Distance Calculation with Exchange- able Dictionaries.” In: 2018 13th IAPR International Workshop on Document Anal- ysis Systems (DAS). Apr. 2018. [123] Martin Schall et al. “Visualization-Assisted Development of Deep Learning Mod- els in Offline Handwriting Recognition.” In: Symposium on Visualization in Data Science (VDS) at IEEE VIS 2018. Oct. 2018. [124] Marc-Peter Schambach and Sheikh Faisal Rashid. “Stabilize Sequence Learning with Recurrent Neural Networks by Forced Alignment.” In: 2013 12th International Conference on Document Analysis and Recognition. IEEE. 2013, pp. 1270–1274. [125] Marc-Peter Schambach, Stephan von der Nüll, and Martin Schall. “Fast and Reli- able Acquisition of Truth Data for Document Analysis using Cyclic Suggest Algo- rithms.” In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). Vol. 2. Sept. 2019, pp. 7–12. DOI: 10.1109/ICDARW.2019. 10030. [126] Jürgen Schmidhuber. “Deep learning in neural networks: An overview.” In: Neural networks 61 (2015), pp. 85–117. [127] Bernhard Schölkopf, Alexander J. Smola, and Francis Bach. Learning with ker- nels: support vector machines, regularization, optimization, and beyond. the MIT Press, 2018. [128] Mike Schuster and Kuldip K. Paliwal. “Bidirectional Recurrent Neural Networks.” In: IEEE Transactions on Signal Processing 45.11 (1997), pp. 2673– 2681. 226 [129] Rita Sevastjanova et al. “Going beyond visualization: Verbalization as complemen- tary medium to explain machine learning models.” In: Workshop on Visualization for AI Explainability at IEEE VIS. 2018. [130] Mehmet Sezgin and Bülent Sankur. “Survey over image thresholding techniques and quantitative performance evaluation.” In: Journal of Electronic imaging 13.1 (2004), pp. 146–165. [131] Prakash P. Shenoy. “Binary join trees for computing marginals in the Shenoy- Shafer architecture.” In: International Journal of approximate reasoning 17.2-3 (1997), pp. 239–263. [132] David Silver et al. “Mastering the game of Go with deep neural networks and tree search.” In: nature 529.7587 (2016), pp. 484–489. [133] David Silver et al. “Mastering the game of go without human knowledge.” In: nature 550.7676 (2017), pp. 354–359. [134] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. “Deep Inside Convo- lutional Networks: Visualising Image Classification Models and Saliency Maps.” In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings. Ed. by Yoshua Ben- gio and Yann LeCun. 2014. [135] Sumeet S. Singh and Sergey Karayev. “Full Page Handwriting Recognition via Image to Sequence Extraction.” In: Document Analysis and Recognition – ICDAR 2021. Ed. by Josep Lladós, Daniel Lopresti, and Seiichi Uchida. Cham: Springer International Publishing, 2021, pp. 55–69. ISBN: 978-3-030-86334-0. [136] Thilo Spinner et al. “explAIner: A visual analytics framework for interactive and ex- plainable machine learning.” In: IEEE transactions on visualization and computer graphics 26.1 (2019), pp. 1064–1074. [137] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overfitting.” In: The journal of machine learning research 15.1 (2014), pp. 1929– 1958. [138] Richard P. Stanley. “Enumerative Combinatorics Volume 1 second edition.” In: Cambridge studies in advanced mathematics (2011). [139] Erik B. Sudderth and William T. Freeman. “Signal and image processing with belief propagation.” In: IEEE Signal Processing Magazine 25.2 (2008), pp. 114–141. [140] Olarik Surinta et al. “A* path planning for line segmentation of handwritten docu- ments.” In: 2014 14th International Conference on Frontiers in Handwriting Recog- nition. IEEE. 2014, pp. 175–180. [141] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. “Sequence to sequence learning with neural networks.” In: Advances in neural information processing systems. 2014, pp. 3104–3112. [142] Richard S. Sutton, Andrew G. Barto, et al. Introduction to reinforcement learning. Vol. 135. MIT press Cambridge, 1998. [143] Christian Szegedy et al. “Scalable, High-Quality Object Detection.” In: CoRR abs/1412.1441 (2014). arXiv: 1412.1441. [144] Christian Szegedy et al. “Going deeper with convolutions.” In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 1–9. [145] Robert Tibshirani. “Regression shrinkage and selection via the lasso.” In: Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996), pp. 267– 288. 227 [146] Dimitris Tsirogiannis, Sudipto Guha, and Nick Koudas. “Improving the perfor- mance of list intersection.” In: Proceedings of the VLDB Endowment 2.1 (2009), pp. 838–849. [147] Vladimir Vapnik, Steven E. Golowich, Alex Smola, et al. “Support vector method for function approximation, regression estimation, and signal processing.” In: Ad- vances in neural information processing systems (1997), pp. 281–287. [148] Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf. “A primer on kernel methods.” In: Kernel methods in computational biology 47 (2004), pp. 35–70. [149] Andrew J. Viterbi. “A personal history of the Viterbi algorithm.” In: IEEE Signal Processing Magazine 23.4 (2006), pp. 120–142. [150] Paul Voigtlaender, Patrick Doetsch, and Hermann Ney. “Handwriting recognition with large multidimensional long short-term memory recurrent neural networks.” In: 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE. 2016, pp. 228–233. [151] Robert A. Wagner and Michael J. Fischer. “The string-to-string correction prob- lem.” In: Journal of the ACM (JACM) 21.1 (1974), pp. 168–173. [152] Paul J. Werbos. “Backpropagation through time: what it does and how to do it.” In: Proceedings of the IEEE 78.10 (1990), pp. 1550–1560. [153] Tobias Weyand, Ilya Kostrikov, and James Philbin. “Planet-photo geolocation with convolutional neural networks.” In: European Conference on Computer Vision. Springer. 2016, pp. 37–55. [154] Alfred Whitehead and Bertrand Russell. Principia mathematica. Cambridge, 1910. [155] D. Randall Wilson and Tony R. Martinez. “The general inefficiency of batch training for gradient descent learning.” In: Neural networks 16.10 (2003), pp. 1429–1451. [156] Yi-Chao Wu et al. “Handwritten chinese text recognition using separable multi- dimensional recurrent neural network.” In: 2017 14th IAPR International Confer- ence on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 79– 84. [157] Yonghui Wu et al. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.” In: CoRR abs/1609.08144 (2016). arXiv: 1609.08144. [158] Kelvin Xu et al. “Show, attend and tell: Neural image caption generation with visual attention.” In: International conference on machine learning. 2015, pp. 2048–2057. [159] Jui-Cheng Yen, Fu-Juay Chang, and Shyang Chang. “A new criterion for automatic multilevel thresholding.” In: IEEE Transactions on Image Processing 4.3 (1995), pp. 370–378. [160] Wojciech Zaremba and Ilya Sutskever. “Learning to Execute.” In: CoRR abs/1410.4615 (2014). arXiv: 1410.4615. [161] Matthew D. Zeiler and Rob Fergus. “Visualizing and understanding convolutional networks.” In: European conference on computer vision. Springer. 2014, pp. 818– 833. [162] Nevin L. Zhang and David Poole. “A simple approach to Bayesian network com- putations.” In: Proc. of the Tenth Canadian Conference on Artificial Intelligence. 1994. [163] Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. “Publaynet: largest dataset ever for document layout analysis.” In: 2019 International Conference on Docu- ment Analysis and Recognition (ICDAR). IEEE. 2019, pp. 1015–1022. 228