Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case

Jaeger-Honz, Sabrina

Publikation:
Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case

Dateien

Jaeger-Honz_2-varvm4xx3o0n3.pdfGröße: 92.32 MBDownloads: 48

Datum

2023

Autor:innen

Jaeger-Honz, Sabrina

URI (zitierfähiger Link)

http://nbn-resolving.de/urn:nbn:de:bsz:352-2-varvm4xx3o0n3

Link zur Lizenz

Urheberrechtlich geschützt

Open Access-Veröffentlichung

Open Access Green

Sammlungen

Informatik und Informationswissenschaft: Publikationen

Publikationstyp

Dissertation

Publikationsstatus

Published

Zusammenfassung

In the research fields of bio- and cheminformatics, small data sets are a particular challenge. Due to the limited number of data points available, it is often difficult to draw conclusions or to develop computational models and methods to predict endpoints of interest, e.g. activity or toxicity. Methods such as molecular docking, machine learning, or molecular dynamics (MD) simulation can be used to predict or analyse potential activity or toxicity of small data sets, and help to accelerate and understand mechanisms of bioactivity. The interest in activity of molecules and their targets often involves understanding biomolecular processes, e.g. protein-ligand binding. For investigations of protein-ligand binding, which is the major interest of this thesis, the research fields of bio- and cheminformatics naturally overlap. Macrocycles are a class of molecules that have twelve or more atoms in their central ring structure. They have been highly neglected in the past, with most research focusing on small molecules or acyclic structures, because macrocycles are expensive and time-consuming to synthesise and test experimentally. Macrocycles were not considered useful for drug discovery, due to the chemical space associated with them. For these reasons, often only small data sets of macrocyclic structures are available. Today, macrocycles are actively researched in cheminformatics, focusing on the study of their structure, conformation, and physicochemical properties in the context of toxicology and drug discovery. This thesis presents interconnected approaches aiming to overcome the difficulties related with studying toxicity on small data sets, particularly in the context of macrocycles. Microcystin (MC) congener toxicity is utilised as a use case. MC congeners are a class of structurally similar macrocycles, of which only a few have been tested and analysed yet. In the course of this thesis, two prediction models for the toxicity/inhibition of MC congeners on ser/thr-protein phosphatases (PPP) 1, PPP2A and PPP5 are developed. The first model built is a machine learning model used to classify MC congeners inhibition capacity towards PPPs into three toxicity classes (toxic, less toxic, non-toxic). Feature representations from natural language processing are used to represent MC congeners and PPPs as numerical vectors. The data set is imbalanced towards the toxic class, since most MC congeners are highly toxic. To increase the number of data points for the non-toxic class, synthetic minority oversampling is applied. This results in a consensus model with 80-90% correct predictions. Nevertheless, this machine learning model is a black box and explainability is preferred in risk assessment for estimating toxicity. For this reason, the second model built is based on a mathematical optimisation method, the so-called (α, β)-k-Feature Set-Problem, to reduce the dimensionality of the data set based on inter-class similarities and intra-class differences. This significantly reduces the number of features (i.e. dimensions in the extended connectivity fingerprint) while retaining meaningful features. Finally, Boolean rules are derived from the feature set to obtain signatures explaining the toxicity of MC congeners. Our findings are verified by experimental data which is available in the literature. The machine learning approaches are build based on two-dimensional similarity approaches. Recent studies show that three-dimensional conformational similarity could explain similar bioactivity better than two-dimensional similarity. Therefore, MD simulations are run on different MC congeners with a wide range of PPP1 inhibition capacity to investigate the interactions and conformations of MC congeners. Our results indicate that the toxic MC congener MC-LR has two conformational backbone clusters in solvent and that the other toxic MC congener MC-LF has a similar conformational structure. In contrast, the less toxic and non-toxic MC congeners differ from the toxic MC congeners in their backbone conformation and have less well defined conformational clusters. A second MD simulation data set of 12 MC congeners and their conjugates shows slightly different conformations for the MC congeners previously simulated, and a greater variety of backbone conformations for the conjugate structures. Some non-toxic MC congeners have the same backbone conformation as toxic congeners, supporting the assumption that the Adda side chain is indeed an important factor in binding. For [Enantio-Adda5]-MC-LF, however, a completely different backbone to MC-LF is observed, suggesting that modification of Adda can, but does not necessarily have to, alter the backbone conformation. In addition, the two MD simulation data sets are used to derive Interaction Fingerprints (IFP). IFPs are numerical vectors that encode an interaction between two molecules, in our case MC congeners and PPP1, as present (1) or absent (0). A method is developed to automatically derive IFP from MD simulation data for each time step. In contrast to the reported procedure, where all IFP of each individual time step are summarised into one so-called aggregated IFP, we develop a method to analyse IFPs in more detail. We consider IFP dynamics and changes, in order to analyse and compare individual changes in interactions between individual MC congeners over time. The development of IFP analysis is not limited to macrocycles, but is applicable to any system where interactions between two molecules are studied. The approach for aggregation and visualisation developed here is therefore a valuable contribution to the field and enables easier and faster analysis of interactions in MD simulations. In summary, this thesis presents and discusses interconnected approaches and methods that have been developed to deal with small data sets, and have demonstrated their applicability to the use case of MC congeners. These approaches and methods helped to derive biologically meaningful insights with computational methods, applicable to a wide range of other use cases in the area of bio- or cheminformatics.

Fachgebiet (DDC)

004 Informatik

Schlagwörter

Bioinformatics, Cheminformatics, Molecular Dynamics Simulations, Machine Learning, Interaction Fingerprints, Phosphatases, Microcystins

Zitieren

ISO 690

JAEGER-HONZ, Sabrina, 2023. Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case [Dissertation]. Konstanz: Universität Konstanz

BibTex

@phdthesis{JaegerHonz2023Combi-72985,
  title={Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case},
  year={2023},
  author={Jaeger-Honz, Sabrina},
  address={Konstanz},
  school={Universität Konstanz}
}

RDF

<rdf:RDF
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:bibo="http://purl.org/ontology/bibo/"
xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:void="http://rdfs.org/ns/void#"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#" >
<rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/72985">
<void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
<dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/72985/4/Jaeger-Honz_2-varvm4xx3o0n3.pdf"/>
<dcterms:abstract>In the research fields of bio- and cheminformatics, small data sets are a particular challenge. Due to the limited number of data points available, it is often difficult to draw conclusions or to develop computational models and methods to predict endpoints of interest, e.g. activity or toxicity. Methods such as molecular docking, machine learning, or molecular dynamics (MD) simulation can be used to predict or analyse potential activity or toxicity of small data sets, and help to accelerate and understand mechanisms of bioactivity. The interest in activity of molecules and their targets often involves understanding biomolecular processes, e.g. protein-ligand binding. For investigations of protein-ligand binding, which is the major interest of this thesis, the research fields of bio- and cheminformatics naturally overlap. Macrocycles are a class of molecules that have twelve or more atoms in their central ring structure. They have been highly neglected in the past, with most research focusing on small molecules or acyclic structures, because macrocycles are expensive and time-consuming to synthesise and test experimentally. Macrocycles were not considered useful for drug discovery, due to the chemical space associated with them. For these reasons, often only small data sets of macrocyclic structures are available. Today, macrocycles are actively researched in cheminformatics, focusing on the study of their structure, conformation, and physicochemical properties in the context of toxicology and drug discovery. This thesis presents interconnected approaches aiming to overcome the difficulties related with studying toxicity on small data sets, particularly in the context of macrocycles. Microcystin (MC) congener toxicity is utilised as a use case. MC congeners are a class of structurally similar macrocycles, of which only a few have been tested and analysed yet.
In the course of this thesis, two prediction models for the toxicity/inhibition of MC congeners on ser/thr-protein phosphatases (PPP) 1, PPP2A and PPP5 are developed. The first model built is a machine learning model used to classify MC congeners inhibition capacity towards PPPs into three toxicity classes (toxic, less toxic, non-toxic). Feature representations from natural language processing are used to represent MC congeners and PPPs as numerical vectors. The data set is imbalanced towards the toxic class, since most MC congeners are highly toxic. To increase the number of data points for the non-toxic class, synthetic minority oversampling is applied. This results in a consensus model with 80-90% correct predictions. Nevertheless, this machine learning model is a black box and explainability is preferred in risk assessment for estimating toxicity. For this reason, the second model built is based on a mathematical optimisation method, the so-called (α, β)-k-Feature Set-Problem, to reduce the dimensionality of the data set based on inter-class similarities and intra-class differences. This significantly reduces the number of features (i.e. dimensions in the extended connectivity fingerprint) while retaining meaningful features. Finally, Boolean rules are derived from the feature set to obtain signatures explaining the toxicity of MC congeners. Our findings are verified by experimental data which is available in the literature. The machine learning approaches are build based on two-dimensional similarity approaches. Recent studies show that three-dimensional conformational similarity could explain similar bioactivity better than two-dimensional similarity. Therefore, MD simulations are run on different MC congeners with a wide range of PPP1 inhibition capacity to investigate the interactions and conformations of MC congeners. Our results indicate that the toxic MC congener MC-LR has two conformational backbone clusters in solvent and that the other toxic MC congener MC-LF has a similar conformational structure. In contrast, the less toxic and non-toxic MC congeners differ from the toxic MC congeners in their backbone conformation and have less well defined conformational clusters. A second MD simulation data set of 12 MC congeners and their conjugates shows slightly different conformations for the MC congeners previously simulated, and a greater variety of backbone conformations for the conjugate structures. Some non-toxic MC congeners have the same backbone conformation as toxic congeners, supporting the assumption that the Adda side chain is indeed an important factor in binding. For [Enantio-Adda5]-MC-LF, however, a completely different backbone to MC-LF is observed, suggesting that modification of Adda can, but does not necessarily have to, alter the backbone conformation. In addition, the two MD simulation data sets are used to derive Interaction Fingerprints (IFP). IFPs are numerical vectors that encode an interaction between two molecules, in our case MC congeners and PPP1, as present (1) or absent (0). A method is developed to automatically derive IFP from MD simulation data for each time step. In contrast to the reported procedure, where all IFP of each individual time step are summarised into one so-called aggregated IFP, we develop a method to analyse IFPs in more detail. We consider IFP dynamics and changes, in order to analyse and compare individual changes in interactions between individual MC congeners over time. The development of IFP analysis is not limited to macrocycles, but is applicable to any system where interactions between two molecules are studied. The approach for aggregation and visualisation developed here is therefore a valuable contribution to the field and enables easier and faster analysis of interactions in MD simulations.
In summary, this thesis presents and discusses interconnected approaches and methods that have been developed to deal with small data sets, and have demonstrated their applicability to the use case of MC congeners. These approaches and methods helped to derive biologically meaningful insights with computational methods, applicable to a wide range of other use cases in the area of bio- or cheminformatics.</dcterms:abstract>
<foaf:homepage rdf:resource="http://localhost:8080/"/>
<dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-09T08:28:35Z</dc:date>
<dc:contributor>Jaeger-Honz, Sabrina</dc:contributor>
<bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/72985"/>
<dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
<dc:language>eng</dc:language>
<dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
<dcterms:title>Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case</dcterms:title>
<dcterms:issued>2023</dcterms:issued>
<dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-09T08:28:35Z</dcterms:available>
<dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/72985/4/Jaeger-Honz_2-varvm4xx3o0n3.pdf"/>
<dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/>
<dc:creator>Jaeger-Honz, Sabrina</dc:creator>
<dc:rights>terms-of-use</dc:rights>
</rdf:Description>
</rdf:RDF>

Prüfungsdatum der Dissertation

January 18, 2024

Hochschulschriftenvermerk

Konstanz, Univ., Diss., 2024

Universitätsbibliographie

Ja

Publikation: Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case

Dateien

Datum

Autor:innen

Herausgeber:innen

Kontakt

ISSN der Zeitschrift

item.preview.dc.identifier.eissn

ISBN

Bibliografische Daten

Verlag

Schriftenreihe

Auflagebezeichnung

URI (zitierfähiger Link)

DOI (zitierfähiger Link)

item.preview.dc.identifier.arxiv

Internationale Patentnummer

Link zur Lizenz

Angaben zur Forschungsförderung

Projekt

Open Access-Veröffentlichung

Sammlungen

Core Facility der Universität Konstanz

Gesperrt bis

Titel in einer weiteren Sprache

Publikationstyp

Publikationsstatus

Erschienen in

Zusammenfassung

Zusammenfassung in einer weiteren Sprache

Fachgebiet (DDC)

Schlagwörter

Konferenz

Rezension

Forschungsvorhaben

Organisationseinheiten

Zeitschriftenheft

Zugehörige Datensätze in KOPS

Zitieren

Interner Vermerk

xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

Kontakt

URL der Originalveröffentl.

Prüfdatum der URL

Prüfungsdatum der Dissertation

Hochschulschriftenvermerk

Finanzierungsart

Kommentar zur Publikation

Allianzlizenz

Corresponding Authors der Uni Konstanz vorhanden

Internationale Co-Autor:innen

Universitätsbibliographie

Begutachtet

Diese Publikation teilen

Publikation:
Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case