Publikation: Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case
Dateien
Datum
Autor:innen
Herausgeber:innen
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
URI (zitierfähiger Link)
Internationale Patentnummer
Link zur Lizenz
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Core Facility der Universität Konstanz
Titel in einer weiteren Sprache
Publikationstyp
Publikationsstatus
Erschienen in
Zusammenfassung
In the research fields of bio- and cheminformatics, small data sets are a particular challenge. Due to the limited number of data points available, it is often difficult to draw conclusions or to develop computational models and methods to predict endpoints of interest, e.g. activity or toxicity. Methods such as molecular docking, machine learning, or molecular dynamics (MD) simulation can be used to predict or analyse potential activity or toxicity of small data sets, and help to accelerate and understand mechanisms of bioactivity. The interest in activity of molecules and their targets often involves understanding biomolecular processes, e.g. protein-ligand binding. For investigations of protein-ligand binding, which is the major interest of this thesis, the research fields of bio- and cheminformatics naturally overlap. Macrocycles are a class of molecules that have twelve or more atoms in their central ring structure. They have been highly neglected in the past, with most research focusing on small molecules or acyclic structures, because macrocycles are expensive and time-consuming to synthesise and test experimentally. Macrocycles were not considered useful for drug discovery, due to the chemical space associated with them. For these reasons, often only small data sets of macrocyclic structures are available. Today, macrocycles are actively researched in cheminformatics, focusing on the study of their structure, conformation, and physicochemical properties in the context of toxicology and drug discovery. This thesis presents interconnected approaches aiming to overcome the difficulties related with studying toxicity on small data sets, particularly in the context of macrocycles. Microcystin (MC) congener toxicity is utilised as a use case. MC congeners are a class of structurally similar macrocycles, of which only a few have been tested and analysed yet. In the course of this thesis, two prediction models for the toxicity/inhibition of MC congeners on ser/thr-protein phosphatases (PPP) 1, PPP2A and PPP5 are developed. The first model built is a machine learning model used to classify MC congeners inhibition capacity towards PPPs into three toxicity classes (toxic, less toxic, non-toxic). Feature representations from natural language processing are used to represent MC congeners and PPPs as numerical vectors. The data set is imbalanced towards the toxic class, since most MC congeners are highly toxic. To increase the number of data points for the non-toxic class, synthetic minority oversampling is applied. This results in a consensus model with 80-90% correct predictions. Nevertheless, this machine learning model is a black box and explainability is preferred in risk assessment for estimating toxicity. For this reason, the second model built is based on a mathematical optimisation method, the so-called (α, β)-k-Feature Set-Problem, to reduce the dimensionality of the data set based on inter-class similarities and intra-class differences. This significantly reduces the number of features (i.e. dimensions in the extended connectivity fingerprint) while retaining meaningful features. Finally, Boolean rules are derived from the feature set to obtain signatures explaining the toxicity of MC congeners. Our findings are verified by experimental data which is available in the literature. The machine learning approaches are build based on two-dimensional similarity approaches. Recent studies show that three-dimensional conformational similarity could explain similar bioactivity better than two-dimensional similarity. Therefore, MD simulations are run on different MC congeners with a wide range of PPP1 inhibition capacity to investigate the interactions and conformations of MC congeners. Our results indicate that the toxic MC congener MC-LR has two conformational backbone clusters in solvent and that the other toxic MC congener MC-LF has a similar conformational structure. In contrast, the less toxic and non-toxic MC congeners differ from the toxic MC congeners in their backbone conformation and have less well defined conformational clusters. A second MD simulation data set of 12 MC congeners and their conjugates shows slightly different conformations for the MC congeners previously simulated, and a greater variety of backbone conformations for the conjugate structures. Some non-toxic MC congeners have the same backbone conformation as toxic congeners, supporting the assumption that the Adda side chain is indeed an important factor in binding. For [Enantio-Adda5]-MC-LF, however, a completely different backbone to MC-LF is observed, suggesting that modification of Adda can, but does not necessarily have to, alter the backbone conformation. In addition, the two MD simulation data sets are used to derive Interaction Fingerprints (IFP). IFPs are numerical vectors that encode an interaction between two molecules, in our case MC congeners and PPP1, as present (1) or absent (0). A method is developed to automatically derive IFP from MD simulation data for each time step. In contrast to the reported procedure, where all IFP of each individual time step are summarised into one so-called aggregated IFP, we develop a method to analyse IFPs in more detail. We consider IFP dynamics and changes, in order to analyse and compare individual changes in interactions between individual MC congeners over time. The development of IFP analysis is not limited to macrocycles, but is applicable to any system where interactions between two molecules are studied. The approach for aggregation and visualisation developed here is therefore a valuable contribution to the field and enables easier and faster analysis of interactions in MD simulations. In summary, this thesis presents and discusses interconnected approaches and methods that have been developed to deal with small data sets, and have demonstrated their applicability to the use case of MC congeners. These approaches and methods helped to derive biologically meaningful insights with computational methods, applicable to a wide range of other use cases in the area of bio- or cheminformatics.
Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
Schlagwörter
Konferenz
Rezension
Zitieren
ISO 690
JAEGER-HONZ, Sabrina, 2023. Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case [Dissertation]. Konstanz: Universität KonstanzBibTex
@phdthesis{JaegerHonz2023Combi-72985, title={Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case}, year={2023}, author={Jaeger-Honz, Sabrina}, address={Konstanz}, school={Universität Konstanz} }
RDF
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:void="http://rdfs.org/ns/void#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/72985"> <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/> <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/72985/4/Jaeger-Honz_2-varvm4xx3o0n3.pdf"/> <dcterms:abstract>In the research fields of bio- and cheminformatics, small data sets are a particular challenge. Due to the limited number of data points available, it is often difficult to draw conclusions or to develop computational models and methods to predict endpoints of interest, e.g. activity or toxicity. Methods such as molecular docking, machine learning, or molecular dynamics (MD) simulation can be used to predict or analyse potential activity or toxicity of small data sets, and help to accelerate and understand mechanisms of bioactivity. The interest in activity of molecules and their targets often involves understanding biomolecular processes, e.g. protein-ligand binding. For investigations of protein-ligand binding, which is the major interest of this thesis, the research fields of bio- and cheminformatics naturally overlap. Macrocycles are a class of molecules that have twelve or more atoms in their central ring structure. They have been highly neglected in the past, with most research focusing on small molecules or acyclic structures, because macrocycles are expensive and time-consuming to synthesise and test experimentally. Macrocycles were not considered useful for drug discovery, due to the chemical space associated with them. For these reasons, often only small data sets of macrocyclic structures are available. Today, macrocycles are actively researched in cheminformatics, focusing on the study of their structure, conformation, and physicochemical properties in the context of toxicology and drug discovery. This thesis presents interconnected approaches aiming to overcome the difficulties related with studying toxicity on small data sets, particularly in the context of macrocycles. Microcystin (MC) congener toxicity is utilised as a use case. MC congeners are a class of structurally similar macrocycles, of which only a few have been tested and analysed yet. In the course of this thesis, two prediction models for the toxicity/inhibition of MC congeners on ser/thr-protein phosphatases (PPP) 1, PPP2A and PPP5 are developed. The first model built is a machine learning model used to classify MC congeners inhibition capacity towards PPPs into three toxicity classes (toxic, less toxic, non-toxic). Feature representations from natural language processing are used to represent MC congeners and PPPs as numerical vectors. The data set is imbalanced towards the toxic class, since most MC congeners are highly toxic. To increase the number of data points for the non-toxic class, synthetic minority oversampling is applied. This results in a consensus model with 80-90% correct predictions. Nevertheless, this machine learning model is a black box and explainability is preferred in risk assessment for estimating toxicity. For this reason, the second model built is based on a mathematical optimisation method, the so-called (α, β)-k-Feature Set-Problem, to reduce the dimensionality of the data set based on inter-class similarities and intra-class differences. This significantly reduces the number of features (i.e. dimensions in the extended connectivity fingerprint) while retaining meaningful features. Finally, Boolean rules are derived from the feature set to obtain signatures explaining the toxicity of MC congeners. Our findings are verified by experimental data which is available in the literature. The machine learning approaches are build based on two-dimensional similarity approaches. Recent studies show that three-dimensional conformational similarity could explain similar bioactivity better than two-dimensional similarity. Therefore, MD simulations are run on different MC congeners with a wide range of PPP1 inhibition capacity to investigate the interactions and conformations of MC congeners. Our results indicate that the toxic MC congener MC-LR has two conformational backbone clusters in solvent and that the other toxic MC congener MC-LF has a similar conformational structure. In contrast, the less toxic and non-toxic MC congeners differ from the toxic MC congeners in their backbone conformation and have less well defined conformational clusters. A second MD simulation data set of 12 MC congeners and their conjugates shows slightly different conformations for the MC congeners previously simulated, and a greater variety of backbone conformations for the conjugate structures. Some non-toxic MC congeners have the same backbone conformation as toxic congeners, supporting the assumption that the Adda side chain is indeed an important factor in binding. For [Enantio-Adda5]-MC-LF, however, a completely different backbone to MC-LF is observed, suggesting that modification of Adda can, but does not necessarily have to, alter the backbone conformation. In addition, the two MD simulation data sets are used to derive Interaction Fingerprints (IFP). IFPs are numerical vectors that encode an interaction between two molecules, in our case MC congeners and PPP1, as present (1) or absent (0). A method is developed to automatically derive IFP from MD simulation data for each time step. In contrast to the reported procedure, where all IFP of each individual time step are summarised into one so-called aggregated IFP, we develop a method to analyse IFPs in more detail. We consider IFP dynamics and changes, in order to analyse and compare individual changes in interactions between individual MC congeners over time. The development of IFP analysis is not limited to macrocycles, but is applicable to any system where interactions between two molecules are studied. The approach for aggregation and visualisation developed here is therefore a valuable contribution to the field and enables easier and faster analysis of interactions in MD simulations. In summary, this thesis presents and discusses interconnected approaches and methods that have been developed to deal with small data sets, and have demonstrated their applicability to the use case of MC congeners. These approaches and methods helped to derive biologically meaningful insights with computational methods, applicable to a wide range of other use cases in the area of bio- or cheminformatics.</dcterms:abstract> <foaf:homepage rdf:resource="http://localhost:8080/"/> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-09T08:28:35Z</dc:date> <dc:contributor>Jaeger-Honz, Sabrina</dc:contributor> <bibo:uri rdf:resource="https://kops.uni-konstanz.de/handle/123456789/72985"/> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <dc:language>eng</dc:language> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <dcterms:title>Combining Bio- and Cheminformatics for Small Data Sets – Microcystins as a Use Case</dcterms:title> <dcterms:issued>2023</dcterms:issued> <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2025-04-09T08:28:35Z</dcterms:available> <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/72985/4/Jaeger-Honz_2-varvm4xx3o0n3.pdf"/> <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/> <dc:creator>Jaeger-Honz, Sabrina</dc:creator> <dc:rights>terms-of-use</dc:rights> </rdf:Description> </rdf:RDF>