Column subset selection with applications to neuroimaging data

Lade...
Vorschaubild
Dateien
Strauch_0-269383.pdf
Strauch_0-269383.pdfGröße: 15.9 MBDownloads: 416
Datum
2014
Autor:innen
Herausgeber:innen
Kontakt
ISSN der Zeitschrift
Electronic ISSN
ISBN
Bibliografische Daten
Verlag
Schriftenreihe
Auflagebezeichnung
DOI (zitierfähiger Link)
ArXiv-ID
Internationale Patentnummer
Angaben zur Forschungsförderung
Projekt
Open Access-Veröffentlichung
Open Access Green
Core Facility der Universität Konstanz
Gesperrt bis
Titel in einer weiteren Sprache
Forschungsvorhaben
Organisationseinheiten
Zeitschriftenheft
Publikationstyp
Dissertation
Publikationsstatus
Published
Erschienen in
Zusammenfassung

Column (subset) selection problems require to select a subset of the columns from a matrix in an unsupervised fashion and such that a matrix norm error criterion is minimised. Motivations for column selection are 1) data interpretation through identifying relevant columns, 2) speedup by performing computationally expensive operations on a small column subset. This work introduces structural and algorithmic improvements regarding both aspects, along with demonstrating applications to neuroimaging data.


Column selection for data interpretation: NNCX and Convex_cone

For CX factorisation (Drineas et al., SIAM J. Matrix Analysis and Applications, 2008), a $c$-subset of the columns of matrix $A$ ($m \times n$) should be selected into the $m \times c$ matrix $C$, and combined linearly with coefficients in $X$ ($c \times n$), such that the CX norm error $\left| A - CX \right|^2_{Fr}$ is minimised in the Frobenius norm. For non-negative CX (NNCX), the coefficients in $X$ are constrained to be non-negative, which has advantages with respect to data interpretation (Hyvönen et al., ACM SIGKDD, 2008).

The goal is to find good column selection strategies for NNCX, and to analyse the interpretability aspect of CX/NNCX column selection. To this end, a generative model for NNCX is introduced, where the columns of $A$ contain either one of $s$ generating pure signal columns or a linear combination (with non-negative coefficients) of several pure signal columns. An algorithm, Convex_cone, is proposed as a heuristic for selecting the extreme columns of $A$. These extreme columns correspond to the generating columns and they span a convex cone that contains the data points of $A$. The extreme columns are interpretable in the sense that they allow to understand how $A$ has been constructed, and they also serve to reduce the NNCX norm error $\left| A - CX^{0+} \right|^2_{Fr}$ (non-negativity indicated by $^{0+}$).

Empirical evaluation is performed against state-of-the-art algorithms for column selection. With respect to recovering the generating columns and to reducing the NNCX norm error, Convex_cone performs better than established algorithms for CX that have been modified to compute a NNCX factorisation.


Column selection for speedup: Weighted norm error and FastPCA

A fast approximation to Principal Component Analysis (PCA) is proposed: FastPCA. Similar to the paradigm of the Nyström extension (Williams & Seeger, NIPS, 2000), the rank-$k$ reconstruction $A_k$ of matrix $A$, that is based on the top-$k$ principal components of $A$, is approximated by $\widehat{A}_k$. The approximation $\widehat{A}_k$ is based on the top-$k$ principal components of a column subset $C \in A$ to be chosen such that the norm error $\left| A_k - \widehat{A}k \right|{Fr}$ is small.

FastPCA implements a column sampling scheme to quickly find a good $C$, in particular in the light of a priori information about the data type. A weighted norm error criterion is introduced where column weights specify column importance. The a priori information helps to set up appropriate column weights, and it gives rise to a data-dependent probability distribution over the columns that assigns relevant columns a high probability of being sampled.


Applications to neuroimaging data

Both algorithms can be applied to calcium imaging movies of insect brain activity. The imaging movies turn out to have a NNCX-like structure: They can be modelled by linear combination (with non-negative coefficients) of the pure signals of neural units. Among a large number of columns (pixels), Convex_cone selects a subset of columns that contain the pure signals of identifiable neural units. The NNCX factorisation computed by Convex_cone proves to be useful in practical data analysis and visualisation as demonstrated by biological publications.

For FastPCA, a priori information about the spatial aspect of neuroimaging data enables a column importance sampling that takes the spatial relationship of columns (pixels) into account. This is used to explicitly reduce the approximation error for columns with biological signals that are known to occur in spatially contiguous clusters. FastPCA is employed as a preprocessing step for fast dimensionality and noise reduction, ensuring scalability to large data sizes in neuroimaging.

Zusammenfassung in einer weiteren Sprache
Fachgebiet (DDC)
004 Informatik
Schlagwörter
column subset selection, column-based matrix factorisation, neuroimaging data
Konferenz
Rezension
undefined / . - undefined, undefined
Zitieren
ISO 690STRAUCH, Martin, 2014. Column subset selection with applications to neuroimaging data [Dissertation]. Konstanz: University of Konstanz
BibTex
@phdthesis{Strauch2014Colum-29539,
  year={2014},
  title={Column subset selection with applications to neuroimaging data},
  author={Strauch, Martin},
  address={Konstanz},
  school={Universität Konstanz}
}
RDF
<rdf:RDF
    xmlns:dcterms="http://purl.org/dc/terms/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:bibo="http://purl.org/ontology/bibo/"
    xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > 
  <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/29539">
    <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/29539/3/Strauch_0-269383.pdf"/>
    <dc:rights>terms-of-use</dc:rights>
    <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-01-19T08:28:51Z</dcterms:available>
    <dcterms:issued>2014</dcterms:issued>
    <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/>
    <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/>
    <bibo:uri rdf:resource="http://kops.uni-konstanz.de/handle/123456789/29539"/>
    <dc:creator>Strauch, Martin</dc:creator>
    <dc:contributor>Strauch, Martin</dc:contributor>
    <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/>
    <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-01-19T08:28:51Z</dc:date>
    <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/29539/3/Strauch_0-269383.pdf"/>
    <dcterms:title>Column subset selection with applications to neuroimaging data</dcterms:title>
    <foaf:homepage rdf:resource="http://localhost:8080/"/>
    <dcterms:abstract xml:lang="eng">Column (subset) selection problems require to select a subset of the columns from a matrix in an unsupervised fashion and such that a matrix norm error criterion is minimised. Motivations for column selection are 1) data interpretation through identifying relevant columns, 2) speedup by performing computationally expensive operations on a small column subset. This work introduces structural and algorithmic improvements regarding both aspects, along with demonstrating applications to neuroimaging data.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Column selection for data interpretation: NNCX and Convex_cone&lt;br /&gt;&lt;br /&gt;For CX factorisation (Drineas et al., SIAM J. Matrix Analysis and Applications, 2008), a $c$-subset of the columns of matrix $A$ ($m \times n$) should be selected into the $m \times c$ matrix $C$, and combined linearly with coefficients in $X$ ($c \times n$), such that the CX norm error $\left\| A - CX \right\|^2_{Fr}$ is minimised in the Frobenius norm.  For non-negative CX (NNCX), the coefficients in $X$ are constrained to be non-negative, which has advantages with respect to data interpretation (Hyvönen et al., ACM SIGKDD, 2008).&lt;br /&gt;&lt;br /&gt;The goal is to find good column selection strategies for NNCX, and to analyse the interpretability aspect of CX/NNCX column selection. To this end, a generative model for NNCX is introduced, where the columns of $A$ contain either one of $s$ generating pure signal columns or a linear combination (with non-negative coefficients) of several pure signal columns. An algorithm, Convex_cone, is proposed as a heuristic for selecting the extreme columns of $A$. These extreme columns correspond to the generating columns and they span a convex cone that contains the data points of $A$. The extreme columns are interpretable in the sense that they allow to  understand how $A$ has been constructed, and they also serve to reduce the NNCX norm error $\left\| A - CX^{0+} \right\|^2_{Fr}$ (non-negativity indicated by $^{0+}$).&lt;br /&gt;&lt;br /&gt;Empirical evaluation is performed against state-of-the-art algorithms for column selection. With respect to recovering the generating columns and to reducing the NNCX norm error, Convex_cone performs better than established algorithms for CX that have been modified to compute a NNCX factorisation.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Column selection for speedup: Weighted norm error and FastPCA&lt;br /&gt;&lt;br /&gt;A fast approximation to Principal Component Analysis (PCA) is proposed: FastPCA. Similar to the paradigm of the Nyström extension (Williams &amp; Seeger, NIPS, 2000), the rank-$k$ reconstruction $A_k$ of matrix $A$, that is based on the top-$k$ principal components of $A$, is approximated by $\widehat{A}_k$. The approximation $\widehat{A}_k$ is based on the top-$k$ principal components of a column subset $C \in A$ to be chosen such that the norm error $\left\| A_k - \widehat{A}_k \right\|_{Fr}$ is small.&lt;br /&gt;&lt;br /&gt;FastPCA implements a column sampling scheme to quickly find a good $C$, in particular in the light of a priori information about the data type. A weighted norm error criterion is introduced where column weights specify column importance. The a priori information helps to set up appropriate column weights, and it gives rise to a data-dependent probability distribution over the columns that assigns relevant columns a high probability of being sampled.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Applications to neuroimaging data&lt;br /&gt;&lt;br /&gt;Both algorithms can be applied to calcium imaging movies of insect brain activity. The imaging movies turn out to have a NNCX-like structure: They can be modelled by linear combination (with non-negative coefficients) of the pure signals of neural units. Among a large number of columns (pixels), Convex_cone selects a subset of columns that contain the pure signals of identifiable neural units. The NNCX factorisation computed by Convex_cone proves to be useful in practical data analysis and visualisation as demonstrated by biological publications.&lt;br /&gt;&lt;br /&gt;For FastPCA, a priori information about the spatial aspect of neuroimaging data enables a column importance sampling that takes the spatial relationship of columns (pixels) into account. This is used to explicitly reduce the approximation error for columns with biological signals that are known to occur in spatially contiguous clusters. FastPCA is employed as a preprocessing step for fast dimensionality and noise reduction, ensuring scalability to large data sizes in neuroimaging.</dcterms:abstract>
    <dc:language>eng</dc:language>
  </rdf:Description>
</rdf:RDF>
Interner Vermerk
xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter
Kontakt
URL der Originalveröffentl.
Prüfdatum der URL
Prüfungsdatum der Dissertation
December 3, 2014
Hochschulschriftenvermerk
Konstanz, Univ., Diss., 2014
Finanzierungsart
Kommentar zur Publikation
Allianzlizenz
Corresponding Authors der Uni Konstanz vorhanden
Internationale Co-Autor:innen
Universitätsbibliographie
Nein
Begutachtet
Diese Publikation teilen