Column subset selection with applications to neuroimaging data

Column subset selection with applications to neuroimaging data

##### Date

2014

##### Authors

##### Editors

##### Journal ISSN

##### Electronic ISSN

##### ISBN

##### Bibliographical data

##### Publisher

##### Series

##### URI (citable link)

##### International patent number

##### Link to the license

##### EU project number

##### Project

##### Open Access publication

##### Collections

##### Title in another language

##### Publication type

Dissertation

##### Publication status

##### Published in

##### Abstract

Column (subset) selection problems require to select a subset of the columns from a matrix in an unsupervised fashion and such that a matrix norm error criterion is minimised. Motivations for column selection are 1) data interpretation through identifying relevant columns, 2) speedup by performing computationally expensive operations on a small column subset. This work introduces structural and algorithmic improvements regarding both aspects, along with demonstrating applications to neuroimaging data.

Column selection for data interpretation: NNCX and Convex_cone

For CX factorisation (Drineas et al., SIAM J. Matrix Analysis and Applications, 2008), a $c$-subset of the columns of matrix $A$ ($m \times n$) should be selected into the $m \times c$ matrix $C$, and combined linearly with coefficients in $X$ ($c \times n$), such that the CX norm error $\left\| A - CX \right\|^2_{Fr}$ is minimised in the Frobenius norm. For non-negative CX (NNCX), the coefficients in $X$ are constrained to be non-negative, which has advantages with respect to data interpretation (Hyvönen et al., ACM SIGKDD, 2008).

The goal is to find good column selection strategies for NNCX, and to analyse the interpretability aspect of CX/NNCX column selection. To this end, a generative model for NNCX is introduced, where the columns of $A$ contain either one of $s$ generating pure signal columns or a linear combination (with non-negative coefficients) of several pure signal columns. An algorithm, Convex_cone, is proposed as a heuristic for selecting the extreme columns of $A$. These extreme columns correspond to the generating columns and they span a convex cone that contains the data points of $A$. The extreme columns are interpretable in the sense that they allow to understand how $A$ has been constructed, and they also serve to reduce the NNCX norm error $\left\| A - CX^{0+} \right\|^2_{Fr}$ (non-negativity indicated by $^{0+}$).

Empirical evaluation is performed against state-of-the-art algorithms for column selection. With respect to recovering the generating columns and to reducing the NNCX norm error, Convex_cone performs better than established algorithms for CX that have been modified to compute a NNCX factorisation.

Column selection for speedup: Weighted norm error and FastPCA

A fast approximation to Principal Component Analysis (PCA) is proposed: FastPCA. Similar to the paradigm of the Nyström extension (Williams & Seeger, NIPS, 2000), the rank-$k$ reconstruction $A_k$ of matrix $A$, that is based on the top-$k$ principal components of $A$, is approximated by $\widehat{A}_k$. The approximation $\widehat{A}_k$ is based on the top-$k$ principal components of a column subset $C \in A$ to be chosen such that the norm error $\left\| A_k - \widehat{A}_k \right\|_{Fr}$ is small.

FastPCA implements a column sampling scheme to quickly find a good $C$, in particular in the light of a priori information about the data type. A weighted norm error criterion is introduced where column weights specify column importance. The a priori information helps to set up appropriate column weights, and it gives rise to a data-dependent probability distribution over the columns that assigns relevant columns a high probability of being sampled.

Applications to neuroimaging data

Both algorithms can be applied to calcium imaging movies of insect brain activity. The imaging movies turn out to have a NNCX-like structure: They can be modelled by linear combination (with non-negative coefficients) of the pure signals of neural units. Among a large number of columns (pixels), Convex_cone selects a subset of columns that contain the pure signals of identifiable neural units. The NNCX factorisation computed by Convex_cone proves to be useful in practical data analysis and visualisation as demonstrated by biological publications.

For FastPCA, a priori information about the spatial aspect of neuroimaging data enables a column importance sampling that takes the spatial relationship of columns (pixels) into account. This is used to explicitly reduce the approximation error for columns with biological signals that are known to occur in spatially contiguous clusters. FastPCA is employed as a preprocessing step for fast dimensionality and noise reduction, ensuring scalability to large data sizes in neuroimaging.

Column selection for data interpretation: NNCX and Convex_cone

For CX factorisation (Drineas et al., SIAM J. Matrix Analysis and Applications, 2008), a $c$-subset of the columns of matrix $A$ ($m \times n$) should be selected into the $m \times c$ matrix $C$, and combined linearly with coefficients in $X$ ($c \times n$), such that the CX norm error $\left\| A - CX \right\|^2_{Fr}$ is minimised in the Frobenius norm. For non-negative CX (NNCX), the coefficients in $X$ are constrained to be non-negative, which has advantages with respect to data interpretation (Hyvönen et al., ACM SIGKDD, 2008).

The goal is to find good column selection strategies for NNCX, and to analyse the interpretability aspect of CX/NNCX column selection. To this end, a generative model for NNCX is introduced, where the columns of $A$ contain either one of $s$ generating pure signal columns or a linear combination (with non-negative coefficients) of several pure signal columns. An algorithm, Convex_cone, is proposed as a heuristic for selecting the extreme columns of $A$. These extreme columns correspond to the generating columns and they span a convex cone that contains the data points of $A$. The extreme columns are interpretable in the sense that they allow to understand how $A$ has been constructed, and they also serve to reduce the NNCX norm error $\left\| A - CX^{0+} \right\|^2_{Fr}$ (non-negativity indicated by $^{0+}$).

Empirical evaluation is performed against state-of-the-art algorithms for column selection. With respect to recovering the generating columns and to reducing the NNCX norm error, Convex_cone performs better than established algorithms for CX that have been modified to compute a NNCX factorisation.

Column selection for speedup: Weighted norm error and FastPCA

A fast approximation to Principal Component Analysis (PCA) is proposed: FastPCA. Similar to the paradigm of the Nyström extension (Williams & Seeger, NIPS, 2000), the rank-$k$ reconstruction $A_k$ of matrix $A$, that is based on the top-$k$ principal components of $A$, is approximated by $\widehat{A}_k$. The approximation $\widehat{A}_k$ is based on the top-$k$ principal components of a column subset $C \in A$ to be chosen such that the norm error $\left\| A_k - \widehat{A}_k \right\|_{Fr}$ is small.

FastPCA implements a column sampling scheme to quickly find a good $C$, in particular in the light of a priori information about the data type. A weighted norm error criterion is introduced where column weights specify column importance. The a priori information helps to set up appropriate column weights, and it gives rise to a data-dependent probability distribution over the columns that assigns relevant columns a high probability of being sampled.

Applications to neuroimaging data

Both algorithms can be applied to calcium imaging movies of insect brain activity. The imaging movies turn out to have a NNCX-like structure: They can be modelled by linear combination (with non-negative coefficients) of the pure signals of neural units. Among a large number of columns (pixels), Convex_cone selects a subset of columns that contain the pure signals of identifiable neural units. The NNCX factorisation computed by Convex_cone proves to be useful in practical data analysis and visualisation as demonstrated by biological publications.

For FastPCA, a priori information about the spatial aspect of neuroimaging data enables a column importance sampling that takes the spatial relationship of columns (pixels) into account. This is used to explicitly reduce the approximation error for columns with biological signals that are known to occur in spatially contiguous clusters. FastPCA is employed as a preprocessing step for fast dimensionality and noise reduction, ensuring scalability to large data sizes in neuroimaging.

##### Summary in another language

##### Subject (DDC)

004 Computer Science

##### Keywords

column subset selection, column-based matrix factorisation, neuroimaging data

##### Conference

##### Review

undefined / . - undefined, undefined. - (undefined; undefined)

##### Cite This

## ISO 690

STRAUCH, Martin, 2014.*Column subset selection with applications to neuroimaging data*[Dissertation]. Konstanz: University of Konstanz

## BibTex

@phdthesis{Strauch2014Colum-29539, year={2014}, title={Column subset selection with applications to neuroimaging data}, author={Strauch, Martin}, address={Konstanz}, school={Universität Konstanz} }

## RDF

<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dspace="http://digital-repositories.org/ontologies/dspace/0.1.0#" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:void="http://rdfs.org/ns/void#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" > <rdf:Description rdf:about="https://kops.uni-konstanz.de/server/rdf/resource/123456789/29539"> <dspace:hasBitstream rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/29539/3/Strauch_0-269383.pdf"/> <dc:rights>terms-of-use</dc:rights> <dcterms:available rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-01-19T08:28:51Z</dcterms:available> <dcterms:issued>2014</dcterms:issued> <dcterms:rights rdf:resource="https://rightsstatements.org/page/InC/1.0/"/> <dcterms:isPartOf rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <dspace:isPartOfCollection rdf:resource="https://kops.uni-konstanz.de/server/rdf/resource/123456789/36"/> <bibo:uri rdf:resource="http://kops.uni-konstanz.de/handle/123456789/29539"/> <dc:creator>Strauch, Martin</dc:creator> <dc:contributor>Strauch, Martin</dc:contributor> <void:sparqlEndpoint rdf:resource="http://localhost/fuseki/dspace/sparql"/> <dc:date rdf:datatype="http://www.w3.org/2001/XMLSchema#dateTime">2015-01-19T08:28:51Z</dc:date> <dcterms:hasPart rdf:resource="https://kops.uni-konstanz.de/bitstream/123456789/29539/3/Strauch_0-269383.pdf"/> <dcterms:title>Column subset selection with applications to neuroimaging data</dcterms:title> <foaf:homepage rdf:resource="http://localhost:8080/"/> <dcterms:abstract xml:lang="eng">Column (subset) selection problems require to select a subset of the columns from a matrix in an unsupervised fashion and such that a matrix norm error criterion is minimised. Motivations for column selection are 1) data interpretation through identifying relevant columns, 2) speedup by performing computationally expensive operations on a small column subset. This work introduces structural and algorithmic improvements regarding both aspects, along with demonstrating applications to neuroimaging data.<br /><br /><br />Column selection for data interpretation: NNCX and Convex_cone<br /><br />For CX factorisation (Drineas et al., SIAM J. Matrix Analysis and Applications, 2008), a $c$-subset of the columns of matrix $A$ ($m \times n$) should be selected into the $m \times c$ matrix $C$, and combined linearly with coefficients in $X$ ($c \times n$), such that the CX norm error $\left\| A - CX \right\|^2_{Fr}$ is minimised in the Frobenius norm. For non-negative CX (NNCX), the coefficients in $X$ are constrained to be non-negative, which has advantages with respect to data interpretation (Hyvönen et al., ACM SIGKDD, 2008).<br /><br />The goal is to find good column selection strategies for NNCX, and to analyse the interpretability aspect of CX/NNCX column selection. To this end, a generative model for NNCX is introduced, where the columns of $A$ contain either one of $s$ generating pure signal columns or a linear combination (with non-negative coefficients) of several pure signal columns. An algorithm, Convex_cone, is proposed as a heuristic for selecting the extreme columns of $A$. These extreme columns correspond to the generating columns and they span a convex cone that contains the data points of $A$. The extreme columns are interpretable in the sense that they allow to understand how $A$ has been constructed, and they also serve to reduce the NNCX norm error $\left\| A - CX^{0+} \right\|^2_{Fr}$ (non-negativity indicated by $^{0+}$).<br /><br />Empirical evaluation is performed against state-of-the-art algorithms for column selection. With respect to recovering the generating columns and to reducing the NNCX norm error, Convex_cone performs better than established algorithms for CX that have been modified to compute a NNCX factorisation.<br /><br /><br />Column selection for speedup: Weighted norm error and FastPCA<br /><br />A fast approximation to Principal Component Analysis (PCA) is proposed: FastPCA. Similar to the paradigm of the Nyström extension (Williams & Seeger, NIPS, 2000), the rank-$k$ reconstruction $A_k$ of matrix $A$, that is based on the top-$k$ principal components of $A$, is approximated by $\widehat{A}_k$. The approximation $\widehat{A}_k$ is based on the top-$k$ principal components of a column subset $C \in A$ to be chosen such that the norm error $\left\| A_k - \widehat{A}_k \right\|_{Fr}$ is small.<br /><br />FastPCA implements a column sampling scheme to quickly find a good $C$, in particular in the light of a priori information about the data type. A weighted norm error criterion is introduced where column weights specify column importance. The a priori information helps to set up appropriate column weights, and it gives rise to a data-dependent probability distribution over the columns that assigns relevant columns a high probability of being sampled.<br /><br /><br />Applications to neuroimaging data<br /><br />Both algorithms can be applied to calcium imaging movies of insect brain activity. The imaging movies turn out to have a NNCX-like structure: They can be modelled by linear combination (with non-negative coefficients) of the pure signals of neural units. Among a large number of columns (pixels), Convex_cone selects a subset of columns that contain the pure signals of identifiable neural units. The NNCX factorisation computed by Convex_cone proves to be useful in practical data analysis and visualisation as demonstrated by biological publications.<br /><br />For FastPCA, a priori information about the spatial aspect of neuroimaging data enables a column importance sampling that takes the spatial relationship of columns (pixels) into account. This is used to explicitly reduce the approximation error for columns with biological signals that are known to occur in spatially contiguous clusters. FastPCA is employed as a preprocessing step for fast dimensionality and noise reduction, ensuring scalability to large data sizes in neuroimaging.</dcterms:abstract> <dc:language>eng</dc:language> </rdf:Description> </rdf:RDF>

##### Internal note

##### xmlui.Submission.submit.DescribeStep.inputForms.label.kops_note_fromSubmitter

##### Examination date of dissertation

December 3, 2014

##### University note

Konstanz, Univ., Doctoral dissertation, 2014

##### Method of financing

##### Comment on publication

##### Alliance license

##### Corresponding Authors der Uni Konstanz vorhanden

##### International Co-Authors

##### Bibliography of Konstanz

No