KOPS - The Institutional Repository of the University of Konstanz

# Column subset selection with applications to neuroimaging data

## Files in this item

STRAUCH, Martin Tobias, 2014. Column subset selection with applications to neuroimaging data [Dissertation]. Konstanz: University of Konstanz

@phdthesis{Strauch2014Colum-29539, title={Column subset selection with applications to neuroimaging data}, year={2014}, author={Strauch, Martin Tobias}, address={Konstanz}, school={Universität Konstanz} }

2015-01-19T08:28:51Z terms-of-use eng Column subset selection with applications to neuroimaging data Strauch, Martin Tobias Strauch, Martin Tobias 2014 2015-01-19T08:28:51Z Column (subset) selection problems require to select a subset of the columns from a matrix in an unsupervised fashion and such that a matrix norm error criterion is minimised. Motivations for column selection are 1) data interpretation through identifying relevant columns, 2) speedup by performing computationally expensive operations on a small column subset. This work introduces structural and algorithmic improvements regarding both aspects, along with demonstrating applications to neuroimaging data.<br /><br /><br />Column selection for data interpretation: NNCX and Convex_cone<br /><br />For CX factorisation (Drineas et al., SIAM J. Matrix Analysis and Applications, 2008), a $c$-subset of the columns of matrix $A$ ($m \times n$) should be selected into the $m \times c$ matrix $C$, and combined linearly with coefficients in $X$ ($c \times n$), such that the CX norm error $\left\| A - CX \right\|^2_{Fr}$ is minimised in the Frobenius norm. For non-negative CX (NNCX), the coefficients in $X$ are constrained to be non-negative, which has advantages with respect to data interpretation (Hyvönen et al., ACM SIGKDD, 2008).<br /><br />The goal is to find good column selection strategies for NNCX, and to analyse the interpretability aspect of CX/NNCX column selection. To this end, a generative model for NNCX is introduced, where the columns of $A$ contain either one of $s$ generating pure signal columns or a linear combination (with non-negative coefficients) of several pure signal columns. An algorithm, Convex_cone, is proposed as a heuristic for selecting the extreme columns of $A$. These extreme columns correspond to the generating columns and they span a convex cone that contains the data points of $A$. The extreme columns are interpretable in the sense that they allow to understand how $A$ has been constructed, and they also serve to reduce the NNCX norm error $\left\| A - CX^{0+} \right\|^2_{Fr}$ (non-negativity indicated by $^{0+}$).<br /><br />Empirical evaluation is performed against state-of-the-art algorithms for column selection. With respect to recovering the generating columns and to reducing the NNCX norm error, Convex_cone performs better than established algorithms for CX that have been modified to compute a NNCX factorisation.<br /><br /><br />Column selection for speedup: Weighted norm error and FastPCA<br /><br />A fast approximation to Principal Component Analysis (PCA) is proposed: FastPCA. Similar to the paradigm of the Nyström extension (Williams & Seeger, NIPS, 2000), the rank-$k$ reconstruction $A_k$ of matrix $A$, that is based on the top-$k$ principal components of $A$, is approximated by $\widehat{A}_k$. The approximation $\widehat{A}_k$ is based on the top-$k$ principal components of a column subset $C \in A$ to be chosen such that the norm error $\left\| A_k - \widehat{A}_k \right\|_{Fr}$ is small.<br /><br />FastPCA implements a column sampling scheme to quickly find a good $C$, in particular in the light of a priori information about the data type. A weighted norm error criterion is introduced where column weights specify column importance. The a priori information helps to set up appropriate column weights, and it gives rise to a data-dependent probability distribution over the columns that assigns relevant columns a high probability of being sampled.<br /><br /><br />Applications to neuroimaging data<br /><br />Both algorithms can be applied to calcium imaging movies of insect brain activity. The imaging movies turn out to have a NNCX-like structure: They can be modelled by linear combination (with non-negative coefficients) of the pure signals of neural units. Among a large number of columns (pixels), Convex_cone selects a subset of columns that contain the pure signals of identifiable neural units. The NNCX factorisation computed by Convex_cone proves to be useful in practical data analysis and visualisation as demonstrated by biological publications.<br /><br />For FastPCA, a priori information about the spatial aspect of neuroimaging data enables a column importance sampling that takes the spatial relationship of columns (pixels) into account. This is used to explicitly reduce the approximation error for columns with biological signals that are known to occur in spatially contiguous clusters. FastPCA is employed as a preprocessing step for fast dimensionality and noise reduction, ensuring scalability to large data sizes in neuroimaging.

 Strauch_0-269383.pdf 431