MC3 Coping with Data: Dimensionality Reduction
High-dimensional data are ubiquitous in many branches of science like sociology, psychometrics, medicine, and many others. Modern data science faces huge challenges in extracting useful information from these data. Indeed high-dimensional data have statistical properties that make them ill-adapted to conventional data analysis tools. In addition the choice among the wide range of modern machine learning tools is difficult because it should be guided by the (unknown) structure and properties of the data. Finally, understanding the results of data analysis may be even more important than the performances in many applications because of the need to convince users and experts.
These reasons make dimensionality reduction techniques an essential step in a data analysis process. Dimensionality reduction aims at providing faithful low-dimensional representations of high-dimensional data. Low-dimensional representations of data are useful both to simplify and improve the performances of further data analysis and information extraction (classification, regression, clustering,…), and for a better understanding of the data, including their visualization.
Dimensionality reduction covers both feature selection and feature extraction. Selection restricts the new low-dimensional features to a subset of the original ones, while extraction builds new features, using linear or nonlinear projections based on the idea of manifold learning. Dimensionality reduction can be supervised or unsupervised; in both cases choosing the key property (Euclidean distances, geodesic distances, similarities,…) to preserve largely influences the resulting representation. Quality assessment is also an important issue.
This course will cover basics and advances in machine-learning based dimensionality reduction. After a brief historical perspective, the tutorial will present the “curse of dimensionality” and its consequences on machine learning techniques. It will then cover feature selection in particular with information-based criteria. Finally it will cover modern dimensionality reduction feature extraction methods relying on distance, neighborhood or similarity preservation, and using either spectral methods or nonlinear optimization tools. It will cover important issues such as visualization, scalability to big data, user interaction for dynamical exploration, reproducibility and stability.
To understand the possibilities, the difficulties and the limitations in extracting information from data represented in high-dimensional spaces. To understand and be able to use feature selection methods and nonlinear dimensionality reduction tools, including assessing the quality of the resulting data representation.Literature
An introduction to variable and feature selection? I Guyon, A Elisseeff, Journal of machine learning research 3 (Mar), 1157-1182.
Nonlinear dimensionality reduction. JA Lee, M Verleysen, Springer Science & Business Media, 2007.
Michel Verleysen is a Professor of Machine Learning at the Université catholique de Louvain, Belgium, and Honorary Research Director of the Belgian F.N.R.S. (National Fund for Scientific Research). He is editor-in-chief of the Neural Processing Letters journal (published by Springer), chairman of the annual ESANN conference (European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning), past associate editor of the IEEE Trans. on Neural Networks journal, and member of the editorial board and program committee of several journals and conferences on neural networks and learning. He is author or co-author of more than 250 scientific papers in international journals and books or communications to conferences with reviewing committee. His research interests include machine learning, feature selection, nonlinear dimensionality reduction, visualization, high-dimensional data analysis, self-organization, time-series forecasting and biomedical signal processing.Website