MC3 Coping with Data: Dimensionality Reduction
High-dimensional data are ubiquitous in many branches of science like sociology, psychometrics, medicine, and many others. Modern data science faces huge challenges in extracting useful information from these data. Indeed high-dimensional data have statistical properties that make them ill-adapted to conventional data analysis tools. In addition the choice among the wide range of modern machine learning tools is difficult because it should be guided by the (unknown) structure and properties of the data. Finally, understanding the results of data analysis may be even more important than the performances in many applications because of the need to convince users and experts.
These reasons make dimensionality reduction techniques an essential step in a data analysis process. Dimensionality reduction aims at providing faithful low-dimensional representations of high-dimensional data. Low-dimensional representations of data are useful both to simplify and improve the performances of further data analysis and information extraction (classification, regression, clustering,…), and for a better understanding of the data, including their visualization.
Dimensionality reduction covers both feature selection and feature extraction. Selection restricts the new low-dimensional features to a subset of the original ones, while extraction builds new features, using linear or nonlinear projections based on the idea of manifold learning. Dimensionality reduction can be supervised or unsupervised; in both cases choosing the key property (Euclidean distances, geodesic distances, similarities,…) to preserve largely influences the resulting representation. Quality assessment is also an important issue.
This course will cover basics and advances in machine-learning based dimensionality reduction. After a brief historical perspective, the tutorial will present the “curse of dimensionality” and its consequences on machine learning techniques. It will then cover feature selection in particular with information-based criteria. Finally it will cover modern dimensionality reduction feature extraction methods relying on distance, neighborhood or similarity preservation, and using either spectral methods or nonlinear optimization tools. It will cover important issues such as visualization, scalability to big data, user interaction for dynamical exploration, reproducibility and stability.
To understand the possibilities, the difficulties and the limitations in extracting information from data represented in high-dimensional spaces. To understand and be able to use feature selection methods and nonlinear dimensionality reduction tools, including assessing the quality of the resulting data representation.Literature
An introduction to variable and feature selection? I Guyon, A Elisseeff, Journal of machine learning research 3 (Mar), 1157-1182.
Nonlinear dimensionality reduction. JA Lee, M Verleysen, Springer Science & Business Media, 2007.
Forum 1Course requirements