Descripción: |
In modern data analysis areas such as Image Analysis, Chemometrics or Information Retrieval the
raw data are often complex and their representation in Euclidean spaces is not straightforward.
However most statistical data analysis techniques are designed to deal with points in Euclidean
spaces and hence a representation of the data in some Euclidean coordinate system is always
required as a previous step to apply multivariate analysis techniques. This process is crucial to
guarantee the success of the data analysis methodologies and will be a core contribution of this
thesis.
In this work we will develop general data representation techniques in the framework of Functional
Data Analysis (FDA) for classification and clustering problems. In Chapter 1 we motivate
the problems to solve, describe the roadmap of the contributions and set up the notation of this
work.
In Chapter 2 we review some aspects concerning Reproducing Kernel Hilbert Spaces (RKHSs),
Regularization Theory Integral Operators, Support Vector Machines and Kernel Combinations.
In Chapter 3 we propose a new methodology to obtain finite-dimensional representations of
functional data. The key idea is to consider each functional curve as a point in a general function
space and then project these points onto a Reproducing Kernel Hilbert Space (RKHS) with the
aid of Regularization theory. We will describe the projection methods, analyze its theoretical
properties and develop an strategy to select appropriate RKHSs to represent the functional data.
Following the functional data analysis approach, we develop in Chapter 4 a new procedure to
deal with proximity (similarity or distance) matrices in classification problems by studying the
connection between proximity measures and a certain class of integral operators. The idea is to
come up with a methodology able to estimate an integral operator whose associated kernel function,
evaluated at the sample, approximates the sample proximity matrix of the problem. To show the
broad scope of application of the methodology, we will apply it to three cases: (1) classification
problems where the only available information about the data is an asymmetric similarity matrix
(2) partially labeled classification problems and (3) classification problems where several sources
of information are available and can be combined to obtain the discrimination function.
In Chapter 5 we propose an spectral framework for information fusion when the sources of
information are given by a set of proximity matrices. Our approach is based on the simultaneous
diagonalization of the original matrices of the problem and it represents a natural way to manage
the redundant information involved in the fusion process. In particular, we define a new metric
for proximity matrices and we propose a method that automatically eliminates the redundant
information among a set of matrices when they are combined.
We conclude the contributions of the thesis in Chapter 6 with a battery of simulated and real
examples devoted to compare the performance of the proposed methodologies with the state of
the art in representation methods. Finally, in Chapter 7 we include a discussion regarding the
topics described above and we propose some future lines of research we believe are the natural
extensions to the work developed in this thesis. |