Functional Data Analysis Notes 1 - Introduction and Basis Expansion

阅读材料:Ramsay, J.O., Silverman, B.W. (2005) Functional Data Analysis (2nd Edition) Section 1.1, 1.2, 1.3, 1.4, 1.6.1

1 Functional data

Functional Data: Multivariate data with an ordering on the dimensions. (Muller, 2006)

  • Key assumption is smoothness:

    \[y_{ij} = x_i(t_{ij}) + \epsilon_{ij} \]

    with

    • \(y_{ij}\): the ith individual in the jth time piont;
    • \(t\): in a continuum (usually time);
    • \(x_i(t)\): functions for different \(i\).
  • Highest quality data from monitoring equipment. Noiser and less frequent data can also be used.

Eg. Handwriting Data: Measures of position of nib of a pen writing "fda". 20 replications, measurements taken at 200 Hz.
![Handwriting Data]( https://img-blog.csdnimg.cn/2021020820063684.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_7#pic_center)

Fanctional data is complex.

  • Requires more thought/judgement than a t-test.
  • Data needs pre-processing.
  • Parametric inference is rarely available/appropriate.
  • Most approaches are computationally challenging.

FDA is relatively new.

  • First named in Dalzell & Ramsay, 1991.
  • Relatively little penetration into applied fields.
  • Several competing methodologies from different schools of thought (we focus on one).
  • Limited public software/resourses - changed over the last decade.
  • Initial development on data analysis rather than inference.

Necessities for Functional Data:

  • must believably derive from a smooth process;
  • process should not be easily parameterizable (should not be able to write down a formula);
  • enough data to resolve the essential features of the process (peaks, zero-crossings, speed... will depend on application);
  • some repetition in the process;
  • do not need equally spaced or perfect measurements.

2 From discrete to functional data

2.1 Two methods

  • Representing non-parametric continuous-time functions.

    • Basis-expansion methods:
      \(x(t) = \sum_{i = 1}^k \phi_i (t) c_i\)
      for pre-defined \(\phi_i (t)\) and coefficient \(c_i\).
    • Several basis systems available: focus on Fourior and B-splines.
  • Reducing noise in measurements.

    • Smoothing penalties:

    \[c = argmin \sum_{i = 1}^n (y_i - x(t_i))^2 + \lambda \int [Lx(t)]^2 dt \]

    • \(Lx(t)\) measures "roughness" of \(x\)
    • \(\lambda\): a "smoothing parameter" that trades-off fit to the \(y_i\) and roughness; must be chosen.

2.2 Basis expansions

Consider only one record

\[y_i = x(t_i) + \epsilon \]

representing

\[x(t) = \sum_{j=1}^k c_j \phi_j (t_i) = \Phi (t)^T \mathbf{c} \]

\(\Phi (t)\) is a basis system for \(x\).

Eg. Terms for curvature in linear regression

\[y_i = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... + \epsilon_i \]

implies

\[x(t) = \beta_0 + x_i \beta_1 + x_i^2 \beta_2 + x_i^3 \beta_3 + ... \]

Polynomials are unstable; Fourier basis and B-splines wil be more useful.

2.2.1 The Fourier basis

Add up these basis plots.

![fourier-1]( https://img-blog.csdnimg.cn/20210221125707346.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

![fourier-2]( https://img-blog.csdnimg.cn/20210221125817324.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

![fourier-3]( https://img-blog.csdnimg.cn/20210221130222793.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

Advantages:

  • Only alternative to monomial bases until the middle of the 20th century;
  • Excellent computational properties, especially if the observations are equally spaced;
  • Natural for describing periodic data.

Disadvantages:

  • The functions are periodc which can be a problem for data with growth curves etc.

2.2.2 B-splines

\[P(u) = \sum_{i=0}^k P_i B_{i, j} (u), ~~~~~ u \in [u_{j-1}, u_{k+1}], \]

Splines (\(B_{i, j}\)): polynomial segmetns joined end-to-end.

  • Segments are constrained to be smooth at the join.

Knotes (\(P_i\)): the pointes at which the segments join.

The system is difined by

  • the order \(m\) (order = degree + 1) of the polynomial segments (splines);
  • the location of the knots.

![b-splines-1]( https://img-blog.csdnimg.cn/20210221133822434.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

![b-splines-2]( https://img-blog.csdnimg.cn/20210221141920202.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

![b-splines-3]( https://img-blog.csdnimg.cn/2021022114220878.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

![b-spline-local]( https://img-blog.csdnimg.cn/202102211420419.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

Properties:

  • Number of basis functions = order (\(m\)) + number of interior knots.
  • Derivatives up to \(m - 2\) are continuous.
  • B-spline basis functions are positive over at most \(m\) adjacent intervals \(\rightarrow\) fast computation for even thousands of basis functions.
  • Sum of all B-splines in a basis is always 1; can fit any polynomial of order \(m\).
  • Most popular choice is order 4, implying continuous second derivatives. Second derivatives have straight-line segments.

Choosing the number of knots and order:

  • The order of the spline should be at least \(m + 2\) if you are interested in \(m\) derivatives.
  • Knots are often equally spaced (a useful default).
  • Place more knotes where you know there is strong curvature, and fewer where the function changes slowly.
  • Be sure there is at least one data point in every interval.

2.2.3 Other basis

The fda library in R also allows the following basis:

  • Constant: \(\phi (t) = 1\).
  • Power: \(t^{\lambda_1}, t^{\lambda_2}, t^{\lambda_3}, ...\), powers are distinctbut not necessarily integers or positive.
  • Exponential: \(e^{\lambda_1 t}, e^{\lambda_2 t}, e^{\lambda_3 t}, ...\).

Other possible basis include:

  • Wavelets: especially for sharp, local features.
  • Empirical: functional principal components.
  • Designer: dynamic models, tailoring a basis to data can be more efficient.

2.2.4 Summary

  • Basis expansions: just like adding different independent variables in linear regression
  • Fourier basis: classical, common in signal processing etc. Great for periodic functions. Must be infinitely differentiable.
  • B-spline basis: locally polynomial. Allows control of smoothness and accuracy. Local definition ) good numerics.
  • Other basis systems also exist.
  • What is best depends on the data.

3 R example

3.1 New functions

We need the package "fda". (Package "fda".pdf)

Functions used Detials
matplot(matrixA, matrixB) matrixA would be on the x-axis while matrixB on the y-axis. The number of rows of matrixA and matrixB should match.
create.fourier.basis(vectorRange, numberBasis) Create Fourier basis
create.bspline.basis(vectorRange, numberBasis, numberOrder, breaks = numberKnots) Create B-spline basis
eval.basis(vectorTimepoint, basis) Evaluate the basis. Outcome is a vector with vectorTimepoint as rows and basis as colnumns.

library("fda")

# Plot the temperature data of the 35 cities
matplot(CanadianWeather$dailyAv[, , 1], , type = "l", xlab = "Days")

![matplot]( https://img-blog.csdnimg.cn/20210222104742506.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# Load the dataset "CanadianWeather"
data(CanadianWeather)

# Temperature and precipitation are contained in the dailyAV element, and we
# will extract them.
temp <- CanadianWeather$dailyAv[, , 1]
precip <- CanadianWeather$dailyAv[, , 2]

# We need corresponding time points. I'll put them half-way through a day. 
# This is because the period is over 0:365 and we'd like the 365 data points 
# to be about symmetric in that period.
daytime <- (1:365) - 0.5

# This is a bit fine for plotting purposes, so we'll also create a vector of
# points to use for plotting every 5 days
day5 <- seq(0, 365, 5)


# Plot these data
matplot(daytime, temp, type='l')
matplot(daytime, precip, type='l')

# We can also plot by region; Atlantic, Pacific, Central and North.
matplot(daytime, temp, type = 'l', col = as.factor(CanadianWeather$region))
matplot(daytime, precip, type = 'l',col = as.factor(CanadianWeather$region))
Temperature of the 35 cities
Precipitation of the 35 cities
Grouped by region
# Perform fda

# Define a basis system
# We'll always need the range.
dayrng = c(0, 365)

3.2 Fourier basis

# 1/ Fourier Basis with 365 basis functions
fbasis = create.fourier.basis(dayrng, 365)

# Plot a basis with just 5 components
plot(create.fourier.basis(dayrng, 5))

![在这里插入图片描述]( https://img-blog.csdnimg.cn/20210222115709613.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# Try a simple linear regression of the first temperature record on the first
# 20 basis functions.
fb.values <- eval.basis(day5, fbasis)
# The 74 by 365 matrix that results has days as rows, bases as columns.
dim(fb.values)
plot(day5, fb.values[, 1])
plot(day5, fb.values[, 2])
# Extract the first temperature record
ex.temp <- temp[, 1]
plot(ex.temp, main = "Daily Temperature of St. Johns")

![ex.temp]( https://img-blog.csdnimg.cn/20210222123601435.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# Run a linear regression on temperature with the first 5 basis functions.
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:5]  # First 5 basis functions
matplot(Xmat, type = "l")

![Xmat]( https://img-blog.csdnimg.cn/20210222124331968.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

ex.temp.mod <- lm(ex.temp ~ Xmat)

# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (5 basis functions")

![fit_5]( https://img-blog.csdnimg.cn/20210222134244108.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# Check the residuals
plot(daytime, ex.temp.mod$resid)

![resid]( https://img-blog.csdnimg.cn/202102221249315.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

There's some clear autocorrelation. More basis functions may be warranted.
Repeat the above with different numbers of basis functions (say 20, 50, 100, 200, 365) to find the one give a reasonable smooth.

Xmat <- eval.basis(daytime, fbasis)
Xmat <- Xmat[, 1:20]  # First 20 basis functions
ex.temp.mod <- lm(ex.temp ~ Xmat)

# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main = "Daily Temperature of St. Johns with smoothing (20 basis functions)")

![fit_20]( https://img-blog.csdnimg.cn/20210222134138496.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# Check the residuals
plot(daytime, ex.temp.mod$resid)

![resid_20]( https://img-blog.csdnimg.cn/20210222134346104.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

3.3 B-splines

# 2/ B-spline bases with knots every 5 days

# First of all define a knot sequence; this will be the same as day5.
knots <- day5

# We'll use fourth-order B-splines
norder <- 4

# This implies the number of basis functions
nbasis = length(knots) + norder - 2

# Define the basis
bbasis <- create.bspline.basis(dayrng, nbasis, norder, knots)

# If in doubt, we can obtain
bbasis$nbasis  # number of basis functions
bbasis$rangeval  # basis range

# Plot the basis
plot(bbasis)

![bbasis]( https://img-blog.csdnimg.cn/20210222130342280.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# We can look at a smaller number
plot(create.bspline.basis(dayrng, nbasis = 12, norder))

![bbasis-12]( https://img-blog.csdnimg.cn/20210222130446213.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# We can also look at the inner product of these
in.mat = inprod(bbasis, bbasis)
par(mfrow=c(1, 1))
image(in.mat)

![inner_product]( https://img-blog.csdnimg.cn/20210222130538811.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

It is zero outside of a diagonal band This can help computation a great deal.

Change the order of the basis and observe how the width of the support of the basis changes and how its smoothness properties change.

bbasis <- create.bspline.basis(dayrng, nbasis = 20, norder)
# Run a linear regression on temperature with the first 20 basis functions. 
# In this case I need to evaluate the basis at the observation times.
Xmat <- eval.basis(daytime, bbasis)
Xmat <- Xmat[, 1:20]  # First 20 basis functions
matplot(Xmat, type = "l")

![Xmat_b]( https://img-blog.csdnimg.cn/20210222133226253.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

ex.temp.mod <- lm(ex.temp ~ Xmat)

# Plot the fitted values
plot(daytime, ex.temp)
lines(daytime, ex.temp.mod$fitted, col = 2, lwd = 2)
title(main="Daily Temperature of St. Johns with smoothing")

![b_fitted]( https://img-blog.csdnimg.cn/20210222133431393.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

# Check the residuals
plot(daytime,ex.temp.mod$resid)

![b_residual]( https://img-blog.csdnimg.cn/20210222133504588.png?x-oss-process=image/watermark ,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2JhaWR1XzE4NTQ4MDA5,size_16,color_FFFFFF,t_70#pic_center)

posted @ 2022-02-28 22:08  ZZN而已  阅读(444)  评论(0)    收藏  举报