Main Content

## Data Analysis

### Introduction

Every data analysis has some standard components:

• Preprocessing — Consider outliers and missing values, and smooth data to identify possible models.

• Summarizing — Compute basic statistics to describe the overall location, scale, and shape of the data.

• Visualizing — Plot data to identify patterns and trends.

• Modeling — Give data trends fuller descriptions, suitable for predicting new values.

Data analysis moves among these components with two basic goals in mind:

1. Describe the patterns in the data with simple models that lead to accurate predictions.

2. Understand the relationships among variables that lead to the model.

This section explains how to carry out a basic data analysis in the MATLAB® environment.

### Preprocessing Data

This example shows how to preprocess data for analysis.

Overview

Begin a data analysis by loading data into suitable MATLAB® container variables and sorting out the "good" data from the "bad." This is a preliminary step that assures meaningful conclusions in subsequent parts of the analysis.

Loading the Data

Begin by loading the data in `count.dat`:

`load count.dat`

The 24-by-3 array `count` contains hourly traffic counts (the rows) at three intersections (the columns) for a single day.

Missing Data

The MATLAB `NaN` (Not a Number) value is normally used to represent missing data. `NaN` values allow variables with missing data to maintain their structure - in this case, 24-by-1 vectors with consistent indexing across all three intersections.

Check the data at the third intersection for `NaN` values using the `isnan` function:

```c3 = count(:,3); % Data at intersection 3 c3NaNCount = sum(isnan(c3))```
```c3NaNCount = 0 ```

`isnan` returns a logical vector the same size as `c3`, with entries indicating the presence (`1`) or absence (`0`) of `NaN` values for each of the 24 elements in the data. In this case, the logical values sum to `0`, so there are no `NaN` values in the data.

`NaN` values are introduced into the data in the section on Outliers.

Outliers

Outliers are data values that are dramatically different from patterns in the rest of the data. They might be due to measurement error, or they might represent significant features in the data. Identifying outliers, and deciding what to do with them, depends on an understanding of the data and its source.

One common method for identifying outliers is to look for values more than a certain number of standard deviations $\sigma$ from the mean $\mu$. The following code plots a histogram of the data at the third intersection together with lines at $\mu$ and $\mu +\eta \sigma$, for $\eta =1,2$:

```h = histogram(c3,10); % Histogram N = max(h.Values); % Maximum bin count mu3 = mean(c3); % Data mean sigma3 = std(c3); % Data standard deviation hold on plot([mu3 mu3],[0 N],'r','LineWidth',2) % Mean X = repmat(mu3+(1:2)*sigma3,2,1); Y = repmat([0;N],1,2); plot(X,Y,'Color',[255 153 51]./255,'LineWidth',2) % Standard deviations legend('Data','Mean','Stds') hold off```

The plot shows that some of the data are more than two standard deviations above the mean. If you identify these data as errors (not features), replace them with `NaN` values as follows:

```outliers = (c3 - mu3) > 2*sigma3; c3m = c3; % Copy c3 to c3m c3m(outliers) = NaN; % Add NaN values```

Smoothing and Filtering

A time-series plot of the data at the third intersection (with the outlier removed in Outliers) results in the following plot:

```plot(c3m,'o-') hold on```

The `NaN` value at hour 20 appears as a gap in the plot. This handling of `NaN` values is typical of MATLAB plotting functions.

Noisy data shows random variations about expected values. You might want to smooth the data to reveal its main features before building a model. Two basic assumptions underlie smoothing:

- The relationship between the predictor (time) and the response (traffic volume) is smooth.

- The smoothing algorithm results in values that are better estimates of expected values because the noise has been reduced.

Apply a simple moving average smoother to the data using the MATLAB `convn` function:

```span = 3; % Size of the averaging window window = ones(span,1)/span; smoothed_c3m = convn(c3m,window,'same'); h = plot(smoothed_c3m,'ro-'); legend('Data','Smoothed Data')```

The extent of the smoothing is controlled with the variable `span`. The averaging calculation returns `NaN` values whenever the smoothing window includes the `NaN` value in the data, thus increasing the size of the gap in the smoothed data.

The `filter` function is also used for smoothing data:

```smoothed2_c3m = filter(window,1,c3m); delete(h) plot(smoothed2_c3m,'ro-','DisplayName','Smoothed Data');```

The smoothed data are shifted from the previous plot. `convn` with the `'same'` parameter returns the central part of the convolution, the same length as the data. `filter` returns the initial part of the convolution, the same length as the data. Otherwise, the algorithms are identical.

Smoothing estimates the center of the distribution of response values at each value of the predictor. It invalidates a basic assumption of many fitting algorithms, namely, that the errors at each value of the predictor are independent. Accordingly, you can use smoothed data to identify a model, but avoid using smoothed data to fit a model.

### Summarizing Data

This example shows how to summarize data.

Overview

Many MATLAB® functions enable you to summarize the overall location, scale, and shape of a data sample.

One of the advantages of working in MATLAB® is that functions operate on entire arrays of data, not just on single scalar values. The functions are said to be vectorized. Vectorization allows for both efficient problem formulation, using array-based data, and efficient computation, using vectorized statistical functions.

Measures of Location

Summarize the location of a data sample by finding a "typical" value. Common measures of location or "central tendency" are computed by the functions `mean`, `median`, and `mode`:

```load count.dat x1 = mean(count)```
```x1 = 1×3 32.0000 46.5417 65.5833 ```
`x2 = median(count)`
```x2 = 1×3 23.5000 36.0000 39.0000 ```
`x3 = mode(count)`
```x3 = 1×3 11 9 9 ```

Like all of its statistical functions, the MATLAB® functions above summarize data across observations (rows) while preserving variables (columns). The functions compute the location of the data at each of the three intersections in a single call.

Measures of Scale

There are many ways to measure the scale or "dispersion" of a data sample. The MATLAB® functions `max`, `min`, `std`, and `var` compute some common measures:

`dx1 = max(count)-min(count)`
```dx1 = 1×3 107 136 250 ```
`dx2 = std(count)`
```dx2 = 1×3 25.3703 41.4057 68.0281 ```
`dx3 = var(count)`
```dx3 = 1×3 103 × 0.6437 1.7144 4.6278 ```

Like all of its statistical functions, the MATLAB® functions above summarize data across observations (rows) while preserving variables (columns). The functions compute the scale of the data at each of the three intersections in a single call.

Shape of a Distribution

The shape of a distribution is harder to summarize than its location or scale. The MATLAB® `hist` function plots a histogram that provides a visual summary:

```figure hist(count) legend('Intersection 1',... 'Intersection 2',... 'Intersection 3')```

Parametric models give analytic summaries of distribution shapes. Exponential distributions, with parameter `mu` given by the data mean, are a good choice for the traffic data:

```c1 = count(:,1); % Data at intersection 1 [bin_counts,bin_locations] = hist(c1); bin_width = bin_locations(2) - bin_locations(1); hist_area = (bin_width)*(sum(bin_counts)); figure hist(c1) hold on mu1 = mean(c1); exp_pdf = @(t)(1/mu1)*exp(-t/mu1); % Integrates % to 1 t = 0:150; y = exp_pdf(t); plot(t,(hist_area)*y,'r','LineWidth',2) legend('Distribution','Exponential Fit')```

Methods for fitting general parametric models to data distributions are beyond the scope of this section. Statistics and Machine Learning Toolbox™ software provides functions for computing maximum likelihood estimates of distribution parameters.

### Visualizing Data

#### Overview

You can use many MATLAB graph types for visualizing data patterns and trends. Scatter plots, described in this section, help to visualize relationships among the traffic data at different intersections. Data exploration tools let you query and interact with individual data points on graphs.

Note

This section continues the data analysis from Summarizing Data.

#### 2-D Scatter Plots

A two-dimensional scatter plot, created with the `scatter` function, shows the relationship between the traffic volume at the first two intersections:

```load count.dat c1 = count(:,1); % Data at intersection 1 c2 = count(:,2); % Data at intersection 2 figure scatter(c1,c2,'filled') xlabel('Intersection 1') ylabel('Intersection 2')```

The covariance, computed by the `cov` function measures the strength of the linear relationship between the two variables (how tightly the data lies along a least-squares line through the scatter):

`C12 = cov([c1 c2])`
```C12 = 2×2 103 × 0.6437 0.9802 0.9802 1.7144 ```

The results are displayed in a symmetric square matrix, with the covariance of the ith and jth variables in the (i, j)th position. The ith diagonal element is the variance of the ith variable.

Covariances have the disadvantage of depending on the units used to measure the individual variables. You can divide a covariance by the standard deviations of the variables to normalize values between +1 and –1. The `corrcoef` function computes correlation coefficients:

`R12 = corrcoef([c1 c2])`
```R12 = 2×2 1.0000 0.9331 0.9331 1.0000 ```
`r12 = R12(1,2) % Correlation coefficient`
```r12 = 0.9331 ```
`r12sq = r12^2 % Coefficient of determination`
```r12sq = 0.8707 ```

Because it is normalized, the value of the correlation coefficient is readily comparable to values for other pairs of intersections. Its square, the coefficient of determination, is the variance about the least-squares line divided by the variance about the mean. Thus, it is the proportion of variation in the response (in this case, the traffic volume at intersection 2) that is eliminated or statistically explained by a least-squares line through the scatter.

#### 3-D Scatter Plots

A three-dimensional scatter plot, created with the `scatter3` function, shows the relationship between the traffic volume at all three intersections. Use the variables `c1`, `c2`, and `c3` that you created in the previous step:

```figure c3 = count(:,3); % Data at intersection 3 scatter3(c1,c2,c3,'filled') xlabel('Intersection 1') ylabel('Intersection 2') zlabel('Intersection 3')```

Measure the strength of the linear relationship among the variables in the three-dimensional scatter by computing eigenvalues of the covariance matrix with the `eig` function:

`vars = eig(cov([c1 c2 c3]))`
```vars = 3×1 103 × 0.0442 0.1118 6.8300 ```
`explained = max(vars)/sum(vars)`
```explained = 0.9777 ```

The eigenvalues are the variances along the principal components of the data. The variable `explained` measures the proportion of variation explained by the first principal component, along the axis of the data. Unlike the coefficient of determination for two-dimensional scatters, this measure distinguishes predictor and response variables.

#### Scatter Plot Arrays

Use the `plotmatrix` function to make comparisons of the relationships between multiple pairs of intersections:

```figure plotmatrix(count)```

The plot in the (i, j)th position of the array is a scatter with the ith variable on the vertical axis and the jth variable on the horizontal axis. The plot in the ith diagonal position is a histogram of the ith variable.

#### Exploring Data in Graphs

Using your mouse, you can pick observations on almost any MATLAB graph with two tools from the figure toolbar:

• Data Cursor

• Data Brushing

These tools each place you in exploratory modes in which you can select data points on graphs to identify their values and create workspace variables to contain specific observations. When you use data brushing, you can also copy, remove or replace the selected observations.

For example, make a scatter plot of the first and third columns of `count`:

```load count.dat scatter(count(:,1),count(:,3))```
Select the Data Cursor Tool and click the rightmost data point. A datatip displaying the point's x and y value is placed there.

Datatips display x-, y-, and z- (for three-dimensional plots) coordinates by default. You can drag a datatip from one data point to another to see new values or add additional datatips by right-clicking a datatip and using the context menu. You can also customize the text that datatips display using MATLAB code.

Data brushing is a related feature that lets you highlight one or more observations on a graph by clicking or dragging. To enter data brushing mode, click the left side of the Data Brushing tool on the figure toolbar. Clicking the arrow on the right side of the tool icon drops down a color palette for selecting the color with which to brush observations. This figure shows the same scatter plot as the previous figure, but with all observations beyond one standard deviation of the mean (as identified using the Tools > Data Statistics GUI) brushed in red.

`scatter(count(:,1),count(:,3))`

After you brush data observations, you can perform the following operations on them:

• Delete them.

• Replace them with constant values.

• Replace them with NaN values.

• Drag or copy, and paste them to the Command Window.

• Save them as workspace variables.

For example, use the Data Brush context menu or the Tools > Brushing > Create new variable option to create a new variable called `count13high`.

A new variable in the workspace results:

```count13high count13high = 61 186 75 180 114 257 ```

Linked plots, or data linking, is a feature closely related to data brushing. A plot is said to be linked when it has a live connection to the workspace data it depicts. The copies of variables stored in an object's `XData`, `YData`, (and, where appropriate, `ZData`), are automatically updated whenever the workspace variables to which they are linked change or are deleted. This causes the graphs on which they appear to update automatically.

Linking plots to variables lets you track specific observations through different presentations of them. When you brush data points in linked plots, brushing one graph highlights the same observations in every graph that is linked to the same variables.

Data linking establishes immediate, two-way communication between figures and workspace variables, in the same way that the Variable Editor communicates with workspace variables. You create links by activating the Data Linking tool on a figure's toolbar. Activating this tool causes the Linked Plot information bar, displayed in the next figure, to appear at the top of the plot (possibly obscuring its title). You can dismiss the bar (shown in the following figure) without unlinking the plot; it does not print and is not saved with the figure.

The following two graphs depict scatter plot displays of linked data after brushing some observations on the left graph. The common variable, `count` carries the brush marks to the right figure. Even though the right graph is not in data brushing mode, it displays brush marks because it is linked to its variables.

```figure scatter(count(:,1),count(:,2)) xlabel ('count(:,1)') ylabel ('count(:,2)') figure scatter(count(:,3),count(:,2)) xlabel ('count(:,3)') ylabel ('count(:,2)') ```

The right plot shows that the brushed observations are more linearly related than in the left plot.

Brushed data observations appear highlighted in the brushing color when you display those variables in the Variable Editor, as you can see here:

`openvar count`

In the Variable Editor, you can alter any values of linked plot data, and the graphs will reflect your edits. To brush data observation from the Variable Editor, click its Brushing Tool button. If the variable you brush is currently depicted in a linked plot, the observations you brush highlight in the plot, as well as in the Variable Editor. When you brush a variable that is a column in a matrix, the other columns in that row are also brushed. That is, you can brush individual observations in a row or column vector, but all columns in a matrix highlight in any row you brush, not just the observations you click.

### Modeling Data

#### Overview

Parametric models translate an understanding of data relationships into analytic tools with predictive power. Polynomial and sinusoidal models are simple choices for the up and down trends in the traffic data.

#### Polynomial Regression

Use the `polyfit` function to estimate coefficients of polynomial models, then use the `polyval` function to evaluate the model at arbitrary values of the predictor.

The following code fits the traffic data at the third intersection with a polynomial model of degree six:

```load count.dat c3 = count(:,3); % Data at intersection 3 tdata = (1:24)'; p_coeffs = polyfit(tdata,c3,6); figure plot(c3,'o-') hold on tfit = (1:0.01:24)'; yfit = polyval(p_coeffs,tfit); plot(tfit,yfit,'r-','LineWidth',2) legend('Data','Polynomial Fit','Location','NW')```

The model has the advantage of being simple while following the up-and-down trend. The accuracy of its predictive power, however, is questionable, especially at the ends of the data.

#### General Linear Regression

Assuming that the data are periodic with a 12-hour period and a peak around hour 7, it is reasonable to fit a sinusoidal model of the form:

`$y=a+b\mathrm{cos}\left(\left(2\pi /12\right)\left(t-7\right)\right)$`

The coefficients a and b appear linearly. Use the MATLAB® `mldivide` (backslash) operator to fit general linear models:

```load count.dat c3 = count(:,3); % Data at intersection 3 tdata = (1:24)'; X = [ones(size(tdata)) cos((2*pi/12)*(tdata-7))]; s_coeffs = X\c3; figure plot(c3,'o-') hold on tfit = (1:0.01:24)'; yfit = [ones(size(tfit)) cos((2*pi/12)*(tfit-7))]*s_coeffs; plot(tfit,yfit,'r-','LineWidth',2) legend('Data','Sinusoidal Fit','Location','NW')```

Use the `lscov` function to compute statistics on the fit, such as estimated standard errors of the coefficients and the mean squared error:

`[s_coeffs,stdx,mse] = lscov(X,c3)`
```s_coeffs = 2×1 65.5833 73.2819 ```
```stdx = 2×1 8.9185 12.6127 ```
```mse = 1.9090e+03 ```

Check the assumption of a 12-hour period in the data with a periodogram, computed using the `fft` function:

```Fs = 1; % Sample frequency (per hour) n = length(c3); % Window length Y = fft(c3); % DFT of data f = (0:n-1)*(Fs/n); % Frequency range P = Y.*conj(Y)/n; % Power of the DFT figure plot(f,P) xlabel('Frequency') ylabel('Power')```

`predicted_f = 1/12`
```predicted_f = 0.0833 ```

The peak near `0.0833` supports the assumption, although it occurs at a slightly higher frequency. The model can be adjusted accordingly.

Download ebook