## Use Tall Arrays on a Parallel Pool

If you have Parallel Computing Toolbox™, you can use tall arrays in your local MATLAB^{®} session, or on a local parallel pool. You can also run tall
array calculations on a cluster if you have MATLAB
Parallel Server™ installed. This example uses the workers in a local cluster on your
machine. You can develop code locally, and then scale up, to take advantage of the
capabilities offered by Parallel Computing Toolbox and MATLAB
Parallel Server without having to rewrite your algorithm. See also Big Data Workflow Using Tall Arrays and Datastores.

Create a datastore and convert it into a tall table.

ds = datastore('airlinesmall.csv'); varnames = {'ArrDelay', 'DepDelay'}; ds.SelectedVariableNames = varnames; ds.TreatAsMissing = 'NA';

If you have Parallel Computing Toolbox installed, when you use the `tall`

function, MATLAB automatically starts a parallel pool of workers, unless you turn off the
default parallel pool preference. The default cluster uses local workers on your
machine.

**Note**

If you want to turn off automatically opening a parallel pool, change your
parallel preferences. If you turn off the **Automatically create a parallel pool** option, then you must explicitly start a pool if you want the
`tall`

function to use it for
parallel processing. See Specify Your Parallel Preferences.

If you have Parallel Computing Toolbox, you can run the same code as the MATLAB tall table example and automatically execute it in parallel on the workers of your local machine.

Create a tall table `tt`

from the datastore.

tt = tall(ds)

Starting parallel pool (parpool) using the 'Processes' profile ... connected to 4 workers. tt = M×2 tall table ArrDelay DepDelay ________ ________ 8 12 8 1 21 20 13 12 4 -1 59 63 3 -2 11 -1 : : : :

The display indicates that the number of rows, `M`

, is not yet
known. `M`

is a placeholder until the calculation completes.

Extract the arrival delay `ArrDelay`

from the tall table. This action
creates a new tall array variable to use in subsequent calculations.

a = tt.ArrDelay;

You can specify a series of operations on your tall array, which are not executed
until you call `gather`

. Doing so enables you to batch up commands that
might take a long time. For example, calculate the mean and standard deviation of the
arrival delay. Use these values to construct the upper and lower thresholds for delays
that are within 1 standard deviation of the mean.

m = mean(a,'omitnan'); s = std(a,'omitnan'); one_sigma_bounds = [m-s m m+s];

Use `gather`

to calculate `one_sigma_bounds`

, and
bring the answer into memory.

sig1 = gather(one_sigma_bounds)

Evaluating tall expression using the Parallel Pool 'Processes': - Pass 1 of 1: Completed in 4.5 sec Evaluation completed in 6.3 sec sig1 = -23.4572 7.1201 37.6975

You can specify multiple inputs and outputs to `gather`

if you want
to evaluate several things at once. Doing so is faster than calling
`gather`

separately on each tall array. As an example, calculate
the minimum and maximum arrival delay.

[max_delay, min_delay] = gather(max(a),min(a))

max_delay = 1014 min_delay = -64

If you want to develop in serial and not use local workers or your specified cluster, enter the following command.

mapreducer(0);

`mapreducer`

to change the execution environment after creating a
tall array, then the tall array is invalid and you must recreate it. To use local
workers or your specified cluster again, enter the following
command.mapreducer(gcp);

**Note**

One of the benefits of developing algorithms with tall arrays is that you only
need to write the code once. You can develop your code locally, and then use
`mapreducer`

to scale up to a
cluster, without needing to rewrite your algorithm. For an example, see Use Tall Arrays on a Spark Cluster.

## See Also

`gather`

| `tall`

| `datastore`

| `table`

| `mapreducer`

| `parpool`

## Related Examples

- Big Data Workflow Using Tall Arrays and Datastores
- Use Tall Arrays on a Spark Cluster
- Tall Arrays for Out-of-Memory Data