# randfeatures

Generate randomized subset of features

## Syntax

`[IDX, Z] = randfeatures(X, Group, '`

* PropertyName*',

*...)*

`PropertyValue`

randfeatures(..., 'Classifier', C)

randfeatures(..., 'ClassOptions', CO)

randfeatures(..., 'PerformanceThreshold', PT)

randfeatures(..., 'ConfidenceThreshold', CT)

randfeatures(..., 'SubsetSize', SS)

randfeatures(..., 'PoolSize', PS)

randfeatures(..., 'NumberOfIndices', N)

randfeatures(..., 'CrossNorm', CN)

randfeatures(..., 'Verbose', VerboseValue)

## Description

`[IDX, Z] = randfeatures(X, Group, '`

performs
a randomized subset feature search reinforced by classification. * PropertyName*',

*...)*

`PropertyValue`

`randfeatures`

randomly
generates subsets of features used to classify the samples. Every
subset is evaluated with the apparent error. Only the best subsets
are kept, and they are joined into a single final pool. The cardinality
for every feature in the pool gives the measurement of the significance. `X`

contains the training samples. Every column of `X`

is an
observed vector. `Group`

contains the class labels.
`Group`

can be a numeric vector, a cell array of character
vectors or string vector;` numel(Group)`

must be the same as the
number of columns in `X`

, and
`numel(unique(Group))`

must be greater than or equal to
`2`

. `Z`

is the classification significance
for every feature. `IDX`

contains the indices after sorting
`Z`

; i.e., the first one points to the most significant
feature.

`randfeatures(..., 'Classifier', C)`

sets
the classifier. Options are

'da' (default) Discriminant analysis 'knn' K nearest neighbors

`randfeatures(..., 'ClassOptions', CO)`

is
a cell with extra options for the selected classifier. When you specify
the discriminant analysis model (`'da'`

) as a classifier, `randfeatures`

uses
the `classify`

function with its
default parameters. For the KNN classifier, `randfeatures`

uses `fitcknn`

with the following default options. `{'Distance','correlation','NumNeighbors',5}`

.

```
randfeatures(..., 'PerformanceThreshold',
PT)
```

sets the correct classification threshold used to pick
the subsets included in the final pool. For the `'da'`

model,
the default is `0.8`

. For the `'knn'`

model,
the default is `0.7`

.

```
randfeatures(..., 'ConfidenceThreshold',
CT)
```

uses the posterior probability of the discriminant
analysis to invalidate classified subvectors with low confidence.
When using the `'da'`

model, the default is ```
0.95.^(number
of classes)
```

. When using the `'knn'`

model,
the default is 1, meaning any classified subvector must have all *k* neighbors
classified to the same class in order to be kept in the pool.

`randfeatures(..., 'SubsetSize', SS)`

sets
the number of features considered in every subset. Default is `20`

.

`randfeatures(..., 'PoolSize', PS)`

sets
the targeted number of accepted subsets for the final pool. Default
is `1000`

.

```
randfeatures(..., 'NumberOfIndices',
N)
```

sets the number of output indices in `IDX`

.
Default is the same as the number of features.

`randfeatures(..., 'CrossNorm', CN)`

applies
independent normalization across the observations for every feature.
Cross-normalization ensures comparability among different features,
although it is not always necessary because the selected classifier
properties might already account for this. Options are

'none' (default) Intensities are not cross-normalized. 'meanvar' x_new = (x - mean(x))/std(x) 'softmax' x_new = (1+exp((mean(x)-x)/std(x)))^-1 'minmax' x_new = (x - min(x))/(max(x)-min(x))

`randfeatures(..., 'Verbose', VerboseValue)`

,
when `Verbose`

is `true`

, turns
off verbosity. Default is `true`

.

## Examples

Find a reduced set of genes that is sufficient for classification of all the cancer types in the t-matrix NCI60 data set. Load sample data.

`load NCI60tmatrix`

Select features.

I = randfeatures(X,GROUP,'SubsetSize',15,'Classifier','da');

Test features with a linear discriminant classifier.

C = classify(X(I(1:25),:)',X(I(1:25),:)',GROUP); cp = classperf(GROUP,C); cp.CorrectRate

ans = 1

## References

[1] Li, L., Umbach, D.M., Terry, P., and Taylor, J.A. (2003). Application of the GA/KNN method to SELDI proteomics data. PNAS. 20, 1638-1640.

[2] Liu, H., Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers.

[3] Ross, D.T. et.al. (2000). Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines. Nature Genetics. 24 (3), 227-235.

## Version History

**Introduced before R2006a**

## See Also

`classperf`

| `crossvalind`

| `rankfeatures`

| `classify`

| `sequentialfs`