Discriminant Function

Discriminant function analysis (DFA) examines the separation between two groups of observations. The groups are known a priori. The difference to canonical variate analysis is that only two groups are considered in the analysis. The implementation of DFA in MorphoJ can accept more than two groups, but performs separate analyses for each pair of them or just for selected pairs.

Background

DFA is one of the classical techniques of multivariate statistics, and details can be found in most textbooks of multivariate statistics (e.g. Timm 2002). The implementation in MorphoJ uses Fisher's classification rule, which sets the cut-off point at a value of zero.

Particularly if the sample size is small relative to the number of dimensions (many landmarks...), the discriminant function tends to over-estimate the separation between groups. A good separation of the groups on its own does therefore not mean that observations can be reliably classified.

The reliability of the discriminination is assessed by leave-one-out cross-validation (e.g. Lachenbruch 1967). Although efforts were made to optimize performance (e.g. by 'down-dating' rather than recalculating the generalized inverses at each iteration; Lachenbruch 1967; Dunne and Stone 1993), cross-validation is is computationally intensive, particularly if the dimensionality of the data is high (many landmarks). Therefore, DFA is most useful for comparisons of specific groups, whereas canonical variate analysis may be more useful for general analysis of group structure in a dataset.

The analysis automatically includes a parametric T-square test for the difference between group means. At the user's request, this procedure can also include a permutation test (using Procrustes distance and the T-square statistic). Note that the permutation test involves a substantial computational effort and will therefore take some time.

Requesting a DFA

To request a DFA, select Discriminant Function from the Comparison menu.

A dialog box like the following will appear:

At the top of the dialog box, there is a text field for entering a name for the analysis. Below this, there are two drop-down menus for selecting the dataset and the data type to be used.

The the box to the left contains a list of the classifiers in the selected dataset. This list can be used to select one or more classifiers to be used to define the groups for the analysis (in the screen shot, two classifiers named "genotype" and "sex" are selected).

The box to the right is for selecting the specific pairs of groups that are to be included in the analysis. The box contains a list of all available pairs of groups, depending on the selection of classifiers (left box). For each pair, the two groups are separated by " -- ". If more than one classifier is used for defining groups, the values of the classifiers are separated by commas in the group designations. In the screen shot, the pair "s1, f" and "s1, m" is selected (for the genotype "s1", the sexes "f" and "m" will be compared).

If the user wants to include all pairs of groups, the check box under the left box can be used to select all pairs simultaneously.

There is also the option to include a permutation test for the null hypothesis of equal group means. If the check box is selected, the text field for specifying the number of permutation rounds is activated.

Graphical output

The graphical output of the procedure is organized into a separate tab for each pair of groups for which a DFA is conducted. Each of these tabs contains three graphs: a diagram with the shape difference between the two group means, a histogram with the values of the discriminant scores for the original data and another histogram with the scores for the leave-one-out cross-validation.

The difference between shapes is shown as a change from the first to the second group; for instance, if the string describing the pair of groups is "s1, f -- s1, m" as in the example above, the shape change is from the group mean of "s1, f" to that of "s1, m" (you can reverse the direction by setting the scale factor to a negative value).

Text output

The text output provides information about the groups in the analysis and about the shape difference between them: Procrustes distance, Mahalanobis distance, the T-square statistic and associated parametric P-value. If the user has selected to include the permutation test, the permutation P-values for the T-square statistic and Procrustes distance are also given.

The output also contains two classification/misclassification tables: one for the discriminant scores of a DFA with the complete data and another one that contains the results for the cross-validation. To assess the accuracy of classification, the latter table is to be preferred.

Output dataset

The scores from the DFA are contained in a dataset appended to the DFA icon in the Project Tree. The dataset contains sepatate data matrices with the discriminant scores for the complete data and the scores from the cross-validation procedure. For each comparison in the analysis, each of these data matrices has a separate column. Observations from groups other than the two groups considered in each DFA comparison, the entries in the data matrices are "NaN" (for "not a number", i.e. a missing value). With more than two groups in the original data set, much of the data matrices will be filled with such entries.

These values can be exported, for instance, to identify which observations were misclassified (remember, values are scaled so that the cut-off value is at zero).

References

Dunne, T. T., and M. Stone. 1993. Downdating the Moore-Penrose generalized inverse for cross-validation of centred least squares prediction. J. R. Statist. Soc. B 55:369–375.

Lachenbruch, P. A. 1967. An almost unbiased method of obtaining confidence intervals for the probability of misclassification in discriminant analysis. Biometrics 23:639–645.

Timm, N. H. 2002, Applied multivariate analysis. New York, Springer.