Cluster-Size Thresholding in BrainVoyager

The paper "Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates" by Eklund and colleagues received a lot of attention this summer pointing out that fMRI software packages do not correctly perform multiple comparisons correction of statistical maps when using cluster-extent based thresholding for group analyses. The paper re-analyzes a massive set of resting state studies recorded at different sites. By using resting state studies, it can be assumed that the fMRI data contains no systematic activity changes and can thus serve as realizations under the null hypothesis of no effect. While all tested software produced nominal false positive rates for univariate tests, cluster-extent thresholding tools based on Gaussian Random Field theory (SPM, FSL) and Monte Carlo simulation (AFNI) produced elevated false positive rates.

As described in the OHBM blog
Keep calm and scan on, the negative “hype” resulting from the PNAS paper originated mainly from the quite alarming sentence “These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results”. The sentence was picked up by the press, probably because it is well visible in the “Significance" box of the original paper (see below). It is somewhat unfortunate that this sentence refers to a particularly simplified thresholding heuristic called "ad hoc cluster inference" in the text (indeed used in some fMRI studies) but not to the cluster-extent thresholding tools that were thoroughly analysed in the main part of the paper. The ad hoc clustering approach leading "up to 70% false positives" simply accepts all clusters with more than 80 mm3 (e.g. only 3 contiguous VTC/VMP voxels of 3 mm x 3 mm x 3mm or only 10 contiguous 2 mm x 2 mm x 2mm voxels) in combination with a single-voxel threshold of 0.001. The initial voxel-level threshold is referred to as the cluster-defining threshold (CDT) in the paper and its value is very important for the amount of false positives produced by the tested methods. While most of the investigated cluster-extent thresholding tools indeed suffer from inflated false positive rates, the reported values were much less severe. When starting with a CDT of 0.001, the false positive rates were still inflated but instead of 70% the resulting values were in the range of 5%-20% while they were in the range of 20%-40% when using a starting threshold of 0.01. While expected, it is noteworthy that when using permutation testing, false positive rates were not inflated, i.e. rates were close to the nominal 5% value controlling the family-wise error (FWE) as expected.

Despite publication of a
correction, it is unfortunate that the paper lead to overly negative press headlines that likely have generated some damage to the reputation of our field. On the positive side, the paper has provided important insights about the complex spatial and temporal dependencies in real data indicating that synthetic test data is not enough to evaluate cluster-extent thresholding methods because it may miss subtle spatio-temporal properties of the data. In my view this observation and discussions about likely causes of inflated false positive rates (heavy-tailed spatial correlations and spatially-varying smoothness) are the most important contributions of the paper.

Implications for BrainVoyager

Some of us (besides myself especially Armin Heinecke, head of support) were contacted by users asking whether BrainVoyager is also affected by the cluster thresholding issue reported in the Eklund et al paper and what we would advise to perform proper cluster-size thresholding. Since the main cause of the inflated false positive rates stems from complex hitherto unknown spatial and temporal dependencies in the data, BrainVoyager is also affected from this issue but likely to a modest extent.

Gaussian random-field approach. BrainVoyager does not use the Gaussian random-field (GRF) approach for cluster-size thresholding and results related to this method are not applicable to our software. My main argument against GRF always was the fact that fMRI data needs to be heavily smoothed just to meet the criteria of its use, i.e. to fulfil the Gaussian spatial distribution of errors model. In my opinion unnecessary spatial smoothing opposes one of the main strengths of fMRI, namely to scan the brain with high spatial resolution. As discussed in the Eklund et al. paper, it is also no surprise for me that despite strong spatial smoothing the subtler requirements of GRF were not met (the spatial smoothness of the fMRI signal must be constant over the brain, and the spatial autocorrelation function must have a squared exponential shape). Even for whole-brain group studies, smoothing should not mainly be in the service of GRF but kept as low as possible, e.g. by using advanced volumetric or cortex-based brain normalization tools to enhance spatial correspondence between subjects. Using a priori knowledge, masked / ROI-based analyses should also be considered.

False-discovery rate. BrainVoyager’s main multiple comparisons thresholding tool is the false-discovery rate (FDR) approach that is automatically employed as the default thresholding method. While not explicitly correcting for the extent of clusters, it is noteworthy to point out that FDR adjusts to the amount of “active” voxels in the data (based on ranking voxel-wise p values) and its application should be sufficient in most cases. FDR (and it’s more recent variants) is attractive because it does not focus on FWE better avoiding the equally important thresholding error to miss clusters when real effects actually exist! FDR does, however, not cover well cases when one or a few rather small clusters are observed in statistical maps since calculated thresholds then approach the FWE Bonferroni threshold.

Monte Carlo simulation. As an explicit cluster-extent thresholding tool, BrainVoyager uses Monte Carlo simulations in a similar way like AFNI, which is an extension of the Forman et al. (1995) approach from 2D to 3D image space (Goebel et al., 2006). This approach has a good rationale because it uses randomness to estimate a cluster-extent threshold. More specifically, images with independent normally distributed random numbers are generated with spatial correlations between neighboring voxels that are based on the average spatial correlation measured in the original statistical map. The spatial correlation value (as full-width-at-half-maximum (FWHM) of a Gaussian kernel) can also be specified explicitly by the user, e.g. if measured with custom tools. Based on a user provided initial voxel threshold (i.e. CDT), the generated random maps are optionally masked and thresholded such that approximately the theoretical number of false positive voxels are activated in each random map. Finally, a cluster identification step counts how often clusters of a certain size are observed in each generated image. Based on a sufficient number of random processes (1000 or more recommended), this procedure results in a cluster-extent frequency table from which the cluster-extent threshold can be determined that protects with a FWE threshold of p=0.05, i.e. the likelihood that a cluster of the determined size (or larger) is observed in the brain under the null hypothesis of no effect is p<0.05.

While BrainVoyager’s cluster-extent threshold method was not explicitly tested in the paper, results should be similar to those of AFNI also using the Monte Carlo simulation approach. Unfortunately, there was a bug in the respective AFNI routine but even with corrected software, inflated false positive rates were observed. In light of the analysis of the Eklund et al paper, this unexpected result might be partly related to the employed simplifying assumption that the spatial smoothness is constant across the brain, i.e. the same spatial Gaussian smoothing kernel is used at each voxel during image generation. As the Eklund et al. paper points out, the AFNI approach may have underestimated group smoothness because it estimates smoothness on the basis of individual subject data. Since BrainVoyager estimates spatial smoothness directly on the respective group statistical map, it should be less affected than AFNI. As for the tested GRF approaches, the value of the initial cluster-defining threshold also has a large effect on the resulting false positive rates with values ranging from 27.1% to 31% for a CDT of p=0.01 and values ranging from 8.6% to 11.5% for a CDT of p=0.001.

Advice and improvements. While BrainVoyager’s Monte Carlo simulation should be less affected than the one of AFNI, the simplifying assumption of constant spatial smoothness might also lead to modestly inflated false positive rates in BrainVoyager’s ClusterThresh plugin (more tests are needed to quantify this). Future research will focus on better estimates of (local) spatial correlation properties to generate more realistic random data. At present it is advised to start with a CDT of p=0.001 since the issue of inflated false positive rates strongly depends on the initial voxel threshold. In the Help document of the ClusterThresh plugin, a CDT “in the range of 0.01 - 0.00001” has been recommended previously. As a consequence of the results of the Eklund paper, the text will be changed to “a CDT of 0.001 or smaller is recommended” in the upcoming release (BV 20.4); if a user supplies a less strict CDT, a warning will be issued that the value is too weak and that an initial threshold of p=0.001 will be used instead.

It is reassuring that most BrainVoyager users seem to employ a CDT of 0.001 or better as can be seen in the table at the end of a
blog post by Tom Nichols, one of the authors of the Eklund et al paper. Based on data from a survey of 814 fMRI studies published in 2010 and 2011 (Woo et al., 2014), about 23% of BrainVoyager studies used a CDT of 0.01 or worse, which compares favourably to the numbers of most other software packages. The results are especially bad for FSL which provides a CDT of 0.01 as default which seems to be accepted by most users. With the changes of the ClusterThresh plugin described above, the situation for BrainVoyager should be further improved for future publications.

The Eklund et al. paper also showed that permutation testing is the safest of all methods producing false positive rates that are very close to the nominal value of p=0.05 in all cases. While it is also based on randomness like the Monte Carlo simulation approach, permutation testing does not require assumptions about the spatial correlation in the data since it uses the original data with all its spatio-temporal dependencies and only permutes the condition labels to generate a null distribution for cluster sizes. The excellent results of this nonparametric method is motivation enough to add permutation testing as an additional possibility for (cluster) thresholding in one of the next minor releases of BrainVoyager (scheduled for BV 20.6). While simple to implement, permutation testing is compute intensive, and I will thus provide a fast implementation that exploits multiple CPU cores (using multi-threaded C++ code) and in a later release also multiple GPU cores (using OpenCL).

I want to conclude with a more general advice – It is in my opinion important to inspect time courses of detected clusters (at least a plot with a single time course averaged across subjects or better a plot showing cluster time courses for each subject) in order to cross-check that cluster time courses exhibit expected BOLD-like shapes. This will avoid blindly accepting clusters that pass statistical tests because of various complex noise sources. I have the feeling that discussions on double-dipping have discouraged researchers to inspect their data, which is in my opinion an unfortunate trend. Besides testing the methods used in fMRI software packages with real data, it is of equal importance to teach good scientific practise and awareness that complex data and processing pipelines require careful inspection.