S6: Data-Driven (ML) Analysis

This notebook has three sections:

Multimodal-source-decomposition methods on simulated fNIRS-EEG data
ICA Source Extraction
Single Trial Classification

[1]:

# This cells setups the environment when executed in Google Colab.
try:
    import google.colab
    !curl -s https://raw.githubusercontent.com/ibs-lab/cedalion/dev/scripts/colab_setup.py -o colab_setup.py
    # Select branch with --branch "branch name" (default is "dev")
    %run colab_setup.py
except ImportError:
    pass

[2]:

import cedalion
import cedalion.data
import cedalion.sigproc.quality as quality
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import xarray as xr
from cedalion import units
import cedalion.sigproc.physio as physio
import cedalion.sigproc.motion as motion
from cedalion.sigdecomp.unimodal import ICA_ERBM
from cedalion.sigproc.frequency import sampling_rate
from cedalion.sim.datasets.synthetic_fnirs_eeg import (
    BimodalToyDataSimulation,
    standardize,
)

# Limit display to show 3 items at each edge, then "..."
np.set_printoptions(threshold=20, edgeitems=2)
xr.set_options(display_expand_data=False)

# To reaload packages and modules automatically.
%load_ext autoreload
%autoreload 2

ICA Source Extraction

In this notebook we will investigate an example on how Independent Component Analysis by Entropy Rate Bound Minimization (ICA-ERBM) can be used to extract physiological sources from fNIRS data. For this example, we will use a finger tapping dataset.

Let \(X \in \mathbb{R}^{N \times T }\) denote the finger tapping data with \(N\) channels and \(T\) time points. We assume that the data \(X\) consists of unknown independent sources \(S \in \mathbb{R}^{N \times T }\) that were mixed through an unknown mixing matrix \(A \in \mathbb{R}^{N \times N}\), such that

\[X = A \cdot S.\]

Source reconstruction in ICA-ERBM is done by minimizing the mutual information rate of the estimated sources. A demixing matrix \(W\) is determined and the estimated sources \(\hat S \in \mathbb{R}^{N \times T }\) can be computed as

\[\hat S = W \cdot X.\]

Among the extracted sources, we will identify the ones that correspond to the PPG and Mayer Wave signals.

Loading Raw Finger Tapping Data

[32]:

# Load finger tapping data set
finger_tapping_data = cedalion.data.get_fingertappingDOT()

# Extract the fnirs recording
fnirs_data = finger_tapping_data['amp']

# Plot three channels of the fnirs data
fig, ax = plt.subplots(3, 1, sharex=True, figsize=(10, 5))
for i, ch in enumerate(["S1D1", "S1D2", "S7D9"]):
    ax[i].plot(fnirs_data.time, fnirs_data.sel(channel=ch, wavelength="760"), "r-", label="760nm")
    ax[i].plot(fnirs_data.time, fnirs_data.sel(channel=ch, wavelength="850"), "b-", label="850nm")
    ax[i].set_title(f"Channel {ch}")

ax[0].legend()
ax[2].set_xlim(0,60)
ax[2].set_xlabel("time / s")
plt.tight_layout()

../../_images/examples_tutorial_6_data_driven_analysis_68_0.png

Conversion to Optical Density

[33]:

# Convert to Optical Density (OD)
fnirs_data_od = cedalion.nirs.cw.int2od(fnirs_data)

Channel Quality Assessment and Pruning

The Scalp Coupling Index (SCI) and Peak Spectral Power (PSP) are used for quality assessment. We compute SCI and PSP for each channel, and remove channels with less than 75% of clean time.

[34]:

# Calculate masks for SCI and PSP quality metrics
window_length = 5 * units.s
sci_thresh = 0.75
psp_thresh = 0.1
sci_psp_percentage_thresh = 0.75

sci, sci_mask = quality.sci(fnirs_data_od, window_length, sci_thresh)
psp, psp_mask = quality.psp(fnirs_data_od, window_length, psp_thresh)
sci_x_psp_mask = sci_mask & psp_mask
perc_time_clean = sci_x_psp_mask.sum(dim="time") / len(sci.time)
sci_psp_mask = [perc_time_clean >= sci_psp_percentage_thresh]

# Prune channels that do not pass the quality test
fnirs_data_pruned, drop_list = quality.prune_ch(fnirs_data_od, sci_psp_mask, "all")

# Display pruned channels
print(f"List of pruned channels: {drop_list}  ({len(drop_list)})")

List of pruned channels: ['S13D26']  (1)

High-pass filter

[35]:

# Filter the data
# fmax = 0 is used to indicate high-pass filtering
fnirs_data_filtered = fnirs_data_pruned.cd.freq_filter(fmin= 0.01, fmax= 0, butter_order=4)

# Store sampling rate
fnirs_data_samplingrate = sampling_rate(fnirs_data_pruned.time).magnitude

# Plot the filtered data
fig, ax = plt.subplots(3, 1, sharex=True, figsize=(10, 5))
for i, ch in enumerate(["S1D1", "S1D2", "S7D9"]):
    ax[i].plot(fnirs_data_filtered.time, fnirs_data_filtered.sel(channel=ch, wavelength="760"), "r-", label="760nm")
    ax[i].plot(fnirs_data_filtered.time, fnirs_data_filtered.sel(channel=ch, wavelength="850"), "b-", label="850nm")
    ax[i].set_title(f"Channel {ch}")

ax[0].legend()
ax[2].set_xlim(0,60)
ax[2].set_label("time / s")
plt.tight_layout()

../../_images/examples_tutorial_6_data_driven_analysis_74_0.png

Select Channels and Time Slice for ICA

The entire finger tapping dataset was recorded over 30 minutes and contains 99 channels after pruning. Unfortunately, these dimensions result in a long runtime for ICA-ERBM. For this reason, we will use only a subset of the channels and a 10-minute slice of the selected channels. Despite the longer runtime, this example is also applicable to the full dataset.

[36]:

# Choose the best 30 channels based on the percentage of time clean
id_best_channels = np.argsort(perc_time_clean).values[-30:]
best_channels = fnirs_data['channel'][id_best_channels]

# Extract the best channels from the filtered data
fnirs_best_channels = fnirs_data_filtered.sel(channel = best_channels)

# Select a 10 min interval
duration = 10 * 60
buffer = 60
fnirs_best_channels = fnirs_best_channels.sel(time=slice(buffer, buffer + duration))

# Select the first wavelength
X = fnirs_best_channels.values[:, 0, :]
print(f"Shape of data for ICA-ERBM: {X.shape}")

Shape of data for ICA-ERBM: (30, 2616)

/opt/miniconda3/envs/cedalion_250922/lib/python3.11/site-packages/xarray/core/variable.py:315: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  data = np.asarray(data)

Apply ICA-ERBM

ICA-ERBM is applied to the selected channels. For the autoregressive filter used in ICA-ERBM, we use the default parameter \(p = 11\). The source estimates are then computed as \(\hat S = W \cdot X\).

[37]:

# Set filter length
p = 11

# Apply ICA-ERBM to the data
W = ICA_ERBM(X, p)

# Compute separated source as S = W * X
sources = W.dot(X)

[38]:

# Apply z-score normalization to the sources
sources_zscore = sp.stats.zscore(sources, axis=0)

Selection of PPG Source

From the reconstructed sources, we now want to identify those that are most similar to a PPG signal. To this end, we compare the frequency band in which the PPG signal is expected to have large amplitudes with the surrounding frequency bands. The sources with the highest contrast are selected. The PPG signal is expected to exhibit high amplitudes in a frequency band around 1 Hz.

[39]:

# Compute the frequency spectrum for each source
psd_sources = np.abs(np.fft.fft(sources, axis = 1))

# The frequencies corresponding to the spectrum
freqs = np.fft.fftfreq(sources.shape[1], 1/fnirs_data_samplingrate)

# Choose the indices of frequencies that are in the ppg band (0.75 - 1.25 Hz)
ppg_band_ind = np.logical_and(freqs >= 0.75, freqs <= 1.25)

# Choose the indices of frequencies that are in the band (0 - 0.75 Hz and 1.25 - 3.0 Hz)
comp_band =  np.logical_and(freqs >= 0, freqs < 0.75) +  np.logical_and(freqs > 1.25 , freqs <= 3.0)

# Compute the quotient of the  ppg band and the contrast band
psd_quotient = np.sum(psd_sources[:, ppg_band_ind], axis = 1 ) / np.sum(psd_sources[:, comp_band], axis = 1 )

# Choose the indices of the sources with the highest contrast
max_contrast_index = np.argsort(psd_quotient, axis = 0 )[-5:]

# Reverse the order of the indices to have the highest contrast first
max_contrast_index = max_contrast_index[::-1]

# Choose the sources with the highest contrast
ppg_sources = sources_zscore[max_contrast_index, :]

[40]:

# Plot the sources with the highest contrast and their frequency spectrum
fig, ax = plt.subplots(ppg_sources.shape[0], 2, figsize=(12, 2 * ppg_sources.shape[0]))

for i in range(ppg_sources.shape[0]):
    # Plot the source for 60 seconds
    samples = int(fnirs_data_samplingrate * 60 * 1)
    ax[i, 0].plot( 1/fnirs_data_samplingrate * np.arange(0,samples), ppg_sources[i, :samples], label=f"Source {max_contrast_index[i]+1}")
    ax[i, 0].set_title(f"Source {max_contrast_index[i] + 1}", fontsize=10)

    # Plot frequency spectrum of the source
    psd = np.abs(np.fft.rfft(ppg_sources[i, :])) ** 2
    x_freqs = np.fft.rfftfreq(ppg_sources.shape[1], 1 / fnirs_data_samplingrate)
    ax[i, 1].plot(x_freqs, psd, label="Contrast Band")
    ax[i, 1].set_title(f"Frequency Spectrum of Source {max_contrast_index[i]+1}, Contrast Quotient: {psd_quotient[max_contrast_index[i]]:.2f}",  fontsize=10)

    # Highlight the PPG band in the frequency spectrum
    highlight_ppg_band = np.logical_and(x_freqs >= 0.75, x_freqs <= 1.25)
    ax[i, 1].plot(x_freqs[highlight_ppg_band], psd[highlight_ppg_band], color='orange', label='PPG Band')

ax[0, 1].legend()
ax[i, 0].set_xlabel("Time / s")
ax[i, 1].set_xlabel("Frequency / Hz")
fig.suptitle("PPG Sources", fontsize=16)
plt.tight_layout()

../../_images/examples_tutorial_6_data_driven_analysis_82_0.png

Selection of Mayer Wave Source

Mayer waves are expected to have a frequency around 0.1 Hz. Similar to the PPG sources above, we will use the contrast between the frequency band around 0.1 Hz and the surrounding bands to rank the sources and identify those that are most similar to the Mayer wave.

[41]:

# Choose the indices of frequencies that are in the Mayer Wave band (0.05 - 0.15 Hz)
mw_band_ind = np.logical_and(freqs >= 0.05, freqs <= 0.15)

# Choose the indices of frequencies that are in the band (0 - 0.05 Hz and 0.15 - 3.0 Hz)
comp_band =  np.logical_and(freqs >= 0, freqs < 0.05) +  np.logical_and(freqs > 0.15 , freqs <= 3.0)

# Compute the quotient of the Mayer Wave band and the contrast band
psd_quotient = np.sum(psd_sources[:, mw_band_ind], axis = 1 ) / np.sum(psd_sources[:, comp_band], axis = 1 )

# Choose the indices of the sources with the highest contrast
max_contrast_index = np.argsort(psd_quotient, axis = 0 )[-5:]

# Reverse the order of the indices to have the highest contrast first
max_contrast_index = max_contrast_index[::-1]

# Extract the sources with the highest contrast
mw_sources = sources_zscore[max_contrast_index, :]

[42]:

# Plot the sources with the highest contrast and their frequency spectrum
fig, ax = plt.subplots(mw_sources.shape[0], 2, figsize=(12, 2 * mw_sources.shape[0]))

for i in range(mw_sources.shape[0]):
    # Plot the source for 60 seconds
    samples = int(fnirs_data_samplingrate * 60 * 1)
    ax[i, 0].plot( 1/fnirs_data_samplingrate * np.arange(0,samples), mw_sources[i, : samples], label=f"Source {max_contrast_index[i]+1}")
    ax[i, 0].set_title(f"Source {max_contrast_index[i] + 1}", fontsize=10)

    # Plot frequency spectrum of the source
    psd = np.abs(np.fft.rfft(mw_sources[i, :])) ** 2
    x_freqs = np.fft.rfftfreq(mw_sources.shape[1], 1 / fnirs_data_samplingrate)
    ax[i, 1].plot(x_freqs, psd, label="Contrast Band")
    ax[i, 1].set_title(f"Frequency Spectrum of Source {max_contrast_index[i]+1}, Contrast Quotient: {psd_quotient[max_contrast_index[i]]:.2f}",  fontsize=10)

    # Highlight the Mayer Wave band in the frequency spectrum
    highlight_mw_band = np.logical_and(x_freqs >= 0.05, x_freqs <= 0.15)
    ax[i, 1].plot(x_freqs[highlight_mw_band], psd[highlight_mw_band], color='orange', label='Mayer Wave Band')

ax[0, 1].legend()
ax[i, 0].set_xlabel("Time / s")
ax[i, 1].set_xlabel("Frequency / Hz")
fig.suptitle("Mayer Wave Sources", fontsize=16)
plt.tight_layout()

../../_images/examples_tutorial_6_data_driven_analysis_85_0.png

Single Trial Classification

The last section of this notebook demonstrates how Cedalion interfaces with scikit-learn to train a simple single-subject, single-trial classifier. The focus is on data flow: performing preprocessing with short-channel regression within a cross-validation scheme, extracting features from epochs, and passing these features to scikit-learn for training and evaluation while preserving feature metadata to allow tracing each feature back to its origin.

[43]:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import LabelEncoder
from sklearn.inspection import permutation_importance

import cedalion.models.glm as glm
import cedalion.mlutils as mlutils

[44]:

rec = cedalion.data.get_fingertappingDOT()

# assign string labels to events and pool finger-tapping and ball-squeezing trials
rec.stim.cd.rename_events(
    {
        "1": "Control",
        "2": "Motor/Left",  # "FTapping/Left",
        "3": "Motor/Right", # "FTapping/Right",
        "4": "Motor/Left",  # "BallSqueezing/Left",
        "5": "Motor/Right", # "BallSqueezing/Right",
    }
)

# Keep only motor trials. Also remove the last trial so that there
# are equal numbers of trials for Motor/Left and Motor/Right
rec.stim = (
    rec.stim[rec.stim.trial_type.str.startswith("Motor")]
    .sort_values("onset")
    .reset_index(drop=True)
    .iloc[:-1]
)

display(rec.stim.groupby("trial_type").count())
display(rec.stim)

	onset	duration	value
trial_type
Motor/Left	32	32	32
Motor/Right	32	32	32

	onset	duration	value	trial_type
0	8.486912	10.0	1.0	Motor/Left
1	38.764544	10.0	1.0	Motor/Right
2	69.042176	10.0	1.0	Motor/Right
3	99.549184	10.0	1.0	Motor/Left
4	129.597440	10.0	1.0	Motor/Left
...	...	...	...	...
59	1837.301760	10.0	1.0	Motor/Left
60	1868.496896	10.0	1.0	Motor/Left
61	1899.003904	10.0	1.0	Motor/Right
62	1931.116544	10.0	1.0	Motor/Right
63	1962.541056	10.0	1.0	Motor/Left

64 rows × 4 columns

Preprocessing the dataset

As in previous notebooks, motion artifacts are corrected with the TDDR and wavelet algorithms. The data is transformed into optical densities and a global component is subtracted. After bandpass filtering, concentration changes are calculated.

[45]:

rec["od"] = cedalion.nirs.cw.int2od(rec["amp"])
rec["od_tddr"] = motion.tddr(rec["od"])
rec["od_wavelet"] = motion.wavelet(rec["od_tddr"])

# see 2_tutorial_preprocessing.ipynb for channel selection
bad_channels = ['S13D26', 'S14D28']

rec["od_clean"] = rec["od_wavelet"].sel(channel=~rec["od"].channel.isin(['S13D26', 'S14D28']))

od_var = quality.measurement_variance(rec["od_clean"], calc_covariance=False)

rec["od_mean_subtracted"], global_comp = physio.global_component_subtract(
    rec["od_clean"], ts_weights=1 / od_var, k=0
)

rec["od_freqfiltered"] = rec["od_mean_subtracted"].cd.freq_filter(
#rec["od_freqfiltered"] = rec["od_clean"].cd.freq_filter(
    fmin=0.01, fmax=0.5, butter_order=4
)

dpf = xr.DataArray(
    [6, 6],
    dims="wavelength",
    coords={"wavelength": rec["amp"].wavelength},
)

rec["conc"] = cedalion.nirs.cw.od2conc(rec["od_freqfiltered"], rec.geo3d, dpf)

Short-channel regression

Following the approach of von Lühmann et al. (2020), we apply a GLM to regress out physiological noise, thereby improving single-trial classification performance. To incorporate the GLM into a cross-validaton scheme, the design matrix will be masked for each cross-validation fold.

[46]:

# separate long and short channels
rec["conc_long"], rec["conc_short"] = cedalion.nirs.split_long_short_channels(
    rec["conc"], rec.geo3d, distance_threshold=22.5 * units.mm
)

# define the design matrix
dms = (
    glm.design_matrix.hrf_regressors(
        rec["conc_long"],
        rec.stim,
        glm.Gamma(tau=0 * units.s, sigma=3 * units.s),
    )
    & glm.design_matrix.drift_regressors(rec["conc_long"], drift_order=1)
    & glm.design_matrix.average_short_channel_regressor(rec["conc_short"])
)
dms

[46]:

DesignMatrix(common=['HRF Motor/Left','HRF Motor/Right','Drift 0','Drift 1','short'], channel_wise=[])

The dataset has 64 trials, with equal numbers of “Motor/Left” and “Motor/Right” conditions. In the following this dataset is split into 4 folds using the function mlutils.cv.create_cv_splits.

For each fold, a masked design matrix is created by setting the time segment used for testing to zero.

The GLM parameters are estimated and the signal components explained by nuisance regressors are subtracted.

Finally, the time series is segmented into epochs, yielding for each cross-validation fold an array of samples, to which the trial information required for training is appended as additional coordinates.

[47]:

n_splits = 4
before = 5 * cedalion.units.s
after = 25 * cedalion.units.s


cv_folds = []

for i_split, (df_stim_train, df_stim_test) in enumerate(
    mlutils.cv.create_cv_splits(rec.stim, n_splits)
):
    # short-channel regression (SCR):

    # zero-out design matrix in test segment
    dms_masked = mlutils.cv.mask_design_matrix(
        dms,
        df_stim_test,
        before=before,
        after=after,
    )

    # fit long channels with masked design matrix
    result = glm.fit(rec["conc_long"], dms_masked, noise_model="ols")

    # compute component explained by short-channel regressor
    short_component = glm.predict(
        rec["conc_long"],
        result.sm.params.sel(regressor=["short", "Drift 0", "Drift 1"]),
        dms_masked,
    )
    short_component = short_component.pint.quantify(rec["conc_long"].pint.units)

    # subtract short component
    conc_long_scr = rec["conc_long"] - short_component

    # split time series into epochs
    epochs = conc_long_scr.cd.to_epochs(
        rec.stim,
        ["Motor/Left", "Motor/Right"],
        before=before,
        after=after,
    )

    # baseline correction
    baseline = epochs.sel(reltime=(epochs.reltime < 0)).mean("reltime")
    epochs = epochs - baseline

    # assign train-/test-set membership...
    is_train = np.zeros(epochs.sizes["epoch"], dtype=bool)
    is_test = np.zeros(epochs.sizes["epoch"], dtype=bool)
    is_train[df_stim_train.index.values] = True
    is_test[df_stim_test.index.values] = True

    # ... and digitized trial labels ...
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(epochs.trial_type.values)

    # ... as coordinates to the DataArray
    epochs = epochs.assign_coords(
        {
            "is_train": ("epoch", is_train),
            "is_test": ("epoch", is_test),
            "y" : ("epoch", y)
         }
    )

    cv_folds.append(epochs)

 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 440.45it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 447.75it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 447.00it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 424.21it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 400.19it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 441.67it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 447.72it/s]
 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋    | 43/44 [00:00<00:00, 431.98it/s]

For each cross-validation fold the epoched time series looks like this:

[48]:

cv_folds[0]

[48]:

<xarray.DataArray (epoch: 64, chromo: 2, channel: 44, reltime: 132)> Size: 6MB
[µM] 0.06823 0.04856 0.04232 0.03702 0.03248 ... 0.04141 0.0395 0.03601 0.03161
Coordinates:
  * reltime     (reltime) float64 1kB -5.038 -4.809 -4.58 ... 24.5 24.73 24.96
    trial_type  (epoch) <U11 3kB 'Motor/Left' 'Motor/Right' ... 'Motor/Left'
  * chromo      (chromo) <U3 24B 'HbO' 'HbR'
  * channel     (channel) object 352B 'S1D6' 'S1D8' 'S2D5' ... 'S14D25' 'S14D27'
    source      (channel) object 352B 'S1' 'S1' 'S2' 'S2' ... 'S13' 'S14' 'S14'
    detector    (channel) object 352B 'D6' 'D8' 'D5' 'D9' ... 'D28' 'D25' 'D27'
    is_train    (epoch) bool 64B False False False False ... True True True True
    is_test     (epoch) bool 64B True True True True ... False False False False
    y           (epoch) int64 512B 0 1 1 0 0 0 1 1 1 0 1 ... 0 1 1 0 1 0 0 1 1 0
Dimensions without coordinates: epoch

The hemodynamic response to the two motor tasks can be visualized by block-averaging these epochs.

[49]:

blockaverage = cv_folds[0].groupby("trial_type").mean("epoch")

# Plot block averages. Please ignore errors if the plot is too small in the HD case

noPlts2 = int(np.ceil(np.sqrt(len(blockaverage.channel))))
f,ax = plt.subplots(noPlts2,noPlts2, figsize=(12,10))
ax = ax.flatten()
for i_ch, ch in enumerate(blockaverage.channel):
    for ls, trial_type in zip(["-", "--"], blockaverage.trial_type):
        ax[i_ch].plot(blockaverage.reltime, blockaverage.sel(chromo="HbO", trial_type=trial_type, channel=ch), "r", lw=2, ls=ls)
        ax[i_ch].plot(blockaverage.reltime, blockaverage.sel(chromo="HbR", trial_type=trial_type, channel=ch), "b", lw=2, ls=ls)

    ax[i_ch].grid(1)
    ax[i_ch].set_title(ch.values)
    ax[i_ch].set_ylim(-.3, .3)
    ax[i_ch].set_axis_off()
    ax[i_ch].axhline(0, c="k")
    ax[i_ch].axvline(0, c="k")

for i in range(len(blockaverage.channel), len(ax)):
    ax[i].set_axis_off()

plt.suptitle("HbO: r | HbR: b | left: - | right: --")
plt.tight_layout()

../../_images/examples_tutorial_6_data_driven_analysis_98_0.png

For each of the 64 epochs there are \(N_{chromo} \times N_{channel}\) time traces. From these time traces, features must get extracted.

The function mlutils.features.epoch_features calculates common features of the hemodynamic response, such as slope, mean, maximum, minimum and area under the curve. For each feature type, a time range can be specified, over which the feature is calculated. In the present case, this yields features for each channel and chromophore, which are then stacked into a single feature dimensions. The resulting array has the shape expected by scikit-learn, (\(N_{samples}, N_{features}\)).

[50]:

X = mlutils.features.epoch_features(
    cv_folds[0],
    feature_types=["slope", "mean", "max", "min", "auc"],
    reltime_slices={
        "slope": slice(0, 9),
        "mean": slice(3, 10),
        "max": slice(2, 8),
        "min": slice(2, 8),
    },
)
X

During stacking, the coordinates of the 'chromo' and 'channel' dimensions don’t get lost. They are combined and reassigned to the new 'feature' dimension. This means that for every feature in X, you can trace back which channel and chromophore it originated from.

[51]:

X.feature

Training and Evaluating a LDA classifier

For each cross-validation fold, extract features, train the classifier and estimate classification accuracy.

[52]:

accuracies = []

for epochs in cv_folds:
    # extract features
    X = mlutils.features.epoch_features(
        epochs.sel(chromo="HbO"), # HbO only
        feature_types=["slope", "max", "mean"],
        reltime_slices={
            "slope": slice(0, 9),
            "mean": slice(3, 10),
            "max": slice(2, 8),
        },
    )

    # separate train and test sets
    X_train = X[X.is_train]
    y_train = X_train.y
    X_test = X[X.is_test]
    y_test = X_test.y

    # train a LDA classifier
    clf = LinearDiscriminantAnalysis(n_components=1, solver='lsqr', shrinkage="auto")
    clf.fit(X_train, y_train)

    # evaluate perfomance
    accuracy = clf.score(X_test, y_test)
    accuracies.append(accuracy)
    print(f"#train: {len(y_train)} #test: {len(y_test)} #features: {X_train.shape[1]} accuracy: {accuracy}")

print()
print(rf"average accuracy over cross-validation splits: {np.mean(accuracies):.3f} ± {np.std(accuracies):.3f}")

#train: 48 #test: 16 #features: 132 accuracy: 0.8125
#train: 48 #test: 16 #features: 132 accuracy: 0.8125
#train: 48 #test: 16 #features: 132 accuracy: 0.6875
#train: 48 #test: 16 #features: 132 accuracy: 0.8125

average accuracy over cross-validation splits: 0.781 ± 0.054

S6: Data-Driven (ML) Analysis

Multimodal Source Decomposition Methods on Simulated fNIRS-EEG data

Simulated Dataset

Regularized CCA

CCA

Sparse CCA, Ridge CCA, and ElasticNet CCA

Structured Sparse CCA (ssCCA)

Additional Functionalities

Multiple Components

Partial Least Squares (PLS)

Temporally Embedded CCA (tCCA)

Aditional Functionalities

Multimodal Source Power Co-modulation (mSPoC)