New Datasets Guide¶
In this notebook we are going to explain how to use a new dataset with openretina.
For this example, we are going to use data from Maheswaranathan et al. (2023): Interpreting the retinal neural code for natural scenes: From computations to neurons .
Along the way, we are also going to address some questions that can arise regarding the process for your own data.
import logging
import os
import lightning
from openretina.data_io.base import MoviesTrainTestSplit, ResponsesTrainTestSplit, compute_data_info
from openretina.data_io.base_dataloader import multiple_movies_dataloaders
from openretina.data_io.cyclers import LongCycler, ShortCycler
from openretina.models.core_readout import ExampleCoreReadout
from openretina.utils.file_utils import get_cache_directory, get_local_file_path
from openretina.utils.h5_handling import load_dataset_from_h5, load_h5_into_dict
from openretina.utils.misc import CustomPrettyPrinter
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
) # to display logs in jupyter notebooks
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
pp = CustomPrettyPrinter(indent=4, max_lines=40)
First, let's set the cache directory for the data and models.
# The default directory for downloads will be ~/openretina_cache
# To change this, uncomment the following line and change its path
# os.environ["OPENRETINA_CACHE_DIRECTORY"] = "/Data/"
# You can then check if that directory has been correctly set by running:
get_cache_directory()
Let's now download the data from HuggingFace.
data_path = get_local_file_path(
"https://huggingface.co/datasets/open-retina/open-retina/blob/main/baccus_lab/maheswaranathan_2023/neural_code_data.zip"
)
Now let's inspect the structure of this dataset
!ls $data_path/ganglion_cell_data
We can see that the ganglion cell data is structured by sessions. We are going to pick session 15-10-07 to use throughout the examples in this notebook.
!ls $data_path/ganglion_cell_data/15-10-07
Inside each session we have files for two different type of stimuli.
Let's load the file dealing with whitenoise and inspect it.
whitenoise_file = load_h5_into_dict(os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"))
pp.pprint(whitenoise_file)
We can see that at the first level of the .h5 hierarchy the data is split into train, test and spikes.
spikes will contain the spike times for each neuron, which we can ignore.
train and test are structured similarly: they both contain numpy arrays for the stimulus, time (mapping to the spike indices in spikes) and the response. The latter is saved with different binnings (by choosing a different bin width in time, there are more ways to group a sequence of spike times into a firing rate representation).
We can see that the stimulus and the response arrays share the time dimensions. These are the data we are interested in for model fitting.
Now let's see how we can load this to use with openretina.
Loading data¶
What we need is the matching stimulus and response pairs for training and testing. We will then need to feed them inside the two classes that handle their data, respectively ResponsesTrainTestSplit and MoviesTrainTestSplit.
Let's briefly print the classes help information, so we can see which arguments they expect:
MoviesTrainTestSplit?
ResponsesTrainTestSplit?
Let's now start importing the data that we will feed into these classes:
test_stimulus = load_dataset_from_h5(
os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "test/stimulus"
)
test_response = load_dataset_from_h5(
os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "test/response/firing_rate_20ms"
)
train_stimulus = load_dataset_from_h5(
os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "train/stimulus"
)
train_response = load_dataset_from_h5(
os.path.join(data_path, "ganglion_cell_data", "15-10-07", "whitenoise.h5"), "train/response/firing_rate_20ms"
)
print(f"Train stimulus shape: {train_stimulus.shape}")
print(f"Train response shape: {train_response.shape}")
print(f"Test stimulus shape: {test_stimulus.shape}")
print(f"Test response shape: {test_response.shape}")
Looking at the shapes of the arrays we just imported, we need to make some small adjustments to match the assumptions that the classes within openretina make.
- The stimulus needs to be 4-dimensional, with shape
color_channels x time x height x width: in this case the channel dimension is missing. - The responses need to have shape
n_neurons x time: this is already the case here. - The stimuli and responses time dimension should match exactly: in this case the test response seems to have one extra time bin, which we are simply going to cut in this case.
Let's do all of this here:
test_stimulus = test_stimulus[None, ...]
train_stimulus = train_stimulus[None, ...]
test_response = test_response[:, :-1]
print(f"Train stimulus shape: {train_stimulus.shape}")
print(f"Train response shape: {train_response.shape}")
print(f"Test stimulus shape: {test_stimulus.shape}")
print(f"Test response shape: {test_response.shape}")
Before finally initialising our target functions, we should normalise the stimuli (and optionally the responses). This is mostly done to stabilise training, as too wide of an input data range can lead to exploding gradients.
train_stim_mean = train_stimulus.mean()
train_stim_std = train_stimulus.std()
norm_train_stimulus = (train_stimulus - train_stim_mean) / train_stim_std
norm_test_stimulus = (test_stimulus - train_stim_mean) / train_stim_std
Finally, we can initialise the classes
single_stimulus = MoviesTrainTestSplit(
train=norm_train_stimulus,
test=norm_test_stimulus,
stim_id="whitenoise",
norm_mean=train_stim_mean,
norm_std=train_stim_std,
)
single_response = ResponsesTrainTestSplit(
train=train_response,
test=test_response,
stim_id="whitenoise",
)
Q: How to do this step with your data?¶
What matters for how the pipeline is configured within openretina is that you can import your data in a way that stimuli and responses for each session have the same sampling frequency, and that you can then end up with two numpy arrays, one for the stimuli and one for the responses, at the same sampling rate (i.e. having the exact same length in the time dimension).
This might require some resampling if it is not the case already, and the workflow will vary depending on how your data is exported. This decision and implementation are the responsibility of the user.
Q: What if I do not have a train and a test split in my data?¶
The train/test split is completely arbitrary, but it is sometimes a direct consequence of certain experimental design choices. For example, test stimuli usually have been repeated multiple times, so that an average response can be computed, along with different estimates of SNR or response reliability. Training stimuli on the other hand tend to have a lower number of repeats, often only 1.
If all your stimuli have multiple repeats by design and no clear train/test separation, you can then decide which parts you want to use for training and which for testing, for example by doing a 80% / 20% split. It is recommended to use the average test trace across repetitions for testing. On the contrary, during training, it can be beneficial to introduce some noise and it is recommended use the single repeats (this will also lead to having more training data).
If your data has no clear trial/repetition structure, and you only have 1 repeat per stimulus, you can similarly arbitrarly decide how to split your data, and how much to leave for testing. What you can expect in this case, however, is to have lower test performance compared to what you would get if your test responses were actually collected across multiple trials. The reason for this is simply that having more trials averages out noise, which otherwise is treated as ground-truth signal when computing test performance.
Dataloading¶
We are now ready to initialise a dataloader with the stimuli and responses we extracted. Note that dataloader functions within openretina assume that you input a dictionary of stimuli and responses, where keys are session names and values are instances of ResponsesTrainTestSplit and MoviesTrainTestSplit classes we just created.
We make this assumption to accommodate multiple experimental sessions for training, which is the usual case. If you indeed have data from multiple sessions, you have two options moving forward.
- Manually repeat what we have done above for all sessions
- Recommended: code up your personal data_io functions / modules, one for the stimuli and one for the responses. The output of these functions should be two dictionaries that share the keys (i.e. the session names), and have as values the different
ResponsesTrainTestSplitandMoviesTrainTestSplitobjects. If you take this route, you can insert such functions insideopenretina.data_io.your_dataset_nameand if you feel like sharing, submit a PR to us such that we can include your dataset in the repository! To see a worked example, check out how we coded up the functions to do so for the current dataset atopenretina.data_io.maheswaranathan_2023.
To keep things simple, here we simply initialise one-item dictionaries for the stimuli and responses we just extracted.
stimuli = {"15-10-07": single_stimulus}
responses = {"15-10-07": single_response}
We are now ready to feed our matching dictionaries of stimuli and responses to a dataloader.
dataloaders = multiple_movies_dataloaders(neuron_data_dictionary=responses, movies_dictionary=stimuli)
stimuli["15-10-07"].train.shape
pp.pprint(dataloaders)
Initialising a simple model¶
Digital twin models (as are ML models in general) are very much dependent on the data they are trained and evaluated on, even in model architecture. Practically:
- The shape of the input stimulus will influence the shape of the convolutional kernels, and is therefore a parameter at model creation
- The number of sessions and the neurons in each session will, in turn, influence the structure and number of parameters in the readout networks, and are also parameters at model creation.
To get this information from the data and pass it to the model and store it, openretina has an utility function, compute_data_info, which takes as arguments the same two dictionaries that are fed to the dataloader function.
data_info = compute_data_info(neuron_data_dictionary=responses, movies_dictionary=stimuli)
# Display the data info
pp.pprint(data_info)
Now we can initialise a model:
model = ExampleCoreReadout(
in_shape=(1, 100, 50, 50), # Note that data_info does not include time, we add a dummy time dimension here.
hidden_channels=[32, 64],
temporal_kernel_sizes=[3, 3],
spatial_kernel_sizes=[7, 7],
n_neurons_dict=data_info["n_neurons_dict"],
)
Training¶
The last step before initiating training is now to wrap the training, validation and testing dataloaders (which are, in fact, dictionaries of dataloaders) into a Cycler object, which is an utility that will go through the data for each session.
(Note that we still need to do this in our one-session running example, because dataloaders["train"], dataloaders["validation"] and dataloaders["test"] will still be dictionaries, in this case of only one item. Feel free to inspect a bit more the dataloaders dictionary we created to make sense of this.)
train_loader = LongCycler(dataloaders["train"])
val_loader = ShortCycler(dataloaders["validation"])
test_loader = ShortCycler(dataloaders["test"])
Here we will just check whether the trainer works.
trainer = lightning.Trainer(fast_dev_run=True)
trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)
It is recommended to set up training using a training script, either a custom one or using our unified interface. More on that below
Using new data with our unified training script¶
openretina comes with a few command line scripts, among which openretina train. This calls our training script, which uses hydra for config management.
A few things are needed to run training on a completely new dataset using our training script:
- Creating data_io and dataloading functions for the new dataset, and placing them in the
openretina/data_iosubmodule. Earlier parts of this notebook dealt with this. - Creating data_io and dataloader configs for the new dataset, and placing them in the appropriate folders in
configs. - Creating an "outer" config, to place as a direct children in the
configsfolder.
Let's go through these step by step.
1. Creating a custom sub-module for the new dataset.¶
Doing something similar to what we did in this notebook, you would need to code up functions for the stimuli and for the responses that create two dictionaries that share the keys (i.e. the session names), and have as values ResponsesTrainTestSplit and MoviesTrainTestSplit objects.
Extending the example in this notebook, we already provide such function for the maheswaranathan_2023 dataset under openretina.data_io.maheswaranathan_2023.
2. Creating config files for data_io and dataloading¶
Once such dataloading functions are in place, we need to make sure they are used correctly in the training script. This is how dataloading happens in the training script:
{python}
movies_dict = hydra.utils.call(cfg.data_io.stimuli)
neuron_data_dict = hydra.utils.call(cfg.data_io.responses)
dataloaders = hydra.utils.instantiate(
cfg.dataloader,
neuron_data_dictionary=neuron_data_dict,
movies_dictionary=movies_dict,
)
Let's break this down in the case for the stimuli.
hydra.utils.call is calling a function which is found in the config at data_io.stimuli.
In the main configuration files folder (called configs), we have different subfolders for different possibilities of configuration options. In the data_io folder we have different YAML files dealing with the data_io functions. There, we created a file called maheswaranathan_2023.yaml which looks like this:
{yaml}
stimuli:
_target_: openretina.data_io.maheswaranathan_2023.stimuli.load_all_stimuli
_convert_: object
base_data_path: ${data.data_dir}
stim_type: "naturalscene"
normalize_stimuli: true
responses:
_target_: openretina.data_io.maheswaranathan_2023.responses.load_all_responses
_convert_: object
base_data_path: ${data.data_dir}
stim_type: "naturalscene"
response_type: "firing_rate_20ms"
fr_normalization: 1.0
When we call hydra.utils.call(cfg.data_io.stimuli), Hydra looks up the stimuli key in our configuration and finds that it specifies a function to call:
_target_: Specifies the fully qualified function path that should be called, in this case,openretina.data_io.maheswaranathan_2023.stimuli.load_all_stimuli._convert_: Ensures that the output of the function is returned as an object rather than a dictionary.- The rest are arguments specific to the function that we coded up.
Importantly then, when adding the configuration for a new dataset, the user should specify in a a new config file under data_io which function should be called and with which parameters such that they will return the dictionary of keys to ResponsesTrainTestSplit and MoviesTrainTestSplit objects.
The same holds for dataloading.
3. Creating an "outer" config.¶
Once data_io functions are coded up, and data_io configs are created, these will need to be referenced in an "outer" config file which orchestrates the run. A template is present under configs/template_outer_config.yaml.
Here is how the template looks like:
{yaml}
defaults:
- data_io: ??? # For new data, create data_io config and put its name here
- dataloader: ??? # For new data, create dataloader config and put its name here
- model: base_core_readout
- training_callbacks:
- early_stopping
- lr_monitor
- model_checkpoint
- logger:
- tensorboard
- csv
- trainer: default_deterministic
- hydra: default
- _self_ # values in this config will overwrite the defaults
exp_name: example_experiment_new_data
seed: 42
check_stimuli_responses_match: false
paths:
cache_dir: ${oc.env:OPENRETINA_CACHE_DIRECTORY} # Remote files are downloaded to this location
# If data_dir is a local path, data will be read from there. If a remote link, the target will be downloaded to cache_dir.
data_dir: ??? # Choose the location of the data. Should be used in data_io functions.
log_dir: "." # Used as parent for output_dir. Will store train logs.
output_dir: ${hydra:runtime.output_dir} # Modify in the "hydra/default.yaml" config
# Overwrite model defaults with specifics for the current data input format
model:
in_shape: ???
hidden_channels: ???
spatial_kernel_sizes: ???
# Can over-ride further model defaults here.
Breaking down the template¶
1. Defaults section:¶
- Hydra uses the defaults section to compose configurations from different files.
- Each line here references a specific configuration file, stored in subdirectories within configs/.
- For example, data_io: ??? means that a specific data_io config must be created and provided (e.g., maheswaranathan_2023).
- Similarly, dataloader: ??? ensures that a dataloader configuration is selected.
- self ensures that values defined later in this file override the defaults.
2. Run specific variables¶
- exp_name: The experiment name, which helps organize logs and outputs.
- seed: A fixed seed for reproducibility.
- check_stimuli_responses_match: A debugging flag to ensure that stimuli and responses are aligned correctly.
3. File paths¶
- cache_dir: The base directory for downloads, if any need to happen.
- data_dir: The location of the dataset, which can be referenced in data_io functions using
${paths.data_dir}. If${paths.data_dir}is a remote path, its contents will be downloaded to cache_dir, and the downloaded files path will be used in loading the data. - log_dir: Parent folder for the logs, which is used by output_dir.
- output_dir: Where logs, model checkpoints, and results will be saved. Uses logs dir as the parent, and sub-folder structure is set by hydra.
4. Model specific overrides¶
This section defines the input shape and architecture details, overriding the default model configuration if needed.
in_shape,hidden_channels, andspatial_kernel_sizesare left as placeholders (???), meaning they should be specified based on the dataset used.
Filling in the Configuration for maheswaranathan_2023¶
Now, let’s see how this template is filled in for an actual experiment using maheswaranathan_2023:
{yaml}
defaults:
- data_io: maheswaranathan_2023
- dataloader: maheswaranathan_2023
- model: base_core_readout
- training_callbacks:
- early_stopping
- lr_monitor
- model_checkpoint
- logger:
- tensorboard
- csv
- trainer: default_deterministic
- hydra: default
- _self_
Instead of ???, we now explicitly specify maheswaranathan_2023 for both data_io and dataloader.
The remaining configuration choices (e.g., logging, training callbacks, trainer) stay the same as the template, but could also be modified further. We provide different options in the respective folders.
Continuing:
{yaml}
exp_name: core_readout_maheswaranathan
seed: 42
check_stimuli_responses_match: false
paths:
cache_dir: null # Assume we already downloaded and unzipped manually the data
data_dir: ${oc.env:HOME}/baccus_data/neural_code_data/ganglion_cell_data/ # Say we downloaded it in home
log_dir: "." # Save logs in the current directory
output_dir: ${hydra:runtime.output_dir} # Keep hydra default for sub-folders, which we set in configs/hydra/default.yaml
model:
in_shape: [1, 100, 50, 50]
hidden_channels: [16, 32]
spatial_kernel_sizes: [15, 11]
- The experiment is now named "core_readout_maheswaranathan", which will be used in logs and outputs.
- The dataset location is explicitly set to "baccus_data/neural_code_data/ganglion_cell_data/", where
cache_dirshould still be defined by the user. - The model section is defined, containing a few over-rides of the defaults for
base_core_readout:in_shape: [1, 100, 50, 50]represents the input dimensions for the dataset.hidden_channels: [16, 32]defines the number of channels in each convolutional layer.spatial_kernel_sizes: [15, 11]specifies the spatial kernel sizes.
Once an outer config is specified, running training with the specified options is done via the command line with:
{bash}
openretina train --config-name "maheswaranathan_2023_core_readout"
Where you need to change "maheswaranathan_2023_core_readout" with the name of your outer YAML config.
Conclusion¶
In this tutorial, we walked through the process of integrating new dataset into openretina and getting started with training on it. While setting up a new dataset can be challenging, taking a structured approach makes it much more manageable, despite the initial learning curve. If you run into issues, don’t hesitate to reach out and explore further the Hydra and OpenRetina documentations for more details.