Welcome to the API docs of the ShiftHappens benchmark!#
We aim to create a community-built benchmark suite for ImageNet models comprised of new datasets for OOD robustness and detection, as well as new tasks for existing OOD datasets.
While the popularity of robustness benchmarks and new test datasets increased over the past years, the performance of computer vision models is still largely evaluated on ImageNet directly, or on simulated or isolated distribution shifts like in ImageNet-C.
The goal of this workshop is to enhance the landscape of robustness evaluation datasets for computer vision and devise new test sets and metrics for quantifying desirable properties of computer vision models. Our goal is to bring the robustness, domain adaptation, and out-of-distribution detection communities together to work on a new broad-scale benchmark that tests diverse aspects of current computer vision models and guides the way towards the next generation of models.
Submissions to the workshop will be comprised of an addition of a Task, which will be used to test the performance
of various computer vision models on a new evaluation task you specify with your submission. Below we provide documentation
for the shifthappens API.
Also make sure to look at the examples in the github repository. If in doubt or if the API is not yet sufficiently flexible to fit your needs, consider opening an issue on github or join our slack channel.
Benchmark#
Base functions to register new tasks to the benchmark and evaluate models.
To register add a new task decorate a task class inherited from
shifthappens.tasks.base.Task with shifthappens.benchmark.register_task
function.
To evaluate model on all the registered tasks run
shifthappens.benchmark.evaluate_model.
- shifthappens.benchmark.evaluate_model(model, data_root)[source]#
Runs all registered tasks of the benchmark which are supported by the supplied model.
- Parameters:
- Return type:
- Returns:
Associates
shifthappens.task_data.task_metadata.TaskMetadatawith the respectiveshifthappens.tasks.task_result.TaskResult.
Examples
>>> import shifthappens.benchmark >>> from shifthappens.models.torchvision import ResNet18 >>> # import existing model or create a custom one inherited from >>> # shifthappens.models.base.Model and ModelMixin's >>> model = ResNet18() >>> shifthappens.benchmark.evaluate_model(model, "path_to_store_tasks_data")
- shifthappens.benchmark.register_task(*, name, relative_data_folder, standalone=True)[source]#
Registers a task class inherited from
shifthappens.tasks.base.Taskas task as part of the benchmark.- Parameters:
name (
str) – Name of the task (can contain spaces or special characters).relative_data_folder (
str) – Name of the folder in which the data for this dataset will be saved for this task relative to the root folder of the benchmark.standalone (
bool, default:True) – Boolean which represents if this task meaningful as a standalone task or will this only be relevant as a part of a collection of tasks.
Examples
>>> @shifthappens.benchmark.register_task( name="CustomTask", relative_data_folder="path_to_store_task_data", standalone=True ) >>> @dataclasses.dataclass >>> class CustomTaskClass(Task): ...
- shifthappens.benchmark.get_registered_tasks()[source]#
All tasks currently registered as part of the benchmark.
- Return type:
- Returns:
A tuple of all currently registered tasks as part of the benchmark. This tuple used for task iteration in
shifthappens.benchmark.evaluate_model.
Task implementations#
Base definition of a task in the shift happens benchmark.
Fully defined tasks should subclass the Task abstract base class, and implement all
mixins based on the required model outputs to evaluate the task, also part of this module.
Implementing a new task consists of the following steps:
Subclass the
Taskclass and implement its abstract methods to specify the task setup and evaluation schemeImplement any number of mixins specified in
shifthappens.tasks.mixins. You just need to include the mixin in the class definition, e.g.class MyTask(Task, ConfidenceTaskMixin), and do not need to implement additional methods. By specifying the mixin, it will be assured that your task gets the correct model outputs. See the individual mixin classes, or theModelResultclass for further details.Register your class to the benchmark using the
register_taskdecorator, along with a name and data path for your benchmark.
- shifthappens.tasks.base.T#
A generic representing arbitrary types.
alias of TypeVar(‘T’)
- shifthappens.tasks.base.parameter(default, options, description=None)[source]#
Register a task’s parameter. Setting multiple options here allows automatically creating different flavours of the task. Use this field for storing values of a hyperparameter if you want to run task with different hyperparameter. For example, you can use this mechanism to create tasks with varying difficulties without creating multiple classes/tasks.
- Parameters:
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> max_batch_size: Optional[int] = shifthappens.tasks.base.parameter( >>> default=typing.cast(Optional[int], None), >>> options=(32, 64, 128, None), #None corresponds to dataset-sized batch >>> description="maximum size of batches fed to the model during evaluation", >>> ) ...
- shifthappens.tasks.base.variable(value)[source]#
Creates a non-parametric variable for a task, i.e. its value won’t be passed to the constructor. Use it to store constants such as links to the data.
- Parameters:
value (
TypeVar(T)) – value of the constant.
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> constant: str = shifthappens.tasks.base.variable("your constant") ...
- shifthappens.tasks.base.abstract_variable()[source]#
Marks a variable as abstract such that a child class needs to override it with a non-abstract variable. See
variablefor the non-abstract counterpart.Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> constant: str = shifthappens.tasks.base.abstract_variable() ... >>> @dataclasses.dataclass >>> class InheritedTask(CustomTask): >>> constant: str = shifthappens.tasks.base.variable("your constant") ...
- class shifthappens.tasks.base.Task(data_root)[source]#
Bases:
ABCTask base class.
Override the
setup,_prepare_dataloaderand_evaluatemethods to define your task. Also make sure to add in additional mixins fromshifthappens.tasks.baseto specify which models your task is compatible to (e.g., specify that your task needs labels, or confidences from a model).To include the task in the benchmark, use the
register_taskdecorator.- Parameters:
data_root (
str) – Folder where individual tasks can store their data. This field is initialized with the value passed toshifthappens.benchmark.evaluate_model.
- classmethod iterate_flavours(**kwargs)[source]#
Iterate over all possible task configurations, i.e. different settings of parameter fields. Parameters should be defined with
shifthappens.tasks.base.parameter, whereoptionsargument corresponds to possible configurations of particular parameter.
- abstract setup()[source]#
Set the task up, i.e., download, load and prepare the dataset.
Examples
>>> # imagenet_r example >>> @shifthappens.benchmark.register_task( >>> ... >>> ) >>> @dataclasses.dataclass >>> class ImageNetR(Task): >>> ... >>> def setup(self): >>> dataset_folder = os.path.join(self.data_root, "imagenet-r") >>> if not os.path.exists(dataset_folder): # download data >>> for file_name, url, md5 in self.resources: >>> sh_utils.download_and_extract_archive( >>> url, self.data_root, md5, file_name >>> ) >>> ...
- evaluate(model)[source]#
Validates that the model is compatible with the task and then evaluates the model’s performance using the
_evaluatefunction of this class.
- _prepare_dataloader()[source]#
Prepare a
shifthappens.data.base.DataLoaderbased on just the unlabeled images which will be passed to the model before the actual evaluation. This allows models to, e.g., run unsupervised domain adaptation techniques. This method can be used to give models access to the unlabeled data in case they want to run some kind of unsupervised adaptation mechanism such as re-calibration.Note that this function could also be used to introduce domain shifts for such adaptation methods, by creating a different dataloader in this prepare function than used during
evaluate.By default no DataLoader <shifthappens.data.base.DataLoader> is returned, i.e., the models do not get access to the unlabeled data.
- Return type:
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> ... >>> def _prepare_dataloader(self) -> DataLoader: >>> ... >>> return shifthappens.data.base.DataLoader(dataset, max_batch_size) >>> ...
- abstract _evaluate(model)[source]#
Evaluate the task and return a dictionary with the calculated metrics.
- Parameters:
model (shifthappens.models.base.Model) – The passed model implements a
predictfunction returning an iterator overshifthappens.models.base.ModelResult. Each result contains predictions such as the class labels assigned to the images, confidences, etc., based on which mixins were implemented by this task to request these prediction outputs.- Returns:
The results of the task in the form of a
shifthappens.tasks.task_result.TaskResultcontaining an arbitrary dictionary of metrics, along with a specification of which of these metrics are main results/summary metrics for the task.- Return type:
Examples
>>> # imagenet_r example >>> @shifthappens.benchmark.register_task( >>> ... >>> ) >>> @dataclasses.dataclass >>> class ImageNetR(Task): >>> ... >>> def _evaluate(self, model: shifthappens.models.base.Model) -> TaskResult: >>> dataloader = self._prepare_dataloader() >>> all_predicted_labels_list = [] >>> for predictions in model.predict( >>> dataloader, PredictionTargets(class_labels=True) >>> ): >>> all_predicted_labels_list.append(predictions.class_labels) >>> all_predicted_labels = np.concatenate(all_predicted_labels_list, 0) >>> >>> accuracy = all_predicted_labels == np.array(self.ch_dataset.targets) >>> >>> return TaskResult( >>> accuracy=accuracy, summary_metrics={Metric.Robustness: "accuracy"} >>> ) >>> ...
Task mixins indicate the requirements of the task on the model in terms of the model’s supported prediction types.
- class shifthappens.tasks.mixins.LabelTaskMixin[source]#
Bases:
objectIndicates that the task requires the model to return the predicted labels.
Tasks implementing this mixin will be provided with the
class_labelsattribute in theshifthappens.models.base.ModelResultreturned during evaluation.
- class shifthappens.tasks.mixins.ConfidenceTaskMixin[source]#
Bases:
objectIndicates that the task requires the model to return the confidence scores.
Tasks implementing this mixin will be provided with the
confidencesattribute in theshifthappens.models.base.ModelResultreturned during evaluation.
- class shifthappens.tasks.mixins.UncertaintyTaskMixin[source]#
Bases:
objectIndicates that the task requires the model to return the uncertainty scores.
Tasks implementing this mixin will be provided with the
uncertaintiesattribute in theshifthappens.models.base.ModelResultreturned during evaluation.
- class shifthappens.tasks.mixins.OODScoreTaskMixin[source]#
Bases:
objectIndicates that the task requires the model to return the OOD scores.
Tasks implementing this mixin will be provided with the
ood_scoresattribute in theshifthappens.models.base.ModelResultreturned during evaluation.
- class shifthappens.tasks.mixins.FeaturesTaskMixin[source]#
Bases:
objectIndicates that the task requires the model to return the raw features.
Tasks implementing this mixin will be provided with the
featuresattribute in theshifthappens.models.base.ModelResultreturned during evaluation.
Class for representing the results of a single task.
- class shifthappens.tasks.task_result.TaskResult(*, summary_metrics, **metrics)[source]#
Bases:
objectContains the results of a result, which can be arbitrary metrics. At least one of these metrics must be references as a summary metric.
- Parameters:
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> ... >>> def _evaluate(self, model: shifthappens.models.base.Model) -> DataLoader: >>> ... >>> return TaskResult( >>> your_robustness_metric=your_robustness_metric, >>> your_calibration_metric=your_calibration_metric, >>> your_custom_metric=1.0 - your_custom_metric, >>> summary_metrics={ >>> Metric.Robustness: "your_robustness_metric", >>> Metric.Calibration: "your_calibration_metric"}, >>> ) >>> ...
Categories in which metrics of tasks will be grouped.
Configuration#
Global configuration options for the benchmark.
- class shifthappens.config.Config(imagenet_validation_path='shifthappens/imagenet', imagenet21k_preprocessed_validation_path='shifthappens/imagenet21k-p', cache_directory_path='shifthappens/cache', verbose=False)[source]#
Bases:
objectGlobal configuration for the benchmark.
- Config options can be edited by the following order:
By setting variables explicitly when getting the instance via
shifthappens.config.get_configBy setting an environment variable, prefixed as “SH_VARIABLE_NAME”
By relying on the default values defined in this class.
- Parameters:
-
imagenet21k_preprocessed_validation_path:
str= 'shifthappens/imagenet21k-p'# The ImageNet21k-P validation path.
- shifthappens.config.get_config(**init_kwargs)[source]#
Returns a global config initialized with provided arguments. This allows you to change defaults paths to ImageNet validation set, cached models result, etc. Note that reinitializing config will raise an error. For more details see
get_instance.
Accessing Scores for the ImageNet Validation Set#
This module provides functionality to access the predictions of a model for the ImageNet validation set. Further, the predictions can be cached and loaded from the cache to reduce computational costs. For path configuration please have a look at :py:module:shifthappens.config.
- shifthappens.data.imagenet.get_imagenet_validation_loader(max_batch_size=128)[source]#
Creates a
shifthappens.data.base.DataLoaderfor the validation set of ImageNet. Note that the path to ImageNet validation setshifthappens.config.imagenet_validation_pathmust be specified.- Parameters:
max_batch_size (default:
128) – How many samples allowed per batch to load.- Return type:
- Returns:
ImageNet validation set data loader.
- shifthappens.data.imagenet.get_cached_predictions(cls)[source]#
Checks whether there exist cached results for the model’s class and if so, returns them. Note that the path to ImageNet validation set
shifthappens.config.imagenet_validation_pathmust be specified.- Parameters:
cls – Model’s class. Used for specifying folder name.
- Return type:
- Returns:
Dictionary of loaded model predictions on ImageNet validation set.
- shifthappens.data.imagenet.cache_predictions(cls, imagenet_validation_result)[source]#
Caches model predictions in cls-named folder and load model predictions from it. Note that the path to ImageNet validation set
shifthappens.config.imagenet_validation_pathmust be specified as well asshifthappens.config.cache_directory_path.- Parameters:
cls – Model’s class. Used for specifying folder name.
imagenet_validation_result (ModelResult) – Model’s prediction on ImageNet validation set.
- shifthappens.data.imagenet.is_cached(cls)[source]#
Checks if model’s results are cached in cls-named folder. Note that the path to the ImageNet validation set
shifthappens.config.imagenet_validation_pathmust be specified as well asshifthappens.config.cache_directory_path.- Parameters:
cls – Model’s class. Used for specifying folder name.
- Return type:
- Returns:
Trueif model’s results are cached,Falseotherwise.
- shifthappens.data.imagenet.load_imagenet_targets()[source]#
Returns the ground-truth labels of the ImageNet valdation set.
- Return type:
Model implementations#
Base classes and helper functions for adding models to the benchmark.
To add a new model, implement a new wrapper class inheriting from
shifthappens.models.base.Model, and from any of the Mixins defined
in shifthappens.models.mixins.
Model results should be converted to numpy.ndarray objects, and
packed into an shifthappens.models.base.ModelResult instance.
- class shifthappens.models.base.ModelResult(class_labels, confidences=None, uncertainties=None, ood_scores=None, features=None)[source]#
Bases:
objectEmissions of a model after processing a batch of data.
Each model needs to return class labels that are compatible with the ILSRC2012 labels. We use the same convention used by PyTorch regarding the ordering of labels.
- Parameters:
class_labels (
ndarray) –(N, k), top-k predictions for each sample in the batch. Choice ofkcan be selected by the user, and potentially influences the type of accuracy based benchmarks that the model can be run on. For standard ImageNet, ImageNet-C evaluation, choose at leastk=5.confidences (
Optional[ndarray], default:None) –(N, 1000), confidences for each class. Standard PyTorch ImageNet class label order is expected for this array. Scores can be in the range-inftoinf.uncertainties (
Optional[ndarray], default:None) –(N, 1000), uncertainties for the different class predictions. Different from theconfidences, this is a measure of certainty of the givenconfidencesand common e.g. in Bayesian Deep neural networks.ood_scores (
Optional[ndarray], default:None) –(N,), score for interpreting the sample as an out-of-distribution class, in the range-inftoinf.features (
Optional[ndarray], default:None) –(N, d), wheredcan be arbitrary, feature representation used to arrive at the given predictions.
- class shifthappens.models.base.PredictionTargets(class_labels=False, confidences=False, uncertainties=False, ood_scores=False, features=False)[source]#
Bases:
objectContains boolean flags of which type of targets model is predicting. Note that at least one flag should be set as
Trueand model should be inherited from corresponding ModelMixin.- Parameters:
class_labels (
bool, default:False) – Set toTrueif model returns predicted labels.confidences (
bool, default:False) – Set toTrueif model returns confidences.uncertainties (
bool, default:False) – Set toTrueif model returns uncertainties.ood_scores (
bool, default:False) – Set toTrueif model returns ood scores.features (
bool, default:False) – Set toTrueif model returns features.
- class shifthappens.models.base.Model[source]#
Bases:
ABCModel base class.
Override the
_predictmethod to define predictions type of your specific model. If your model uses unsupervised adaptation mechanisms overrideprepareas well.Also make sure that your model inherits from the mixins from
shifthappens.models.mixinscorresponding to your model predictions type (e.g.,LabelModelMixinfor labels orConfidenceModelMixinfor confidences).- property imagenet_validation_result#
Access the model’s predictions/evaluation results on the ImageNet validation set.
- Returns:
Model evaluation result on ImageNet validation set wrapped with ModelResult.
- prepare(dataloader)[source]#
If the model uses unsupervised adaptation mechanisms, it will run those.
- Parameters:
dataloader (
DataLoader) – Dataloader producing batches of data.
- predict(input_dataloader, targets)[source]#
Yield all the predictions of the model for all data samples contained in the dataloader
- Parameters:
input_dataloader (
DataLoader) – Dataloader producing batches of data.targets (
PredictionTargets) – Indicates which kinds of targets should be predicted.
- Return type:
- Returns:
Prediction results for the given batch. Depending on the target arguments this includes the predicted labels, class confidences, class uncertainties, ood scores, and image features, all as
numpy.ndarrayobjects.
- _get_imagenet_predictions(rewrite=False)[source]#
Loads cached predictions on ImageNet validation set for the model or predicts on ImageNet validation set and caches the result whenever there is no cached predictions for the model or
rewriteargument set toTrue.- Parameters:
rewrite (default:
False) –Trueif models predictions need to be rewritten.
- _predict_imagenet_val()[source]#
Evaluates model on ImageNet validation set and store all possible targets scores for the particular model.
- abstract _predict(input_dataloader, targets)[source]#
Override this function for the specific model.
- Parameters:
input_dataloader (
DataLoader) – Dataloader producing batches of data.targets (
PredictionTargets) – Indicates which kinds of targets should be predicted.
- Return type:
- Returns:
Yields prediction results for all batches yielded by the dataloader. Depending on the target arguments the model results may include the predicted labels, class confidences, class uncertainties, ood scores, and image features, all as
numpy.ndarrayobjects.
Model mixins indicate the supported prediction types of a model.
- class shifthappens.models.mixins.LabelModelMixin[source]#
Bases:
objectInherit from this class if your model returns predicted labels.
- class shifthappens.models.mixins.ConfidenceModelMixin[source]#
Bases:
objectInherit from this class if your model returns confidences.
- class shifthappens.models.mixins.UncertaintyModelMixin[source]#
Bases:
objectInherit from this class if your model returns uncertainties.
- class shifthappens.models.mixins.OODScoreModelMixin[source]#
Bases:
objectInherit from this class if your model returns ood scores.
- class shifthappens.models.mixins.FeaturesModelMixin[source]#
Bases:
objectInherit from this class if your model returns features.
Model implementations from torchvision#
Model baselines from torchvision.
- class shifthappens.models.torchvision.__TorchvisionPreProcessingMixin[source]#
Bases:
objectPerforms the default preprocessing for torchvision ImageNet classifiers.
- class shifthappens.models.torchvision.__TorchvisionModel(model, feature_layer, max_batch_size, device='cpu')[source]#
Bases:
Model,__TorchvisionPreProcessingMixin,LabelModelMixin,ConfidenceModelMixin,FeaturesModelMixin,OODScoreModelMixinWraps a torchvision model.
- Parameters:
- _predict(input_dataloader, targets)#
Override this function for the specific model.
- Parameters:
input_dataloader (
DataLoader) – Dataloader producing batches of data.targets (
PredictionTargets) – Indicates which kinds of targets should be predicted.
- Return type:
- Returns:
Yields prediction results for all batches yielded by the dataloader. Depending on the target arguments the model results may include the predicted labels, class confidences, class uncertainties, ood scores, and image features, all as
numpy.ndarrayobjects.
- class shifthappens.models.torchvision.ResNet18(max_batch_size=16, device='cpu')[source]#
Bases:
__TorchvisionModelResNet18 network trained on the ImageNet 2012 train set from torchvision. See
torchvision.models.resnet18for details. :type max_batch_size:int, default:16:param max_batch_size: How many samples allowed per batch to load. :type device:str, default:'cpu':param device: Selected device to run the model on.
- class shifthappens.models.torchvision.ResNet50(max_batch_size=16, device='cpu')[source]#
Bases:
__TorchvisionModelLoad a ResNet50 network trained on the ImageNet 2012 train set from torchvision. See
torchvision.models.resnet50for details.
- class shifthappens.models.torchvision.VGG16(max_batch_size=16, device='cpu')[source]#
Bases:
__TorchvisionModelLoad a VGG16 network trained on the ImageNet 2012 train set from torchvision. See
torchvision.models.vgg16for details.
Data loading#
Base classes and helper functions for data handling (dataset, dataloader).
- class shifthappens.data.base.Dataset[source]#
Bases:
ABCAn abstract class representing an iterable dataset. Your iterable datasets should be inherited from this class.
- class shifthappens.data.base.IndexedDataset[source]#
Bases:
DatasetA class representing a map-style dataset. Your map-style datasets should be inherited from this class.
- class shifthappens.data.base.DataLoader(dataset, max_batch_size)[source]#
Bases:
objectInterface b/w model and task, implements restrictions (e.g. max batch size) for models.
- Parameters:
- shifthappens.data.base.shuffle_data(*, data, seed)[source]#
Randomly shuffles without replacement an
numpy.ndarray/list ofnumpy.ndarrayobjects with a fixed random seed.
Wrappers for PyTorch datasets such that they can be used as datasets for the benchmark.
- class shifthappens.data.torch.TorchDataset(torch_dataset)[source]#
Bases:
DatasetWraps a torch iterable dataset (i.e.
torch.utils.data.IterableDataset).- Parameters:
torch_dataset (
IterableDataset) – Dataset from which to load the data.
- class shifthappens.data.torch.IndexedTorchDataset(torch_dataset)[source]#
Bases:
IndexedDatasetWraps a torch map-style dataset (i.e.
torch.utils.data.Dataset).- Parameters:
torch_dataset (
Dataset) – Dataset from which to load the data.
Storing tasks within the benchmark#
Class for storing a task’s metadata.
- class shifthappens.task_data.task_metadata.TaskMetadata(name, relative_data_folder, standalone=True)[source]#
Bases:
objectClass for storing a task’s metadata required by the task registration mechanism. Arguments are passed by
shifthappens.benchmark.register_taskand are the same.- Parameters:
name (
str) – Name of the task (can contain spaces or special characters).relative_data_folder (
str) – Name of the folder in which the data for this dataset will be saved for this task relative to the root folder of the benchmark.standalone (
bool, default:True) – Boolean which represents if this task meaningful as a standalone task or will this only be relevant as a part of a collection of tasks.
Class for storing a task’s registration for the benchmark.
- class shifthappens.task_data.task_registration.TaskRegistration(cls, metadata)[source]#
Bases:
objectClass for storing a task’s registration for the benchmark. Arguments initialized automatically during task registration process.
- Parameters:
metadata (
TaskMetadata) – Task metadata passed withshifthappens.benchmark.register_task.