Welcome to the API docs of the ShiftHappens benchmark!#

We aim to create a community-built benchmark suite for ImageNet models comprised of new datasets for OOD robustness and detection, as well as new tasks for existing OOD datasets.

While the popularity of robustness benchmarks and new test datasets increased over the past years, the performance of computer vision models is still largely evaluated on ImageNet directly, or on simulated or isolated distribution shifts like in ImageNet-C.

The goal of this workshop is to enhance the landscape of robustness evaluation datasets for computer vision and devise new test sets and metrics for quantifying desirable properties of computer vision models. Our goal is to bring the robustness, domain adaptation, and out-of-distribution detection communities together to work on a new broad-scale benchmark that tests diverse aspects of current computer vision models and guides the way towards the next generation of models.

Submissions to the workshop will be comprised of an addition of a Task, which will be used to test the performance of various computer vision models on a new evaluation task you specify with your submission. Below we provide documentation for the shifthappens API.

Also make sure to look at the examples in the github repository. If in doubt or if the API is not yet sufficiently flexible to fit your needs, consider opening an issue on github or join our slack channel.

Benchmark#

Base functions to register new tasks to the benchmark and evaluate models.

To register add a new task decorate a task class inherited from shifthappens.tasks.base.Task with shifthappens.benchmark.register_task function.

To evaluate model on all the registered tasks run shifthappens.benchmark.evaluate_model.

shifthappens.benchmark.evaluate_model(model, data_root)[source]#

Runs all registered tasks of the benchmark which are supported by the supplied model.

Parameters:

model (Model) – Model to evaluate.
data_root (str) – Folder where individual tasks can store their data.

Return type:

Dict[TaskRegistration, Optional[TaskResult]]

Returns:

Associates shifthappens.task_data.task_metadata.TaskMetadata with the respective shifthappens.tasks.task_result.TaskResult.

Examples

>>> import shifthappens.benchmark
>>> from shifthappens.models.torchvision import ResNet18
>>> # import existing model or create a custom one inherited from
>>> # shifthappens.models.base.Model and ModelMixin's
>>> model = ResNet18()
>>> shifthappens.benchmark.evaluate_model(model, "path_to_store_tasks_data")

shifthappens.benchmark.register_task(*, name, relative_data_folder, standalone=True)[source]#

Registers a task class inherited from shifthappens.tasks.base.Task as task as part of the benchmark.

Parameters:

name (str) – Name of the task (can contain spaces or special characters).
relative_data_folder (str) – Name of the folder in which the data for this dataset will be saved for this task relative to the root folder of the benchmark.
standalone (bool, default: True) – Boolean which represents if this task meaningful as a standalone task or will this only be relevant as a part of a collection of tasks.

Examples

>>> @shifthappens.benchmark.register_task(
        name="CustomTask",
        relative_data_folder="path_to_store_task_data",
        standalone=True
    )
>>> @dataclasses.dataclass
>>> class CustomTaskClass(Task):
        ...

shifthappens.benchmark.get_registered_tasks()[source]#

All tasks currently registered as part of the benchmark.

Return type:: Tuple[Type[Task], ...]
Returns:: A tuple of all currently registered tasks as part of the benchmark. This tuple used for task iteration in shifthappens.benchmark.evaluate_model.

Task implementations#

Base definition of a task in the shift happens benchmark.

Fully defined tasks should subclass the Task abstract base class, and implement all mixins based on the required model outputs to evaluate the task, also part of this module.

Implementing a new task consists of the following steps:

Subclass the Task class and implement its abstract methods to specify the task setup and evaluation scheme
Implement any number of mixins specified in shifthappens.tasks.mixins. You just need to include the mixin in the class definition, e.g. class MyTask(Task, ConfidenceTaskMixin), and do not need to implement additional methods. By specifying the mixin, it will be assured that your task gets the correct model outputs. See the individual mixin classes, or the ModelResult class for further details.
Register your class to the benchmark using the register_task decorator, along with a name and data path for your benchmark.

shifthappens.tasks.base.T#

A generic representing arbitrary types.

alias of TypeVar(‘T’)

shifthappens.tasks.base.parameter(default, options, description=None)[source]#

Register a task’s parameter. Setting multiple options here allows automatically creating different flavours of the task. Use this field for storing values of a hyperparameter if you want to run task with different hyperparameter. For example, you can use this mechanism to create tasks with varying difficulties without creating multiple classes/tasks.

Parameters:

default (TypeVar(T)) – default value.
options (Tuple[TypeVar(T), ...]) – allowed options.
description (Optional[str], default: None) – short description.

Examples

>>> @dataclasses.dataclass
>>> class CustomTask(Task):
>>>     max_batch_size: Optional[int] = shifthappens.tasks.base.parameter(
>>>     default=typing.cast(Optional[int], None),
>>>     options=(32, 64, 128, None), #None corresponds to dataset-sized batch
>>>     description="maximum size of batches fed to the model during evaluation",
>>>     )
        ...

shifthappens.tasks.base.variable(value)[source]#

Creates a non-parametric variable for a task, i.e. its value won’t be passed to the constructor. Use it to store constants such as links to the data.

Parameters:: value (TypeVar(T)) – value of the constant.

Examples

>>> @dataclasses.dataclass
>>> class CustomTask(Task):
>>>     constant: str = shifthappens.tasks.base.variable("your constant")
        ...

shifthappens.tasks.base.abstract_variable()[source]#

Marks a variable as abstract such that a child class needs to override it with a non-abstract variable. See variable for the non-abstract counterpart.

Examples

>>> @dataclasses.dataclass
>>> class CustomTask(Task):
>>>     constant: str = shifthappens.tasks.base.abstract_variable()
        ...
>>> @dataclasses.dataclass
>>> class InheritedTask(CustomTask):
>>>     constant: str = shifthappens.tasks.base.variable("your constant")
        ...

class shifthappens.tasks.base.Task(data_root)[source]#

Bases: ABC

Task base class.

Override the setup, _prepare_dataloader and _evaluate methods to define your task. Also make sure to add in additional mixins from shifthappens.tasks.base to specify which models your task is compatible to (e.g., specify that your task needs labels, or confidences from a model).

To include the task in the benchmark, use the register_task decorator.

Parameters:: data_root (str) – Folder where individual tasks can store their data. This field is initialized with the value passed to shifthappens.benchmark.evaluate_model.

classmethod iterate_flavours(**kwargs)[source]#

Iterate over all possible task configurations, i.e. different settings of parameter fields. Parameters should be defined with shifthappens.tasks.base.parameter, where options argument corresponds to possible configurations of particular parameter.

Return type:: Iterator[Task]

abstract setup()[source]#

Set the task up, i.e., download, load and prepare the dataset.

Examples

>>> # imagenet_r example
>>> @shifthappens.benchmark.register_task(
>>> ...
>>> )
>>> @dataclasses.dataclass
>>> class ImageNetR(Task):
>>>     ...
>>>     def setup(self):
>>>         dataset_folder = os.path.join(self.data_root, "imagenet-r")
>>>         if not os.path.exists(dataset_folder): # download data
>>>             for file_name, url, md5 in self.resources:
>>>                 sh_utils.download_and_extract_archive(
>>>                     url, self.data_root, md5, file_name
>>>                 )
>>>         ...

evaluate(model)[source]#

Validates that the model is compatible with the task and then evaluates the model’s performance using the _evaluate function of this class.

Parameters:: model (Model) – The model to evaluate. See _evaluate for more details.
Return type:: Optional[TaskResult]

_prepare_dataloader()[source]#

Prepare a shifthappens.data.base.DataLoader based on just the unlabeled images which will be passed to the model before the actual evaluation. This allows models to, e.g., run unsupervised domain adaptation techniques. This method can be used to give models access to the unlabeled data in case they want to run some kind of unsupervised adaptation mechanism such as re-calibration.

Note that this function could also be used to introduce domain shifts for such adaptation methods, by creating a different dataloader in this prepare function than used during evaluate.

By default no DataLoader <shifthappens.data.base.DataLoader> is returned, i.e., the models do not get access to the unlabeled data.

Return type:: Optional[DataLoader]

Examples

>>> @dataclasses.dataclass
>>> class CustomTask(Task):
>>>     ...
>>>     def _prepare_dataloader(self) -> DataLoader:
>>>         ...
>>>         return shifthappens.data.base.DataLoader(dataset, max_batch_size)
>>>     ...

abstract _evaluate(model)[source]#

Evaluate the task and return a dictionary with the calculated metrics.

Parameters:: model (shifthappens.models.base.Model) – The passed model implements a predict function returning an iterator over shifthappens.models.base.ModelResult. Each result contains predictions such as the class labels assigned to the images, confidences, etc., based on which mixins were implemented by this task to request these prediction outputs.
Returns:: The results of the task in the form of a shifthappens.tasks.task_result.TaskResult containing an arbitrary dictionary of metrics, along with a specification of which of these metrics are main results/summary metrics for the task.
Return type:: shifthappens.tasks.task_result.TaskResult

Examples

>>> # imagenet_r example
>>> @shifthappens.benchmark.register_task(
>>> ...
>>> )
>>> @dataclasses.dataclass
>>> class ImageNetR(Task):
>>>     ...
>>>         def _evaluate(self, model: shifthappens.models.base.Model) -> TaskResult:
>>>             dataloader = self._prepare_dataloader()
>>>             all_predicted_labels_list = []
>>>             for predictions in model.predict(
>>>                 dataloader, PredictionTargets(class_labels=True)
>>>             ):
>>>                 all_predicted_labels_list.append(predictions.class_labels)
>>>             all_predicted_labels = np.concatenate(all_predicted_labels_list, 0)
>>>
>>>             accuracy = all_predicted_labels == np.array(self.ch_dataset.targets)
>>>
>>>             return TaskResult(
>>>                 accuracy=accuracy, summary_metrics={Metric.Robustness: "accuracy"}
>>>             )
>>>         ...

Task mixins indicate the requirements of the task on the model in terms of the model’s supported prediction types.

class shifthappens.tasks.mixins.LabelTaskMixin[source]#

Welcome to the API docs of the ShiftHappens benchmark!#

Benchmark#

Task implementations#

Configuration#

Accessing Scores for the ImageNet Validation Set#

Model implementations#

Model implementations from torchvision#

Data loading#

Storing tasks within the benchmark#