Welcome to the API docs of the ShiftHappens benchmark!#
We aim to create a community-built benchmark suite for ImageNet models comprised of new datasets for OOD robustness and detection, as well as new tasks for existing OOD datasets.
While the popularity of robustness benchmarks and new test datasets increased over the past years, the performance of computer vision models is still largely evaluated on ImageNet directly, or on simulated or isolated distribution shifts like in ImageNet-C.
The goal of this workshop is to enhance the landscape of robustness evaluation datasets for computer vision and devise new test sets and metrics for quantifying desirable properties of computer vision models. Our goal is to bring the robustness, domain adaptation, and out-of-distribution detection communities together to work on a new broad-scale benchmark that tests diverse aspects of current computer vision models and guides the way towards the next generation of models.
Submissions to the workshop will be comprised of an addition of a Task
, which will be used to test the performance
of various computer vision models on a new evaluation task you specify with your submission. Below we provide documentation
for the shifthappens
API.
Also make sure to look at the examples in the github repository. If in doubt or if the API is not yet sufficiently flexible to fit your needs, consider opening an issue on github or join our slack channel.
Benchmark#
Base functions to register new tasks to the benchmark and evaluate models.
To register add a new task decorate a task class inherited from
shifthappens.tasks.base.Task
with shifthappens.benchmark.register_task
function.
To evaluate model on all the registered tasks run
shifthappens.benchmark.evaluate_model
.
- shifthappens.benchmark.evaluate_model(model, data_root)[source]#
Runs all registered tasks of the benchmark which are supported by the supplied model.
- Parameters:
- Return type:
- Returns:
Associates
shifthappens.task_data.task_metadata.TaskMetadata
with the respectiveshifthappens.tasks.task_result.TaskResult
.
Examples
>>> import shifthappens.benchmark >>> from shifthappens.models.torchvision import ResNet18 >>> # import existing model or create a custom one inherited from >>> # shifthappens.models.base.Model and ModelMixin's >>> model = ResNet18() >>> shifthappens.benchmark.evaluate_model(model, "path_to_store_tasks_data")
- shifthappens.benchmark.register_task(*, name, relative_data_folder, standalone=True)[source]#
Registers a task class inherited from
shifthappens.tasks.base.Task
as task as part of the benchmark.- Parameters:
name (
str
) – Name of the task (can contain spaces or special characters).relative_data_folder (
str
) – Name of the folder in which the data for this dataset will be saved for this task relative to the root folder of the benchmark.standalone (
bool
, default:True
) – Boolean which represents if this task meaningful as a standalone task or will this only be relevant as a part of a collection of tasks.
Examples
>>> @shifthappens.benchmark.register_task( name="CustomTask", relative_data_folder="path_to_store_task_data", standalone=True ) >>> @dataclasses.dataclass >>> class CustomTaskClass(Task): ...
- shifthappens.benchmark.get_registered_tasks()[source]#
All tasks currently registered as part of the benchmark.
- Return type:
- Returns:
A tuple of all currently registered tasks as part of the benchmark. This tuple used for task iteration in
shifthappens.benchmark.evaluate_model
.
Task implementations#
Base definition of a task in the shift happens benchmark.
Fully defined tasks should subclass the Task
abstract base class, and implement all
mixins based on the required model outputs to evaluate the task, also part of this module.
Implementing a new task consists of the following steps:
Subclass the
Task
class and implement its abstract methods to specify the task setup and evaluation schemeImplement any number of mixins specified in
shifthappens.tasks.mixins
. You just need to include the mixin in the class definition, e.g.class MyTask(Task, ConfidenceTaskMixin)
, and do not need to implement additional methods. By specifying the mixin, it will be assured that your task gets the correct model outputs. See the individual mixin classes, or theModelResult
class for further details.Register your class to the benchmark using the
register_task
decorator, along with a name and data path for your benchmark.
- shifthappens.tasks.base.T#
A generic representing arbitrary types.
alias of TypeVar(‘T’)
- shifthappens.tasks.base.parameter(default, options, description=None)[source]#
Register a task’s parameter. Setting multiple options here allows automatically creating different flavours of the task. Use this field for storing values of a hyperparameter if you want to run task with different hyperparameter. For example, you can use this mechanism to create tasks with varying difficulties without creating multiple classes/tasks.
- Parameters:
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> max_batch_size: Optional[int] = shifthappens.tasks.base.parameter( >>> default=typing.cast(Optional[int], None), >>> options=(32, 64, 128, None), #None corresponds to dataset-sized batch >>> description="maximum size of batches fed to the model during evaluation", >>> ) ...
- shifthappens.tasks.base.variable(value)[source]#
Creates a non-parametric variable for a task, i.e. its value won’t be passed to the constructor. Use it to store constants such as links to the data.
- Parameters:
value (
TypeVar
(T
)) – value of the constant.
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> constant: str = shifthappens.tasks.base.variable("your constant") ...
- shifthappens.tasks.base.abstract_variable()[source]#
Marks a variable as abstract such that a child class needs to override it with a non-abstract variable. See
variable
for the non-abstract counterpart.Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> constant: str = shifthappens.tasks.base.abstract_variable() ... >>> @dataclasses.dataclass >>> class InheritedTask(CustomTask): >>> constant: str = shifthappens.tasks.base.variable("your constant") ...
- class shifthappens.tasks.base.Task(data_root)[source]#
Bases:
ABC
Task base class.
Override the
setup
,_prepare_dataloader
and_evaluate
methods to define your task. Also make sure to add in additional mixins fromshifthappens.tasks.base
to specify which models your task is compatible to (e.g., specify that your task needs labels, or confidences from a model).To include the task in the benchmark, use the
register_task
decorator.- Parameters:
data_root (
str
) – Folder where individual tasks can store their data. This field is initialized with the value passed toshifthappens.benchmark.evaluate_model
.
- classmethod iterate_flavours(**kwargs)[source]#
Iterate over all possible task configurations, i.e. different settings of parameter fields. Parameters should be defined with
shifthappens.tasks.base.parameter
, whereoptions
argument corresponds to possible configurations of particular parameter.
- abstract setup()[source]#
Set the task up, i.e., download, load and prepare the dataset.
Examples
>>> # imagenet_r example >>> @shifthappens.benchmark.register_task( >>> ... >>> ) >>> @dataclasses.dataclass >>> class ImageNetR(Task): >>> ... >>> def setup(self): >>> dataset_folder = os.path.join(self.data_root, "imagenet-r") >>> if not os.path.exists(dataset_folder): # download data >>> for file_name, url, md5 in self.resources: >>> sh_utils.download_and_extract_archive( >>> url, self.data_root, md5, file_name >>> ) >>> ...
- evaluate(model)[source]#
Validates that the model is compatible with the task and then evaluates the model’s performance using the
_evaluate
function of this class.
- _prepare_dataloader()[source]#
Prepare a
shifthappens.data.base.DataLoader
based on just the unlabeled images which will be passed to the model before the actual evaluation. This allows models to, e.g., run unsupervised domain adaptation techniques. This method can be used to give models access to the unlabeled data in case they want to run some kind of unsupervised adaptation mechanism such as re-calibration.Note that this function could also be used to introduce domain shifts for such adaptation methods, by creating a different dataloader in this prepare function than used during
evaluate
.By default no DataLoader <shifthappens.data.base.DataLoader> is returned, i.e., the models do not get access to the unlabeled data.
- Return type:
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> ... >>> def _prepare_dataloader(self) -> DataLoader: >>> ... >>> return shifthappens.data.base.DataLoader(dataset, max_batch_size) >>> ...
- abstract _evaluate(model)[source]#
Evaluate the task and return a dictionary with the calculated metrics.
- Parameters:
model (shifthappens.models.base.Model) – The passed model implements a
predict
function returning an iterator overshifthappens.models.base.ModelResult
. Each result contains predictions such as the class labels assigned to the images, confidences, etc., based on which mixins were implemented by this task to request these prediction outputs.- Returns:
The results of the task in the form of a
shifthappens.tasks.task_result.TaskResult
containing an arbitrary dictionary of metrics, along with a specification of which of these metrics are main results/summary metrics for the task.- Return type:
Examples
>>> # imagenet_r example >>> @shifthappens.benchmark.register_task( >>> ... >>> ) >>> @dataclasses.dataclass >>> class ImageNetR(Task): >>> ... >>> def _evaluate(self, model: shifthappens.models.base.Model) -> TaskResult: >>> dataloader = self._prepare_dataloader() >>> all_predicted_labels_list = [] >>> for predictions in model.predict( >>> dataloader, PredictionTargets(class_labels=True) >>> ): >>> all_predicted_labels_list.append(predictions.class_labels) >>> all_predicted_labels = np.concatenate(all_predicted_labels_list, 0) >>> >>> accuracy = all_predicted_labels == np.array(self.ch_dataset.targets) >>> >>> return TaskResult( >>> accuracy=accuracy, summary_metrics={Metric.Robustness: "accuracy"} >>> ) >>> ...
Task mixins indicate the requirements of the task on the model in terms of the model’s supported prediction types.
- class shifthappens.tasks.mixins.LabelTaskMixin[source]#
Bases:
object
Indicates that the task requires the model to return the predicted labels.
Tasks implementing this mixin will be provided with the
class_labels
attribute in theshifthappens.models.base.ModelResult
returned during evaluation.
- class shifthappens.tasks.mixins.ConfidenceTaskMixin[source]#
Bases:
object
Indicates that the task requires the model to return the confidence scores.
Tasks implementing this mixin will be provided with the
confidences
attribute in theshifthappens.models.base.ModelResult
returned during evaluation.
- class shifthappens.tasks.mixins.UncertaintyTaskMixin[source]#
Bases:
object
Indicates that the task requires the model to return the uncertainty scores.
Tasks implementing this mixin will be provided with the
uncertainties
attribute in theshifthappens.models.base.ModelResult
returned during evaluation.
- class shifthappens.tasks.mixins.OODScoreTaskMixin[source]#
Bases:
object
Indicates that the task requires the model to return the OOD scores.
Tasks implementing this mixin will be provided with the
ood_scores
attribute in theshifthappens.models.base.ModelResult
returned during evaluation.
- class shifthappens.tasks.mixins.FeaturesTaskMixin[source]#
Bases:
object
Indicates that the task requires the model to return the raw features.
Tasks implementing this mixin will be provided with the
features
attribute in theshifthappens.models.base.ModelResult
returned during evaluation.
Class for representing the results of a single task.
- class shifthappens.tasks.task_result.TaskResult(*, summary_metrics, **metrics)[source]#
Bases:
object
Contains the results of a result, which can be arbitrary metrics. At least one of these metrics must be references as a summary metric.
- Parameters:
Examples
>>> @dataclasses.dataclass >>> class CustomTask(Task): >>> ... >>> def _evaluate(self, model: shifthappens.models.base.Model) -> DataLoader: >>> ... >>> return TaskResult( >>> your_robustness_metric=your_robustness_metric, >>> your_calibration_metric=your_calibration_metric, >>> your_custom_metric=1.0 - your_custom_metric, >>> summary_metrics={ >>> Metric.Robustness: "your_robustness_metric", >>> Metric.Calibration: "your_calibration_metric"}, >>> ) >>> ...
Categories in which metrics of tasks will be grouped.
Configuration#
Global configuration options for the benchmark.
- class shifthappens.config.Config(imagenet_validation_path='shifthappens/imagenet', imagenet21k_preprocessed_validation_path='shifthappens/imagenet21k-p', cache_directory_path='shifthappens/cache', verbose=False)[source]#
Bases:
object
Global configuration for the benchmark.
- Config options can be edited by the following order:
By setting variables explicitly when getting the instance via
shifthappens.config.get_config
By setting an environment variable, prefixed as “SH_VARIABLE_NAME”
By relying on the default values defined in this class.
- Parameters:
-
imagenet21k_preprocessed_validation_path:
str
= 'shifthappens/imagenet21k-p'# The ImageNet21k-P validation path.
- shifthappens.config.get_config(**init_kwargs)[source]#
Returns a global config initialized with provided arguments. This allows you to change defaults paths to ImageNet validation set, cached models result, etc. Note that reinitializing config will raise an error. For more details see
get_instance
.
Accessing Scores for the ImageNet Validation Set#
This module provides functionality to access the predictions of a model for the ImageNet validation set. Further, the predictions can be cached and loaded from the cache to reduce computational costs. For path configuration please have a look at :py:module:shifthappens.config.
- shifthappens.data.imagenet.get_imagenet_validation_loader(max_batch_size=128)[source]#
Creates a
shifthappens.data.base.DataLoader
for the validation set of ImageNet. Note that the path to ImageNet validation setshifthappens.config.imagenet_validation_path
must be specified.- Parameters:
max_batch_size (default:
128
) – How many samples allowed per batch to load.- Return type:
- Returns:
ImageNet validation set data loader.
- shifthappens.data.imagenet.get_cached_predictions(cls)[source]#
Checks whether there exist cached results for the model’s class and if so, returns them. Note that the path to ImageNet validation set
shifthappens.config.imagenet_validation_path
must be specified.- Parameters:
cls – Model’s class. Used for specifying folder name.
- Return type:
- Returns:
Dictionary of loaded model predictions on ImageNet validation set.
- shifthappens.data.imagenet.cache_predictions(cls, imagenet_validation_result)[source]#
Caches model predictions in cls-named folder and load model predictions from it. Note that the path to ImageNet validation set
shifthappens.config.imagenet_validation_path
must be specified as well asshifthappens.config.cache_directory_path
.- Parameters:
cls – Model’s class. Used for specifying folder name.
imagenet_validation_result (ModelResult) – Model’s prediction on ImageNet validation set.
- shifthappens.data.imagenet.is_cached(cls)[source]#
Checks if model’s results are cached in cls-named folder. Note that the path to the ImageNet validation set
shifthappens.config.imagenet_validation_path
must be specified as well asshifthappens.config.cache_directory_path
.- Parameters:
cls – Model’s class. Used for specifying folder name.
- Return type:
- Returns:
True
if model’s results are cached,False
otherwise.
- shifthappens.data.imagenet.load_imagenet_targets()[source]#
Returns the ground-truth labels of the ImageNet valdation set.
- Return type:
Model implementations#
Base classes and helper functions for adding models to the benchmark.
To add a new model, implement a new wrapper class inheriting from
shifthappens.models.base.Model
, and from any of the Mixins defined
in shifthappens.models.mixins
.
Model results should be converted to numpy.ndarray
objects, and
packed into an shifthappens.models.base.ModelResult
instance.
- class shifthappens.models.base.ModelResult(class_labels, confidences=None, uncertainties=None, ood_scores=None, features=None)[source]#
Bases:
object
Emissions of a model after processing a batch of data.
Each model needs to return class labels that are compatible with the ILSRC2012 labels. We use the same convention used by PyTorch regarding the ordering of labels.
- Parameters:
class_labels (
ndarray
) –(N, k)
, top-k predictions for each sample in the batch. Choice ofk
can be selected by the user, and potentially influences the type of accuracy based benchmarks that the model can be run on. For standard ImageNet, ImageNet-C evaluation, choose at leastk=5
.confidences (
Optional
[ndarray
], default:None
) –(N, 1000)
, confidences for each class. Standard PyTorch ImageNet class label order is expected for this array. Scores can be in the range-inf
toinf
.uncertainties (
Optional
[ndarray
], default:None
) –(N, 1000)
, uncertainties for the different class predictions. Different from theconfidences
, this is a measure of certainty of the givenconfidences
and common e.g. in Bayesian Deep neural networks.ood_scores (
Optional
[ndarray
], default:None
) –(N,)
, score for interpreting the sample as an out-of-distribution class, in the range-inf
toinf
.features (
Optional
[ndarray
], default:None
) –(N, d)
, whered
can be arbitrary, feature representation used to arrive at the given predictions.
- class shifthappens.models.base.PredictionTargets(class_labels=False, confidences=False, uncertainties=False, ood_scores=False, features=False)[source]#
Bases:
object
Contains boolean flags of which type of targets model is predicting. Note that at least one flag should be set as
True
and model should be inherited from corresponding ModelMixin.- Parameters:
class_labels (
bool
, default:False
) – Set toTrue
if model returns predicted labels.confidences (
bool
, default:False
) – Set toTrue
if model returns confidences.uncertainties (
bool
, default:False
) – Set toTrue
if model returns uncertainties.ood_scores (
bool
, default:False
) – Set toTrue
if model returns ood scores.features (
bool
, default:False
) – Set toTrue
if model returns features.
- class shifthappens.models.base.Model[source]#
Bases:
ABC
Model base class.
Override the
_predict
method to define predictions type of your specific model. If your model uses unsupervised adaptation mechanisms overrideprepare
as well.Also make sure that your model inherits from the mixins from
shifthappens.models.mixins
corresponding to your model predictions type (e.g.,LabelModelMixin
for labels orConfidenceModelMixin
for confidences).- property imagenet_validation_result#
Access the model’s predictions/evaluation results on the ImageNet validation set.
- Returns:
Model evaluation result on ImageNet validation set wrapped with ModelResult.
- prepare(dataloader)[source]#
If the model uses unsupervised adaptation mechanisms, it will run those.
- Parameters:
dataloader (
DataLoader
) – Dataloader producing batches of data.
- predict(input_dataloader, targets)[source]#
Yield all the predictions of the model for all data samples contained in the dataloader
- Parameters:
input_dataloader (
DataLoader
) – Dataloader producing batches of data.targets (
PredictionTargets
) – Indicates which kinds of targets should be predicted.
- Return type:
- Returns:
Prediction results for the given batch. Depending on the target arguments this includes the predicted labels, class confidences, class uncertainties, ood scores, and image features, all as
numpy.ndarray
objects.
- _get_imagenet_predictions(rewrite=False)[source]#
Loads cached predictions on ImageNet validation set for the model or predicts on ImageNet validation set and caches the result whenever there is no cached predictions for the model or
rewrite
argument set toTrue
.- Parameters:
rewrite (default:
False
) –True
if models predictions need to be rewritten.
- _predict_imagenet_val()[source]#
Evaluates model on ImageNet validation set and store all possible targets scores for the particular model.
- abstract _predict(input_dataloader, targets)[source]#
Override this function for the specific model.
- Parameters:
input_dataloader (
DataLoader
) – Dataloader producing batches of data.targets (
PredictionTargets
) – Indicates which kinds of targets should be predicted.
- Return type:
- Returns:
Yields prediction results for all batches yielded by the dataloader. Depending on the target arguments the model results may include the predicted labels, class confidences, class uncertainties, ood scores, and image features, all as
numpy.ndarray
objects.
Model mixins indicate the supported prediction types of a model.
- class shifthappens.models.mixins.LabelModelMixin[source]#
Bases:
object
Inherit from this class if your model returns predicted labels.
- class shifthappens.models.mixins.ConfidenceModelMixin[source]#
Bases:
object
Inherit from this class if your model returns confidences.
- class shifthappens.models.mixins.UncertaintyModelMixin[source]#
Bases:
object
Inherit from this class if your model returns uncertainties.
- class shifthappens.models.mixins.OODScoreModelMixin[source]#
Bases:
object
Inherit from this class if your model returns ood scores.
- class shifthappens.models.mixins.FeaturesModelMixin[source]#
Bases:
object
Inherit from this class if your model returns features.
Model implementations from torchvision#
Model baselines from torchvision.
- class shifthappens.models.torchvision.__TorchvisionPreProcessingMixin[source]#
Bases:
object
Performs the default preprocessing for torchvision ImageNet classifiers.
- class shifthappens.models.torchvision.__TorchvisionModel(model, feature_layer, max_batch_size, device='cpu')[source]#
Bases:
Model
,__TorchvisionPreProcessingMixin
,LabelModelMixin
,ConfidenceModelMixin
,FeaturesModelMixin
,OODScoreModelMixin
Wraps a torchvision model.
- Parameters:
- _predict(input_dataloader, targets)#
Override this function for the specific model.
- Parameters:
input_dataloader (
DataLoader
) – Dataloader producing batches of data.targets (
PredictionTargets
) – Indicates which kinds of targets should be predicted.
- Return type:
- Returns:
Yields prediction results for all batches yielded by the dataloader. Depending on the target arguments the model results may include the predicted labels, class confidences, class uncertainties, ood scores, and image features, all as
numpy.ndarray
objects.
- class shifthappens.models.torchvision.ResNet18(max_batch_size=16, device='cpu')[source]#
Bases:
__TorchvisionModel
ResNet18 network trained on the ImageNet 2012 train set from torchvision. See
torchvision.models.resnet18
for details. :type max_batch_size:int
, default:16
:param max_batch_size: How many samples allowed per batch to load. :type device:str
, default:'cpu'
:param device: Selected device to run the model on.
- class shifthappens.models.torchvision.ResNet50(max_batch_size=16, device='cpu')[source]#
Bases:
__TorchvisionModel
Load a ResNet50 network trained on the ImageNet 2012 train set from torchvision. See
torchvision.models.resnet50
for details.
- class shifthappens.models.torchvision.VGG16(max_batch_size=16, device='cpu')[source]#
Bases:
__TorchvisionModel
Load a VGG16 network trained on the ImageNet 2012 train set from torchvision. See
torchvision.models.vgg16
for details.
Data loading#
Base classes and helper functions for data handling (dataset, dataloader).
- class shifthappens.data.base.Dataset[source]#
Bases:
ABC
An abstract class representing an iterable dataset. Your iterable datasets should be inherited from this class.
- class shifthappens.data.base.IndexedDataset[source]#
Bases:
Dataset
A class representing a map-style dataset. Your map-style datasets should be inherited from this class.
- class shifthappens.data.base.DataLoader(dataset, max_batch_size)[source]#
Bases:
object
Interface b/w model and task, implements restrictions (e.g. max batch size) for models.
- Parameters:
- shifthappens.data.base.shuffle_data(*, data, seed)[source]#
Randomly shuffles without replacement an
numpy.ndarray
/list ofnumpy.ndarray
objects with a fixed random seed.
Wrappers for PyTorch datasets such that they can be used as datasets for the benchmark.
- class shifthappens.data.torch.TorchDataset(torch_dataset)[source]#
Bases:
Dataset
Wraps a torch iterable dataset (i.e.
torch.utils.data.IterableDataset
).- Parameters:
torch_dataset (
IterableDataset
) – Dataset from which to load the data.
- class shifthappens.data.torch.IndexedTorchDataset(torch_dataset)[source]#
Bases:
IndexedDataset
Wraps a torch map-style dataset (i.e.
torch.utils.data.Dataset
).- Parameters:
torch_dataset (
Dataset
) – Dataset from which to load the data.
Storing tasks within the benchmark#
Class for storing a task’s metadata.
- class shifthappens.task_data.task_metadata.TaskMetadata(name, relative_data_folder, standalone=True)[source]#
Bases:
object
Class for storing a task’s metadata required by the task registration mechanism. Arguments are passed by
shifthappens.benchmark.register_task
and are the same.- Parameters:
name (
str
) – Name of the task (can contain spaces or special characters).relative_data_folder (
str
) – Name of the folder in which the data for this dataset will be saved for this task relative to the root folder of the benchmark.standalone (
bool
, default:True
) – Boolean which represents if this task meaningful as a standalone task or will this only be relevant as a part of a collection of tasks.
Class for storing a task’s registration for the benchmark.
- class shifthappens.task_data.task_registration.TaskRegistration(cls, metadata)[source]#
Bases:
object
Class for storing a task’s registration for the benchmark. Arguments initialized automatically during task registration process.
- Parameters:
metadata (
TaskMetadata
) – Task metadata passed withshifthappens.benchmark.register_task
.