Datasets¶

time_interpret provides several datasets used as benchmark for time series attribution methods. These datasets are listed below:

Summary¶

`tint.datasets.Arma`([times, features, ...])	Arma dataset.
`tint.datasets.BioBank`([label, discretised, ...])	BioBank dataset.
`tint.datasets.Hawkes`([mu, alpha, decay, ...])	Hawkes dataset.
`tint.datasets.HMM`([n_signal, n_state, ...])	2-state Hidden Markov Model as described in the DynaMask paper.
`tint.datasets.Mimic3`([task, data_dir, ...])	MIMIC-III dataset.

Detailed classes and methods¶

class tint.datasets.Arma(times: int = 50, features: int = 50, subset: int = 5, ar: Optional[list] = None, ma: Optional[list] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/arma', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶

Arma dataset.

Parameters:

times¶ (int) – Length of each time series. Default to 50
features¶ (int) – Number of features in each time series. Default to 50
ar¶ (list) – Coefficient for autoregressive lag polynomial, including zero lag. If None, use default values. Default to None
ma¶ (list) – Coefficient for moving-average lag polynomial, including zero lag. If None, use default values. Default to None
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
n_folds¶ (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None
prop_val¶ (float) – Proportion of validation. Default to .2
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42

References

Examples

>>> from tint.datasets import Arma

>>> arma = Arma()
>>> arma.download(split="train")
>>> x_train = arma.preprocess(split="train")["x"]
>>> y_train = arma.preprocess(split="train")["y"]

static get_white_box(inputs: Tensor, true_saliency: Tensor) → Tensor[source]¶

Create a white box regressor to be interpreted.

Parameters:

inputs¶ (th.Tensor) – The input data.
true_saliency¶ (th.Tensor) – The true saliency.

Returns:

Output data.

Return type:

th.Tensor

class tint.datasets.BioBank(label: Optional[str] = None, discretised: bool = False, granularity: int = 1, maximum_time: int = 115, fasttext: Optional[Fasttext] = None, time_to_task: float = 0.5, std_time_to_task: float = 0.2, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/biobank', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶

BioBank dataset.

Parameters:

label¶ (str) – Condition to be used as label. If None, it is set to type 2 diabetes. Default to None
discretised¶ (bool) – Whether to return a discretised dataset or not. Default to False
granularity¶ (str, int) – The time granularity. Default to a year.
maximum_time¶ (int) – Maximum time to record. Default to 115 years
fasttext¶ (Fasttext) – A Fasttext model to encode categorical features. Default to None
time_to_task¶ (float) – Special arg for diabetes task. Stops the recording before diabetes happens. Default to .5
std_time_to_task¶ (float) – Add randomness into when to stop recording. Default to .2
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
prop_val¶ (float) – Proportion of validation. Default to .2
n_folds¶ (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42

References

https://www.ukbiobank.ac.uk

build_discretized_features(events: List[Tensor], times: List[Tensor], verbose: Union[bool, int] = False)[source]¶

Build discretized features.

Parameters:

events¶ (list) – The read codes.
times¶ (list) – Times of each event.
verbose¶ (bool, int) – Verbosity level. Default to False

Returns:

Preprocessed features.

build_discretized_labels(events: list, times: list) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]¶

Build discretized labels.

Parameters:

events¶ (list) – List of events.
times¶ (list) – List of times.

Returns:

Two tensors of labels and tasks

Return type:

(th.Tensor, th.Tensor)

build_features(events: List[Tensor], times: List[Tensor], verbose: Union[bool, int] = False)[source]¶

Build features.

Parameters:

events¶ (list) – The read codes.
times¶ (list) – Times of each event.
verbose¶ (bool, int) – Verbosity level. Default to False

Returns:

Preprocessed features.

build_labels(events: list, times: list) -> (<class 'list'>, <class 'list'>)[source]¶

Build labels.

Parameters:

events¶ (list) – Dict of events.
times¶ (list) – List of times.

Returns:

Two lists of labels and tasks

Return type:

(list, list)

class tint.datasets.HMM(n_signal: int = 3, n_state: int = 1, corr_features: Optional[list] = None, imp_features: Optional[list] = None, scale: Optional[list] = None, p0: Optional[list] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/hmm', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶

2-state Hidden Markov Model as described in the DynaMask paper.

Parameters:

n_signal¶ (int) – Number of different signals. Default to 3
n_state¶ (int) – Number of different possible states. Default to 1
corr_features¶ (list) – Features that re correlated with the important feature in each state. If None, use default values. Default to None
imp_features¶ (list) – Features that are always set as important. If None, use default values. Default to None
scale¶ (list) – Scaling factor for distribution mean in each state. If None, use default values. Default to None
p0¶ (list) – Starting probability. If None, use default values. Default to None
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
prop_val¶ (float) – Proportion of validation. Default to .2
n_folds¶ (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42

References

Explaining Time Series Predictions with Dynamic Masks

Examples

>>> from tint.datasets import HMM

>>> hmm = HMM()
>>> hmm.download(split="train")
>>> x_train = hmm.preprocess(split="train")["x"]
>>> y_train = hmm.preprocess(split="train")["y"]

class tint.datasets.Hawkes(mu: Optional[list] = None, alpha: Optional[list] = None, decay: Optional[list] = None, window: Optional[int] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/hawkes', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶

Hawkes dataset.

Parameters:

mu¶ (list) – Intensity baselines. If None, use default values. Default to None
alpha¶ (list) – Events parameters. If None, use default values. Default to None
decay¶ (list) – Intensity decays. If None, use default values. Default to None
window¶ (int) – The window of the simulated process. If None, use default value. Default to None
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
prop_val¶ (float) – Proportion of validation. Default to .2
n_folds¶ (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42

References

https://x-datainitiative.github.io/tick/modules/hawkes.html

Examples

>>> from tint.datasets import Hawkes()

>>> hawkes = Hawkes()
>>> hawkes.download(split="train")
>>> x_train = hawkes.preprocess(split="train")["x"]
>>> y_train = hawkes.preprocess(split="train")["y"]

static generate_points(mu: list, alpha: list, decay: list, window: int, seed: int, dt: float = 0.01)[source]¶

Generates points of an marked Hawkes processes using the tick library.

Parameters:

mu¶ (list) – Hawkes baseline.
alpha¶ (list) – Event parameter.
decay¶ (list) – Decay parameter.
window¶ (int) – The window of the simulated process.
seed¶ (int) – The random seed.
dt¶ (float) – Granularity. Default to 0.01

static get_features(point: list) → Tensor[source]¶

Create features and labels from a hawkes process.

Parameters:: point¶ (list) – A hawkes process.

static get_labels(point: list) → Tensor[source]¶

Create features and labels from a hawkes process.

Parameters:: point¶ (list) – A hawkes process.

static intensity(mu: Tensor, alpha: Tensor, decay: Tensor, times: Tensor, labels: Tensor, t: Tensor) → Tensor[source]¶

Given parameters mu, alpha and decay, some times and labels, and a vector of query times t, compute intensities at these time points.

B: Batch size. T: Temporal dim. N: Number of processes. Q: Number of time queries.

Parameters:

mu¶ (th.Tensor) – Intensity baselines. Shape N, Values 0..1
alpha¶ (th.Tensor) – Events parameters. Shape N x N, Values 0..1
decay¶ (th.Tensor) – Intensity decays. Shape N x N, Values 0..1
times¶ (th.Tensor) – Times of the process. Shape B x T x 1
intensity.¶(th.Tensor (labels) – Labels of the process. Shape B x T x 1
t¶ (th.Tensor) – Query times. Shape Q

Returns:

Intensities. Shape B x Q x N

Return type:

th.Tensor

true_saliency(split: str = 'train')[source]¶

Get process true saliency.

Parameters:: split¶ (str) – Data split. Default to 'train'
Returns:: The true saliency.
Return type:: th.Tensor

true_saliency_t(t: Tensor, mu: Optional[Tensor] = None, alpha: Optional[Tensor] = None, decay: Optional[Tensor] = None, times: Optional[Tensor] = None, labels: Optional[Tensor] = None, split: str = 'train')[source]¶

Compute the true saliency given some time queries.

B: Batch size. T: Temporal dim. N: Number of processes. Q: Number of time queries.

Parameters:

t¶ (th.Tensor) – Time queries. Shape Q
mu¶ (th.Tensor) – Intensity baselines. Shape N, Values 0..1
alpha¶ (th.Tensor) – Events parameters. Shape N x N, Values 0..1
decay¶ (th.Tensor) – Intensity decays. Shape N x N, Values 0..1
times¶ (th.Tensor) – Times of the process. Shape B x T x 1
true_saliency_t.¶(th.Tensor (labels) – Labels of the process. Shape B x T x 1
split¶ (str) – Data split. Default to 'train'

Returns:

true_saliency

Return type:

th.Tensor

class tint.datasets.Mimic3(task: str = 'mortality', data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/mimic3', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶

MIMIC-III dataset.

Download is set up according to this repository: https://github.com/sanatonek/time_series_explainability.

Warning

Using this dataset requires to have the MIMIC III data running on a local server. Please see https://mimic.mit.edu/docs/gettingstarted/local/install-mimic-locally-ubuntu/ for more information.

Parameters:

task¶ (str) – Name of the task to perform. Either 'mortality' or 'blood_pressure'. Default to 'mortality'
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
n_folds¶ (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None
prop_val¶ (float) – Proportion of validation. Default to .2
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42

References

Examples

>>> from tint.datasets import Mimic3

>>> mimci3 = Mimic3()
>>> mimci3.download(sqluser="your_username", split="train")
>>> x_train = mimci3.preprocess(split="train")["x"]
>>> y_train = mimci3.preprocess(split="train")["y"]