Datasets

time_interpret provides several datasets used as benchmark for time series attribution methods. These datasets are listed below:

Summary

tint.datasets.Arma([times, features, ...])

Arma dataset.

tint.datasets.BioBank([label, discretised, ...])

BioBank dataset.

tint.datasets.Hawkes([mu, alpha, decay, ...])

Hawkes dataset.

tint.datasets.HMM([n_signal, n_state, ...])

2-state Hidden Markov Model as described in the DynaMask paper.

tint.datasets.Mimic3([task, data_dir, ...])

MIMIC-III dataset.

Detailed classes and methods

class tint.datasets.Arma(times: int = 50, features: int = 50, subset: int = 5, ar: Optional[list] = None, ma: Optional[list] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/arma', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]

Arma dataset.

Parameters:
  • times (int) – Length of each time series. Default to 50

  • features (int) – Number of features in each time series. Default to 50

  • ar (list) – Coefficient for autoregressive lag polynomial, including zero lag. If None, use default values. Default to None

  • ma (list) – Coefficient for moving-average lag polynomial, including zero lag. If None, use default values. Default to None

  • data_dir (str) – Where to download files.

  • batch_size (int) – Batch size. Default to 32

  • n_folds (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None

  • fold (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None

  • prop_val (float) – Proportion of validation. Default to .2

  • num_workers (int) – Number of workers for the loaders. Default to 0

  • seed (int) – For the random split. Default to 42

References

  1. Explaining Time Series Predictions with Dynamic Masks

  2. https://www.statsmodels.org/dev/generated/statsmodels.tsa.arima_process.ArmaProcess.html

Examples

>>> from tint.datasets import Arma

>>> arma = Arma()
>>> arma.download(split="train")
>>> x_train = arma.preprocess(split="train")["x"]
>>> y_train = arma.preprocess(split="train")["y"]
static get_white_box(inputs: Tensor, true_saliency: Tensor) Tensor[source]

Create a white box regressor to be interpreted.

Parameters:
  • inputs (th.Tensor) – The input data.

  • true_saliency (th.Tensor) – The true saliency.

Returns:

Output data.

Return type:

th.Tensor

class tint.datasets.BioBank(label: Optional[str] = None, discretised: bool = False, granularity: int = 1, maximum_time: int = 115, fasttext: Optional[Fasttext] = None, time_to_task: float = 0.5, std_time_to_task: float = 0.2, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/biobank', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]

BioBank dataset.

Parameters:
  • label (str) – Condition to be used as label. If None, it is set to type 2 diabetes. Default to None

  • discretised (bool) – Whether to return a discretised dataset or not. Default to False

  • granularity (str, int) – The time granularity. Default to a year.

  • maximum_time (int) – Maximum time to record. Default to 115 years

  • fasttext (Fasttext) – A Fasttext model to encode categorical features. Default to None

  • time_to_task (float) – Special arg for diabetes task. Stops the recording before diabetes happens. Default to .5

  • std_time_to_task (float) – Add randomness into when to stop recording. Default to .2

  • data_dir (str) – Where to download files.

  • batch_size (int) – Batch size. Default to 32

  • prop_val (float) – Proportion of validation. Default to .2

  • n_folds (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None

  • fold (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None

  • num_workers (int) – Number of workers for the loaders. Default to 0

  • seed (int) – For the random split. Default to 42

References

https://www.ukbiobank.ac.uk

build_discretized_features(events: List[Tensor], times: List[Tensor], verbose: Union[bool, int] = False)[source]

Build discretized features.

Parameters:
  • events (list) – The read codes.

  • times (list) – Times of each event.

  • verbose (bool, int) – Verbosity level. Default to False

Returns:

Preprocessed features.

build_discretized_labels(events: list, times: list) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]

Build discretized labels.

Parameters:
  • events (list) – List of events.

  • times (list) – List of times.

Returns:

Two tensors of labels and tasks

Return type:

(th.Tensor, th.Tensor)

build_features(events: List[Tensor], times: List[Tensor], verbose: Union[bool, int] = False)[source]

Build features.

Parameters:
  • events (list) – The read codes.

  • times (list) – Times of each event.

  • verbose (bool, int) – Verbosity level. Default to False

Returns:

Preprocessed features.

build_labels(events: list, times: list) -> (<class 'list'>, <class 'list'>)[source]

Build labels.

Parameters:
  • events (list) – Dict of events.

  • times (list) – List of times.

Returns:

Two lists of labels and tasks

Return type:

(list, list)

class tint.datasets.HMM(n_signal: int = 3, n_state: int = 1, corr_features: Optional[list] = None, imp_features: Optional[list] = None, scale: Optional[list] = None, p0: Optional[list] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/hmm', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]

2-state Hidden Markov Model as described in the DynaMask paper.

Parameters:
  • n_signal (int) – Number of different signals. Default to 3

  • n_state (int) – Number of different possible states. Default to 1

  • corr_features (list) – Features that re correlated with the important feature in each state. If None, use default values. Default to None

  • imp_features (list) – Features that are always set as important. If None, use default values. Default to None

  • scale (list) – Scaling factor for distribution mean in each state. If None, use default values. Default to None

  • p0 (list) – Starting probability. If None, use default values. Default to None

  • data_dir (str) – Where to download files.

  • batch_size (int) – Batch size. Default to 32

  • prop_val (float) – Proportion of validation. Default to .2

  • n_folds (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None

  • fold (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None

  • num_workers (int) – Number of workers for the loaders. Default to 0

  • seed (int) – For the random split. Default to 42

References

Explaining Time Series Predictions with Dynamic Masks

Examples

>>> from tint.datasets import HMM

>>> hmm = HMM()
>>> hmm.download(split="train")
>>> x_train = hmm.preprocess(split="train")["x"]
>>> y_train = hmm.preprocess(split="train")["y"]
class tint.datasets.Hawkes(mu: Optional[list] = None, alpha: Optional[list] = None, decay: Optional[list] = None, window: Optional[int] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/hawkes', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]

Hawkes dataset.

Parameters:
  • mu (list) – Intensity baselines. If None, use default values. Default to None

  • alpha (list) – Events parameters. If None, use default values. Default to None

  • decay (list) – Intensity decays. If None, use default values. Default to None

  • window (int) – The window of the simulated process. If None, use default value. Default to None

  • data_dir (str) – Where to download files.

  • batch_size (int) – Batch size. Default to 32

  • prop_val (float) – Proportion of validation. Default to .2

  • n_folds (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None

  • fold (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None

  • num_workers (int) – Number of workers for the loaders. Default to 0

  • seed (int) – For the random split. Default to 42

References

https://x-datainitiative.github.io/tick/modules/hawkes.html

Examples

>>> from tint.datasets import Hawkes()

>>> hawkes = Hawkes()
>>> hawkes.download(split="train")
>>> x_train = hawkes.preprocess(split="train")["x"]
>>> y_train = hawkes.preprocess(split="train")["y"]
static generate_points(mu: list, alpha: list, decay: list, window: int, seed: int, dt: float = 0.01)[source]

Generates points of an marked Hawkes processes using the tick library.

Parameters:
  • mu (list) – Hawkes baseline.

  • alpha (list) – Event parameter.

  • decay (list) – Decay parameter.

  • window (int) – The window of the simulated process.

  • seed (int) – The random seed.

  • dt (float) – Granularity. Default to 0.01

static get_features(point: list) Tensor[source]

Create features and labels from a hawkes process.

Parameters:

point (list) – A hawkes process.

static get_labels(point: list) Tensor[source]

Create features and labels from a hawkes process.

Parameters:

point (list) – A hawkes process.

static intensity(mu: Tensor, alpha: Tensor, decay: Tensor, times: Tensor, labels: Tensor, t: Tensor) Tensor[source]

Given parameters mu, alpha and decay, some times and labels, and a vector of query times t, compute intensities at these time points.

B: Batch size. T: Temporal dim. N: Number of processes. Q: Number of time queries.

Parameters:
  • mu (th.Tensor) – Intensity baselines. Shape N, Values 0..1

  • alpha (th.Tensor) – Events parameters. Shape N x N, Values 0..1

  • decay (th.Tensor) – Intensity decays. Shape N x N, Values 0..1

  • times (th.Tensor) – Times of the process. Shape B x T x 1

  • intensity.(th.Tensor (labels) – Labels of the process. Shape B x T x 1

  • t (th.Tensor) – Query times. Shape Q

Returns:

Intensities. Shape B x Q x N

Return type:

th.Tensor

true_saliency(split: str = 'train')[source]

Get process true saliency.

Parameters:

split (str) – Data split. Default to 'train'

Returns:

The true saliency.

Return type:

th.Tensor

true_saliency_t(t: Tensor, mu: Optional[Tensor] = None, alpha: Optional[Tensor] = None, decay: Optional[Tensor] = None, times: Optional[Tensor] = None, labels: Optional[Tensor] = None, split: str = 'train')[source]

Compute the true saliency given some time queries.

B: Batch size. T: Temporal dim. N: Number of processes. Q: Number of time queries.

Parameters:
  • t (th.Tensor) – Time queries. Shape Q

  • mu (th.Tensor) – Intensity baselines. Shape N, Values 0..1

  • alpha (th.Tensor) – Events parameters. Shape N x N, Values 0..1

  • decay (th.Tensor) – Intensity decays. Shape N x N, Values 0..1

  • times (th.Tensor) – Times of the process. Shape B x T x 1

  • true_saliency_t.(th.Tensor (labels) – Labels of the process. Shape B x T x 1

  • split (str) – Data split. Default to 'train'

Returns:

true_saliency

Return type:

th.Tensor

class tint.datasets.Mimic3(task: str = 'mortality', data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/mimic3', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]

MIMIC-III dataset.

Download is set up according to this repository: https://github.com/sanatonek/time_series_explainability.

Warning

Using this dataset requires to have the MIMIC III data running on a local server. Please see https://mimic.mit.edu/docs/gettingstarted/local/install-mimic-locally-ubuntu/ for more information.

Parameters:
  • task (str) – Name of the task to perform. Either 'mortality' or 'blood_pressure'. Default to 'mortality'

  • data_dir (str) – Where to download files.

  • batch_size (int) – Batch size. Default to 32

  • n_folds (int) – Number of folds for cross validation. If None, the dataset is only split once between train and val using prop_val. Default to None

  • fold (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to None

  • prop_val (float) – Proportion of validation. Default to .2

  • num_workers (int) – Number of workers for the loaders. Default to 0

  • seed (int) – For the random split. Default to 42

References

  1. https://physionet.org/content/mimiciii/1.4/

  2. https://github.com/sanatonek/time_series_explainability/blob/master/data_generator/icu_mortality.py

Examples

>>> from tint.datasets import Mimic3

>>> mimci3 = Mimic3()
>>> mimci3.download(sqluser="your_username", split="train")
>>> x_train = mimci3.preprocess(split="train")["x"]
>>> y_train = mimci3.preprocess(split="train")["y"]