Datasets¶
time_interpret provides several datasets used as benchmark for time series attribution methods. These datasets are listed below:
Summary¶
|
Arma dataset. |
|
BioBank dataset. |
|
Hawkes dataset. |
|
2-state Hidden Markov Model as described in the DynaMask paper. |
|
MIMIC-III dataset. |
Detailed classes and methods¶
- class tint.datasets.Arma(times: int = 50, features: int = 50, subset: int = 5, ar: Optional[list] = None, ma: Optional[list] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/arma', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶
Arma dataset.
- Parameters:
times¶ (int) – Length of each time series. Default to 50
features¶ (int) – Number of features in each time series. Default to 50
ar¶ (list) – Coefficient for autoregressive lag polynomial, including zero lag. If
None
, use default values. Default toNone
ma¶ (list) – Coefficient for moving-average lag polynomial, including zero lag. If
None
, use default values. Default toNone
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
n_folds¶ (int) – Number of folds for cross validation. If
None
, the dataset is only split once between train and val usingprop_val
. Default toNone
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to
None
prop_val¶ (float) – Proportion of validation. Default to .2
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42
References
Examples
>>> from tint.datasets import Arma >>> arma = Arma() >>> arma.download(split="train") >>> x_train = arma.preprocess(split="train")["x"] >>> y_train = arma.preprocess(split="train")["y"]
- class tint.datasets.BioBank(label: Optional[str] = None, discretised: bool = False, granularity: int = 1, maximum_time: int = 115, fasttext: Optional[Fasttext] = None, time_to_task: float = 0.5, std_time_to_task: float = 0.2, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/biobank', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶
BioBank dataset.
- Parameters:
label¶ (str) – Condition to be used as label. If
None
, it is set to type 2 diabetes. Default toNone
discretised¶ (bool) – Whether to return a discretised dataset or not. Default to
False
granularity¶ (str, int) – The time granularity. Default to a year.
maximum_time¶ (int) – Maximum time to record. Default to 115 years
fasttext¶ (Fasttext) – A Fasttext model to encode categorical features. Default to
None
time_to_task¶ (float) – Special arg for diabetes task. Stops the recording before diabetes happens. Default to .5
std_time_to_task¶ (float) – Add randomness into when to stop recording. Default to .2
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
prop_val¶ (float) – Proportion of validation. Default to .2
n_folds¶ (int) – Number of folds for cross validation. If
None
, the dataset is only split once between train and val usingprop_val
. Default toNone
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to
None
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42
References
- build_discretized_features(events: List[Tensor], times: List[Tensor], verbose: Union[bool, int] = False)[source]¶
Build discretized features.
- build_discretized_labels(events: list, times: list) -> (<class 'torch.Tensor'>, <class 'torch.Tensor'>)[source]¶
Build discretized labels.
- class tint.datasets.HMM(n_signal: int = 3, n_state: int = 1, corr_features: Optional[list] = None, imp_features: Optional[list] = None, scale: Optional[list] = None, p0: Optional[list] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/hmm', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶
2-state Hidden Markov Model as described in the DynaMask paper.
- Parameters:
n_signal¶ (int) – Number of different signals. Default to 3
n_state¶ (int) – Number of different possible states. Default to 1
corr_features¶ (list) – Features that re correlated with the important feature in each state. If
None
, use default values. Default toNone
imp_features¶ (list) – Features that are always set as important. If
None
, use default values. Default toNone
scale¶ (list) – Scaling factor for distribution mean in each state. If
None
, use default values. Default toNone
p0¶ (list) – Starting probability. If
None
, use default values. Default toNone
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
prop_val¶ (float) – Proportion of validation. Default to .2
n_folds¶ (int) – Number of folds for cross validation. If
None
, the dataset is only split once between train and val usingprop_val
. Default toNone
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to
None
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42
References
Explaining Time Series Predictions with Dynamic Masks
Examples
>>> from tint.datasets import HMM >>> hmm = HMM() >>> hmm.download(split="train") >>> x_train = hmm.preprocess(split="train")["x"] >>> y_train = hmm.preprocess(split="train")["y"]
- class tint.datasets.Hawkes(mu: Optional[list] = None, alpha: Optional[list] = None, decay: Optional[list] = None, window: Optional[int] = None, data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/hawkes', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶
Hawkes dataset.
- Parameters:
mu¶ (list) – Intensity baselines. If
None
, use default values. Default toNone
alpha¶ (list) – Events parameters. If
None
, use default values. Default toNone
decay¶ (list) – Intensity decays. If
None
, use default values. Default toNone
window¶ (int) – The window of the simulated process. If
None
, use default value. Default toNone
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
prop_val¶ (float) – Proportion of validation. Default to .2
n_folds¶ (int) – Number of folds for cross validation. If
None
, the dataset is only split once between train and val usingprop_val
. Default toNone
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to
None
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42
References
https://x-datainitiative.github.io/tick/modules/hawkes.html
Examples
>>> from tint.datasets import Hawkes() >>> hawkes = Hawkes() >>> hawkes.download(split="train") >>> x_train = hawkes.preprocess(split="train")["x"] >>> y_train = hawkes.preprocess(split="train")["y"]
- static generate_points(mu: list, alpha: list, decay: list, window: int, seed: int, dt: float = 0.01)[source]¶
Generates points of an marked Hawkes processes using the tick library.
- static get_features(point: list) Tensor [source]¶
Create features and labels from a hawkes process.
- Parameters:
point¶ (list) – A hawkes process.
- static get_labels(point: list) Tensor [source]¶
Create features and labels from a hawkes process.
- Parameters:
point¶ (list) – A hawkes process.
- static intensity(mu: Tensor, alpha: Tensor, decay: Tensor, times: Tensor, labels: Tensor, t: Tensor) Tensor [source]¶
Given parameters mu, alpha and decay, some times and labels, and a vector of query times t, compute intensities at these time points.
B: Batch size. T: Temporal dim. N: Number of processes. Q: Number of time queries.
- Parameters:
mu¶ (th.Tensor) – Intensity baselines. Shape N, Values 0..1
alpha¶ (th.Tensor) – Events parameters. Shape N x N, Values 0..1
decay¶ (th.Tensor) – Intensity decays. Shape N x N, Values 0..1
times¶ (th.Tensor) – Times of the process. Shape B x T x 1
intensity.¶(th.Tensor (labels) – Labels of the process. Shape B x T x 1
t¶ (th.Tensor) – Query times. Shape Q
- Returns:
Intensities. Shape B x Q x N
- Return type:
th.Tensor
- true_saliency(split: str = 'train')[source]¶
Get process true saliency.
- Parameters:
split¶ (str) – Data split. Default to
'train'
- Returns:
The true saliency.
- Return type:
th.Tensor
- true_saliency_t(t: Tensor, mu: Optional[Tensor] = None, alpha: Optional[Tensor] = None, decay: Optional[Tensor] = None, times: Optional[Tensor] = None, labels: Optional[Tensor] = None, split: str = 'train')[source]¶
Compute the true saliency given some time queries.
B: Batch size. T: Temporal dim. N: Number of processes. Q: Number of time queries.
- Parameters:
t¶ (th.Tensor) – Time queries. Shape Q
mu¶ (th.Tensor) – Intensity baselines. Shape N, Values 0..1
alpha¶ (th.Tensor) – Events parameters. Shape N x N, Values 0..1
decay¶ (th.Tensor) – Intensity decays. Shape N x N, Values 0..1
times¶ (th.Tensor) – Times of the process. Shape B x T x 1
true_saliency_t.¶(th.Tensor (labels) – Labels of the process. Shape B x T x 1
split¶ (str) – Data split. Default to
'train'
- Returns:
true_saliency
- Return type:
th.Tensor
- class tint.datasets.Mimic3(task: str = 'mortality', data_dir: str = '/Users/josephenguehard/Documents/Python/time_interpret/tint/data/mimic3', batch_size: int = 32, prop_val: float = 0.2, n_folds: Optional[int] = None, fold: Optional[int] = None, num_workers: int = 0, seed: int = 42)[source]¶
MIMIC-III dataset.
Download is set up according to this repository: https://github.com/sanatonek/time_series_explainability.
Warning
Using this dataset requires to have the MIMIC III data running on a local server. Please see https://mimic.mit.edu/docs/gettingstarted/local/install-mimic-locally-ubuntu/ for more information.
- Parameters:
task¶ (str) – Name of the task to perform. Either
'mortality'
or'blood_pressure'
. Default to'mortality'
data_dir¶ (str) – Where to download files.
batch_size¶ (int) – Batch size. Default to 32
n_folds¶ (int) – Number of folds for cross validation. If
None
, the dataset is only split once between train and val usingprop_val
. Default toNone
fold¶ (int) – Index of the fold to use with cross-validation. Ignored if n_folds is None. Default to
None
prop_val¶ (float) – Proportion of validation. Default to .2
num_workers¶ (int) – Number of workers for the loaders. Default to 0
seed¶ (int) – For the random split. Default to 42
References
Examples
>>> from tint.datasets import Mimic3 >>> mimci3 = Mimic3() >>> mimci3.download(sqluser="your_username", split="train") >>> x_train = mimci3.preprocess(split="train")["x"] >>> y_train = mimci3.preprocess(split="train")["y"]