# Downloading Public Datasets

## What you will learn in this tutorial:

* how to download and extract one of the available public datasets
* how to customize the default directory structure

## Preparations

We import `pymovements` as the alias `pm` for convenience.

In [None]:
import pymovements as pm

pymovements provides a library of publicly available datasets.

You can browse through the available dataset definitions here:
[Datasets](https://pymovements.readthedocs.io/en/latest/reference/pymovements.datasets.html#module-pymovements.datasets)

For this tutorial we will limit ourselves to the `ToyDataset` due to its minimal space requirements.

Other datasets can be downloaded by simply replacing `ToyDataset` with one of the other available datasets.

 ## Initialization

First we initialize our public dataset by specifying its name and the root data directory.

Our dataset will then be placed in a directory with the name of the dataset:

In [None]:
dataset = pm.Dataset('ToyDataset', path='data/ToyDataset')

dataset.path

If you only want to specify a root directory which contains all your datasets, you can pass a `DatasetPaths` instance.

The directory of your dataset will have the same name as in the dataset definition.

In [None]:
dataset_paths = pm.DatasetPaths(root='data/')
dataset = pm.Dataset('ToyDataset', path=dataset_paths)

dataset.path

Can also specify an alternative dataset directory for your downloaded dataset.

In [None]:
dataset_paths_alt = pm.DatasetPaths(root='data/', dataset='my_dataset')
dataset_alt = pm.Dataset('ToyDataset', path=dataset_paths_alt)

dataset_alt.path

## Downloading

The dataset will then be downloaded by calling:

In [None]:
dataset.download()

As we see from the download message, the dataset resource has been downloaded to a downloads directory.

You can get the path to this directory from the `Datset.paths.downloads` attribute:

In [None]:
dataset.paths.downloads

You can also specify a custom directory name during initialization:

In [None]:
dataset_paths_3 = pm.DatasetPaths(root='data/', downloads='new_downloads')
dataset_3 = pm.Dataset('ToyDataset', path=dataset_paths_3)

dataset_3.paths.downloads

By default, all archives are recursively extracted to `Dataset.paths.raw`:

In [None]:
dataset.paths.raw

If you want to remove the downloaded archives after extraction to save some space, you can set `remove_finished` to `True`:

In [None]:
dataset.extract(remove_finished=True)

This is also available for the `PublicDataset.download()` method:

In [None]:
dataset.download(remove_finished=True)

## Loading into memory

The `PublicDataset` class is a subset of the `Dataset` class and thus inherits all its functionality.

Hende, we can load the data into our working memory by using the common `load()` method:

In [None]:
dataset.load()

Let's verify that we have correctly scanned the dataset files:

In [None]:
dataset.fileinfo

Wonderful, all of our data has been downloaded and loaded in successfully!

## What you have learned in this tutorial:

* how to initialize a public dataset
* how to download and extract dataset resources
* how to customize the default directory structure
* how to load the dataset into your working memory