Downloading Public Datasets#

What you will learn in this tutorial:#

how to download and extract one of the available public datasets
how to customize the default directory structure

Preparations#

We import pymovements as the alias pm for convenience.

[1]:

import pymovements as pm

/home/docs/checkouts/readthedocs.org/user_builds/pymovements/envs/stable/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

pymovements provides a library of publicly available datasets.

You can browse through the available dataset definitions here: Datasets

For this tutorial we will limit ourselves to the ToyDataset due to its minimal space requirements.

Other datasets can be downloaded by simply replacing ToyDataset with one of the other available datasets.

## Initialization

First we initialize our public dataset by specifying its name and the root data directory.

Our dataset will then be placed in a directory with the name of the dataset:

[2]:

dataset = pm.Dataset('ToyDataset', path='data/ToyDataset')

dataset.path

[2]:

PosixPath('data/ToyDataset')

If you only want to specify a root directory which contains all your datasets, you can pass a DatasetPaths instance.

The directory of your dataset will have the same name as in the dataset definition.

[3]:

dataset_paths = pm.DatasetPaths(root='data/')
dataset = pm.Dataset('ToyDataset', path=dataset_paths)

dataset.path

[3]:

PosixPath('data/ToyDataset')

Can also specify an alternative dataset directory for your downloaded dataset.

[4]:

dataset_paths_alt = pm.DatasetPaths(root='data/', dataset='my_dataset')
dataset_alt = pm.Dataset('ToyDataset', path=dataset_paths_alt)

dataset_alt.path

[4]:

PosixPath('data/my_dataset')

Downloading#

The dataset will then be downloaded by calling:

[5]:

dataset.download()

Using already downloaded and verified file: data/ToyDataset/downloads/pymovements-toy-dataset.zip
Extracting pymovements-toy-dataset.zip to data/ToyDataset/raw

[5]:

<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>

As we see from the download message, the dataset resource has been downloaded to a downloads directory.

You can get the path to this directory from the Datset.paths.downloads attribute:

[6]:

dataset.paths.downloads

[6]:

PosixPath('data/ToyDataset/downloads')

You can also specify a custom directory name during initialization:

[7]:

dataset_paths_3 = pm.DatasetPaths(root='data/', downloads='new_downloads')
dataset_3 = pm.Dataset('ToyDataset', path=dataset_paths_3)

dataset_3.paths.downloads

[7]:

PosixPath('data/ToyDataset/new_downloads')

By default, all archives are recursively extracted to Dataset.paths.raw:

[8]:

dataset.paths.raw

[8]:

PosixPath('data/ToyDataset/raw')

If you want to remove the downloaded archives after extraction to save some space, you can set remove_finished to True:

[9]:

dataset.extract(remove_finished=True)

Extracting pymovements-toy-dataset.zip to data/ToyDataset/raw

[9]:

<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>

This is also available for the PublicDataset.download() method:

[10]:

dataset.download(remove_finished=True)

Downloading http://github.com/aeye-lab/pymovements-toy-dataset/zipball/6cb5d663317bf418cec0c9abe1dde5085a8a8ebd/ to data/ToyDataset/downloads/pymovements-toy-dataset.zip

pymovements-toy-dataset.zip: 100%|██████████| 3.06M/3.06M [00:00<00:00, 25.6MB/s]

Checking integrity of pymovements-toy-dataset.zip
Extracting pymovements-toy-dataset.zip to data/ToyDataset/raw

[10]:

<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>

Loading into memory#

The PublicDataset class is a subset of the Dataset class and thus inherits all its functionality.

Hende, we can load the data into our working memory by using the common load() method:

[11]:

dataset.load()

100%|██████████| 20/20 [00:00<00:00, 20.43it/s]

[11]:

<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>

Let’s verify that we have correctly scanned the dataset files:

[12]:

dataset.fileinfo

[12]:

shape: (20, 3)

text_id	page_id	filepath
i64	i64	str
0	1	"aeye-lab-pymov…
0	2	"aeye-lab-pymov…
0	3	"aeye-lab-pymov…
0	4	"aeye-lab-pymov…
0	5	"aeye-lab-pymov…
1	1	"aeye-lab-pymov…
1	2	"aeye-lab-pymov…
1	3	"aeye-lab-pymov…
1	4	"aeye-lab-pymov…
1	5	"aeye-lab-pymov…
2	1	"aeye-lab-pymov…
2	2	"aeye-lab-pymov…
2	3	"aeye-lab-pymov…
2	4	"aeye-lab-pymov…
2	5	"aeye-lab-pymov…
3	1	"aeye-lab-pymov…
3	2	"aeye-lab-pymov…
3	3	"aeye-lab-pymov…
3	4	"aeye-lab-pymov…
3	5	"aeye-lab-pymov…

Wonderful, all of our data has been downloaded and loaded in successfully!

What you have learned in this tutorial:#

how to initialize a public dataset
how to download and extract dataset resources
how to customize the default directory structure
how to load the dataset into your working memory