Downloading Public Datasets#

What you will learn in this tutorial:#

how to download and extract one of the available public datasets
how to customize the default directory structure

Preparations#

We import pymovements as the alias pm for convenience.

[1]:

import pymovements as pm

/home/docs/checkouts/readthedocs.org/user_builds/pymovements/envs/v0.7.0/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

pymovements provides a library of publicly available datasets.

You can browse through the available dataset definitions here: Datasets

For this tutorial we will limit ourselves to the ToyDataset due to its minimal space requirements.

Other datasets can be downloaded by simply replacing ToyDataset with one of the other available datasets.

## Initialization

First we initialize the dataset by specifying the root data directory. Our dataset will then be placed in a directory with the name of the dataset:

[2]:

dataset = pm.datasets.ToyDataset(root='data/')

dataset.path

[2]:

PosixPath('data/ToyDataset')

If you don’t want to create this additional directory and just use the root path as your dataset path, you can specify the dataset_dirname explicitly and set it to .:

[3]:

pm.datasets.ToyDataset(root='data/', dataset_dirname='.').path

[3]:

PosixPath('data')

Downloading#

The dataset will then be downloaded by calling:

[4]:

dataset.download()

Using already downloaded and verified file: data/ToyDataset/downloads/pymovements-toy-dataset.zip

As we see from the download message, the dataset resource has been downloaded to a downloads directory.

You can get the path to this directory from the downloads_rootpath attribute:

[5]:

dataset.downloads_rootpath

[5]:

PosixPath('data/ToyDataset/downloads')

You can also specify a custom directory name during initialization:

[6]:

pm.datasets.ToyDataset(root='data/', downloads_dirname='my_downloads').downloads_rootpath

[6]:

PosixPath('data/ToyDataset/my_downloads')

Extracting#

You can then extract you downloaded data by calling:

[7]:

dataset.extract()

Your data is now extracted to the following directory:

[8]:

dataset.raw_rootpath

[8]:

PosixPath('data/ToyDataset/raw')

Loading into memory#

Finally we can load the data into our working memory by using the common load() method:

[9]:

dataset.load()

100%|██████████| 20/20 [00:00<00:00, 200.48it/s]

Let’s verify that we have correctly scanned the dataset files:

[10]:

dataset.fileinfo

[10]:

shape: (20, 3)

text_id	page_id	filepath
i64	i64	str
0	1	"aeye-lab-pymov...
0	2	"aeye-lab-pymov...
0	3	"aeye-lab-pymov...
0	4	"aeye-lab-pymov...
0	5	"aeye-lab-pymov...
1	1	"aeye-lab-pymov...
1	2	"aeye-lab-pymov...
1	3	"aeye-lab-pymov...
1	4	"aeye-lab-pymov...
1	5	"aeye-lab-pymov...
2	1	"aeye-lab-pymov...
2	2	"aeye-lab-pymov...
2	3	"aeye-lab-pymov...
2	4	"aeye-lab-pymov...
2	5	"aeye-lab-pymov...
3	1	"aeye-lab-pymov...
3	2	"aeye-lab-pymov...
3	3	"aeye-lab-pymov...
3	4	"aeye-lab-pymov...
3	5	"aeye-lab-pymov...

Wonderful, all of our data has been downloaded successfully!

What you have learned in this tutorial:#

how to initialize a public dataset
how to download and extract dataset resources
how to customize the default directory structure
how to load the dataset into your working memory