Downloading Public Datasets#
What you will learn in this tutorial:#
how to download and extract one of the available public datasets
how to customize the default directory structure
Preparations#
We import pymovements
as the alias pm
for convenience.
[1]:
import pymovements as pm
/home/docs/checkouts/readthedocs.org/user_builds/pymovements/envs/stable/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
pymovements provides a library of publicly available datasets.
You can browse through the available dataset definitions here: Datasets
For this tutorial we will limit ourselves to the ToyDataset
due to its minimal space requirements.
Other datasets can be downloaded by simply replacing ToyDataset
with one of the other available datasets.
## Initialization
First we initialize our public dataset by specifying its name and the root data directory.
Our dataset will then be placed in a directory with the name of the dataset:
[2]:
dataset = pm.Dataset('ToyDataset', path='data/ToyDataset')
dataset.path
[2]:
PosixPath('data/ToyDataset')
If you only want to specify a root directory which contains all your datasets, you can pass a DatasetPaths
instance.
The directory of your dataset will have the same name as in the dataset definition.
[3]:
dataset_paths = pm.DatasetPaths(root='data/')
dataset = pm.Dataset('ToyDataset', path=dataset_paths)
dataset.path
[3]:
PosixPath('data/ToyDataset')
Can also specify an alternative dataset directory for your downloaded dataset.
[4]:
dataset_paths_alt = pm.DatasetPaths(root='data/', dataset='my_dataset')
dataset_alt = pm.Dataset('ToyDataset', path=dataset_paths_alt)
dataset_alt.path
[4]:
PosixPath('data/my_dataset')
Downloading#
The dataset will then be downloaded by calling:
[5]:
dataset.download()
Using already downloaded and verified file: data/ToyDataset/downloads/pymovements-toy-dataset.zip
Extracting pymovements-toy-dataset.zip to data/ToyDataset/raw
[5]:
<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>
As we see from the download message, the dataset resource has been downloaded to a downloads directory.
You can get the path to this directory from the Datset.paths.downloads
attribute:
[6]:
dataset.paths.downloads
[6]:
PosixPath('data/ToyDataset/downloads')
You can also specify a custom directory name during initialization:
[7]:
dataset_paths_3 = pm.DatasetPaths(root='data/', downloads='new_downloads')
dataset_3 = pm.Dataset('ToyDataset', path=dataset_paths_3)
dataset_3.paths.downloads
[7]:
PosixPath('data/ToyDataset/new_downloads')
By default, all archives are recursively extracted to Dataset.paths.raw
:
[8]:
dataset.paths.raw
[8]:
PosixPath('data/ToyDataset/raw')
If you want to remove the downloaded archives after extraction to save some space, you can set remove_finished
to True
:
[9]:
dataset.extract(remove_finished=True)
Extracting pymovements-toy-dataset.zip to data/ToyDataset/raw
[9]:
<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>
This is also available for the PublicDataset.download()
method:
[10]:
dataset.download(remove_finished=True)
Downloading http://github.com/aeye-lab/pymovements-toy-dataset/zipball/6cb5d663317bf418cec0c9abe1dde5085a8a8ebd/ to data/ToyDataset/downloads/pymovements-toy-dataset.zip
pymovements-toy-dataset.zip: 100%|██████████| 3.06M/3.06M [00:00<00:00, 25.6MB/s]
Checking integrity of pymovements-toy-dataset.zip
Extracting pymovements-toy-dataset.zip to data/ToyDataset/raw
[10]:
<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>
Loading into memory#
The PublicDataset
class is a subset of the Dataset
class and thus inherits all its functionality.
Hende, we can load the data into our working memory by using the common load()
method:
[11]:
dataset.load()
100%|██████████| 20/20 [00:00<00:00, 20.43it/s]
[11]:
<pymovements.dataset.dataset.Dataset at 0x7fd1b25e0940>
Let’s verify that we have correctly scanned the dataset files:
[12]:
dataset.fileinfo
[12]:
text_id | page_id | filepath |
---|---|---|
i64 | i64 | str |
0 | 1 | "aeye-lab-pymov… |
0 | 2 | "aeye-lab-pymov… |
0 | 3 | "aeye-lab-pymov… |
0 | 4 | "aeye-lab-pymov… |
0 | 5 | "aeye-lab-pymov… |
1 | 1 | "aeye-lab-pymov… |
1 | 2 | "aeye-lab-pymov… |
1 | 3 | "aeye-lab-pymov… |
1 | 4 | "aeye-lab-pymov… |
1 | 5 | "aeye-lab-pymov… |
2 | 1 | "aeye-lab-pymov… |
2 | 2 | "aeye-lab-pymov… |
2 | 3 | "aeye-lab-pymov… |
2 | 4 | "aeye-lab-pymov… |
2 | 5 | "aeye-lab-pymov… |
3 | 1 | "aeye-lab-pymov… |
3 | 2 | "aeye-lab-pymov… |
3 | 3 | "aeye-lab-pymov… |
3 | 4 | "aeye-lab-pymov… |
3 | 5 | "aeye-lab-pymov… |
Wonderful, all of our data has been downloaded and loaded in successfully!
What you have learned in this tutorial:#
how to initialize a public dataset
how to download and extract dataset resources
how to customize the default directory structure
how to load the dataset into your working memory