pymovements.datasets.PoTeC#

class pymovements.datasets.PoTeC(name: str = 'PoTeC', mirrors: tuple[str, ...] = ('https://osf.io/download/',), resources: tuple[dict[str, str], ...] = ({'filename': 'PoTeC.zip', 'md5': 'cffd45039757c3777e2fd130e5d8a2ad', 'resource': 'tgd9q/'},), experiment: Experiment = <pymovements.gaze.experiment.Experiment object>, filename_format: str = 'reader{subject_id:d}_{text_id}_raw_data.tsv', filename_format_dtypes: dict[str, type] = <factory>, custom_read_kwargs: dict[str, Any] = <factory>, column_map: dict[str, str] = <factory>, trial_columns: list[str] = <factory>, time_column: str = 'time', time_unit: str = 'ms', pixel_columns: list[str] = <factory>, position_columns: list[str] | None = None, velocity_columns: list[str] | None = None, acceleration_columns: list[str] | None = None, distance_column: str | None = None)#

PoTeC dataset [Jakobi et al., 2024].

The Potsdam Textbook Corpus (PoTeC) is a naturalistic eye-tracking-while-reading corpus containing data from 75 participants reading 12 scientific texts. PoTeC is the first naturalistic eye-tracking-while-reading corpus that contains eye-movements from domain-experts as well as novices in a within-participant manipulation: It is based on a 2×2×2 fully-crossed factorial design which includes the participants’ level of study and the participants’ discipline of study as between-subject factors and the text domain as a within-subject factor. The participants’ reading comprehension was assessed by a series of text comprehension questions and their domain knowledge was tested by text-independent background questions for each of the texts. The materials are annotated for a variety of linguistic features at different levels. We envision PoTeC to be used for a wide range of studies including but not limited to analyses of expert and non-expert reading strategies.

The corpus and all the accompanying data at all stages of the preprocessing pipeline and all code used to preprocess the data are made available via GitHub.

name#

The name of the dataset.

Type:: str

mirrors#

A tuple of mirrors of the dataset. Each entry must be of type str and end with a ‘/’.

Type:: tuple[str, …]

resources#

A tuple of dataset resources. Each list entry must be a dictionary with the following keys: - resource: The url suffix of the resource. This will be concatenated with the mirror. - filename: The filename under which the file is saved as. - md5: The MD5 checksum of the respective file.

Type:: tuple[dict[str, str], …]

experiment#

The experiment definition.

Type:: Experiment

filename_format#

Regular expression which will be matched before trying to load the file. Namedgroups will appear in the fileinfo dataframe.

Type:: str

filename_format_dtypes#

If named groups are present in the filename_format, this makes it possible to cast specific named groups to a particular datatype.

Type:: dict[str, type], optional

column_map#

The keys are the columns to read, the values are the names to which they should be renamed.

Type:: dict[str, str]

custom_read_kwargs#

If specified, these keyword arguments will be passed to the file reading function.

Type:: dict[str, Any], optional

Examples

Initialize your PublicDataset object with the PoTeC definition:

>>> import pymovements as pm
>>>
>>> dataset = pm.Dataset("PoTeC", path='data/PoTeC')

Download the dataset resources:

>>> dataset.download()

Load the data into memory:

>>> dataset.load()

__init__(name: str = 'PoTeC', mirrors: tuple[str, ...] = ('https://osf.io/download/',), resources: tuple[dict[str, str], ...] = ({'filename': 'PoTeC.zip', 'md5': 'cffd45039757c3777e2fd130e5d8a2ad', 'resource': 'tgd9q/'},), experiment: Experiment = <pymovements.gaze.experiment.Experiment object>, filename_format: str = 'reader{subject_id:d}_{text_id}_raw_data.tsv', filename_format_dtypes: dict[str, type] = <factory>, custom_read_kwargs: dict[str, Any] = <factory>, column_map: dict[str, str] = <factory>, trial_columns: list[str] = <factory>, time_column: str = 'time', time_unit: str = 'ms', pixel_columns: list[str] = <factory>, position_columns: list[str] | None = None, velocity_columns: list[str] | None = None, acceleration_columns: list[str] | None = None, distance_column: str | None = None) → None

Methods

__init__([name, mirrors, resources, ...])

Attributes

`acceleration_columns`
`distance_column`
`experiment`
`filename_format`
`mirrors`
`name`
`pixel_columns`
`position_columns`
`resources`
`time_column`
`time_unit`
`trial_columns`
`velocity_columns`
`filename_format_dtypes`
`custom_read_kwargs`
`column_map`