CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading

Eye movement recordings of 69 native English speakers and 296 English learners reading Wall Street Journal (WSJ) newswire sentences. Each participant reads 156 sentences: 78 sentences shared across participants and 78 unique to each participant.

Example

Obtaining the Eye-tracking Data

The eyetracking data is not made directly available due to licensing restictions of the Penn Treebank (PTB) and the BLLIP datasets from which the reading materials are drawn. In order to obtain the data with the underlying texts please follow these instructions (require Python 3).

Obtain the PTB-WSJ and BLLIP corpora through LDC.
- Copy the README file of the PTB-WSJ (starts with "This is the Penn Treebank Project: Release 2 ...") to the folder ptb_bllip_readmes/.
- Copy the README.1st file of BLLIP (starts with "File: README.1st ...") to the folder ptb_bllip_readmes/.
Run python obtain_data.py. This will download a zipped data_v2.0/ data folder. Extract to the top level of this directory.

Statistics (v2.0)

	Participants	Sentences	Words
Native	69	5,460	61,272
ESL	296	23,166	260,888
Total	365	28,548	321,260

Directory Structure

data_[version]/

SR DataViewer Interest Area and Fixation Reports, and syntactic annotations.

sent_ia.tsv Interest Area report.
sent_fix.tsv Fixations report.
annotations/ Syntactic annotations.

participant_metadata/

metadata.tsv metadata on participants.
languages.tsv information on languages spoken besides English.
test_scores/
- test_conversion.tsv unofficial conversion table between standardized proficiency tests (used to convert TOEIC to TOEFL scores).
- michigan-cefr.tsv conversion table between form B and the newer forms D/E/F, as well as to CEFR levels.
- michigan/ item level responses for the Michigan Placement Test (MPT).
- comprehension/ item level responses for the reading comprehension during the eyetracking experiment.

splits/

Trial and participant splits.

trials/
- all_trials.txt trial numbers for all the sentences (1-157).
- shared_trials.txt trial numbers of the Shared Text regime.
- individual_trials.txt trial number of the Individual Text regime.
participants/[version]/
- random_order.csv random participant order.
- train.csv train participants.
- test.csv test participants.

dataset_analyses.Rmd

Analyses for the paper "CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading". Note that this script requires:

CELER (in the folder data_v2.0/) and,
GECO Augmented (in the folder geco/). Download GECO augmented with frequency and surprisal values and place geco/ at the top level of this directory.

Documentation

Eyetracking Variables Description of the variables in the fixations and interest area reports.
Metadata Variables Description of the variables in the participants metadata and languages files.
Language Models Details on language models for surprisal values.
Syntactic Annotations Details on syntactic annotations (POS, phrase structure trees, dependency trees).
GECO Augmented Details on new fields added to GECO.
Experiment Builder Programs Information on the EB experiment.
Known Issues Known issues with the dataset.

Citation

Paper: CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading

@article{celer2022,
    author = {Berzak, Yevgeni and Nakamura, Chie and Smith, Amelia and Weng, Emily and Katz, Boris and Flynn, Suzanne and Levy, Roger},
    title = "{CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading}",
    journal = {Open Mind},
    pages = {1-10},
    year = {2022},
    month = {04},
    issn = {2470-2986},
    doi = {10.1162/opmi_a_00054},
    url = {https://doi.org/10.1162/opmi\_a\_00054},
    eprint = {https://direct.mit.edu/opmi/article-pdf/doi/10.1162/opmi\_a\_00054/2012324/opmi\_a\_00054.pdf},
}

License

This work, with the exception of the underlying PTB-WSJ and BLLIP texts, is licensed under a Creative Commons Attribution 4.0 International License.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
dataset_analyses_figures		dataset_analyses_figures
documentation		documentation
participant_metadata		participant_metadata
ptb_bllip_readmes		ptb_bllip_readmes
splits		splits
LICENSE.txt		LICENSE.txt
README.md		README.md
dataset_analyses.Rmd		dataset_analyses.Rmd
full_trial.gif		full_trial.gif
obtain_data.py		obtain_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset_analyses_figures

dataset_analyses_figures

documentation

documentation

participant_metadata

participant_metadata

ptb_bllip_readmes

ptb_bllip_readmes

splits

splits

LICENSE.txt

LICENSE.txt

README.md

README.md

dataset_analyses.Rmd

dataset_analyses.Rmd

full_trial.gif

full_trial.gif

obtain_data.py

obtain_data.py

Repository files navigation

CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading

Example

Table of Contents

Obtaining the Eye-tracking Data

Statistics (v2.0)

Directory Structure

Documentation

Citation

License

About

Releases

Packages

Languages

License

berzak/celer

Folders and files

Latest commit

History

Repository files navigation

CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading

Example

Table of Contents

Obtaining the Eye-tracking Data

Statistics (v2.0)

Directory Structure

Documentation

Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages