Skip to content

berzak/celer

Repository files navigation

CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading

Eye movement recordings of 69 native English speakers and 296 English learners reading Wall Street Journal (WSJ) newswire sentences. Each participant reads 156 sentences: 78 sentences shared across participants and 78 unique to each participant.

Example

Table of Contents

  1. Obtaining the Eye-tracking Data
  2. Statistics
  3. Directory Structure
  4. Additional Documentation
  5. Citation

The eyetracking data is not made directly available due to licensing restictions of the Penn Treebank (PTB) and the BLLIP datasets from which the reading materials are drawn. In order to obtain the data with the underlying texts please follow these instructions (require Python 3).

  1. Obtain the PTB-WSJ and BLLIP corpora through LDC.
    • Copy the README file of the PTB-WSJ (starts with "This is the Penn Treebank Project: Release 2 ...") to the folder ptb_bllip_readmes/.
    • Copy the README.1st file of BLLIP (starts with "File: README.1st ...") to the folder ptb_bllip_readmes/.
  2. Run python obtain_data.py. This will download a zipped data_v2.0/ data folder. Extract to the top level of this directory.
Participants Sentences Words
Native 69 5,460 61,272
ESL 296 23,166 260,888
Total 365 28,548 321,260

data_[version]/

SR DataViewer Interest Area and Fixation Reports, and syntactic annotations.

  • sent_ia.tsv Interest Area report.
  • sent_fix.tsv Fixations report.
  • annotations/ Syntactic annotations.

participant_metadata/

  • metadata.tsv metadata on participants.
  • languages.tsv information on languages spoken besides English.
  • test_scores/
    • test_conversion.tsv unofficial conversion table between standardized proficiency tests (used to convert TOEIC to TOEFL scores).
    • michigan-cefr.tsv conversion table between form B and the newer forms D/E/F, as well as to CEFR levels.
    • michigan/ item level responses for the Michigan Placement Test (MPT).
    • comprehension/ item level responses for the reading comprehension during the eyetracking experiment.

splits/

Trial and participant splits.

  • trials/
    • all_trials.txt trial numbers for all the sentences (1-157).
    • shared_trials.txt trial numbers of the Shared Text regime.
    • individual_trials.txt trial number of the Individual Text regime.
  • participants/[version]/
    • random_order.csv random participant order.
    • train.csv train participants.
    • test.csv test participants.

dataset_analyses.Rmd

Analyses for the paper "CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading". Note that this script requires:

  • CELER (in the folder data_v2.0/) and,
  • GECO Augmented (in the folder geco/). Download GECO augmented with frequency and surprisal values and place geco/ at the top level of this directory.

Documentation

Paper: CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading

@article{celer2022,
    author = {Berzak, Yevgeni and Nakamura, Chie and Smith, Amelia and Weng, Emily and Katz, Boris and Flynn, Suzanne and Levy, Roger},
    title = "{CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading}",
    journal = {Open Mind},
    pages = {1-10},
    year = {2022},
    month = {04},
    issn = {2470-2986},
    doi = {10.1162/opmi_a_00054},
    url = {https://doi.org/10.1162/opmi\_a\_00054},
    eprint = {https://direct.mit.edu/opmi/article-pdf/doi/10.1162/opmi\_a\_00054/2012324/opmi\_a\_00054.pdf},
}

License

Creative Commons License
This work, with the exception of the underlying PTB-WSJ and BLLIP texts, is licensed under a Creative Commons Attribution 4.0 International License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages