Corpus-Specific Word Frequency Effects on L2 Lexical Processing: Evidence from Korean English Education Corpora
Abstract
This study investigated whether word frequency norms derived from Korean English education corpora (KE) better predict second language (L2) lexical processing than standard native-speaker norms (SUBTLEX-US). In Study 1, we constructed a KE corpus from CSAT examinations and textbooks (4,195 passages; 859,346 tokens; 1994–2025) and validated its frequency estimates against 19 prior psycholinguistic studies, confirming moderate low-frequency reliability despite asymmetric attenuation consistent with corpus-size predictions. In Study 2, 74 Korean university students completed a lexical decision task with 111 target words. A joint linear mixed-effects model (R lme4; Bates et al., 2015) revealed that KE Zipf frequency was a substantially stronger predictor of response latencies (−29.7 ms per SD; ΔAIC = 40.5) than SUBTLEX frequency (−4.6 ms per SD). Controlling for age of acquisition eliminated the SUBTLEX effect (p = .870) while the KE effect remained robust (p < .001), suggesting that SUBTLEX frequency shares variance with AoA whereas KE captures independent educational exposure information. The KE advantage was stable across proficiency levels and robust to controls for orthographic neighborhood density and contextual diversity. These findings suggest that L2 lexical processing research may benefit from frequency norms that approximate learners’ actual input rather than relying on L1-derived norms.
Keywords: word frequency, L2 lexical access, corpus linguistics, lexical decision task, Korean English education, CSAT (수능), age of acquisition, mixed-effects models
1. Introduction
1.1 The Frequency Effect in Lexical Processing
Word frequency is among the most robust predictors of lexical processing speed. High-frequency words are recognized faster and more accurately than low-frequency words — a phenomenon known as the frequency effect (Forster & Chambers, 1973; Murray & Forster, 2004). This effect is observed across a wide range of tasks including lexical decision, naming, eye-tracking during reading, and word production (Balota et al., 2004; Brysbaert & New, 2009). The frequency effect is typically attributed to the cumulative strengthening of lexical representations through repeated exposure: words encountered more often develop stronger memory traces, leading to faster retrieval (Morton, 1969; Forster, 1976).
The magnitude of the frequency effect depends critically on the quality of the frequency estimates used. Early studies relied on corpora such as Kučera and Francis (1967), but more recent work has demonstrated that subtitle-based frequency norms (e.g., SUBTLEX-US; Brysbaert & New, 2009) outperform traditional written corpora in predicting behavioral data. In an updated review, Brysbaert, Mandera, and Keuleers (2018) consolidated evidence that word frequency remains the single strongest predictor of visual word recognition, while highlighting that the predictive advantage of subtitle-based norms may partly reflect their alignment with spoken-language exposure rather than any inherent superiority of the subtitle genre. In addition, Brysbaert, Mandera, and Keuleers (2019) introduced word prevalence — the proportion of people who know a word — as a complementary measure, showing that prevalence explains unique variance beyond frequency and captures individual differences in vocabulary knowledge. The HAL (Hyperspace Analogue to Language; Lund & Burgess, 1996) corpus, derived from Usenet newsgroup postings, represents another widely used frequency norm in psycholinguistic research. These norms are available through the English Lexicon Project (ELP; Balota et al., 2007), which provides a comprehensive database of lexical decision and naming latencies for over 40,000 words.
1.2 The L2 Frequency Problem
Despite the centrality of word frequency in lexical processing research, a fundamental methodological challenge exists in second language (L2) research: the frequency norms commonly used (SUBTLEX-US, HAL, COCA) are based on native speaker (L1) language exposure. L2 learners, however, have qualitatively and quantitatively different exposure histories. For Korean learners of English, the vast majority of English exposure occurs through formal education — textbooks, standardized tests, and classroom instruction — rather than through naturalistic immersion in English-speaking environments (Baek et al., 2023).
However, constructing education-specific corpora necessarily yields smaller datasets than general-purpose corpora, raising concerns about the reliability of frequency estimates — particularly for low-frequency words, where sampling noise is greatest (Brysbaert & New, 2009). Addressing this reliability concern is therefore a prerequisite for interpreting any behavioral comparison between corpus-derived frequency norms.
It should also be noted that formal education is not the sole source of L2 input. Korean learners increasingly encounter English through out-of-school channels such as YouTube, streaming media, and online gaming, and the contribution of such informal exposure to vocabulary knowledge has been documented in comparable EFL populations (De Wilde et al., 2020). The present study focuses on formal educational input because it is the most systematic, documentable, and broadly shared source of English exposure for Korean secondary students, while acknowledging that informal exposure may contribute additional variance.
This discrepancy raises a critical question: if frequency norms reflect exposure patterns, and L2 learners have fundamentally different exposure from L1 speakers, then should L2 lexical processing be modeled using exposure-matched frequency norms rather than standard L1-derived norms?
1.3 Korean English Education Context
The Korean College Scholastic Ability Test (CSAT, 수능) is a high-stakes standardized examination that has shaped English education in South Korea since 1994. The English section, comprising listening and reading comprehension, represents a de facto standard for the level and type of English input that Korean students encounter throughout their secondary education. Together with authorized English textbooks, the CSAT constitutes a substantial — though not exhaustive — proportion of the formal English exposure for Korean secondary learners, as supplementary materials and private tutoring also contribute to the input landscape.
This educational context provides a unique opportunity: by constructing a comprehensive corpus from CSAT examinations and English textbooks, we can derive frequency norms that approximate the actual English input received by Korean L2 learners. If L2 lexical representations are shaped by educational exposure, then these education-derived frequency norms should better predict L2 lexical processing than norms derived from native English language environments.
1.4 Research Questions
The present study addresses two research questions:
- RQ1a (Corpus characterization): What are the distributional properties of the KE corpus, and how do its word frequencies compare with those of naturalistic corpora (SUBTLEX-US, HAL)?
- RQ1b (Frequency reliability): How reliable are KE frequency estimates across the frequency spectrum, given the corpus’s relatively small size?
- RQ2: Which frequency source better predicts Korean L2 learners’ lexical decision latencies — KE frequency or SUBTLEX-US frequency — and does this advantage persist after controlling for age of acquisition, contextual diversity, and orthographic variables?
To address these questions, we conducted two interconnected studies: (1) corpus construction, characterization, and cross-corpus frequency analysis, and (2) a behavioral lexical decision experiment with Korean university students accompanied by comprehensive statistical modeling.
1.5 Research Overview
flowchart TD A["Corpus Construction"] --> B["Frequency Analysis"] B --> C["Stimuli Selection"] C --> D["LDT Experiment"] D --> E["Statistical Analysis"] style A fill:#4A90D9,color:#fff style B fill:#5BA55B,color:#fff style C fill:#D4A843,color:#fff style D fill:#D96A4A,color:#fff style E fill:#9B59B6,color:#fff
2. Related Work
2.1 Frequency Norms and L2 Lexical Processing
The role of word frequency in L2 lexical access has been well documented, though methodological debates persist regarding which frequency norms best predict L2 performance. Chen, Dong, and Yu (2018) provided a comprehensive evaluation of 17 frequency norms against L2 lexical decision data, finding that subtitle-based norms (SUBTLEX) generally outperformed text-based corpora for L1 speakers, but that the optimal frequency source for L2 populations remained an open question. More recently, Haeuser and Kray (2025) challenged the assumed superiority of SUBTLEX norms by showing that, for German sentential reading, a text-based frequency database (dlexDB) outperformed SUBTLEX-DE in the majority of models, suggesting that SUBTLEX superiority may be language- and task-dependent rather than universal. Diependaele, Lemhöfer, and Brysbaert (2013) demonstrated robust frequency effects in L2 visual word recognition, with L2 readers showing larger frequency effects than L1 readers, highlighting the importance of frequency estimation accuracy for L2 research.
Cop, Keuleers, Drieghe, and Duyck (2015) examined frequency effects during natural reading of an entire novel and found that bilinguals showed a considerably larger frequency effect in their L2 than in their L1, with the effect decreasing as a function of L1 proficiency. Their results are consistent with an integrated mental lexicon in which lexical entrenchment is determined by cumulative exposure. The present study extends this line of research by asking whether frequency norms derived from learners’ actual educational input provide superior prediction compared to norms based on native speaker exposure. Kuperman and Van Dyke (2013) further demonstrated that corpus-based frequencies systematically overestimate the strength of lexical representations, particularly for low-frequency words and readers with smaller vocabularies, and that subjective frequency ratings predicted by reading experience explained unique variance beyond corpus counts. Their finding underscores the importance of selecting frequency norms that match the target population’s actual exposure — a principle that the present study applies to L2 educational contexts.
2.2 Population-Specific Frequency Norms
A growing body of work suggests that frequency norms tailored to specific populations can outperform general-purpose norms. Korochkina, Birchenough, Dawson, and Sheriston (2024) developed CYP-LEX, a large-scale lexical database (70M+ tokens, 105,000+ types) derived from children’s and young adult literature, demonstrating that age-appropriate frequency norms explained variance in children’s word recognition beyond what adult-derived norms (e.g., SUBTLEX-UK) could capture. Their work establishes the principle that the source of frequency estimates matters — a principle the present study extends to L2 educational contexts. More recently, Nohejl, Vít, Kocmi, and Bojar (2024) challenged the dominance of film-subtitle corpora by showing that YouTube subtitle norms rivaled or outperformed SUBTLEX-based norms in predicting lexical decision and word naming across five languages. Their finding that domain-specific subtitle corpora (e.g., YouTube) can approximate spoken vocabulary more accurately than film subtitles reinforces the broader point that matching the frequency source to the target population’s input improves behavioral prediction.
These studies converge on a theoretical point: frequency effects reflect cumulative exposure, and the predictive validity of any frequency norm depends on how well it approximates the exposure profile of the target population. The present study applies this principle to L2 learners, for whom the relevant exposure profile is dominated by formal educational input rather than naturalistic L1 environments.
2.3 Contextual Diversity and Frequency
Adelman, Brown, and Quesada (2006) demonstrated that contextual diversity (CD) — the number of distinct contexts in which a word appears — is a better predictor of lexical decision and naming times than raw frequency in L1 English. This finding raised the question of whether the frequency effect is truly driven by cumulative exposure or by the diversity of encounters. Hamrick and Pandza (2020) extended this investigation to L2, showing that CD effects are also present in L2 lexical processing. The present study addresses this issue by directly comparing frequency-based and CD-based models within the KE corpus framework.
2.4 Education-Specific Input and L2 Exposure
Recent research has highlighted the importance of education-specific input in L2 vocabulary development. Li, Wolter, Yang, and Siyanova-Chanturia (2025) demonstrated that textbook frequency, along with congruency and word class, affects L2 collocation processing by Chinese EFL learners, providing evidence that the educational register constitutes a distinct source of lexical input for formulaic language. De Wilde, Brysbaert, and Eyckmans (2020) showed that out-of-school exposure to English through media and gaming significantly predicted English vocabulary knowledge among Dutch-speaking adolescents, underscoring the importance of considering actual input sources when modeling L2 lexical knowledge.
In the Korean EFL context, Baek, Lee, and Choi (2023) demonstrated that individual differences in word-frequency effects — captured by lexical processing efficiency rather than overall proficiency scores — serve as a more sensitive index of L2 lexical quality in Korean learners’ English visual word recognition. Relevant to the CSAT context specifically, Murphy Odo (2023) analyzed vocabulary coverage of the Basic English Vocabulary List from the Korean National Curriculum in CSAT reading passages, finding that 6,000 BNC word families achieved approximately 95% coverage, which provides a benchmark for the vocabulary demands of the examination that shapes much of Korean secondary English education.
Unlike Li et al. (2025), who examined collocation-level processing, the present study targets single-word lexical access and constructs a comprehensive multi-source educational corpus rather than relying on textbook frequency alone. Building on these findings, we directly test whether education-derived frequency norms outperform standard norms in predicting L2 lexical processing.
3. Study 1: Corpus Construction and Characterization
3.1 Corpus Materials
Three primary corpora were constructed for this study:
| Corpus | Source | Articles | Sentences | Tokens | Period |
|---|---|---|---|---|---|
| CSAT Reading | 수능/모의고사 독해 | 2,317 | 17,069 | 339,606 | 1994–2025 |
| CSAT Listening | 수능/모의고사 듣기 | 1,234 | 20,466 | 166,987 | 1999–2025 |
| Textbook | Korean HS English textbooks | 644 | 23,108 | 352,753 | 2015–2022 revisions |
| KE Total | Combined | 4,195 | 60,643 | 859,346 | 1994–2025 |
The KE corpus yielded 22,510 unique word types across all three sub-corpora. The textbook sub-corpus was compiled from high school English textbooks authorized by the Korean Ministry of Education, spanning multiple publishers and curriculum revisions (see Appendix A for the full textbook list and selection criteria).
Terminology note. Throughout this paper, KE (Korean English) refers to the combined frequency computed across all three sub-corpora (CSAT Reading + Listening + Textbook). KF (Korean English Frequency) refers to frequencies computed separately for the textbook and CSAT components and then compared. In cross-corpus analyses (Table 1), CSAT_RFreq refers to CSAT-specific relative frequency, while RFreq_KF refers to the textbook-component frequency. The combined KE frequency is used in all behavioral analyses (Study 2).
Two external reference corpora were used for comparison:
- SUBTLEX-US (Brysbaert & New, 2009): Subtitle-based frequency norms from approximately 51 million words of American English subtitles
- HAL (Lund & Burgess, 1996): Hyperspace Analogue to Language corpus derived from approximately 131 million words of Usenet newsgroup text
3.2 Preprocessing Pipeline
flowchart TD A["Raw Corpus(XLSX files)"] --> B["Metadata Extraction(Year, Source, Type)"] B --> C["Sentence Splitting(NLTK sent_tokenize)"] C --> D["Custom Tokenization(Contraction/Possessive Preservation)"] D --> E["POS Tagging(Penn Treebank)"] E --> F["Frequency Calculation(Raw → Zipf Scale)"] F --> G["Cross-Corpus Merging(KE ∩ ELP ∩ HAL ∩ SUBTLEX)"] style A fill:#e8e8e8,color:#333 style D fill:#4A90D9,color:#fff style F fill:#5BA55B,color:#fff style G fill:#D4A843,color:#fff
A custom tokenizer was developed to preserve linguistically meaningful units that standard tokenizers (e.g., NLTK’s word_tokenize) would split incorrectly. Specifically, the tokenizer preserves contractions (I’m, don’t, won’t) and possessives (John’s, teacher’s) as single tokens, filters non-alphabetic tokens, and splits sentence-final periods from preceding words. The tokenizer employs a regex-based protection mechanism that identifies and shields target patterns before applying TreebankWordTokenizer, then restores the protected forms. Preprocessing was implemented using a custom pipeline (available at the project repository).
Preprocessing proceeded through type-specific handling for CSAT (type='test') and textbook (type='textbook') corpora. For CSAT listening passages, speaker gender information (M/W markers) was extracted and removed from the text. Frequency values were converted to the Zipf scale (van Heuven et al., 2014), defined as log₁₀(frequency per million) + 3, which provides an intuitive 1–7 scale where everyday words score 4–7 and rare words score 1–3.
3.3 Corpus Characterization
3.3.1 Distributional Properties
The KE corpus was verified to follow Zipf’s law, with rank-frequency distributions showing the characteristic power-law relationship across all three sub-corpora. Zipf-Mandelbrot model fitting confirmed α ≈ 1, consistent with natural language distributions (Piantadosi, 2014).

Figure 1. Zipf’s law rank-frequency distributions for the three KE sub-corpora (Reading, Listening, Textbook). Log-log linearity confirms the expected power-law relationship (α ≈ 1).
Vocabulary growth was assessed using Heaps’ law (V(N) = K × N^β), which showed that vocabulary size continued to grow as a function of corpus size, indicating that the corpus has not reached vocabulary saturation.

Figure 2. Heaps’ law vocabulary growth curves for the KE sub-corpora, showing vocabulary size as a function of cumulative tokens. The continued upward trajectory confirms that the corpus has not reached vocabulary saturation.
Lexical diversity was assessed using type-token ratio (TTR), corrected TTR (CTTR), and moving-average TTR (MATTR; computed with a 1,000-token sliding window). The textbook sub-corpus showed higher lexical diversity than the CSAT sub-corpora, consistent with the greater topical range of textbook materials.

Figure 3. Lexical diversity metrics (TTR, CTTR, MATTR) across the three KE sub-corpora. The textbook sub-corpus shows higher diversity, consistent with its broader topical range.
3.3.2 Sub-corpus Comparison
POS distributions were compared across the three sub-corpora using chi-square tests. The listening corpus showed a higher proportion of pronouns and auxiliary verbs, consistent with its conversational register, whereas reading and textbook corpora were characterized by higher proportions of nouns and adjectives, reflecting their expository nature.

Figure 4. Part-of-speech distribution across the three KE sub-corpora. The listening corpus shows higher proportions of pronouns and auxiliaries (conversational register), while reading and textbook corpora are characterized by more nouns and adjectives (expository register).
Listening passages contained balanced speaker distributions (M: 10,132 sentences; W: 10,147 sentences; monologue: 187 sentences).
Sentence length differed across sub-corpora, with reading passages showing the longest average sentence length, followed by textbook and listening passages. Temporal trend analysis of CSAT passages (1994–2025) revealed gradual increases in passage complexity over time.

Figure 5. Temporal trends in CSAT passage characteristics (1994–2025), showing gradual increases in lexical complexity measures over the 31-year examination period.
3.4 Cross-Corpus Frequency Comparison
3.4.1 Correlation Analysis
Cross-corpus frequency correlations were computed using Pearson, Spearman, and Kendall methods on relative frequency values.
Table 1. Spearman Rank Correlation Matrix (Relative Frequency)
| RFreq_KF | CSAT_RFreq | SUBTLRWF | RFreq_HAL | |
|---|---|---|---|---|
| RFreq_KF | 1.000 | .954 | .657 | .955 |
| CSAT_RFreq | .954 | 1.000 | .818 | .989 |
| SUBTLRWF | .657 | .818 | 1.000 | .799 |
| RFreq_HAL | .955 | .989 | .799 | 1.000 |
Note. KF = Korean English Frequency (combined); CSAT = CSAT-specific frequency; SUBTLRWF = SUBTLEX relative word frequency; HAL = HAL relative frequency. All correlations significant at p < .001.
Table 2. Pearson Correlation Matrix (Relative Frequency)
| RFreq_KF | CSAT_RFreq | SUBTLRWF | RFreq_HAL | |
|---|---|---|---|---|
| RFreq_KF | 1.000 | .689 | .668 | .737 |
| CSAT_RFreq | .689 | 1.000 | .678 | .730 |
| SUBTLRWF | .668 | .678 | 1.000 | .802 |
| RFreq_HAL | .737 | .730 | .802 | 1.000 |
The remarkably high Spearman correlation between CSAT and HAL (ρ = .989) is notable given the vast difference in corpus size (~860K vs. ~131M words) and genre. This suggests that the rank-order of word frequencies is largely preserved across educational and naturalistic English corpora, consistent with the well-known stability of Zipf’s law across genres (Piantadosi, 2014). However, SUBTLEX-US shows notably lower correlations with both KE-derived measures (ρ = .657–.818), indicating that subtitle-based norms capture different aspects of word frequency distributions — likely reflecting the predominance of informal, conversational language in subtitles versus the academic English found in educational materials.
Interpretive note on KE-HAL convergence. The high KE-HAL rank correlation raises the question of whether KE frequency is simply a proxy for general lexical difficulty rather than a measure of L2-specific educational exposure. We address this concern in two ways. First, we note that the high Spearman correlation reflects rank-order agreement, whereas the lower Pearson correlation (KE-HAL: r = .730) indicates that the magnitudes of frequency differences diverge, particularly for words at the extremes. Second, the behavioral analyses in Study 2 provide a direct test: if KE frequency were merely a proxy for general word difficulty, it should not outperform HAL or other large-corpus norms that also capture general difficulty. The AoA mediation analysis (Section 4.6.5) further disentangles these possibilities by showing that KE captures variance independent of AoA, whereas SUBTLEX does not — a dissociation that would not be expected if KE were purely a general-difficulty measure.
3.4.2 Distribution Analysis (Kruskal-Wallis)
Non-parametric Kruskal-Wallis H tests compared Zipf frequency distributions across three sources (Zipf_KE, Zipf_KF, Zipf_SUBTLEX) within high-frequency (HF) and low-frequency (LF) word bands.
High-Frequency Words:
- H(2) = 14.60, p < .001, ε² = .145
- Dunn’s post-hoc: KF vs. SUBTLEX (p < .001); KE vs. SUBTLEX (p = .011); KE vs. KF (p = 1.000, n.s.)
Low-Frequency Words:
- H(2) = 1.49, p = .476, n.s.
This pattern indicates that frequency discrepancies between education-derived and naturalistic corpora are most pronounced for high-frequency words — precisely where the frequency effect is most consequential for lexical processing.
3.4.3 Vocabulary Overlap and Coverage
Vocabulary overlap analysis revealed that the KE corpus and SUBTLEX-US share a substantial core vocabulary, while each corpus contains unique words reflecting its domain. KE-unique words (absent from SUBTLEX) tended to be longer, later-acquired, and more abstract, consistent with academic vocabulary.

Figure 8. Venn diagram showing vocabulary overlap between the KE corpus and SUBTLEX-US. The shared core vocabulary is substantial, while each corpus contains unique words reflecting its domain-specific register.
Rank comparison between KE and SUBTLEX showed that the largest discrepancies involved education-specific words (e.g., academic terms overrepresented in KE) and informal/colloquial words (overrepresented in SUBTLEX).

Figure 9. Frequency rank comparison between KE and SUBTLEX-US. Points deviating from the diagonal represent words with large rank discrepancies between the two corpora, highlighting systematic register differences.
Cumulative coverage analysis demonstrated that the top-ranked KE words provided efficient coverage of the KE corpus, comparable to SUBTLEX coverage efficiency despite the large difference in corpus size (860K vs. 51M tokens; cf. Burgess & Livesay, 1998, on corpus size and frequency stability).

Figure 10. Cumulative coverage curves showing the proportion of corpus tokens accounted for by the top-N most frequent words in KE and SUBTLEX-US. Despite the large size difference, KE achieves comparable coverage efficiency.
3.4.4 Frequency-Band Reliability
The KE corpus (~860K tokens) is substantially smaller than SUBTLEX-US (~51M) and HAL (~131M). Brysbaert and New (2009) demonstrated that frequency estimates from smaller corpora are disproportionately noisy for low-frequency (LF) words, because rare words may occur zero or few times in a limited sample. Brysbaert and Diependaele (2013) further showed that zero and near-zero frequencies create systematic distortions in frequency-based regression models, recommending evidence-based smoothing techniques — a concern particularly relevant for smaller domain-specific corpora such as KE. Before proceeding to behavioral validation, it is therefore important to assess whether KE LF estimates carry a reliable signal or are dominated by sampling noise.
To address this question, we compiled stimuli from 19 prior psycholinguistic studies that reported word-level frequency data, yielding 615 high-frequency (HF) and 772 low-frequency (LF) unique words after matching with the merged KE–ELP–SUBTLEX database. For each study, we computed Pearson correlations between Zipf-scale frequencies for three corpus pairs (KE–KF, KE–SUBTLEX, KF–SUBTLEX) separately within the HF and LF bands, and assessed the proportional drop from HF to LF agreement.
Table 3. Frequency-Band Reliability: Mean Pearson r Across 19 Prior Studies
| Corpus Pair | HF Mean r (SD) | LF Mean r (SD) | HF → LF Drop |
|---|---|---|---|
| KE–KF | .691 (.151) | .435 (.186) | 37.1% |
| KE–SUBTLEX | .600 (.209) | .495 (.166) | 17.6% |
| KF–SUBTLEX | .580 (.236) | .481 (.168) | 17.1% |
Note. KF = Korean English Frequency (textbook + CSAT combined, counted separately from KE). HF/LF split based on each study’s frequency classification.

Figure 6. Meta-analytic heatmap of Pearson r values for three corpus pairs (KE–KF, KE–SUBTLEX, KF–SUBTLEX) across 19 prior psycholinguistic studies, separated by frequency band (HF/LF). Color intensity reflects correlation magnitude; the systematic HF-to-LF attenuation is visible across all pairs.
The two corpus pairs involving a small corpus on both sides (KE–KF) showed the steepest HF-to-LF attenuation (37.1%), approximately 2.1 times the drop observed for pairs anchored by the larger SUBTLEX corpus (17.1–17.6%). This asymmetric pattern is precisely what Brysbaert and New’s (2009) corpus-size account predicts: when both corpora are small, LF estimation noise compounds multiplicatively, producing a larger reliability decline than when one member of the pair is a large-sample corpus. At the study level, KE–SUBTLEX LF agreement (r = .495) exceeded KE–KF LF agreement (r = .435) in 11 of 19 studies, though this difference was not statistically significant (Wilcoxon signed-rank W = 66.0, p = .258, Cohen’s d = 0.354) — a descriptive pattern consistent with the corpus-size prediction but lacking power given N = 19.

Figure 7. Forest plot showing corpus-pair frequency correlations (with 95% CIs) for each of the 19 prior studies. The systematic pattern of KE–KF correlations (left) falling below KE–SUBTLEX and KF–SUBTLEX correlations in the LF band illustrates the asymmetric attenuation predicted by corpus-size theory.
Crucially, the KE LF estimates are not pure noise: the mean LF correlation of .435–.495 indicates moderate cross-corpus agreement even at the low-frequency tail, and the attenuation pattern is systematic rather than random. Having established that KE frequency estimates show meaningful cross-corpus agreement across the frequency spectrum — with acknowledged limitations at the low-frequency tail — Study 2 tests whether these education-derived norms better predict actual L2 behavioral responses than SUBTLEX-US.
4. Study 2: Lexical Decision Task
4.1 Participants
This study was approved by the Institutional Review Board of [Institution Name] (IRB No. [XXX-XXXX-XXXX]). All participants provided informed consent prior to participation.
Seventy-six Korean university students initially participated in the study. All participants were native Korean speakers studying English as a second language. Two participants were excluded due to near-native proficiency (LexTALE scores exceeding M + 2SD), yielding a final sample of 74 participants (41 female, 33 male; age: M = 21.47, SD = 3.80, range = 18–47). English proficiency was assessed using the LexTALE test (Lemhöfer & Broersma, 2012; M = 64.47, SD = 7.37, range = 45–80).
4.2 Materials
4.2.1 Stimuli Selection
flowchart TD A["CSAT Corpus + Textbook Corpus"] --> B["KE Frequency Calculation"] C["ELP Database+ HAL + SUBTLEX"] --> D["External Frequency Merging"] B --> E["Cross-Corpus Intersection(KE ∩ ELP ∩ HAL ∩ SUBTLEX)"] D --> E E --> F["POS Filtering(Nouns containing 'NN')"] F --> G["Frequency Rank Banding (Percentile: 0-100)"] G --> H["Stimuli Selection 60 KE-focused+ 60 SUBTLEX-focused"] H --> I["Final Stimuli 111 target words (with 9 overlapping)"] J["ELP NonWord Database"] --> K["Nonword Matching (Length, BG_Sum, BG_Mean, Ortho_N)"] I --> K K --> L["Final Set 111 words + 111 nonwords"] style A fill:#4A90D9,color:#fff style E fill:#D4A843,color:#fff style I fill:#D96A4A,color:#fff style L fill:#5BA55B,color:#fff
Target words were selected from the intersection of four corpora: KE, ELP, HAL, and SUBTLEX-US. The selection process proceeded as follows:
- Corpus intersection: Words present in all four corpora were identified
- POS filtering: Only words with noun POS tags (containing ‘NN’) were retained
- Frequency banding: Words were divided into percentile rank bands (0–10, 10–20, …, 90–100) based on both KE and SUBTLEX frequency rankings
- Stimuli selection: Two overlapping sets of 60 words each were selected:
- KE-focused set (60 words): Words where KE frequency rank diverges most from SUBTLEX rank
- SUBTLEX-focused set (60 words): Words where SUBTLEX frequency rank diverges most from KE rank
- 9 words overlapped between the two sets, yielding 111 unique target words
- Balanced design: Each set contained 10 words per Zipf frequency band, with word length matched across bands (mean length ≈ 4.0–4.1 letters per band)
Controlled variables: Word length, orthographic neighborhood density (Ortho_N), bigram frequency (BG_Sum, BG_Mean). Across the 51 KE-only and 51 SUBTLEX-only stimuli, all lexical variables were well-matched (all ps > .57).
KE raw frequency distribution: All 111 stimuli appeared at least once in the KE corpus (min = 1). The distribution of KE raw frequencies was right-skewed (Mdn = 121, M = 193.2, SD = 202.4, range = 1–901): 74 words (66.7%) had frequency ≥ 51, 25 words (22.5%) had frequency 11–50, 10 words (9.0%) had frequency 3–10, and only 2 words (1.8%) had frequency ≤ 2. The corresponding Zipf_KE values ranged from 3.36 to 6.01 (Mdn = 5.14, M = 5.02, SD = 0.63). The absence of zero-frequency items and the small proportion of very-low-frequency words (1.8%) indicate that the KE corpus provides non-trivial frequency information for virtually all experimental stimuli, although the two lowest-frequency items (Freq = 1 and 2) should be interpreted with caution given the corpus size (~860K tokens; cf. Brysbaert & Diependaele, 2013).
Note on subset analyses: Because the KE-focused and SUBTLEX-focused subsets were defined by frequency rank divergence, subset-level comparisons are inherently biased toward each set’s defining corpus. Accordingly, the full-set analysis (111 words) serves as the primary comparison; subset analyses are reported as exploratory.
Note on cognate/loanword status: Korean has a substantial inventory of English-derived loanwords (e.g., bus, computer), which may enjoy processing advantages due to cross-linguistic phonological similarity (Dijkstra & Van Heuven, 2002). The present stimulus set was not explicitly controlled for cognate or loanword status. However, the 111 target words were all common English nouns present in both KE and SUBTLEX, and the joint model controls for both frequency sources simultaneously, reducing the likelihood that cognate status systematically confounds the KE-SUBTLEX comparison. Future studies should include cognate status as an explicit covariate.
4.2.2 Nonwords
111 nonwords were selected from the ELP NonWord dataset, matched to target words on:
- Length: t(220) = 0.14, p = .888
- BG_Sum: t(220) = 0.09, p = .929
- BG_Mean: t(220) = 0.24, p = .808
- Ortho_N: t(220) = 0.02, p = .986
All matching t-tests were non-significant, confirming successful matching.
4.3 Apparatus and Procedure
The experiment was conducted online via PsychoPy (Peirce et al., 2019) deployed on the Pavlovia platform. Data collection took place between October and December 2025. Online data collection introduces timing variability (typically 10–60 ms additional noise) compared to laboratory settings (Anwyl-Irvine, Massonnié, Flitton, Kirkham, & Evershed, 2021; Bridges, Pitiot, MacAskill, & Peirce, 2020). Crucially, this timing noise is random with respect to our frequency predictors and therefore attenuates rather than inflates frequency effects, making our comparison conservative. However, the absolute effect-size estimates (e.g., 29.7 ms per SD) should be interpreted with this additional noise in mind.
4.3.1 LDT Trial Structure
sequenceDiagram participant Screen participant Participant Screen->>Participant: Fixation cross (+)<br/>500 ms Screen->>Participant: Blank screen<br/>200 ms Screen->>Participant: Target stimulus<br/>(word or nonword) Participant->>Screen: Key press<br/>(L = word, S = nonword) Screen->>Participant: Inter-trial interval<br/>500 ms Note over Screen,Participant: Next trial begins
4.3.2 Experiment Session Flow
flowchart TD A["Informed Consent"] --> B["LDT Instructions"] B --> C["Practice Trials(10 trials)"] C --> D["LDT Main Trials(222 trials with breaks)"] D --> E["Korean Reading Task(KRT)"] E --> F["LexTALE Proficiency Test"] F --> G["Demographics Questionnaire"] G --> H["Debriefing"] style A fill:#e8e8e8,color:#333 style D fill:#D96A4A,color:#fff style F fill:#4A90D9,color:#fff style H fill:#5BA55B,color:#fff
Participants completed the session in the following order: (1) informed consent, (2) LDT task instructions, (3) 10 practice trials with feedback, (4) 222 main LDT trials (111 words + 111 nonwords) with periodic breaks, (5) Korean Reading Task (KRT), (6) LexTALE English proficiency test, and (7) demographic questionnaire. Trial order was randomized for each participant.
4.4 Data Processing
Response time (RT) data were preprocessed following standard procedures in lexical decision research (Balota et al., 2007; Keuleers et al., 2012).
4.4.1 RT Filtering and Outlier Removal
A multi-step filtering procedure was applied:
- Accuracy filtering: Incorrect trials were excluded, as error RTs do not reflect normal lexical access processes.
- Absolute RT bounds: Trials with RTs below 200 ms (anticipatory responses) or above 3000 ms (attentional lapses) were removed.
- Within-subject trimming: RTs beyond ±2.5 SD from each participant’s mean were excluded to remove individual-specific outliers.
- Participant exclusion: Participants with overall accuracy below 80% were excluded from analysis (n = 0 excluded; all participants met the criterion).
The final dataset comprised 7,568 word trials from 74 participants.
4.4.2 RT Transformation
- Log transformation: RT values were log₁₀-transformed to approximate normality (
log_rt = log₁₀(RT)), consistent with standard practice in LDT research (Balota et al., 2007). - Z-scoring: Zipf frequency values were z-scored within each stimulus set (
Zipf_KE_z,Zipf_SUBTLEX_z) to enable comparison of standardized coefficients across predictors.
4.4.3 Covariate Selection Rationale
The following variables were included as fixed-effect covariates:
| Variable | Rationale | Source |
|---|---|---|
| Word length | Physical stimulus property; invariant across populations | Stimulus attribute |
| Age | Individual difference; affects processing speed | Participant survey |
| Trial number | Controls for fatigue/practice effects | Experiment log |
Note on Age of Acquisition (AoA): Available AoA norms (Kuperman et al., 2012) were collected from L1 English speakers rather than L2 learners. As Rodríguez-Cuadrado et al. (2022) noted, AoA norms have limitations when applied to L2 speakers. Despite these limitations, L1-normed AoA remains the best available proxy for word acquisition order in the absence of L2-specific AoA norms. AoA was therefore included in sensitivity analyses (see Section 4.6.3) rather than in the primary model, allowing us to examine its impact on the frequency comparison while acknowledging its limitations for L2 populations.
4.4.4 LexTALE Integration
Individual LexTALE scores (Lemhöfer & Broersma, 2012) were included as a proficiency measure in interaction models to examine whether the frequency effect varied with L2 proficiency.
4.5 Statistical Analysis
4.5.1 Primary Analysis: Joint Mixed-Effects Model
A joint linear mixed-effects model was fitted using R’s lme4 package (Bates, Mächler, Bolker, & Walker, 2015) with lmerTest (Kuznetsova, Brockhoff, & Christensen, 2017) for Satterthwaite-approximated degrees of freedom. Both KE and SUBTLEX frequency predictors were entered simultaneously, allowing them to compete for shared variance:
Fixed: log_rt ~ Zipf_KE_z + Zipf_SUBTLEX_z + Age + Length + trial_num
Random: (1 + Zipf_KE_z + Zipf_SUBTLEX_z | Participant) + (1 | Stimuli)
Method: ML (REML = FALSE, for AIC comparison)
This approach avoids the statistical artifacts associated with comparing separate models (see Supplementary Analysis) and provides a direct test of each predictor’s unique contribution when controlling for the other. The crossed random effects structure follows Barr, Levy, Scheepers, and Tily’s (2013) recommendation for psycholinguistic data with both participant and item variability. Model comparison was conducted using ΔAIC and ΔBIC, with AIC differences greater than 10 considered very strong evidence (Burnham & Anderson, 2002) and BIC reported as a more conservative criterion that penalizes model complexity more heavily. Effect sizes were computed by converting the log₁₀(RT) beta coefficients to millisecond changes per one standard deviation of frequency at the sample mean RT. All models converged without singular fits (optimizer: bobyqa, maxfun = 100,000).
4.5.2 Supplementary Analysis: OLS Two-Stage
Following Lorch and Myers (1990), we conducted an OLS two-stage analysis as a supplementary, shrinkage-free approach. Per-participant OLS regressions were fitted:
Per participant: log_rt ~ Zipf_KE_z + Zipf_SUBTLEX_z + Length + trial_num
Age was excluded from per-participant models because it is a between-subjects variable absorbed by the participant-level intercept. The extracted KE and SUBTLEX slopes were compared via paired t-test.
4.5.3 Sensitivity and Robustness Analyses
Eight sensitivity models (M1–M8) were estimated to assess robustness across different covariate specifications, including models with AoA, contextual diversity, orthographic neighborhood density (Ortho_N), and OLD20. Additional analyses examined frequency × proficiency interactions, sub-corpus decomposition, frequency × AoA interactions, and accuracy data.
4.6 Results
Mean lexical decision latency across all retained word trials was 618 ms (SD = 167 ms; per-participant means: M = 618 ms, SD = 80 ms, range = 451–880 ms).

Figure 11. Distribution of lexical decision response times (RTs) across all retained word trials. The distribution is right-skewed, motivating the log₁₀ transformation used in subsequent analyses. Panels show overall RT distribution and per-participant summaries.
4.6.1 Joint Model Results
Table 4. Joint Linear Mixed-Effects Model (R lme4): KE vs. SUBTLEX Frequency
| Model | KE β | KE t | KE p | SUB β | SUB t | SUB p | AIC |
|---|---|---|---|---|---|---|---|
| A: Base (KE + SUB only) | −0.025 | −6.54 | < .001 | −0.004 | −1.15 | .254 | −15893.4 |
| B: + Covariates | −0.023 | −4.48 | < .001 | −0.004 | −0.74 | .463 | −15888.3 |
| C: + AoA + CD | −0.026 | −5.01 | < .001 | −0.006 | −1.08 | .281 | −15895.6 |
Note. Model B (+ Covariates) is reported as the primary model. t-values with Satterthwaite degrees of freedom (lmerTest). Bold indicates the primary model. All models converged without singular fits.

Figure 12. Frequency–RT regression plots showing the relationship between Zipf frequency (KE and SUBTLEX) and log-transformed lexical decision latencies across all participants. KE frequency shows a steeper negative slope, reflecting its stronger predictive power.
In the primary model (Model B), KE frequency was a strong, significant predictor of log-transformed lexical decision latencies (β = −0.023, t = −4.48, p < .001), whereas SUBTLEX frequency was not significant (β = −0.004, t = −0.74, p = .463). The KE t-value was 6.1 times larger than the SUBTLEX t-value in absolute magnitude.
4.6.2 Model Comparison (ΔAIC)
Table 5. Model Comparison: AIC and BIC
| Model | AIC | ΔAIC | BIC | ΔBIC |
|---|---|---|---|---|
| Joint + cov + AoA + CD (C) | −15895.6 | 0.0 (Best) | −15784.7 | 0.0 (Best) |
| KE-only (+ cov) | −15891.2 | 4.4 | −15821.8 | — |
| Joint base (A) | −15893.4 | 2.2 | −15817.2 | — |
| Joint + cov (B) | −15888.3 | 7.3 | −15791.3 | — |
| SUBTLEX-only (+ cov) | −15850.7 | 44.9 | −15781.4 | — |
ΔAIC between the SUBTLEX-only and KE-only models was 40.5 (= −15850.7 − (−15891.2)), constituting very strong evidence in favor of KE frequency (Burnham & Anderson, 2002: ΔAIC > 10 indicates essentially no support for the weaker model).
4.6.3 Effect Size
Table 6. Effect Sizes in Milliseconds
| Predictor | β | SE | Δms per 1SD | % change | 95% CI (β) |
|---|---|---|---|---|---|
| Zipf_KE_z | −0.023 | 0.005 | −29.7 ms | −5.1% | [−0.034, −0.013] |
| Zipf_SUBTLEX_z | −0.004 | 0.005 | −4.6 ms | −0.8% | [−0.013, +0.006] |
A one standard deviation increase in KE Zipf frequency was associated with a 29.7 ms decrease in lexical decision latency (5.1% of the sample mean RT), compared to only 4.6 ms (0.8%) for SUBTLEX — a 6.5-fold difference in practical effect magnitude. The wider confidence intervals compared to single-level models reflect the proper estimation of crossed random effects in lme4.
4.6.4 Sensitivity Analysis (M1–M8)
Table 7. Sensitivity Analysis: Joint Model Across Covariate Specifications
| Model | KE β | KE t | KE p | SUB β | SUB t | SUB p | AIC |
|---|---|---|---|---|---|---|---|
| M1: KE + SUB | −.025 | −6.54 | < .001 | −.004 | −1.15 | .254 | −15893.4 |
| M2: + Age + Length + trial | −.023 | −4.48 | < .001 | −.004 | −0.74 | .463 | −15888.3 |
| M3: + AoA | −.023 | −4.50 | < .001 | −.001 | −0.16 | .870 | −15891.9 |
| M4: + CD_KE | −.026 | −5.02 | < .001 | −.008 | −1.67 | .097 | −15892.8 |
| M5: Full (cov + AoA + CD) | −.026 | −5.01 | < .001 | −.006 | −1.08 | .281 | −15895.6 |
| M6: M5 + Ortho_N | −.026 | −5.01 | < .001 | −.005 | −1.00 | .318 | −15893.6 |
| M7: M5 + OLD | −.025 | −4.98 | < .001 | −.006 | −1.13 | .263 | −15893.7 |
| M8: M5 + Ortho_N + OLD | −.025 | −5.01 | < .001 | −.006 | −1.08 | .282 | −15892.6 |
Note. KE Zipf frequency was significant at p < .001 across all eight model specifications. SUBTLEX was non-significant in all models (p range: .097–.870). t-values with Satterthwaite df.
The critical finding was in M3 (+ AoA): when AoA was added to the joint model, the SUBTLEX frequency effect was completely eliminated (β = −.001, p = .870), whereas KE frequency remained highly significant (β = −.023, p < .001). This pattern suggests that SUBTLEX frequency shares substantial variance with AoA — words that are frequent in subtitles tend to be early-acquired words — whereas KE frequency carries independent information about L2 educational exposure that is not reducible to AoA.
Adding orthographic controls (Ortho_N, OLD20; Yarkoni et al., 2008) in M6–M8 produced negligible changes in the KE coefficient (Δβ < 0.001), confirming that orthographic neighborhood density is not a confound. KE frequency showed near-zero correlations with Ortho_N (r = −.016) and OLD (r = .009).

Figure 13. Sensitivity analysis heatmap showing KE and SUBTLEX frequency coefficients and significance levels across eight model specifications (M1–M8). KE frequency is consistently significant at p < .001 across all models, while SUBTLEX significance varies substantially depending on covariate inclusion.
4.6.5 AoA Mediation Analysis
The AoA finding in M3 was further examined through frequency × AoA interactions and item-level mediation analysis.
Frequency × AoA Interactions:
| Parameter | β | t | p |
|---|---|---|---|
| Zipf_KE_z × AoA_z | +0.002 | +0.73 | .467 |
| Zipf_SUBTLEX_z × AoA_z | −0.003 | −0.96 | .341 |
Neither interaction reached significance in the lme4 analysis with fully crossed random effects. The numerically opposing directions (positive for KE, negative for SUBTLEX) are consistent with the expected pattern, but the effects were too small relative to the properly estimated standard errors to reach significance. The key evidence for AoA’s role therefore rests on the M3 covariate analysis (Section 4.6.4) rather than the interaction terms.
Item-level Bootstrap Mediation (N = 111 items, 5,000 resamples):
| Corpus | Total (c) | p | Direct (c′) | p | Indirect 95% CI |
|---|---|---|---|---|---|
| SUBTLEX | −0.012 | .016 | −0.009 | .117 | [−.009, .002] |
| KE | −0.016 | .001 | −0.014 | .005 | [−.006, .002] |
Controlling for AoA eliminated the direct effect of SUBTLEX frequency (p = .117) but not of KE frequency (p = .005), converging with the mixed-model M3 results. The indirect effects were non-significant for both predictors, likely reflecting limited power with N = 111 items.
4.6.6 OLS Two-Stage (Supplementary)
Table 8. OLS Two-Stage Analysis: Paired t-tests on Participant-Level Slopes
| Dataset | t(73) | p | Cohen’s d |
|---|---|---|---|
| Full (111) | −6.22 | < .001 | −0.72 |
| KE-focused (60) | −3.44 | < .001 | −0.40 |
| SUBTLEX-focused (60) | −0.94 | .348 | −0.11 |
The full-set analysis confirmed a large effect (d = −0.72) favoring KE frequency. The non-significant result for the SUBTLEX-focused subset is consistent with the expectation that stimuli selected for high SUBTLEX divergence will attenuate the KE advantage.
As a methodological note, an earlier version of this analysis used BLUPs (Best Linear Unbiased Predictions) extracted from separate mixed models. BLUP-based paired t-tests yielded inflated test statistics (t = −43.67, d = −5.08) due to shrinkage reducing between-participant slope variability by 53–59% (cf. Hadfield et al., 2010, on the misuse of BLUPs). The OLS two-stage approach avoids this artifact and provides realistic effect-size estimates.

Figure 14. Comparison of per-participant frequency slopes estimated via OLS (two-stage) and BLUP extraction. BLUP slopes show substantially reduced variability (53–59% shrinkage toward the group mean), illustrating why BLUP-based paired t-tests produce inflated test statistics.
4.6.7 Proficiency Interaction
The joint model with frequency × LexTALE interactions showed no significant moderation:
| Parameter | t | p |
|---|---|---|
| Zipf_KE_z × LexTALE_z | −0.13 | .894 |
| Zipf_SUBTLEX_z × LexTALE_z | 1.63 | .108 |
Median-split analysis confirmed that the KE advantage was present in both lower-proficiency (N = 39, d = −0.61) and higher-proficiency (N = 35, d = −0.85) groups. The null interaction indicates that the KE frequency advantage is stable across the proficiency range observed in this sample (LexTALE 45–80), although the restricted range (IQR = 11) limits the power to detect proficiency moderation.
4.6.8 Sub-corpus Decomposition
Table 9. Sub-corpus Frequency Models
| Model | Corpus t | SUB t | AIC | ΔAIC vs. KE |
|---|---|---|---|---|
| Reading + SUBTLEX | −3.56 | −0.95 | −15872.9 | +15.5 |
| Listening + SUBTLEX | −3.59 | 0.04 | −15877.9 | +10.5 |
| Textbook + SUBTLEX | −2.79 | −0.83 | −15873.8 | +14.5 |
| KE (combined) | −4.48 | −0.74 | −15888.3 | 0.0 |
The combined KE frequency was the best predictor (ΔAIC = 10–16 over any single sub-corpus), confirming the value of aggregating across educational sources. All three sub-corpora independently outperformed SUBTLEX. In the listening sub-corpus model, SUBTLEX frequency was completely eliminated (t = 0.04), likely reflecting the shared conversational register between listening passages and subtitle-based corpora.
4.6.9 Contextual Diversity
Frequency-based models outperformed CD-based models (ΔAIC = 34.1), indicating that the KE frequency effect is driven by cumulative exposure rather than contextual diversity. When both frequency and CD were entered simultaneously, CD_KE showed a positive coefficient (β = +0.013, t = +2.58, p = .011), acting as a suppressor variable — frequency and CD have opposing effects after mutual control, consistent with Adelman et al.’s (2006) interpretation that frequency and CD capture distinct aspects of lexical experience. Note that SUBTLEX frequency and CD were nearly collinear (r = .969, VIF ≈ 18), precluding their simultaneous estimation.
4.6.10 Accuracy
Item-level accuracy showed a ceiling effect (M = .978, SD = .027), and no predictor reached significance in a logistic regression model (R² = .039). There was no speed-accuracy trade-off (r = −.044, p = .650). Among low-accuracy items (N = 55), both KE (r = −.370, p = .006) and SUBTLEX (r = −.345, p = .010) frequencies correlated with mean RT, showing convergent evidence that frequency effects on accuracy parallel those on RT when item difficulty is sufficient to produce variance.

Figure 15. Accuracy analysis showing (a) the ceiling-level overall accuracy distribution and (b) frequency–accuracy correlations for low-accuracy items (N = 55). Both KE and SUBTLEX frequencies are significantly correlated with accuracy for difficult items.
4.6.11 Concreteness Moderation (Exploratory)
An exploratory analysis tested whether concreteness (Brysbaert et al., 2014) moderated the frequency effect:
| Parameter | t | p |
|---|---|---|
| Zipf_KE_z × Concreteness_z | −1.00 | .322 |
| Zipf_SUBTLEX_z × Concreteness_z | 0.28 | .779 |
Neither concreteness interaction reached significance. The descriptive direction suggested a larger KE frequency effect for concrete words (β = −0.004), but the fully crossed random effects structure yielded larger standard errors, and the effect did not survive the more appropriate lme4 analysis. This null result should be interpreted in the context of (a) concreteness ratings being L1-normed, (b) no correction for multiple comparisons, and (c) the analysis not being pre-registered.
4.6.12 Cross-Validation
GroupKFold cross-validation (K = 10, grouped by participant) compared out-of-sample predictive performance:
Table 10. GroupKFold Cross-Validation RMSE
| Set | KE RMSE | SUB RMSE | Joint RMSE | KE vs. SUB p |
|---|---|---|---|---|
| Full (111) | 0.1023 | 0.1029 | 0.1023 | < .001 |
| SUBTLEX-focused (60) | 0.1081 | 0.1085 | 0.1076 | .141 |
| KE-focused (60) | 0.0957 | 0.0958 | 0.0957 | .290 |
The KE model achieved lower or equivalent RMSE in all analyses, confirming its superior generalization performance for the full stimulus set.
5. General Discussion
5.1 Summary of Findings
The central finding of this study is that word frequency norms derived from Korean English education corpora consistently outperformed SUBTLEX-US norms in predicting L2 lexical decision latencies. In the joint linear mixed-effects model (R lme4; Bates et al., 2015), KE frequency showed a robust, significant effect (β = −0.023, t = −4.48, p < .001; −29.7 ms/SD) while SUBTLEX frequency was non-significant (β = −0.004, t = −0.74, p = .463; −4.6 ms/SD). Model comparison yielded ΔAIC = 40.5, very strong evidence favoring KE (Burnham & Anderson, 2002). This pattern was confirmed by OLS two-stage analysis (d = −0.72) and was robust across eight sensitivity models (M1–M8), three sub-corpus decompositions, and cross-validation. These results converge with Haeuser and Kray’s (2025) finding that SUBTLEX superiority is not universal, extending the evidence to the L2 domain.
5.2 KE Frequency as a Dual Predictor: L2-Specific Exposure or General Lexical Difficulty?
An important question — and a limitation of the present study — is whether the KE advantage reflects L2-specific educational exposure or general lexical difficulty. Given the high Spearman correlation between KE and HAL frequencies (ρ = .989; Table 1), KE frequency is nearly rank-equivalent to a general English text corpus. This raises the possibility that the KE advantage over SUBTLEX reflects a general superiority of text-based frequency norms over subtitle-based norms, rather than anything specific to L2 educational exposure — a finding that would parallel Haeuser and Kray’s (2025) demonstration that text-based dlexDB outperformed SUBTLEX-DE for German L1 reading.
Indeed, preliminary analyses of ELP data suggest that KE frequency also outperforms SUBTLEX for L1 English speakers, which is consistent with the “general lexical difficulty” interpretation. If KE frequency were purely an L2-specific measure, it should not outperform SUBTLEX for L1 speakers who have no exposure to Korean English education.
We propose that KE frequency likely operates as a dual predictor, capturing (a) general lexical difficulty — the academic, edited nature of educational materials provides frequency estimates that correlate with objective word difficulty, similarly to other text-based corpora — and (b) L2-specific educational exposure — to the extent that the KE frequency distribution specifically reflects the input that Korean learners receive through formal education.
The AoA mediation analysis provides the strongest — though not conclusive — evidence for the L2-specific component. When AoA was controlled, SUBTLEX frequency was eliminated (p = .870) while KE frequency remained significant (p < .001). This dissociation suggests that SUBTLEX frequency shares substantial variance with AoA (high-frequency subtitle words tend to be early-acquired words), whereas KE frequency carries information independent of AoA. However, this dissociation is also consistent with a text-based corpus advantage: if text-based norms provide better frequency estimates than subtitle norms in general (as Haeuser & Kray, 2025, suggest), the AoA-independent component may reflect corpus-quality rather than L2-specificity. Separating these two accounts would require comparing KE against another large text-based corpus (e.g., COCA or BNC) in a joint model — a comparison we did not perform in the present study.
This interpretation is consistent with Kuperman and Van Dyke’s (2013) demonstration that frequency norms aligned with a reader’s actual experience outperform generic corpus counts, and with the broader principle that population-matched norms improve behavioral prediction (Korochkina et al., 2024; Nohejl et al., 2024).
5.3 AoA Mediation and the Frequency–Acquisition Interface
Although the frequency × AoA interaction model showed descriptive patterns in the expected direction (KE × AoA: β = +0.002, t = +0.73, p = .467; SUBTLEX × AoA: β = −0.003, t = −0.96, p = .341), neither interaction reached significance in the lme4 analysis with fully crossed random effects. The larger standard errors under this more appropriate specification absorbed the variance that had produced nominally significant interactions under the variance-component approximation.
Critically, the AoA main effect evidence remains robust: when AoA was entered as a covariate (M3), SUBTLEX frequency was eliminated (p = .870) while KE frequency remained significant (p < .001). This dissociation — rather than the interaction patterns — provides the strongest evidence for the frequency–acquisition interface, suggesting that SUBTLEX frequency shares substantial variance with AoA, whereas KE frequency carries independent information. This interpretation aligns with usage-based theories of language acquisition (Ellis, 2002; Tomasello, 2003), which posit that linguistic representations are shaped by the cumulative statistical properties of the input learners receive.
5.4 Cross-Corpus Frequency Relationships
The remarkably high Spearman correlation between CSAT and HAL frequencies (ρ = .989) suggests that the rank-ordering of word frequencies is highly stable across corpora of vastly different sizes and genres. This is consistent with Zipf’s law and suggests that the core vocabulary of English is distributed similarly across educational and naturalistic contexts. However, the lower Pearson correlations (KE-HAL: r = .730) indicate that the magnitudes of frequency differences are less stable, particularly for words at the extremes of the frequency distribution.
The weaker relationship with SUBTLEX-US (ρ = .657–.818) is noteworthy. Subtitle-based corpora are dominated by informal, conversational language — a register that differs substantially from the academic English found in CSAT passages. This register mismatch may explain why SUBTLEX frequency is a less effective predictor for L2 learners whose exposure is predominantly academic. This finding converges with Nohejl et al.’s (2024) demonstration that the predictive validity of subtitle-based norms varies with the domain of the target task, and with Korochkina et al.’s (2024) evidence that age-appropriate norms (CYP-LEX) outperform adult norms for child populations — both cases illustrating the general principle that frequency-norm validity depends on the match between the corpus and the target population’s input profile.
The frequency-band-specific analysis (Section 3.4.4) revealed asymmetric HF-to-LF attenuation (37.1% for small-corpus pairs vs. 17% for SUBTLEX-anchored pairs), confirming Brysbaert and New’s (2009) prediction that corpus size disproportionately affects low-frequency estimation. Critically, KE LF estimates maintained moderate agreement with SUBTLEX (r = .495), and Study 2 confirmed that KE frequency predicts LDT latencies across the full frequency range — indicating that the remaining LF signal, though noisier than HF estimates, is sufficient for meaningful behavioral prediction. This convergence between the cross-corpus reliability analysis and the behavioral results strengthens the case that education-derived frequency norms capture genuine lexical exposure patterns rather than idiosyncratic corpus artifacts.
5.5 Implications for L2 Research Methodology
Our findings have important methodological implications for L2 lexical processing research. The common practice of using L1-derived frequency norms (SUBTLEX-US, HAL) as predictors in L2 studies introduces a systematic mismatch between the frequency estimates and the learners’ actual exposure. While this mismatch may be small for high-frequency words (where all corpora tend to agree), it becomes consequential for mid- and low-frequency words that differ in frequency across educational and naturalistic contexts.
We recommend that researchers studying L2 populations consider using input-matched frequency norms where available. For populations with well-defined educational exposure (e.g., Korean students with CSAT preparation), education-derived norms provide a more ecologically valid measure of word familiarity. The combined KE corpus outperformed any single sub-corpus (ΔAIC = 10–16), suggesting that aggregating across multiple educational sources provides the most reliable frequency estimates.
5.6 Bilingual Processing Implications
Within the framework of bilingual lexical access models such as BIA+ (Dijkstra & Van Heuven, 2002) and Multilink (Dijkstra et al., 2019), our results suggest that the resting activation levels of L2 lexical representations may be more strongly determined by education-specific input than by general L1-environment exposure. If lexical entries in the bilingual lexicon are activated as a function of cumulative exposure, then the source and context of that exposure matter: KE frequency better approximates the cumulative educational input that determines resting activation levels for Korean L2 learners.
5.7 Limitations
Several limitations should be acknowledged:
-
Corpus size asymmetry: Although the cross-validation analysis in Section 3.4.4 confirmed that KE low-frequency estimates maintain moderate agreement with SUBTLEX-US (r = .495), the asymmetric HF-to-LF attenuation (37.1% for small-corpus pairs; cf. Brysbaert & New, 2009) indicates that larger educational corpora would further improve LF reliability. The KE corpus (~860K tokens) remains considerably smaller than SUBTLEX-US (~51M) and HAL (~131M), and the LF estimation noise inherent to smaller corpora represents a systematic limitation.
-
Online experiment timing: Data were collected via an online platform (Pavlovia), which introduces timing variability of approximately 10–60 ms compared to laboratory settings (Anwyl-Irvine et al., 2021; Bridges et al., 2020). While this noise is random with respect to our frequency predictors and therefore attenuates rather than inflates effects, the KE-SUBTLEX effect-size difference (29.7 vs. 4.6 ms per SD) falls within a range where online timing noise is non-negligible. Laboratory replication is recommended to obtain more precise absolute effect-size estimates.
-
Participant sample: All participants were Korean university students (N = 74), limiting generalizability to other L2 populations with different educational exposure patterns.
-
Random effects specification: Mixed models were estimated using R lme4 (Bates et al., 2015) with crossed random intercepts for participants and items, plus random slopes for both frequency predictors by participant. While this specification follows the recommendations of Barr et al. (2013), a maximal model including random slopes by item did not converge due to the limited number of observations per item. The consistent results across mixed models, OLS two-stage analysis, and cross-validation mitigate concerns about random effects specification.
-
Cognate and loanword status: Korean contains a substantial inventory of English-derived loanwords, which may enjoy processing advantages. The present stimulus set was not explicitly controlled for cognate or loanword status. Although the joint model controls for both frequency sources simultaneously, cognate status could differentially affect KE and SUBTLEX frequency estimates, and future studies should include this as an explicit covariate.
-
Word prevalence: Brysbaert, Mandera, and Keuleers (2019) demonstrated that word prevalence — the proportion of people who know a word — captures unique variance beyond frequency. The present study did not include word prevalence as a covariate, as L2-specific prevalence norms are not available for Korean learners. The extent to which KE frequency captures prevalence-related variance remains an open question.
-
L1-normed psycholinguistic variables: AoA (Kuperman et al., 2012) and Concreteness (Brysbaert et al., 2014) norms were collected from L1 speakers. Their validity for L2 populations is assumed but not verified.
-
Proficiency range: The LexTALE score range (45–80, IQR = 11) represents intermediate-to-upper-intermediate proficiency. The frequency × proficiency interaction may emerge with a wider proficiency range.
-
Marginal R²: The overall variance explained by frequency predictors was modest (~10%), consistent with the general finding that frequency accounts for a small but reliable portion of RT variance in lexical decision (Brysbaert et al., 2018). The KE–SUBTLEX difference in R² was approximately 1.5 percentage points, though this modest absolute difference corresponds to a 6.5-fold difference in effect magnitude (29.7 vs. 4.6 ms/SD).
-
Stimulus selection circularity: The KE-focused and SUBTLEX-focused subsets were defined by frequency rank divergence between the two corpora, which means subset-level comparisons are inherently biased toward each set’s defining corpus. Although the full-set analysis (111 words) serves as the primary comparison and mitigates this concern, the overall stimulus pool was still drawn from the intersection of KE, ELP, HAL, and SUBTLEX, excluding words absent from any one corpus. Replication with independently selected stimuli — for example, stratified random sampling without reference to cross-corpus divergence — would provide a stronger test of the KE frequency advantage.
-
Comparison scope: The present study compared KE frequency against SUBTLEX-US only, whereas Chen et al. (2018) evaluated 17 frequency norms. While SUBTLEX-US is the current standard in psycholinguistic research, the inclusion of additional text-based norms (e.g., COCA, BNC) would help disentangle whether the KE advantage is L2-specific or reflects a general text-corpus superiority over subtitle-based norms.
-
Future directions: Future research may examine whether contextual surprisal from domain-adapted language models provides additional predictive value beyond Zipf frequency. Extending this approach to other L2 populations, incorporating reading-time measures, developing L2-normed AoA databases, and comparing KE against multiple text-based and spoken corpora would further strengthen the evidence base.
6. Conclusion
This study provides evidence that word frequency norms derived from Korean English education corpora outperform SUBTLEX-US in predicting L2 lexical decision latencies. The KE corpus (859,346 tokens; 22,510 word types; 1994–2025) showed a substantial predictive advantage over SUBTLEX-US (ΔAIC = 40.5; −29.7 vs. −4.6 ms per SD), an advantage that was robust across eight sensitivity models, survived cross-validation, and persisted after controlling for age of acquisition, contextual diversity, and orthographic variables.
The AoA mediation analysis suggested that SUBTLEX frequency shares substantial variance with AoA (p = .870 after AoA control), whereas KE frequency carries independent information (p < .001 after AoA control). However, the high rank correlation between KE and HAL (ρ = .989) and the preliminary ELP analyses indicate that the KE advantage may partly reflect a general superiority of text-based frequency norms over subtitle-based norms, rather than L2-specific educational exposure alone. Future work comparing KE against other large text-based corpora (e.g., COCA, BNC) is needed to disentangle these accounts. These findings have practical implications for selecting appropriate frequency norms in L2 psycholinguistic research and for understanding how corpus characteristics interact with population-specific exposure profiles.
References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823.
Anwyl-Irvine, A. L., Massonnié, J., Flitton, A., Kirkham, N., & Evershed, J. K. (2021). Gorilla in our midst: An online behavioral experiment builder. Behavior Research Methods, 53(1), 388–407. https://doi.org/10.3758/s13428-020-01386-8
Baek, H., Lee, Y., & Choi, W. (2023). Proficiency versus lexical processing efficiency as a measure of L2 lexical quality: Individual differences in word-frequency effects in L2 visual word recognition. Memory & Cognition, 51(8), 1858–1869. https://doi.org/10.3758/s13421-023-01436-0
Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133(2), 283–316.
Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., Neely, J. H., Nelson, D. L., Simpson, G. B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39(3), 445–459.
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68(3), 255–278. https://doi.org/10.1016/j.jml.2012.11.001
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01
Bridges, D., Pitiot, A., MacAskill, M. R., & Peirce, J. W. (2020). The timing mega-study: Comparing a range of experiment generators, both lab-based and online. PeerJ, 8, e9414.
Brysbaert, M., & Diependaele, K. (2013). Dealing with zero word frequencies: A review of the existing rules of thumb and a suggestion for an evidence-based choice. Behavior Research Methods, 45(2), 422–430. https://doi.org/10.3758/s13428-012-0270-5
Brysbaert, M., Mandera, P., & Keuleers, E. (2018). The word frequency effect in word processing: An updated review. Current Directions in Psychological Science, 27(1), 45–50. https://doi.org/10.1177/0963721417727521
Brysbaert, M., Mandera, P., & Keuleers, E. (2019). Word prevalence norms for 62,000 English lemmas. Behavior Research Methods, 51(2), 467–479. https://doi.org/10.3758/s13428-018-1077-9
Brysbaert, M., & New, B. (2009). Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior Research Methods, 46(3), 904–911.
Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, Instruments, & Computers, 30(2), 272–277.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.
Chen, B., Dong, Y., & Yu, Z. (2018). A comparison of word frequency norms based on different corpora for predicting Chinese-English bilinguals’ lexical decision latencies. Behavior Research Methods, 50(6), 2540–2555.
Cop, U., Keuleers, E., Drieghe, D., & Duyck, W. (2015). Frequency effects in monolingual and bilingual natural reading. Psychonomic Bulletin & Review, 22(5), 1216–1234. https://doi.org/10.3758/s13423-015-0819-2
De Wilde, V., Brysbaert, M., & Eyckmans, J. (2020). Learning English through out-of-school exposure: Which levels of language proficiency are attained and which types of input are important? Bilingualism: Language and Cognition, 23(1), 171–185.
Diependaele, K., Lemhöfer, K., & Brysbaert, M. (2013). The word frequency effect in first- and second-language word recognition: A lexical entrenchment account. Quarterly Journal of Experimental Psychology, 66(5), 843–863.
Dijkstra, T., & Van Heuven, W. J. B. (2002). The architecture of the bilingual word recognition system: From identification to decision. Bilingualism: Language and Cognition, 5(3), 175–197.
Dijkstra, T., Wahl, A., Buytenhuijs, F., Van Halem, N., Al-Jefri, Z., De Korte, M., & Rekké, S. (2019). Multilink: A computational model for bilingual word recognition and word translation. Bilingualism: Language and Cognition, 22(4), 657–679.
Ellis, N. C. (2002). Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24(2), 143–188.
Forster, K. I. (1976). Accessing the mental lexicon. In R. J. Wales & E. Walker (Eds.), New approaches to language mechanisms (pp. 257–287). North-Holland.
Forster, K. I., & Chambers, S. M. (1973). Lexical access and naming time. Journal of Verbal Learning and Verbal Behavior, 12(6), 627–635.
Hadfield, J. D., Wilson, A. J., Garant, D., Sheldon, B. C., & Kruuk, L. E. B. (2010). The misuse of BLUP in ecology and evolution. The American Naturalist, 175(1), 116–125.
Hamrick, P., & Pandza, H. (2020). Contextual diversity and word learning in a second language. Canadian Journal of Experimental Psychology, 74(3), 233–243.
Haeuser, K. I., & Kray, J. (2025). Not so SUBTLE(X): Word frequency estimates and their fit to sentential reading times in interaction with predictability. Linguistics, 63. https://doi.org/10.1515/ling-2024-0143
Keuleers, E., Lacey, P., Rastle, K., & Brysbaert, M. (2012). The British Lexicon Project: Lexical decision data for 28,730 monosyllabic and disyllabic English words. Behavior Research Methods, 44(1), 287–304.
Korochkina, M., Birchenough, J. M. H., Dawson, N., & Sheriston, L. (2024). CYP-LEX: A large-scale lexical database for children and young people. Quarterly Journal of Experimental Psychology, 77(8), 1573–1592. https://doi.org/10.1177/17470218241229694
Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Brown University Press.
Kuznetsova, A., Brockhoff, P. B., & Christensen, R. H. B. (2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13
Kuperman, V., & Van Dyke, J. A. (2013). Reassessing word frequency as a determinant of word recognition for skilled and unskilled readers. Journal of Experimental Psychology: Human Perception and Performance, 39(3), 802–823. https://doi.org/10.1037/a0030859
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978–990.
Lemhöfer, K., & Broersma, M. (2012). Introducing LexTALE: A quick and valid lexical test for advanced learners of English. Behavior Research Methods, 44(2), 325–343.
Li, N., Wolter, B., Yang, L., & Siyanova-Chanturia, A. (2025). The effects of textbook frequency, congruency, and word class on the processing of L2 collocations by Chinese EFL learners. Humanities and Social Sciences Communications, 12, 767. https://doi.org/10.1057/s41599-025-05045-x
Lorch, R. F., & Myers, J. L. (1990). Regression analyses of repeated measures data in cognitive research. Journal of Experimental Psychology: Learning, Memory, and Cognition, 16(1), 149–157.
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208.
Morton, J. (1969). Interaction of information in word recognition. Psychological Review, 76(2), 165–178.
Murphy Odo, D. (2023). Vocabulary coverage of the Basic English Vocabulary List from the Korean National Curriculum and the BNC in the CSAT. The Journal of Asia TEFL, 20(4), 906–916.
Murray, W. S., & Forster, K. I. (2004). Serial mechanisms in lexical access: The rank hypothesis. Psychological Review, 111(3), 721–756.
Nohejl, A., Vít, J., Kocmi, T., & Bojar, O. (2024). Beyond film subtitles: Is YouTube the best approximation of spoken vocabulary? arXiv preprint, arXiv:2410.03240. https://doi.org/10.48550/arXiv.2410.03240
Peirce, J. W., Gray, J. R., Simpson, S., MacAskill, M. R., Höchenberger, R., Sogo, H., Kastman, E., & Lindeløv, J. K. (2019). PsychoPy2: Experiments in behavior made easy. Behavior Research Methods, 51(1), 195–203.
Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112–1130.
Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163.
Rodríguez-Cuadrado, S., Hinojosa, J. A., Guasch, M., Romero-Rivas, C., Sabater, L., Suárez-Coalla, P., & Ferré, P. (2022). Subjective age of acquisition norms for 1604 English words by Spanish L2 speakers of English and their relationship with lexico-semantic, affective, sociolinguistic and proficiency variables. Behavior Research Methods, 54(5), 2410–2423.
Tomasello, M. (2003). Constructing a language: A usage-based theory of language acquisition. Harvard University Press.
van Heuven, W. J., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
Yarkoni, T., Balota, D., & Yap, M. (2008). Moving beyond Coltheart’s N: A new measure of orthographic similarity. Psychonomic Bulletin & Review, 15(5), 971–979.