The effect of corpus size in predicting reaction time in a basic word recognition task-Moving on from Kučera and Francis

...

The HAL corpus consists of approximately 131 million words gathered during February of 1995 from approximately 3,000 Usenet newsgroups. All Usenet newsgroups that contained text contributed to the corpus. At the time at which the corpus was collected, about 10 million new words of text were available each day.A recent check of the Usenet feed that we use indicated that about 20 million words were available each day. Thus it is an easily available source of text. A major advantage of this source is the extremely broad range of topics that are discussed. With 3,000 newsgroups virtually any topic is covered. The text is very conversational and noisy, much like spoken language. These features overcome what we saw as two limitations to the Brown corpus: small samples of words and more formal language use. One limitation of the HAL corpus is that it is not yet tagged for parts of speech. Of course, any corpus has limitations and advantages, and these are a function of the question under investigation. The master vocabulary for the HAL corpus contains 3,461,884 lexical entries. This represents a huge number of types, which requires some comment. It is difficult to characterize the complete set of items. Of the 70,000 most frequent items, about half have entries in the standard Unix dictionary and the other half are proper names, slang words, misspellings, and nonword symbols. The remaining 3,391,884 items represent a vast range of LF words, misspellings, hyphenations, nonletter characters, and other nonword symbols. Although for any given user, much of this list would be considered noise, we have found it useful in other projects to have frequency counts for misspellings; emoticons, such as ”:)” or ”: >(”; and slang. As a result, it becomes difficult to clearly state the number of words in the count. For the present purposes, we will claim that there are 97,261 words, but this number is actually larger. The overhead of maintaining the frequency file is not substantial (21.3 MB). The genesis of the HAL corpus requires a brief introduction. The HAL (Hyperspace Analogue to Language) model of memory encodes meaning by transducing lexical co-occurrence information into semantic and grammatical information. This corpus was collected as the language input for the HAL model. Since the model was to simulate normal human memory, a conversational and widely ranging source of language was desired. The basic methodological details for HAL are available (Lund & Burgess, 1996). The model has been used to simulate basic word recognition and priming processes (Burgess & Lund, 1997c; Livesay & Burgess, 1997; Lund, Burgess, & Atchley, 1995; Lund, Burgess, & Audet, 1996), as well as neuropsychological memory processing in normals (Burgess & Lund, 1997a) and in deep dyslexics (Buchanan, Burgess, & Lund, 1996). The model has also been used successfully in the simulation of human grammatical and syntactic behavior (Burgess & Lund, 1997b) and developmental data (Burgess, Lund, & Kromsky, 1997; Lund & Burgess, 1997). An important part of every one of these experiments is the notion that WF is related to how a concept develops. Implicit in the success of these experiments is that the incidence of the words in the corpus needs to correspond to ordinary human experience. The experiments in the present paper were performed explicitly to test the hypothesis that the larger, more recent, and conversational corpus will predict human performance on a simple lexical task better than will the older, smaller corpus. Thus, given the importance of WF as a theoretical construct in models of word recognition as well as other cognitive processes, and given the prevalence of usage of the KF word count, we decided to investigate the correspondence between the KF frequency estimates and a larger, more recent WF count. WF is most often used to segregate stimuli into various frequency ranges (e.g., low, high), to balance sets of items within or between conditions of an experiment, or to set parameters in computer simulations of word retrieval or reading. Inherent in many of these experiments is some kind of evaluation of the role of LF information. An adequate number ofLF items is crucial in a corpus because inadequate sampling is most likely with LF items and thus may introduce considerable unwanted variability into an LF condition.