Pearson, spearman, zipf, Rfreq
Raw Frequency
- 코퍼스 전체 단어(token) 중 출현 빈도.
- 코퍼스 전체에 대한 단순 count이다 보니, range가 너무 커서 값 자체를 사용해서 ANOVA 같은 분석을 하기에는 적합하지 않음.
Relative Frequency
- 위의 raw freq를 코퍼스 크기를 고려하여
- raw_freq / corpus_size 한 값.
- 단순 linear transform이다 보니, 데이터 간 유의미한 변환으로 보기는 힘듦.
Zipf Score
from SUBTLEX-UK 논문
SUBTLEX-UK: A new and improved word frequency database for British English 여기서 제안한 scale은 다음과 같은 조건이 있어야 하고,
Qualification for scale of word frequency
- It should be a logarithm scale (e..g, like the decibel scale of sound loudness).
- It should have relatively few points, without negative values (e.g., like a typical Likert rating scale, from 1 to 7).
- The middle of the scale should separate the low-frequency words from the high-frequency words.
- The scale should have a straightforward unit.
Zipf Scale
“To meet the last requirement, we propose to call the new scale the Zipf scale, after the American-linguist George Kingsley Zipf (1902–1950) who first thoroughly analysed the regularities of word-frequency distribution and formulated a law (Zipf,1949), which was later named after him.”
…
“The calculation of Zipf values is easy as it equals log10(frequency per billion words) or log10 (frequency per million words) + 3. So, a Zipf value of 1 corresponds to words with frequencies of 1 per 100million words, a Zipf value of 2 corresponds towards with frequencies of 1 per 10 million words, a Zipf value of 3 corresponds to words with frequencies of 1 per million words, and so on”.한 가지 문제는 corpus 내 출현하지 않는 단어는 log-transform하면 -inf로 가는데, 이는 Laplace-transform하기도 한다고는 하는데, 여기서
“Rather than working with the raw frequency counts, one works with the frequency counts + 1”라고도 하니, 이렇게 하자.
0-freq word
brysbert의