Pearson, spearman, zipf, Rfreq

Raw Frequency


  • 코퍼스 전체 단어(token) 중 출현 빈도.
  • 코퍼스 전체에 대한 단순 count이다 보니, range가 너무 커서 값 자체를 사용해서 ANOVA 같은 분석을 하기에는 적합하지 않음.

Relative Frequency


  • 위의 raw freq를 코퍼스 크기를 고려하여
    • raw_freq / corpus_size 한 값.
    • 단순 linear transform이다 보니, 데이터 간 유의미한 변환으로 보기는 힘듦.

Zipf Score


from SUBTLEX-UK 논문

SUBTLEX-UK: A new and improved word frequency database for British English 여기서 제안한 scale은 다음과 같은 조건이 있어야 하고,

Qualification for scale of word frequency

  1. It should be a logarithm scale (e..g, like the decibel scale of sound loudness).
  2. It should have relatively few points, without negative values (e.g., like a typical Likert rating scale, from 1 to 7).
  3. The middle of the scale should separate the low-frequency words from the high-frequency words.
  4. The scale should have a straightforward unit.

Zipf Scale

“To meet the last requirement, we propose to call the new scale the Zipf scale, after the American-linguist George Kingsley Zipf (1902–1950) who first thoroughly analysed the regularities of word-frequency distribution and formulated a law (Zipf,1949), which was later named after him.”

“The calculation of Zipf values is easy as it equals log10(frequency per billion words) or log10 (frequency per million words) + 3. So, a Zipf value of 1 corresponds to words with frequencies of 1 per 100million words, a Zipf value of 2 corresponds towards with frequencies of 1 per 10 million words, a Zipf value of 3 corresponds to words with frequencies of 1 per million words, and so on”.

한 가지 문제는 corpus 내 출현하지 않는 단어는 log-transform하면 -inf로 가는데, 이는 Laplace-transform하기도 한다고는 하는데, 여기서

“Rather than working with the raw frequency counts, one works with the frequency counts + 1”라고도 하니, 이렇게 하자.

0-freq word


brysbert의