Frequency

단어 단위로 분절된 단어들을 하나의 list에 넣음
count_values method 사용하여 빈도 파악
일부 corpus에서는 총 빈도 수를 사용한 것이 아닌, 특정 개수 기준 비율을 사용하기 때문에 이를 통일해볼것.
- ex) 100만 단어별 출현 빈도 → RFreq라는 이름으로 통일.
- SUBTLEX는 100만 단어 단위 빈도로, HAL은 수치 그대로 제공함.

Relative Frequency(RFreq)

코퍼스 내 상대빈도를 100만 토큰 기준으로 변환하여 계산.

Orthographic N(Neighborhood)

target 단어에서 한글자 치환만 허용한 단어들의 개수
- ex) cat → cap, cut, hat, etc…
python Levenshtein library의 distance 함수를 사용해서 정의
- levenshtein distance = 1 && word_length 동일하게 fix하면, 정의 상 문자 1개 치환만 허용하기에, orthographic N의 정의와 동일함.

from Levenshtein import distance # distance measure function import
 
def orthographic_N(word : str, lexicon : list[str]) -> int: # orthographic N의 정의를 이용해서 이웃의 개수 산출.
    # 길이는 고정해두었기 때문에, levenshtein에서 허용하는 건 치환만. 또한, distance=1으로, 스스로는 제외
    return sum(1 for w in lexicon if len(w)==len(word) and distance(w, word)==1)

OLD20

target 단어 기준으로 계산한 levenshtein distance가 짧은 상위 20개의 평균 distance
python Levenshtein library의 distance 함수를 사용해서 정의

from Levenshtein import distance # distance measure function import
 
def OLD20(word : str, lexicon : list[str]) -> float:
    dists = sorted(distance(word, w) for w in lexicon)[:20]
    return np.mean(dists).item()

Juhyeon's Blog

탐색기

KE Corpus Preprocessing(Freq, Ortho_N, OLD20)

Frequency

Relative Frequency(RFreq)

Orthographic N(Neighborhood)

OLD20

그래프 뷰

목차

Properties

백링크