ToDo

outlier 처리.
- 일반적이지 않은 음성학 적 특징 범위.
모음 별 완전 분리 후 analysis
수렴 대상만 보는 게 아니라, 다시 전집으로 변환.

Data preprocessing

Summary

participant_id : 참가자 번호

file_id : 파일 명

word_label : experiment stimuli

vowel label: stimuli 내 vowel

duration : vowel duration

gender : participant gender

F1_mid_point: 참가자의 모음 발화 중 mid_time의 F1

F2_mid_point: 참가자의 모음 발화 중 mid_time의 F2

F1_mid_point_z : F1_mid_point를 참가자 내 z-scoring (lobanov norm)

F2_mid_point_z : F2_mid_point를 참가자 내 z-scoring (lobanov norm)

stage(39200 row): 실험 block

stage2(4900 row) : pre-task : visual word only

stage4(14700 row) : test1 block : audio stimuli only

stage5(14700 row) : test2 block : audio stimuli + image stimuli

stage6(4900 row) : post-task : visual word only

actual_list: 실제 실험 시 참가자에게 제공된 test2 block에서 발화자의 identity information

pereived_list : 참가자가 post-task 이후 survey에 응답한 list.

AI_like, Human_like : post-task 에서 받은 rating.

dist_z, dist_vl : model speaker의 값으로 측정한 $L 2$ distance.

$D I D ab$ : $a$ stage 에서 $b$ stage의 dist_z를 뺀 값.

EDA

Note

아래는 일반적인 성별, 나이, 모음별 formant.

from Corner vowels in males and females aged 4 to 20 years: Fundamental and F1-F4 formant frequencies(Houri K. Vorperian and Raymond D. Kent)

Multi column

from 연령 및 성별에 따른 한국인 단모음 포먼트 비교에 관한 연구(2013)

marker는 한국인으로 사용. 외국인 정보 사용했을 시 좀 더 괴리가 컸음.
아래는 모든 raw data를 plot.

Multi column

Question

outlier가 너무 많아보이지만, 박스 영역은 거의 유사. 그렇다면, 참가자 내에서 평균을 한 번 내보고 boxplot을 그려볼까?

아래는 먼저 참가자의 한 모음당 formant를 하나의 값으로 압축하고 그린건데, 범위 절대 안 겹침.

Multi column

→ 일반적으로 보고하는 성별, 모음별 평균 formant 값은 해당 모든 데이터에 대한 평균 아닐까?
→ 그러면 우리것도 범위가 거의 잘 겹치니까 잘 한거고, 이걸 참가자 내에서 먼저 한 번 평균내면 값이 완전 바뀌니까, 저 outlier들을 살리는게 옳은 방향인 듯.

참가자 내에서 한 번 데이터 압축하면 데이터 값이 너무 바뀌어서 그렇게 하지 않는 것 같고, 그래프를 그리면 너무 범위가 달라져서 그리지도 않는 듯. 그렇다고 raw data plot을 그리면 보이는 것처럼 outlier가 너무 많으니.

Summary

pre-processing 수정 불필요.

아래의 max_formant부분은 위보다 전에 이미 적용됨.

Warning
직접 확인해봐도, 남성인데도 F1 저 범위는 이상하다.

보면, F4, F5가 안 잡힌 걸 볼 수 있다. 정말 포먼트 검출을 하나씩 밀려서 한 것.
formant = audio_segment.to_formant_burg(
	time_step = params['time_step],
	max_number_of_formants = params['max_formants],
	maximum_foramnt = max_formant,
	window_length = params['window_length],
	pre_emphasis_from = params['pre_emphasis_from]
)
이전 default 값은 아래와 같다.
self.default_params = {
	'time_step': 0.00625,
	'max_formants' : 5,
	'window_length' : 0.025,
	'pre_emphasis_from' : 50.0,
	'male_max_formant' : 5000,
	'female_max_formant': 5500
}
burg-to_formant

Important

일단 sampling rate가 높은 파일로 다시 formant detect.
녹음은 44.1kHz로 되었는데, KFA를 돌리고나면 따로 받는 16kHz파일을 사용했었음.
원본 파일을 대상으로 재 검출 시도.
→ 결과가 변화 없어서 paramter 수정.

Summary
파라미터는 아래와 같이 수정.
위의 파이썬 코드의 max_formant를 수정하며 해봄.
(5000, 5500)일 떄 (F, M) 그룹에서 f1 > 1000인 경우가

(5000, 5500) : (254, 732)

(5500, 6000) : (230, 321)

(6000, 6500) : (211, 119)
self.default_params = {
	...
	'male_max_formant' : 5500,
	'female_max_formant': 6000
}
위 값은 기존에 잘못 검출된 부분을 praat에서 확인 후 해당 샘플에 대해서만, 값을 변경해보고 잘 잡히는 것 같은 값을 사용.

Check

formant 처리 부분 수정.
기존은 vowel 발화 시간대의 시간 상 중앙 포인트 값을 사용했는데, 안정성이 떨어지는 것 같아, 이를 Babel 참고하여 발화 시간 가운데 50% 영역에서 평균 값을 사용.
별 차이가 없,,,, 다.

확인해보니, 일부 매우 심한 outlier들은(1700이상 F1) 예상과 다른 발음을 함.

밀사를 실사로 발음해서 S발음이 들어가 F1이 높게 잡히는 케이스 등이 확인됨.

Success

Babel 참고해서, (성별, 모음) group 내에서 3SD 기준으로 outlier 처리.

Summary

음성 수렴이 관찰되는 곳은 다음과 같다.

pre-task → task1

$성별에 의한 차이는 발견되지 않았으나, 전반적으로 수렴하는 경향성은 확인 . 그러나 남성의 경우에는 통계적으로 확신할 만큼 데이터가 충분치 못했거나, 분산이 컸다 .$

task1 → task2

task2 → post-task

Analysis 1. 전체 집단에서 음성수렴이 관찰되는가?

Summary
전체 집단에서 참가자별로 수렴정도를 histplot으로 그려보면,,
analysis_df.dropna(subset=['DID42'], inplace=False).groupby(['participant_id'])['DID42'].mean().describe()
49명 당 하나의 값을 가지고 hist 그렸으니, count=49

Summary

# 전체 sample에 대해 일단 수렴하나?
formula = 'DID42 ~ 1' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

Linear Mixed Effects Model 참고.

$유의미하게 수렴한다! 대략 평균은 -0.042$

by gender

이를 참가자 별로 DID 값을 평균해서 histplot에서 보면,

Multi column

Male!

Female

Describe()

3SD 기준으로 잘라내면,

Important

일단 수렴 경향성은 gender가 확실히 관여하는 것처럼 보인다. (평균이 음수쪽 + 분포가 음수 영역에 skewed.)

Question

데이터를 보면서 든 생각은 block에 대한 main-effect가 잡히려면, 일단 수렴을 보이는 사람에 대해서 돌리는 게 맞지 않을까?
보려고 하는게, 일단 수렴하는 사람들이 AI라는 지표에 대해 발산하는지 여부이니까.
그리고 보는 것처럼 남자들은 수렴이 잘 되지도 않은 경우가 있어서, 흠,,

Summary
# gender 차이?
formula = 'DID42 ~ C(gender)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
intercept 즉, baseline case로 분류된 유의미하게 수렴하는 걸로 값이 잡힘. → 여성은 수렴함.

male variable은 유의미하지 않음. 양수 값이니, 여성 보다는 발산하는 경향이나, 유의미 하지는 않음.

즉, 성비에 의한 차이는 발견되지 않음. : 둘 다 수렴한다고 볼 수는 있겠으나? 위의 histplot의 결과랑은 좀 다른 것 같음.

→ 남녀 데이터 쪼개서 각각 reg.

Multi column

Male

# gender 차이?
formula = 'DID42 ~ 1' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False)[male_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

Female

# gender 차이?
formula = 'DID42 ~ 1' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False)[female_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

혹은 power를 유지하면서 하려면, intercept를 0으로 지정해서 baseline을 삭제.

Summary

# gender 차이?
formula = 'DID42 ~ 0 + C(gender)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

여성 수렴, 남성 수렴하지 않음.

Summary

여성은 유의미하게 수렴, 그러나 남성은 그렇지 않음. → 위의 gender를 변수로 넣고 회귀한 거랑 상반!

power 생각하면, 위의 gender 회귀가 더 신뢰할만함.

따라서 기술 시에는
$성별에 의한 차이는 발견되지 않았으나, 전반적으로 수렴하는 경향성은 확인 . 그러나 남성의 경우에는 통계적으로 확신할 만큼 데이터가 충분치 못했거나, 분산이 컸다 .$

by vowel

Summary

# vowel?
formula = 'DID42 ~ 0 + C(vowel_label)' # regression formula
vc_formula = { # variance component formula
	#"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

i, u는 매우 유의미하게 수렴.
babel에서는 ae가 잘 수렴한다고 했는데, gender랑 interaction 이었나?

gender X vowel

Summary

# vowel?
formula = formula = 'DID42 ~ 0 + C(vowel_label):C(gender)' # regression formula
vc_formula = { # variance component formula
	#"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

수렴 조건:
- F
  - a, ae, i,
- M
  - iu

mask를 사용해서 아래처럼, 모음 별로만 gender effect 계산.

p-val은 줄 수도 있다.
- $SE = \frac{SD}{N}$
- 데이터를 유사한 것만 sampling해서 돌리면, SD가 N 보다 훨씬 크게 줄어드는 경우에 한해.
- 아닌 case도 있겠지.
데이터 사이즈는 줄어서, power는 줄겟지만,,
- 그리고 위의 결과가 더 수렴하는 것이 많다.

Multi column

a

ae

i

o

u

Summary

convergence:

F

a, i

M

Analysis2. AI / Human information cue가 convergence에 영향을 주는가?

Summary
list별 차이를 보는게 목적이라 intercept에 baseline 부여.
# vowel?
formula = formula = 'DID54 ~ C(actual_list)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID54'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
$단순 조건별 차이로는 유의미한 차이 없음 .$
$다만, coeff 차이가 있으니, sample size 문제일지도,,$

baseline을 넣고 각각 수렴 여부를 본다면,,

‘C(actual_list)’ → ‘0 + C(actual_list)’

Warning
여기서 문제인게 task2이전 즉, task1 block까지는 제공하는 정보의 차이가 없어서 모두 균일해야 하지만, 실험 대상자의 데이터를 실제로 보면, 음성 수렴 자체를 하지 않은 그룹이 몰려 있음. 따라서 정말 수렴 →발산을 보려면 수렴한 사람들을 먼저 filtering하고 그들만 tracking해야 함.
analysis_df.dropna(subset=['DID42'], inplace=False).groupby(['actual_list', 'gender'])['DID42'].describe()
혹시라도 DID 계산이 잘못되었을 경우는…?

계산은 잘 되었다. 진짜, 그룹 편향이 우연히 있었어서,,

human list에 random assigned 된 남성 참가자들이 bias가 있었다.
# list 별 차이를 task2 이전에 있는지 check. list 분기 전이라 다르면 안 됨.
formula = formula = 'DID42 ~ C(actual_list)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID42'], inplace=False),
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
그렇다면, 위의 회귀에는 수렴을 하지도 않은 사람들이 들어가 있어서, 이대로 해석하면 안된다.
→ 음성 수렴 기준을 설정하고 이들만 다시 회귀.

Question

음성 수렴 기준: pre-task에서 task2으로 진행했을 때, DID가 음수.
사람마다 여러번 발음해서 가지고 있는 DID가 여러개라, 그 평균이 음수 인걸로 기준.

Check

위 기준에 따라 filtering하면 살아남는 사람들은 list, gender로 아래와 같다.

각 참가자 당 DID42는 300개 있다. 즉, human male class에는 단 4명만 assigned. → 너무 표본이 작은데,,
49명 중 33명(67%)만 수렴.

Summary
# list 별 차이가 task1 - task2에 있는지 check. 
formula = formula = 'DID54 ~ 0 + C(actual_list)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID54'], inplace=False)[convergence_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
DID 값이 회귀 예측 변수이고, intercept에 baseline case를 넣지 않았으므로, 각각 가설의 기각 여부를 보면 됨. 즉, 각 변수는 전 단계와 달아졌나? 를 의미.

일달 AI, Human case 모두 통계적으로 유의미한 검정이 이루어 지지 않았지만 그래도 coeff를 보면, Human이 더 발산한다.

또한, AI 의 경우, p 값도 굉장히 크니까, 일반적으로 전단계에서 진행된 음성 수렴이 변하지 않고 그대로 유지됐다고 보는 게 타당하다.

그러면 물어볼 수 있는 것은 label 제시한 것이 수렴에 유의미한 차이를 주었나?
→ intercept에 baseline 제거
# list 별 차이가 task1 - task2에 있는지 check. 
formula = formula = 'DID54 ~ C(actual_list)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID54'], inplace=False)[convergence_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
그것도 아님.

Question

interaction이 있어서 전체적으로 보면 안 잡히는 걸까?

by gender x actual_list

Summary

# list 별 차이가 task1 - task2에 있는지 check. 
formula = formula = 'DID54 ~ 0 + C(actual_list):C(gender)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID54'], inplace=False)[convergence_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

모든 조건에 대해 통계적으로 유의하진 않지만,
- AI 조건에서는 여성은 발산, 남성은 수렴.
- Human 조건에서는 여성은 수렴, 남성은 발산.
  $남성과 여성의 pattern 이 다르다는 점은 흥미로움 .$

by list x gender x vowel

Summary

# list 별 차이가 task1 - task2에 있는지 check. 
formula = formula = 'DID54 ~ 0 + C(actual_list):C(gender):C(vowel_label)' # regression formula
vc_formula = { # variance component formula
	#"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID54'], inplace=False)[convergence_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()

수렴 조건
- M, AI,
  - ae, i, o
- F, Human, i
- F, AI, o
  $남성과 여성의 pattern 이 다르다는 점은 흥미로움 .$

Analysis3. Post-task block에서 다시 diverge?

Analysis 3-1. pre-task to post-task(!!)

Summary
list 간 차이를 보자.
대상은 수렴이 일어난 사람들에 대해서만,,
# list 별 차이가 task1 - task2에 있는지 check. 
formula = formula = 'DID62 ~ 0 + C(actual_list)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID62'], inplace=False)[convergence_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
음성 수렴이 일어난 사람들에 대해서 보자면,

AI list의 경우, voice와 상호작용 후 닮아진 정도가 조금 더 남아 있지만,(통계적으로 유의미하게)

Human의 경우, voice와 상호작용 후 다시 원래 발음으로 돌아왔다? (통계적으로 유의미하지 않게)

사실 이 부분에 한해서는 metric이 L2-Euclidean distanced인게 좀 아쉽긴 한데,

by gender x list

Summary
list 간 차이를 보자.
대상은 수렴이 일어난 사람들에 대해서만,,
# list 별 차이가 task1 - task2에 있는지 check. 
formula = formula = 'DID62 ~ 0 + C(actual_list):C(gender)' # regression formula
vc_formula = { # variance component formula
	"vowel": "0 + C(vowel_label)",
	"word": "0 + C(word_label)"
	}
	
re_formula = '~1' # random slope formula / '~C(vowel_label)' 는 수렴하지 않아서 제외.
 
model = MixedLM.from_formula(formula,
	data=analysis_df.dropna(subset=['DID62'], inplace=False)[convergence_mask],
	groups='participant_id',
	vc_formula=vc_formula,
	re_formula=re_formula
	).fit()
	
model.summary()
모든 셀에 대해 통계적으로는 유의미하지 않지만,

AI list는 gender에 무관하게 coeff가 음수 post-task에서도 여전히 수렴 영향이 남아 있고,

Human list의 경우, 여성은 다시 발산. (물론 coeff도 매우 작음.)

Juhyeon's Blog

탐색기

Analysis - 4

ToDo

Data preprocessing

EDA

Analysis 1. 전체 집단에서 음성수렴이 관찰되는가?

by gender

by vowel

gender X vowel

M

Analysis2. AI / Human information cue가 convergence에 영향을 주는가?

by gender x actual_list

by list x gender x vowel

Analysis3. Post-task block에서 다시 diverge?

Analysis 3-1. pre-task to post-task(!!)

by gender x list

by list x gender x vowel

Analysis 3-2. task2 to post-task

그래프 뷰

목차

Properties

백링크