Let pmodel(y∣x,w)=fw(x)y(1−fw(x))(1−y)
then we obtain, w^ML=wargmax∑logpmodel(yi∣Xi,w) =wargmax∑log[fw(x)y(1−fw(x))(1−y)] =wargmin∑−ylogfw(x)−(1−y)log(1−fw(x))
In other words, we minimize the binary cross-entropy loss(BCE loss).
Remark: Last Layer of fw(x) can be a sigmoid function such that fw(x)y∈[0,1]