## Relook at Label Bias Problem

**Rohit Joshi** rohit.joshi at nus.edu.sg *Tue Sep 27 10:37:00 SGT 2005*

- Previous message: [GraphReading] MEMM vs CRF
- Next message: [GraphReading] Relook at Label Bias Problem
**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ]

Hi, In our discussion on Label bias problem, it was not clear to me why we need normalization in case of MEMM training, while in case of Bayes Nets you don't. This is my take. Please see if it is correct. In case of MEMM, the problem occurs at the branching point as it needs to express P(A,B|X)instead of P(A|X) and P(B|X). P(A,B|X) is not equal to the product of P(A|X)P(B|X). Hence, we need to do a normalization : P(A,B|X) = (P(A|X).P(B|X)/Z) , where Z is normalized with respect to set (A,B). I believe this product of conditional requirement is what is causing per state normalization. So, the conclusion that the undirected graphs are better is not correct. Problem occurs only in case of MEMM where it needs to take into account the future probabilities too. So, let's do the same example of two sequences : ARIB, AROB with different state distribution for R as in example considered in the group : (NP, VP) At the branching point, we want to express P(NP,VP|A) but we only have P(NP|A) and P(VP|A) due to training. So we need to normalize the product of P(NP|A). P(VP|A)/Z with respect to set (NP,VP). So, this means that the whole branch of RIB and ROB needs to follow the requirement of normalization i.e. if we express potential of two edges labeled R as (r1,r2) , edge labeled I as i and edge labeled as O as o and so on, then the distribution must follow the following constraints: r1+ r2 = 1 r1(i) + r2(o) = 1 r1(i)(b) + r2(o)(b) = 1 => i,b,o must be 1 Goal of MEMM = max (product of conditionals of arib, product of conditional of arob) = arib as arib always stochastically dominates arob So, I think the problem is special to MEMM where it needs to express conditional that as a set of variables as consequents e.g. P(Y,Z|X) and it has only P(Y|X) and P(Z|X) (due to training) . In case of BN, a BN always expresses conditionals of form where consequent is only 1 variable (e.g. (P(Y|X)) instead of set of variables as in this case. BN requires merging the set of consequent variables into 1 variable. So, BN allows to have multiple parents but only 1 child consequent variable, where as in this case we have a set of children due to training on different sets. So, we need to normalize with respect to the set of children. Comments on expert (expert means human expert) knowledge: Experts (in biomedical domain) are usually good in giving conditional distribution e.g. P(S|E) than a joint distribution P(S,E) where S means symptoms and E means evidence. We, generally, would like to code it in BN. So, the way you get the probabilities from expert is that you adhere to the requirements of having 1 consequent than a set of consequents. That is to say, we purposely don't want to have a set of variables as a consequent e.g. P(S1,S2|E)as BN cannot code this distribution. In cases, if experts want to deal with (P(S1,S2|E), we need to do convert it to one variable S. Rgds, Rohit Joshi

- Previous message: [GraphReading] MEMM vs CRF
- Next message: [GraphReading] Relook at Label Bias Problem
**Messages sorted by:**[ date ] [ thread ] [ subject ] [ author ]