CRF++: anybody understand what does the float number mean in CRF model file

188 views Asked by At

When you build you model file with -t option by crf_learn: crf_learn template train_data -t model

It will then generate two model file, one of them is model.txt.

Can anybody tell what does the float numbers mean?

See the following example:

version: 100 cost-factor: 1 maxid: 40 xsize: 1

B I

U00:%x[0,0] B

36 B 20 U00:、 26 U00:か 18 U00:が 22 U00:こ 8 U00:た 10 U00:ち 2 U00:っ 4 U00:て 34 U00:に 12 U00:の 0 U00:よ 28 U00:ら 24 U00:れ 32 U00:上 14 U00:世 16 U00:代 30 U00:地 6 U00:私

-0.3022268562246992 0.3022268562246989 -0.3629407244093161 0.3629407244093156 -0.3327259487028221 0.3327259487028215 0.3462799099537973 -0.3462799099537980 0.3452020097664334 -0.3452020097664336 -0.3218750203631590 0.3218750203631575 0.0376944272290242 -0.0376944272290280 0.3329631783491211 -0.3329631783491230 -0.3092967308014029 0.3092967308014015 0.3413769126433928 -0.3413769126433950 0.3786782765859961 -0.3786782765859980 0.5208645073272351 -0.5208645073272384 -0.3261580548802839 0.3261580548802814 -0.3615756495615902 0.3615756495615884 -0.3248593224319323 0.3248593224319312 0.3281895709166696 -0.3281895709166719 -0.3040331359589971 0.3040331359589951 0.2836939567332580 -0.2836939567332600 -0.1530917919770705 -0.1613508585854637 0.4245699543724943 -0.1101273038099901

My understanding is: each float number should correspond to each template, for instance: first float number "-0.3022268562246992" should correspond to "36 B". But why the number of float number double the number of template? what does those float number mean?

Many thanks,

Shuai Hua

1

There are 1 answers

0
shuai On BEST ANSWER

After reading parts of the CRF++058 source code, I know how to understand the crf_learn output. I will use some examples to explain the output.

==== Basic ====

Let's assume we have the following training data:

毎  k   B
日  k   I
新  k   I  
聞  k   I
社  k   I
特  k   B
別  k   I 
顧  k   B
問  k   I

And our template is very simple, only has one line: U00:%x[0,0]

  1. so the number of feature in this case is 9, there are: 毎, 日, 新, 聞, 社, 特, 別, 顧, 問.
  2. now let keep the training data unchanged, add another feature in template:

U00:%x[0,0]

U00:%x[-1,0]/%x[0,0]/%x[1,0]

Now we have two "features" in template. So the total number of feature changes to 18, there are:

毎, 日, 新, 聞, 社, 特, 別, 顧, 問    
 ../毎/日
毎/日/新
日/新/聞
新/聞/社   
聞/社/特
社/特/別
特/別/顧    
別/顧/問    
顧/問/..

(This feature template with two rules will apply to each single word)

  1. now let's add a duplicated word in training data, as following:
毎  k   B
毎  k   B
日  k   I
新  k   I  
聞  k   I
社  k   I
特  k   B
別  k   I 
顧  k   B
問  k   I

For the word "毎", it appears twice, but only be regarded as one feature. So the number of feature still 18.

==== Advance ====

Now let's see how to understand the content in "model.txt".

1) a SPACE LINE is used to delimit different block:

1. First block:

    version: 100
    cost-factor: 1
    maxid: 670
    xsize: 1

The maxid depends on numbers of features, and numbers of tags.

Using the first training data as example:(9 different words, and two tags => B and I)

the id should start from 0, 0+2=2, 2+2=4, ... 16. maxid is 16.

Here, why the step is 2?

Because we have two types of tag. actually each word corresponds to two different tags, like:

0 毎 ==> B
1 毎 ==> I

2 日 ==> B
3 日 ==> I 
...
14 問 ==> B
15 問 ==> I

2. second block:

list all the tags in the training data:

B

I

3. third block:

list all the template used:

U00:%x[0,0]

B

4. fourth block:

the feature id, the template and the correspond word:

0 U00:毎
2 U00:日
...

5. the fifth block:

For each feature, the possibility for each tag:

There are two possibility correspond to each word.

Possibility < 0 will be ignored.

- Shuai Hua