So I wrote myself a small awk script. It's purpose was to find the sum of the fourth column while going over the line and then printing the Records next to the percentage of the lines contribution to the sum.
The code looks the following
#!/bin/awk -f
#
BEGIN{
OFS="\t"
}
{
lines[NR]=$0;
Distances[NR]=$NF;
Max_Distance+=$NF;
MAX_LINES=NR
#print length($0)
}
END{
for(i=3; i<=MAX_LINES; i++)
{
increment = Distances[i]/Max_Distance*100;
print length(lines[i]), lines[i], increment;
}
}
Now, I applied this script to the attached files, whose lines always have the same amount of fields but sometimes they differ in amount of characters.
476 23 281 268 0.0421744
475 24 469 448 0.0426674
474 25 147 141 0.0434187
473 26 70 69 0.0445487
472 27 68 61 0.0482006
471 28 19 15 0.0504292
470 29 315 303 0.0508844
469 30 121 -4 0.0509563
468 31 424 407 0.0511194
467 32 189 180 0.0520713
466 33 18 14 0.0531791
465 34 117 107 0.0532455
464 35 46 43 0.0538684
463 36 179 173 0.0547426
462 37 136 109 0.0550616
461 38 42 38 0.058816
460 39 13 10 0.0640171
459 40 265 250 0.0648825
458 41 120 111 0.064891
457 42 118 99 0.0663346
456 43 464 466 0.0671883
455 44 31 28 0.0681487
454 45 213 201 0.0700088
453 46 -26 129 0.0711404
452 47 185 160 0.0731869
451 48 83 71 0.0735005
450 49 104 -1 0.0736425
449 50 346 330 0.0741638
448 51 311 -20 0.0759164
447 52 400 398 0.0767254
446 53 374 358 0.0770171
445 54 475 465 0.0774754
444 55 90 -12 0.0809141
443 56 -10 -14 0.0831925
And the output is somewhat unexpected:
56 474 25 147 141 0.0434187 2.05997
56 473 26 70 69 0.0445487 2.11359
56 472 27 68 61 0.0482006 2.28685
56 471 28 19 15 0.0504292 2.39258
56 470 29 315 303 0.0508844 2.41418
56 469 30 121 -4 0.0509563 2.41759
56 468 31 424 407 0.0511194 2.42533
56 467 32 189 180 0.0520713 2.47049
56 466 33 18 14 0.0531791 2.52305
56 465 34 117 107 0.0532455 2.5262
56 464 35 46 43 0.0538684 2.55575
56 463 36 179 173 0.0547426 2.59723
56 462 37 136 109 0.0550616 2.61237
55 461 38 42 38 0.058816 2.79049
56 460 39 13 10 0.0640171 3.03725
56 459 40 265 250 0.0648825 3.07831
55 458 41 120 111 0.064891 3.07872
56 457 42 118 99 0.0663346 3.14721
56 456 43 464 466 0.0671883 3.18771
56 455 44 31 28 0.0681487 3.23328
56 454 45 213 201 0.0700088 3.32153
56 453 46 -26 129 0.0711404 3.37521
56 452 47 185 160 0.0731869 3.47231
56 451 48 83 71 0.0735005 3.48719
56 450 49 104 -1 0.0736425 3.49393
56 449 50 346 330 0.0741638 3.51866
56 448 51 311 -20 0.0759164 3.60181
56 447 52 400 398 0.0767254 3.64019
56 446 53 374 358 0.0770171 3.65403
56 445 54 475 465 0.0774754 3.67578
56 444 55 90 -12 0.0809141 3.83892
56 443 56 -10 -14 0.0831925 3.94702
I say unexpected because in two of the lines the last field is seperated from the rest of the lines by only a blank space, whereas in all the other cases it is separated by a tab. My question now is why is that and how can I fix it?
Thanks in advance!
I expect that all fields are equally spaced apart and that between the last column and the rest of the line, there is always a tab. I have already tried using different kinds of print statements
Remark: I think I have a workaround, but I would like to know why it behaves this way.
I do not understand why you say my question is very specific. We have a case where the output field separator is specified to be a tab. But then in a few cases in the output between the field $0 and increment, there is no tab but instead a space. That is really unexpected behaviour
Keeping just a few lines to illustrate:
Note that column 6 usually ends on a multiple of eight characters:
However the lines that display oddly finish one character short of this:
When a tab character is used, the program displaying it will normally align the character that follows at the column that corresponds to the next "tab stop". This is often a multiple of 4 or 8 characters (plus 1) - eg. 9,17,25, etc. At least one "space" is inserted.
Because most of your lines end on a multiple of eight, the next character appears 8 "spaces" further along.
However, on the short lines, only one "space" is required to reach the tab stop.