Awk switching output field seperator depending on the line length

73 views Asked by At

So I wrote myself a small awk script. It's purpose was to find the sum of the fourth column while going over the line and then printing the Records next to the percentage of the lines contribution to the sum.

The code looks the following

#!/bin/awk -f
#

BEGIN{
   OFS="\t"
}

{
        lines[NR]=$0;
        Distances[NR]=$NF;
        Max_Distance+=$NF;
        MAX_LINES=NR
        #print length($0)

}

END{
        for(i=3; i<=MAX_LINES; i++)
        {
                increment = Distances[i]/Max_Distance*100;
                print length(lines[i]), lines[i], increment;

        }
}

Now, I applied this script to the attached files, whose lines always have the same amount of fields but sometimes they differ in amount of characters.

 476         23      281             268        0.0421744
 475         24      469             448        0.0426674
 474         25      147             141        0.0434187
 473         26       70              69        0.0445487
 472         27       68              61        0.0482006
 471         28       19              15        0.0504292
 470         29      315             303        0.0508844
 469         30      121              -4        0.0509563
 468         31      424             407        0.0511194
 467         32      189             180        0.0520713
 466         33       18              14        0.0531791
 465         34      117             107        0.0532455
 464         35       46              43        0.0538684
 463         36      179             173        0.0547426
 462         37      136             109        0.0550616
 461         38       42              38        0.058816
 460         39       13              10        0.0640171
 459         40      265             250        0.0648825
 458         41      120             111        0.064891
 457         42      118              99        0.0663346
 456         43      464             466        0.0671883
 455         44       31              28        0.0681487
 454         45      213             201        0.0700088
 453         46      -26             129        0.0711404
 452         47      185             160        0.0731869
 451         48       83              71        0.0735005
 450         49      104              -1        0.0736425
 449         50      346             330        0.0741638
 448         51      311             -20        0.0759164
 447         52      400             398        0.0767254
 446         53      374             358        0.0770171
 445         54      475             465        0.0774754
 444         55       90             -12        0.0809141
 443         56      -10             -14        0.0831925

And the output is somewhat unexpected:

56      474         25      147             141        0.0434187        2.05997
56      473         26       70              69        0.0445487        2.11359
56      472         27       68              61        0.0482006        2.28685
56      471         28       19              15        0.0504292        2.39258
56      470         29      315             303        0.0508844        2.41418
56      469         30      121              -4        0.0509563        2.41759
56      468         31      424             407        0.0511194        2.42533
56      467         32      189             180        0.0520713        2.47049
56      466         33       18              14        0.0531791        2.52305
56      465         34      117             107        0.0532455        2.5262
56      464         35       46              43        0.0538684        2.55575
56      463         36      179             173        0.0547426        2.59723
56      462         37      136             109        0.0550616        2.61237
55      461         38       42              38        0.058816 2.79049
56      460         39       13              10        0.0640171        3.03725
56      459         40      265             250        0.0648825        3.07831
55      458         41      120             111        0.064891 3.07872
56      457         42      118              99        0.0663346        3.14721
56      456         43      464             466        0.0671883        3.18771
56      455         44       31              28        0.0681487        3.23328
56      454         45      213             201        0.0700088        3.32153
56      453         46      -26             129        0.0711404        3.37521
56      452         47      185             160        0.0731869        3.47231
56      451         48       83              71        0.0735005        3.48719
56      450         49      104              -1        0.0736425        3.49393
56      449         50      346             330        0.0741638        3.51866
56      448         51      311             -20        0.0759164        3.60181
56      447         52      400             398        0.0767254        3.64019
56      446         53      374             358        0.0770171        3.65403
56      445         54      475             465        0.0774754        3.67578
56      444         55       90             -12        0.0809141        3.83892
56      443         56      -10             -14        0.0831925        3.94702

I say unexpected because in two of the lines the last field is seperated from the rest of the lines by only a blank space, whereas in all the other cases it is separated by a tab. My question now is why is that and how can I fix it?

Thanks in advance!

I expect that all fields are equally spaced apart and that between the last column and the rest of the line, there is always a tab. I have already tried using different kinds of print statements

Remark: I think I have a workaround, but I would like to know why it behaves this way.

I do not understand why you say my question is very specific. We have a case where the output field separator is specified to be a tab. But then in a few cases in the output between the field $0 and increment, there is no tab but instead a space. That is really unexpected behaviour

4

There are 4 answers

0
jhnc On BEST ANSWER

Keeping just a few lines to illustrate:

56      463         36      179             173        0.0547426        2.59723
56      462         37      136             109        0.0550616        2.61237
55      461         38       42              38        0.058816 2.79049
56      460         39       13              10        0.0640171        3.03725
56      459         40      265             250        0.0648825        3.07831
55      458         41      120             111        0.064891 3.07872
56      457         42      118              99        0.0663346        3.14721
56      456         43      464             466        0.0671883        3.18771

Note that column 6 usually ends on a multiple of eight characters:

|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|
|.......|.......|.......|.......|.......|.......|.......|.......|.......|.......|
56      462         37      136             109        0.0550616        2.61237

However the lines that display oddly finish one character short of this:

|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|
|.......|.......|.......|.......|.......|.......|.......|.......|.......|.......|
55      461         38       42              38        0.058816 2.79049

When a tab character is used, the program displaying it will normally align the character that follows at the column that corresponds to the next "tab stop". This is often a multiple of 4 or 8 characters (plus 1) - eg. 9,17,25, etc. At least one "space" is inserted.

Because most of your lines end on a multiple of eight, the next character appears 8 "spaces" further along.

However, on the short lines, only one "space" is required to reach the tab stop.

3
Romeo Ninov On

One possible way is to use formatted print command. You can try it by replacing line:

print length(lines[i]), lines[i], increment;

with

printf "%s %-60s %-20s\n", length(lines[i]), lines[i], increment;

fill free to tune the length of field

0
Ed Morton On

Let's pick the last 2 columns of one of the lines that is formatted as you want:

$ awk -v OFS='\t' 'BEGIN{ print "0.0550616", "2.61237"}'
0.0550616       2.61237

Now lets change the first string to digits so it's easy to see how many digits are in that string and change the last one to "foo" to show it's value is irrelevant:

$ awk -v OFS='\t' 'BEGIN{ print "123456789", "foo"}'
123456789       foo

You can see that the separator is a tab with:

$ awk -v OFS='\t' 'BEGIN{ print "123456789", "foo"}' | od -c
0000000   1   2   3   4   5   6   7   8   9  \t   f   o   o  \n
0000016

Now let's make that first string 1 char shorter:

$ awk -v OFS='\t' 'BEGIN{ print "12345678", "foo"}'
12345678        foo

and now 1 char shorter again:

$ awk -v OFS='\t' 'BEGIN{ print "1234567", "foo"}'
1234567 foo

You can again see that the separator is still a tab, it didn't somehow become a blank:

$ awk -v OFS='\t' 'BEGIN{ print "1234567", "foo"}' | od -c
0000000   1   2   3   4   5   6   7  \t   f   o   o  \n
0000014

That's because a tab-separator pads the displayed space to the width of 8 characters (usually but that number is configurable) so if the first field is 8 chars long a tab will make it look like 8 blanks are added before the next field, but if a field is 7 chars the tab will make it look like 1 blank is added but it's still a tab.

If you want tabular output (i.e. where all the values in each column are aligned) instead of tab-separated output then you can pipe the output to column to get that:

$ awk -v OFS='\t' 'BEGIN{ print "123456789", "foo" ORS "1234567", "bar"}'
123456789       foo
1234567 bar

$ awk -v OFS='\t' 'BEGIN{ print "123456789", "foo" ORS "1234567", "bar"}' | column -s $'\t' -t
123456789  foo
1234567    bar
0
karakfa On

there is a build-in tool to handle table alignments, just pipe to

$ ... | column -t

and done.