Replace whole column with a new value in all multiple lines

Question

Replace whole column with a new value in all multiple lines

155 views Asked by Sam At 14 November 2023 at 14:50

I have the following lines in a file

$ cat test.txt
69|1074330570|1,sip:+121345633210x3Bverstat=TN-Validation-Passed|tel:+12134565534|0
69|1077822111|2,;tel:+2223120011~sip:[email protected];|sip:[email protected]|0
69|1077988012|1,sip:+121510016070x3Bverstat=TN-Validation-Passed|tel:+136965339510x3Bnpd|0

I want to replace the third and fourth columns in the file with just phone numbers as following:

69|1074330570|2134563321|2134565534|0
69|1077822111|2223120011|3123120022|0
69|1077988012|2151001607|3696533951|0

The good part is that all files will have a '+' in third and fourth columns. Now the difficult part is that sometimes we get 11 digits as noticed in the first line third column (121345633210) and sometimes there is no '1' added after a '+'. So if there is a 1 after the plus sign then exclude 1. Otherwise, start the length after the '+' sign. Similarly, if there are 11 digits, just substring 10. Also, if there are more than 1 numbers like line 2, I need to pick up the first number after the '+' sign only like 2223120011 and not 2223120051.

I tried the following

awk -F"|"  'BEGIN {FS="+"}{print substr($3,2,10)}' test.txt

But it is giving incorrect value for second line since it needs to start after the '+' sign.

Original Q&A

There are 4 answers

anubhava On 14 November 2023 at 15:19

You may use this awk solution:

awk '
function extphone(s) {
   sub(/^[^+]*\+1?/, "", s)  # remove all text before + then + and optional 1
   return substr(s, 1, 10)   # extract first 10 characters now
}
BEGIN {FS=OFS="|"}
{
   $3 = extphone($3)
   $4 = extphone($4)
} 1' file

69|1074330570|2134563321|2134565534|0
69|1077822111|2223120011|3123120022|0
69|1077988012|2151001607|3696533951|0

Ed Morton On 14 November 2023 at 15:33

With GNU awk for gensub():

$ awk '
    BEGIN { FS=OFS="|" }
    {
        for ( i=3; i<=4; i++ ) {
            $i = gensub(/[^+]+\+1?([0-9]{10}).*/,"\\1",1,$i)
        }
        print
    } 
' test.txt
69|1074330570|2134563321|2134565534|0
69|1077822111|2223120011|3123120022|0
69|1077988012|2151001607|3696533951|0

or with any awk:

$ awk '
    BEGIN { FS=OFS="|" }
    {
        for ( i=3; i<=4; i++ ) {
            if ( match($i,/\+1?[0-9]{10}/) ) {
                $i = substr($i,RSTART+RLENGTH-10,10)
            }
        }
        print
    }
' test.txt
69|1074330570|2134563321|2134565534|0
69|1077822111|2223120011|3123120022|0
69|1077988012|2151001607|3696533951|0

RavinderSingh13 On 14 November 2023 at 23:16

With your shown samples please try following GNU awk code.

awk '
match($0,/^([^|]*\|[^|]*\|)[^+]*\+1*([0-9]+)[^|]*(\|)[^+]*\+1*([0-9]+)[^|]*(\|.*$)/,arr){
  sub(/0$/,"",arr[2])
  sub(/0$/,"",arr[4])
  print arr[1] arr[2] arr[3] arr[4] arr[5]
}
'  Input_file

Explanation: Adding detailed explanation for above code.

Using GNU awk's match function to use regex in it, which creates capturing groups and later on we can use those for getting required output.
Using regex ^([^|]*\|[^|]*\|)[^+]*\+1*([0-9]+)[^|]*(\|)[^+]*\+1*([0-9]+)[^|]*(\|.*$) inside match function.
If a match of regex is found then removing 0 at last of 2nd and 4th capturing group by using sub here.
Then printing all 5 capturing groups to get required output.

**markp-fuso** · Accepted Answer · 2023-11-14T15:14:28+00:00

Issues with the current code:

-F"|" and FS="+" both define the input field delimiter so, technically, you should pick one or the other but not both; in this case FS="+" takes precedence (ie, -F"|" is ignored)
since the overall requirement is to 'replace' the 3rd and 4th columns you will want to keep the -F"|", but also define it as the output field delimiter (recommend: BEGIN { FS=OFS="|"}
then look at further parsing the 3rd/4th fields with some combination of awk string functions (eg, split(), match(), index(), substr())

Assumptions/Understandings:

3rd/4th fields always have a +
the string after the + is always a 10-digit or 11-digit number
if the 3rd/4th fields have more than one + then we're only interested in the first + (eg, we don't have to further parse based on tel or sip)
it appears that we're dealing with North American phone number formats (eg, we don't have to worry about the different phone number formats of other countries)

One awk idea:

awk '
BEGIN { FS=OFS="|" }
      { for (i=3; i<=4; i++) {                         # loop through 3rd and 4th fields
            split($i,a,"+")                            # split on "+"
            d1 = substr(a[2],1,1)                      # get 1st digit after the "+"
            $i = substr(a[2],(d1==1 ? 2 : 1),10)       # redefine ith field based on value of 1st digit
        }
      }
1                                                      # print current line
' test.txt

NOTES:

this logic is hardcoded based on the assumptions we only have to deal with North American phone number formats (10-digit/11-digit, country code = 1)

This generates:

69|1074330570|2134563321|2134565534|0
69|1077822111|2223120011|3123120022|0
69|1077988012|2151001607|3696533951|0

TechQA.

Replace whole column with a new value in all multiple lines

There are 4 answers

Related Questions in REGEX

Related Questions in LINUX

Related Questions in AWK

Related Questions in REPLACE

Related Questions in SUBSTR

Popular Questions

Trending Questions