awk find/print paragraph containing multiple patterns

68 views Asked by At

Request:

Extract blocks of text that contain 2 or more search terms, something akin to [ AND ] logical operator in [ awk ].

Preferably run as awk in bash/zsh function (but also ok with standalone awk script), accepting input/parameter in regex style:

[ A|B|C ] = return blocks that contain either 'A' or 'B' or 'C'

[ A&B&C ] = return blocks that contain ALL 'A' and 'B' and 'C'

Context: Blocks are separated by at least 5 new lines.

Extra: Highlight search matches.

Input

Given [ veganPackage.txt ] input file:

1. Fruits 
Apple
Banana
Honey
   - tasty combo but too many sugars
   - Low prep time
   - bad for teeth, cavity warning 




2. Drinks
Apple Juice
   - served cold and ripe

Add Kiwi
   - peel first

Banana Smoothie
   - tastes good
   - fast power up



3. Veggies
Frillice
Cucumber 
Tomato



Want

Input Blocks to print Colorize words
Apple|Banana Fruits, Drinks Apple, Banana
Apple|Banana|Frillice Fruits, Drinks, Veggies Apple, Banana, Frillice
Apple&Banana Fruits, Drinks Apple, Banana
Apple&Tomato nothing nothing
Kiwi&Banana Drinks Kiwi, Banana (only in Drinks)

Tried

Bash function

Named as [ searchBlock ]

searchBlock ()
{
...

awk \
  -v RS='\n{4}' \
  -v ORS='\n***\n***\n' \
  -v color=$colorOut \
  -v colorReset=$colorReset \
  -v search=$(echo "$searchTerm" | perl -pe 's/(?<!\\)&+/\/&&\//g and s/^/\//g and s/(.)(?=$)/\1\//g') searchTerm \
  -v searchAND=$(echo $searchTerm | perl -pe 's/&+/|/g') '$0~search{gsub(searchAND,color"&"colorReset);print}' $file |
  vim - -c "/$searchTerm" \
        -c ':AnsiEsc' \
        -c 'highlight ColorReverse gui=reverse cterm=reverse' \
        -c ":match ColorReverse /$searchTerm/"
}

Example Call as: searchBlock -s 'Apple|Banana' veganPackage.txt

Rationale:

  • if OR pattern as [ | ] in input, do regular match
  • if AND pattern as [ & ] in input, preserve [ | ] for colorizing, but change to [ && ] for pattern marching
  • feed operand as part of parameter

Bottleneck

  1. If I manually feed '/Apple/&&/Kiwi/{gsub(/(Apple|Kiwi)/,color"&"colorReset);print}' veganPackage.txt, then output as expected:
\*\*\*
\*\*\*

2. Drinks

Apple Juice

   - served cold and ripe

Add Kiwi

   - peel first

Banana Smoothie

   - tastes good

   - fast power 
up

\*\*\*
\*\*\*

However, using '$0 ~ search{gsub(searchAND,color"&"colorReset);print}, [ AND ] pattern #fails (nothing is printed)

(not what I filtered/searched), highlight/coloring is correct though)

  • $0 ~ search = for every block that contains pattern in awk [ search ] variable,

  • {gsub(searchAND,color"&"colorReset);print} = print global substitution of searched text surrounded by ANSI Escape sequences

    • [ & ] is double-quoted as regex-specific syntax (NOT to be confused with [ && ] in AND-pattern match for awk))

It seems that $0 ~ /Apple/&&Kiwi/ does NOT collab with me.

Tests

Input Code fragment Result Expect
Apple|Kiwi $0~search Fruits, Drinks Fruits, Drinks
Apple&Kiwi $0~search nothing Drinks
Apple|Kiwi search entire file Drinks, Fruits
Apple&Kiwi search entire file Drinks
Apple&&Kiwi search entire file Drinks
/Apple/&&/Kiwi/ $0~search nothing Drinks
Apple&&Kiwi $0~search nothing Drinks
1

There are 1 answers

0
markp-fuso On

Focusing solely on awk and the use of dynamically generated regexes ...

Assumptions:

  • you're able (in bash) to parse/reformat the various inputs into formats that are acceptable to awk

General approach:

  • pass the gsub() regex in as a -v variable=value clause
  • pass the search regex by piecing together strings to build (on-the-fly) the awk script
  • we'll simplify the gsub() code to bracket the matches with a pair of underscores (__); OP can incorporate the color codes later

A simple data set for demonstration purposes:

$ cat simple.dat
line_1 - Apple
line_2 - Banana
line_3 - Kiwi Cherry Apple
line_4 - Apple Kiwi
line_5 - Kiwi

We'll use a bash/for loop to test a few different search regexs (the gsub() regex is the same for all 3 search regexes):

for search_regex in "/Apple/ && /Kiwi/" "/Apple|Kiwi/" "/Apple/ || /Kiwi/"
do
    printf "\n########## search : ${search}\n\n"

    awk -v gsub_regex="Apple|Kiwi" ' 
    BEGIN { filler=ignore }
    '"${search_regex}"' { gsub(gsub_regex,"__&__") }
    1
    END   { filler=ignore }
    ' sample.dat

done

NOTE: here's where I'm assuming OP can parse the various input formats into one of these formats for the two regex variables

Where:

  • 1st part of awk script: 'BEGIN { filler=ignore } '
  • 2nd part of awk script: "${search_regex}" ; must be wrapped in double quotes
  • 3rd part of awk script: ' { gsub(gsub_regex,"__&__") } 1; END { filler=ignore }'
  • there must not be any white space between the single quotes (1st/3rd parts) and the double quotes (2nd part) (ie, '"${search_regex}"')

Taking for a test drive:

########## search : /Apple/ && /Kiwi/

line_1 - Apple
line_2 - Banana
line_3 - __Kiwi__ Cherry __Apple__
line_4 - __Apple__ __Kiwi__
line_5 - Kiwi

########## search : /Apple|Kiwi/

line_1 - __Apple__
line_2 - Banana
line_3 - __Kiwi__ Cherry __Apple__
line_4 - __Apple__ __Kiwi__
line_5 - __Kiwi__

########## search : /Apple/ || /Kiwi/

line_1 - __Apple__
line_2 - Banana
line_3 - __Kiwi__ Cherry __Apple__
line_4 - __Apple__ __Kiwi__
line_5 - __Kiwi__