Extract everything between 2 strings in snowflake using regular expression

133 views Asked by At

I have a json like below stored in col1 of a snowflake table table1 from which I want to extract the dump portion into a column dump in a snowflake table table2. I am trying to use regular expression to do it using regexp_substr but am not able to form the correct pattern.

TABLE1.COL1
------------
...
"c_n": "abc",
  "dump": {
    "d_k": "c_s_s",
    "d_v": "{\"abc\":null,\"efg\":3,\"mrc\":3,\"er\":3,\"ff\":{\"ie\":false},\"slm\":false,\"sld\":false,\"lang\":\"hi\"}",
    "date": "20001111112720",
    "fs": "i_t",
    "id": 0000,
    "i_c": null,
    "i_l": null,
    "prof": 0,
    "s_t": "2000-09-11 10:27:20.450924",
    "s_i": "abc57fhg",
    "src": "test_*"
  },
  "ec": "p-ap-p-cl",
...

The logic I am applying is to extract everything in between "dump": and ,"ec" so that I can get below output:

TABLE2.DUMP
------------
{
    "d_k": "c_s_s",
    "d_v": "{\"abc\":null,\"efg\":3,\"mrc\":3,\"er\":3,\"ff\":{\"ie\":false},\"slm\":false,\"sld\":false,\"lang\":\"hi\"}",
    "date": "20001111112720",
    "fs": "i_t",
    "id": 0000,
    "i_c": null,
    "i_l": null,
    "prof": 0,
    "s_t": "2000-09-11 10:27:20.450924",
    "s_i": "abc57fhg",
    "src": "test_*"
  }

Can someone please help in building the correct regular expression for this use case as I am fairly new with using regexp family of functions. Many thanks in advance!

2

There are 2 answers

0
demircioglu On

The example data is not a valid json so json parsing will not work if the source data is formatted like the example.

This can be achieved with SUBSTRING and CHARINDEX assuming "ec" always follows "dump"

WITH table1 (col1) AS (
SELECT * FROM VALUES
('{"c_n": "abc",
  "dump": {
    "d_k": "c_s_s",
    "d_v": "{\"abc\":null,\"efg\":3,\"mrc\":3,\"er\":3,\"ff\":{\"ie\":false},\"slm\":false,\"sld\":false,\"lang\":\"hi\"}",
    "date": "20001111112720",
    "fs": "i_t",
    "id": 0000,
    "i_c": null,
    "i_l": null,
    "prof": 0,
    "s_t": "2000-09-11 10:27:20.450924",
    "s_i": "abc57fhg",
    "src": "test_*"
  },"ec": "p-ap-p-cl"
  }'))
)
SELECT SUBSTRING(col1,
            -- position of dump
            CHARINDEX('"dump": {', col1) + 8, 
            -- length of the string between dump and ec
            NULLIF(CHARINDEX(',"ec":', col1), 0) - (CHARINDEX('"dump": {', col1) + 8) 
        ) AS dump_section
FROM table1

The above SQL assumes the line with ec is formatted as ,ec per the question even though provided sample doesn't follow this.

0
Simeon Pilgrim On

So if I alter the json, to not have the "prior" and "later" rowss, and escape the slashes, so the SQL parser correctly leaves those in the stringified "d_v"

and thus can parse this as JSON, I can then just "access" the JSON element, and not assume the JSON which does not define order of member as important, will just work (tm):

select 
    parse_json($1) as col1
    ,col1:dump as dump
from values
('{"c_n": "abc",
  "dump": {
    "d_k": "c_s_s",
    "d_v": "{\\"abc\\":null,\\"efg\\":3,\\"mrc\\":3,\\"er\\":3,\\"ff\\":{\\"ie\\":false},\\"slm\\":false,\\"sld\\":false,\\"lang\\":\\"hi\\"}",
    "date": "20001111112720",
    "fs": "i_t",
    "id": 0000,
    "i_c": null,
    "i_l": null,
    "prof": 0,
    "s_t": "2000-09-11 10:27:20.450924",
    "s_i": "abc57fhg",
    "src": "test_*"
  },
  "ec": "p-ap-p-cl"
  }')

dump comes out, just as expected:

enter image description here