I have a column in table1 that stores free text comments in it. Within these comments, we have lots of names that I need to redact, somehow.
We set up another table with possible names to be looked for (table2) and now we need to look for these names within table1 and replace either by a fixed string (we thought in something like "Redact" or "Suppressed Name", it doesn't matter now :)) or a value from the table2, where we could, basically, add that string to a second column.
Googling for a solution, I got to the code below, but it's not working so well, because when we have 2+ different names, it doesn't replace it, at all.
Does anyone have any idea?
table1 has about 800,000 rows and table2 has around 14,000 rows. When I run below code, it takes about 30 minutes to complete, which is not exactly a problem, since it'll a one shot. For the daily basis ingest, we'd have less than 1,000 rows.
Code that is partially working:
WITH RECURSIVE replaced AS (
(SELECT pkey,
investigation_comment,
replaced_names,
array_agg(source_names ORDER BY length(source_names) DESC, source_names, replaced_names) AS arrreplacefrom,
array_agg(replaced_names ORDER BY length(source_names) DESC, source_names, replaced_names) AS arrreplaceto,
count(1) AS intcount,
1 AS intindex
FROM table1, table2
WHERE table1.investigation_comment LIKE '%' || source_names || '%'
GROUP BY pkey,investigation_comment,replaced_names
)
UNION ALL
SELECT pkey,
investigation_comment,
replace(replaced_names, arrreplacefrom[intindex], arrreplaceto[intindex]) AS replaced_names,
arrreplacefrom,
arrreplaceto,
intcount,
intindex+1 AS intindex
FROM replaced
WHERE intindex<=intcount
)
SELECT pkey
,investigation_comment
,arrreplacefrom
,replace(investigation_comment,ARRAY_TO_STRING(arrreplacefrom,','),replaced_names) investigation_comment_amended
,ARRAY_TO_STRING(arrreplacefrom,',') strarrreplacefrom
,false as ignore_this_row
FROM replaced
GROUP BY pkey
,investigation_comment
,arrreplacefrom
,replaced_names
So, mock tables could be like that: Table1:
create table testtable1 (pkey text,investigation_comment text);
insert into testtable1
values ('a1','unhappy as last 4 packs havent been sealed propery')
,('a2','unhappy was displayed at £2 and charged £2.75')
,('a3','customer Jack not happy that he was told a few years ago that because he had forgot his Card he go back later and get the money back')
,('a4','Jack and Mary unhappy with PCN')
,('a5','one of the multi pack bags were open and stale')
,('a6','John is unhappy with the quality says the crisps are very small and full of black bits')
Table2:
create table testtable2 (source_names text,replaced_names text);
insert into testtable2
values ('Jack','Suppressed Name')
,('John','Suppressed Name')
,('Mary','Suppressed Name');
Expected Result:
________________________________________________________________________________________________________________________________________________________________
|pkey |__investigation_coment_________________________________________________________________________________________________________________________________|
|a1 | unhappy as last 4 packs havent been sealed propery |
|a2 | unhappy was displayed at £2 and charged £2.75 |
|a3 | customer Suppressed Name not happy that he was told a few years ago that because he had forgot his Card he go back later and get the money back |
|a4 | Suppressed Name and Suppressed Name unhappy with PCN |
|a5 | one of the multi pack bags were open and stale |
|a6 | Suppressed Name is unhappy with the quality says the crisps are very small and full of black bits |
|_______________________________________________________________________________________________________________________________________________________________|