Regex in Google Sheets to extract outline values

63 views Asked by At

Want to extract outline values in a google sheet, using a formula

Tried Regex that will work with Arrayformula(IFERROR(REGEXEXTRACT(RANGE,"REGEX")))

almost there but I'm thinking someone has the solution or an easier way to do this?

1(a) Background
1(a)(i) Historical context
1(a)(i)(A) Early influences
1(a)(i)(A)(1) Roman civilization
IIIV.
AA. Background
1.1 Historical context
34 Historical context
a. Early influences
i. Roman civilization
- Background
- Historical context
We are
I. Roman civilization
II. Romans
IV. 
IV. I am sad
iii

Expected output

1(a)
1(a)(i)
1(a)(i)(A)
1(a)(i)(A)(1)
IIIV.
AA.
1.1
34
a.
i.
-
-

I.
II.
IV.
IV.
iii

This seems to work but everything gets split over multiple columns, Removing some of the parenthesis from capture groups to prevent split resulted in not extracting "-"

Arrayformula(IFERROR(REGEXEXTRACT(RANGE,"REGEX")))

Regex:

^(([A-Za-z]|\d+)((\([A-Za-z\d]+\))|(\.\d+))*\.?|\-)\s+([^\n]+)

Possible solutions?

  1. concatenating the output Arrayformula([working CONCATENATE formula for variable columns](IFERROR(REGEXEXTRACT(RANGE,"REGEX")))

  2. I think the capture groups are needed for the regex to work, if thats true and if the output with google sheets by default uses a split() any ()into a new column, then perhaps using a substitute to replace any "(" or ")" with ~ prior to the regexextract and then after extracting these values

Then I was thinking to just use something like Regexreplace(Range," .+","") but this would extract out the first word of a sentence.

3

There are 3 answers

0
rockinfreakshow On BEST ANSWER

Can you test:

=arrayformula(ifna(regexextract(A:A&"","^.*?[^a-zA-Z](?: |$)"),ifna(regexextract(A:A&"","^[a-z]*(?: |$)"))))

enter image description here

1
The God of Biscuits On

Doubtless there's a pure regex approach to your problem, but (assuming I'm understanding your question properly), I think you could just use choosecols to get the last column of the regexextract 2D array:

=arrayformula(choosecols(iferror(regexextract(RANGE,"^(([A-Za-z]|\d+)((\([A-Za-z\d]+\))|(\.\d+))*\.?|\-)\s+([^\n]+)")),-1))
0
z.. On

Try this out:

=ARRAYFORMULA(REGEXEXTRACT(A2:A14,"(?:^| )([\w ]+$)"))

enter image description here

Explanation

(?:^| ) - match the beginning of the string or a space without capturing it

([\w ]+$) - match and capture a word character \w (equivalent to [A-Za-z0-9_]) or a space one or more times from the end $ of the string