Regex result contains extra match/group with just a return

58 views Asked by At

I would like to match everything between start and end given the following string:

const test = `this is the start
a
b
c
e
f
end
g
h
`;

I have the following regex

const output = test.match(/start((.|\n)*)end/m);

No, output[0] contains the whole string that matched (with start and end) output[1] is the match (everything between start and end) output[2] is only a return (\n)

DEMO

enter image description here

What I don't understand is where does the second match/group (output2) come from. Amy suggestions?

4

There are 4 answers

1
Pointy On BEST ANSWER

This part of your regular expression: ((.|\n)*) creates two capturing groups. The outer group collects all the matched "anything" characters matched by the inner * group. The inner group will contain the last matched single character.

Note that you'd probably be better off with a slightly different regular expression to avoid the odd effect of collecting too many characters in the groups before backtracking takes over.

1
Andrew Parks On

As Pointy mentioned, you have two capturing groups.

If you use a named capturing group, this makes the code easier to write and understand:

const test = `this is the start
a
b
c
e
f
end
g
h
`;

console.log(test.match(/start(?<between>(.|\n)*)end/m).groups.between)

0
Ishan On

Based on what you have described I would assume you want to match from 'start' to 'end', but your example is not quite doing that. So I am providing two examples you can chose from depending what you might need.

  1. Match from 'start' to first 'end' the ? makes the [\s\S]+ non greedy.
/(?<=start)[\s\S]+?(?=end)/

const test = `this is the start
a
b
c
e
f
end
g
end
h
`;

console.log(test.match(/(?<=start)[\s\S]+?(?=end)/))

  1. Match from 'start' to last 'end' greedy - keep going till last is matched.
/(?<=start)[\s\S]+(?=end)/

const test = `this is the start
a
b
c
e
f
end
g
end
h
`;

console.log(test.match(/(?<=start)[\s\S]+(?=end)/))

I am using [\s\S] which means any Whitespace or Non Whitespace character essentially what you had with (.|\n) but takes care of more scenarios.

I am also using Lookahead and Lookbehind which is now widely supported in JS, so that the match is in 0 position of the result.

0
The fourth bird On

Your second capture group matches a newline, because repeating a capture group captures the value of the last iteration.

The last iteration of (.|\n)* is the newline after the f char and before end

Notes

  • The * is greedy and will first match all characters first, then it will backtrack to the first occurrence of end
  • You don't need the /m flag for multiline as the pattern has no anchors
  • If you don't want partial word matches you can use word boundaries like \bstart\b and \bend\b
  • If you want multiple matches, you can specify the /g for global matches and make the quantifier non greedy [^]*?
  • The pattern (.|\n)* is very inefficient as you are optionally repeating a single character or a newline, leaving a lot of paths to backtrack into

You can use a single capture group, and in Javascript you can write the pattern as:

start([^]*)end

See a regex demo

const regex = /start([^]*)end/;
const str = `this is the start
a
b
c
e
f
end
g
h`;

const m = str.match(regex);
if (m) {
  console.log(m[1]);
}