Best regex to match an opening then closing HTML tags, no matter what content

887 views Asked by At

I'm parsing an HTML source as a string, and I want to remove all <style> tags. Here's my regex:

/<style.*?>[\s\S]*?<\/style>/i

NB: The question mark after the star makes it lazy.

If you run this code on the following HTML extract (rather small, around 1,000 chars), it takes over 25s to run. On the full HTML page, it still won't complete after 20 minutes.

text = "<main><h4>Une expérience reconnue</h4> <p class=\"home\">Entrez dans la communauté Initiative et <b>rejoignez nos 400 clients</b> qui font croitre chaque jour leur business.<br/><br/> <style=\"text-align: right;\"><a href=\"/references/\"><button class=\"btn btn-border btn-round waves-effect lighten-1 demo2 \" data-color=\"#e9763e\">EN SAVOIR+</button></a></p> </div></li> </ul> </div> <div class=\"col m8 offset-m1 s12\"> <!-- Partie mockup --> <ul class=\"mockup\"><!-- 0 --> <li><img src=\"/wp-content/uploads/2017/04/pilotage-crm.png\" /></li> <!--style=\"width:3000px;height:1800px;\"--> <li><img src=\"/wp-content/uploads/2017/04/CRM-simple.png\"/></li> <!-- 2 --> <li><img src=\"/wp-content/uploads/2017/04/prix-CRM.png\"/></li> <!-- 3 --> <li><img src=\"wp-content/themes/abonline/img/slider-mockup/iPadConfig.png\"/></li> <!-- 4 --> <li><img src=\"wp-content/uploads/2017/04/clients-CRM.png\"/></li> </ul> </div> </div> </div> <section id=\"slider-iconPlus\"> <div class=\"container\"> <div class=\"iconPlus row\"> <div class=\"iconData col l3 m6 s6\"> <img id=\"hovData\" class=\"data\" src=\"wp-content/themes/abonline/img/slider-mockup/DataSecure.png\" alt=\"png\" /><br/>"
text.gsub!(/<style.*?>[\s\S]*?<\/style>/i, '')

The problem comes from the fact that, in the input, there are some <style> tags without the closing </style>. With the [\s\S]*? part, my understanding is that the regex has to parse the full string to match a closing </style> but doesn't find any. Unfortunately I have no control on the input, so I must fix the regex.

I could replace [\s\S]*? with [^<]* to match anything until the beginning of the closing </style> but that seems wrong, what if there's a < char within the style definition?

text.gsub!(/<style.*?>[^<]*<\/style>/i, '')

This completes immediately, but most of the regex examples I found online to match HTML tags were using the first way above, so I'm wondering what's the "right" way.

How would you do this?

0

There are 0 answers