Remove all attributes not in whitelist from all HTML tags

Question

Remove all attributes not in whitelist from all HTML tags

98 views Asked by user7381822 At 03 November 2023 at 00:09

So, far I can only keep one attribute but I am trying to keep both class and id attributes left in the HTML tags

Code:

$string = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>';

preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\sclass=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i", '<$1$2$3>', $string);

Output:

<div class="someClassName">Some text <a class="classLink">link</a> with only the class and id  attrtibutes./div>

I am trying to remove all other attributes from every tag except the class and id attributes.

Using the DOMDocument(); adds extra p tags to the output for some reason and I believe xpath is faster?

Original Q&A

There are 1 answers

**mickmackusa** · Answer 1 · 2023-11-03T01:11:16+00:00

Iterate over all nodes in the dom, then loop over all attributes in reverse so that you can safely prune attributes that are not in your whitelist.

Code: (Demo)

$html = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>';

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//*') as $node) {
    for ($i = $node->attributes->length - 1; $i >= 0; --$i) {
        $attr = $node->attributes->item($i);
        if (!in_array($attr->name, ['id', 'class'])) {
            $node->removeAttribute($attr->name);
        }
    }
}
echo $dom->saveHTML();

Output:

<div id="one-id" class="someClassName">Some text <a id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>

...actually, XPath isn't really needed because we are iterating every node in the dom. (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
    for ($i = $node->attributes->length - 1; $i >= 0; --$i) {
        $attr = $node->attributes->item($i);
        if (!in_array($attr->name, ['id', 'class'])) {
            $node->removeAttribute($attr->name);
        }
    }
}
echo $dom->saveHTML();

Trying to parse valid HTML with a regular expression is going to be one or more of the following:

Inaccurate/Unreliable,
Convoluted/Verbose,
Hard to read,
Hard to maintain

Regex does not know the difference between tags and text that merely looks like tags. What if the HTML tags and attributes use upper and lower case? What if single quotes, double quotes and/or backticks are used? What if an attribute has no assignment (e.g. readonly or checked)? What if a data- attribute name ends with id or title? What if an attribute value contains a quoting symbol which is escaped instead of html encoded? What if text looks like a starting tag, but isn't a tag at all?

These are valid reasons to steer clear of parsing valid HTML with regex.

TechQA.

Remove all attributes not in whitelist from all HTML tags

There are 1 answers

Related Questions in PHP

Related Questions in HTML

Related Questions in ATTRIBUTES

Related Questions in DOMDOCUMENT

Related Questions in SANITIZATION

Popular Questions

Trending Questions