Remove all attributes not in whitelist from all HTML tags

98 views Asked by At

So, far I can only keep one attribute but I am trying to keep both class and id attributes left in the HTML tags

Code:

$string = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>';

preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\sclass=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i", '<$1$2$3>', $string);

Output:

<div class="someClassName">Some text <a class="classLink">link</a> with only the class and id  attrtibutes./div>

I am trying to remove all other attributes from every tag except the class and id attributes.

Using the DOMDocument(); adds extra p tags to the output for some reason and I believe xpath is faster?

1

There are 1 answers

1
mickmackusa On

Iterate over all nodes in the dom, then loop over all attributes in reverse so that you can safely prune attributes that are not in your whitelist.

Code: (Demo)

$html = '<div id="one-id" class="someClassName">Some text <a href="#" title="Words" id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>';

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//*') as $node) {
    for ($i = $node->attributes->length - 1; $i >= 0; --$i) {
        $attr = $node->attributes->item($i);
        if (!in_array($attr->name, ['id', 'class'])) {
            $node->removeAttribute($attr->name);
        }
    }
}
echo $dom->saveHTML();

Output:

<div id="one-id" class="someClassName">Some text <a id="linkId" class="classLink">link</a> with only the class and id  attrtibutes.</div>

...actually, XPath isn't really needed because we are iterating every node in the dom. (Demo)

$dom = new DOMDocument();
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
foreach ($dom->getElementsByTagName('*') as $node) {
    for ($i = $node->attributes->length - 1; $i >= 0; --$i) {
        $attr = $node->attributes->item($i);
        if (!in_array($attr->name, ['id', 'class'])) {
            $node->removeAttribute($attr->name);
        }
    }
}
echo $dom->saveHTML();

Trying to parse valid HTML with a regular expression is going to be one or more of the following:

  1. Inaccurate/Unreliable,
  2. Convoluted/Verbose,
  3. Hard to read,
  4. Hard to maintain

Regex does not know the difference between tags and text that merely looks like tags. What if the HTML tags and attributes use upper and lower case? What if single quotes, double quotes and/or backticks are used? What if an attribute has no assignment (e.g. readonly or checked)? What if a data- attribute name ends with id or title? What if an attribute value contains a quoting symbol which is escaped instead of html encoded? What if text looks like a starting tag, but isn't a tag at all?

These are valid reasons to steer clear of parsing valid HTML with regex.