document.write('alert(123);')"; $dom = new DOMDocument(" /> document.write('alert(123);')"; $dom = new DOMDocument(" /> document.write('alert(123);')"; $dom = new DOMDocument("/>

PHP DOMDocument cuts part of script

79 views Asked by At

I try to parse such string with DOMDocument:

$html1 = "<script>document.write('<scr'+'ipt>alert(123);</scr'+'ipt>')</script>";
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html1);
$html2 = $dom->saveHTML();

but $html2 has the string

<html><head><script>document.write('\<scr'+'ipt\>alert(123);')</script></head></html>

which is missing the </scr'+'ipt> part.

I expect to receive the same string between script tags as in the input.

1

There are 1 answers

2
CarlosH. On

If you try to load partial HTML on DOMDocument it gets confused as it doesn't know how to parse it and sometimes it ends up parsing it as XML.

To avoid this always be sure that the minimum of a HTML5 document is present before adding loading the document.

Also anything inside the <script> tags cannot have any </ those needs to be escaped before being loaded.

In you case, satisfying the minimum requirements, it should be something like this:

$html1 = "<script>document.write('<scr'+'ipt>alert(123);</scr'+'ipt>')</script>";
$html = '<!DOCTYPE html><html lang="en"><head>'. $html1 . '</head><body></body></html>';
//escape the </ sequences as <\/ inside javascripts
if(preg_match_all('/(<script.*?>)(.*?)(<\/script>)/is', $html, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE)){
  $t = '';
  $o = 0;
  foreach($matches as $m){
    $t .= substr($html, $o, $m[0][1] - $o);
    $t .= $m[1][0].preg_replace('|<\/|', '<\/', $m[2][0]).$m[3][0];
    $o = $m[3][1] + strlen($m[3][0]);
  }
  $t .= substr($html, $o);
  $html = $t;
}


$dom = new DOMDocument();
$dom->loadHTML($html);
$html2 = $dom->saveHTML();
echo $html2;

I just hope that this is not real code. Inserting scripts with scripts and helping you coding some code hacking injection.