PHP XMLReader parser error : xmlParseCharRef: invalid xmlChar value

388 views Asked by At

I'm parsing a very large Xml files, so I need to use the XMLReader of PHP. They cannot be modified from the source. So they have to be parsed as they are. The problem is that the documents contain html chars "&#" inside that the reader detect as not valid.


        $reader = new XMLReader();
    
        if (!$reader->open($fileNamePath))//File xml
            {
            echo "Error opening file: $fileNamePath".PHP_EOL;
            continue;
            }
        echo "Processing file: $file".PHP_EOL;
       
           
        while($reader->read()) 
            {
            
            if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'AIUTO') 
                {
                
                try {
                    $input =$reader->readOuterXML();
                    $nodeAiuto = new SimpleXMLElement($input);
                    }
                catch(Exception $e)
                    {
                    echo "Error Node AIUTO ".$e->getMessage().PHP_EOL;
                    continue;
                    }
                //Do stuff here
                }
         }
    
         $reader->close();

I get a lot of messages like this:

PHP Warning: XMLReader::readOuterXml(): myfile.xml:162: parser error : xmlParseCharRef: invalid xmlChar value 2... Errore Nodo AIUTO String could not be parsed as XML

Obviously the file contains the sequence .

here some xml file code causing the error:

<AIUTO><BASE_GIURIDICA_NAZIONALE>Quadro riepilogativo delle misure a sostegno delle imprese attive nei settori agricolo, forestale, della pesca 
e acquacoltura ai sensi della Comunicazione della Commissione europea C (2020) 1863 final – “Quadro 
temporaneo per le misure di aiuto di Stato a sostegno dell’economia nell’attuale emergenza del COVID&#2;19” e successive modifiche e integrazioni</BASE_GIURIDICA_NAZIONALE></AIUTO>

I thought to parse every file as text, line by line, and replace the invalid sequences.

But it's a little tricky. Has someone a better solution?

3

There are 3 answers

7
Alessandro On

Been there with an xml file and I found that the best workaround is to replace the string with nothing:

$xml= str_replace('YOUR STIRNG',NULL,$xml);

If you can't delete the data in xml, you can try to parse the xml then loop each one with:

$xml= simplexml_load_file('file.xml');
foreach($xml as $object){
  your code...
}
5
Casimir et Hippolyte On

What you can do is to build a custom stream filter in which you proceed to all the fix you need. This way you can continue to read the file as a stream with XMLReader without to load the full content at one time.

class fix_entities_filter extends php_user_filter
{
    function filter($in, $out, &$consumed, $closing): int
    {
        while ($bucket = stream_bucket_make_writeable($in)) {
            $bucket->data = $this->fix($bucket->data);
            $consumed += $bucket->datalen;
            stream_bucket_append($out, $bucket);
        }
        return PSFS_PASS_ON;
    }
    
    function fix($data)
    {
        return strtr($data, ['&#2;' => '&#x202f;']);
    }
}

stream_filter_register("fix_entities", "fix_entities_filter")
    or die("Failed to register filter");

$file = 'file.xml';
$fileNamePath = "/path/to/your/$file";
$path = "php://filter/read=fix_entities/resource=$fileNamePath";

$reader = new XMLReader();
    
if (!$reader->open($path)) {
    echo "Error opening file: $fileNamePath", PHP_EOL;
}

demo

You can find more informations about stream filters in the PHP manual and also in the book "Modern PHP by Josh Lockhart - O'Reilly".

0
Jenemj On

Waiting for a cleaner working solution for now I used my "dirty thought".

I created a temp xml removing line by line the sequences causing errors.

This is working:

$fileNamePath = "/path/to/your/file.xml";
$fileNamePathTmp = "/path/to/your/tmp.xml"

$handle = fopen($fileNamePath, "r");
$handle2 = fopen($fileNamePathTmp, "w");
if ($handle) {
while (($line = fgets($handle)) !== false) {
    $line2=str_replace(array("&#2;","&#11;","&#16;","&#26;"),"",$line);
    fputs($handle2,$line2);
}

fclose($handle);
fclose($handle2);
}

$reader = new XMLReader();

if (!$reader->open($fileNamePathTmp))//File xml tmp
    {
    echo "Error opening file: $fileNamePath".PHP_EOL;
    continue;
    }
echo "Processing file: $file".PHP_EOL;

   
while($reader->read()) 
    {
    
    if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'AIUTO') 
        {
        
        try {
            $input =$reader->readOuterXML();
            $nodeAiuto = new SimpleXMLElement($input);
            }
        catch(Exception $e)
            {
            echo "Error Node AIUTO ".$e->getMessage().PHP_EOL;
            continue;
            }
        //Do stuff here
        }
 }

 $reader->close();
 unlink($fileNamePathTmp);//Remove the temp xml