How can one parse whole XML documents using the LXML Sax module?

136 views Asked by At

I have a script that goes through a directory with many XML files and extracts or adds information to these files. I use XPath to identify the elements of interest.

The relevant piece of code is this:

import lxml.etree as et
import lxml.sax

 
# deleted non relevant code

 for root, dirs, files in os.walk(ROOT):   

        # iterate all files
        for file in files:
            
            if file.endswith('.xml'):

                # join root dir and file name
                file_path = os.path.join(ROOT, file)

                # load root element from file
                file_root = et.parse(file_path).getroot()
        
                # This is a function that I define elsewhere in which I use XPath to identify relevant 
                # elements and extract, change or add some information
                xml_dosomething(file_root)
                
                # init tree object from file_root
                tree = et.ElementTree(file_root)

                # save modified xml tree object to file with an added text so that I can keep a copy of original. 
                
                tree.write(file_path.replace('.xml', '-clean.xml'), encoding='utf-8', doctype='<!DOCTYPE document SYSTEM "estcorpus.dtd">', xml_declaration=True)

I have seen in various places that people recommend using Sax(on) to speed up the processing of large files. After checking the documentation of the LXML Sax module in (https://lxml.de/sax.html) I'm at a loss as to how to modify my code so that I can leverage the Sax module. I can see the following in the documentation:

handler = lxml.sax.ElementTreeContentHandler()

then there is a list of statements like (handler.startElementNS((None, 'a'), 'a', {})) that would populate the 'handler' "document" (?) with what would be the elements of a the XML document. After that I see:

tree = handler.etree
lxml.etree.tostring(tree.getroot())

I think I understand what handler.etree does but my problem is that I want 'handler' to be the files in the directory that I'm working with rather than a string that I create by using 'handler.startElementNS' and the like. What do I need to change in my code to get the Sax module to do the work that needs to be done with the files as input?

0

There are 0 answers