<" /> <" /> <"/>

XML::Twig Add tag using twig_roots

94 views Asked by At

I have the following XML file structure that comes from a .xlsx files internal zip file structure:

<worksheet>
    <sheetData>
        <row r="1"/>
        <row r="2"/>
        <row r="3"/>
        <row r="4"/>
        '
        '
        '
        <row r="500000"/>
        <row r="500001"/>
    </sheetData>
</worksheet>

As shown, in general, there can be hundreds of thousands of <row> tags (in theory 2^20 tags).

I need to update this file by adding a tag after the <sheetData> tag using XML::Twig.

Due to memory restrictions I cannot load the <sheetData> tag because this would load all the <row> child tags and exceed my machines memory capacity.

The only way I have found to solve me problem is as follows:

1.) Using XML::Twig twig_roots on the <row> tags with the twig_print_outside_roots option I remove all row tags and store them in a hash.

use strict;
use warnings;
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::Twig;
use File::Temp qw/ tempfile tempdir /;

my %InfoRemoved;
my $zipName='/home/public/TestFile.xlsx'
my $zip = Archive::Zip->new();
my $zipread=$zip->read($zipName);
my $sheetX="xl/worksheets/sheet1.xml";
my $sheetXC=$zip->contents($sheetX);

(my $outP2, my $fh_outP2) = tempfile(DIR => '/tmp',SUFFIX => '.xml',UNLINK => 1);
open($outP2, '>:encoding(UTF-8)', $fh_outP2);
my $tP2= XML::Twig->new( 
 twig_roots  => {'/worksheet/sheetData/row' => sub {Remove_and_store_tag(@_,\%InfoRemoved)} },
 twig_print_outside_roots => $outP2);
$tP2->parse($sheetC);
close $outP2;

This produces the new file:

<worksheet>
    <sheetData/>
</worksheet>

2.) Next I add my tag after the <sheetData> tag, which now has no <row> child tags.

(my $outP3, my $fh_outP3) = tempfile(DIR => '/tmp',SUFFIX => '.xml',UNLINK => 1);
open($outP3, '>:encoding(UTF-8)', $fh_outP3);
my $tP3= XML::Twig->new( 
twig_roots  => {'/worksheet/sheetData' => sub {Add_tag(@_)} },
twig_print_outside_roots => $outP3);
$tP3->parsefile($fh_outP2);
close $outP3;

This produces the new file:

<worksheet>
    <sheetData/>
    <tagname/>
</worksheet>

3.) Finally I put all the row tags back from the hash generated in 1.)

(my $outP4, my $fh_outP4) = tempfile(DIR => '/tmp',SUFFIX => '.xml',UNLINK => 1);
open($outP4, '>:encoding(UTF-8)', $fh_outP4);
my $tP4= XML::Twig->new( 
twig_roots  => {'/worksheet/sheetData' => sub {Add_rows_back_in(@_,\%InfoRemoved)} },
twig_print_outside_roots => $outP4);
$tP4->parsefile($fh_outP3);
close $outP4;

Which produces the required result:

<worksheet>
    <sheetData>
        <row r="1"/>
        <row r="2"/>
        <row r="3"/>
        <row r="4"/>
        '
        '
        '
        <row r="500000"/>
        <row r="500001"/>
    </sheetData>
    <tagname/>
</worksheet>

In the code above the following subs have been used:

sub Remove_and_store_tag{
    my ( $twig, $elt, $InfoRemoved) = @_;
    $InfoRemoved->{$elt->{'att'}->{'r'}}=$elt->sprint;
    $elt->delete;
}

sub Add_tag{
    my ( $twig, $elt) = @_;
    $elt->insert_new_elt( 'after', => 'tagname');
    $elt->print;
}

sub Add_rows_back_in{
    my ( $twig, $elt, $InfoRemoved) = @_;
    foreach my $irow (sort {$a <=> $b} keys %$InfoRemoved){
        $elt->insert_new_elt( 'last_child', => $InfoRemoved->{$irow});
    }
    $elt->print;
}

Is there an 'easier' way ?

0

There are 0 answers