I have the following XML file structure that comes from a .xlsx files internal zip file structure:
<worksheet>
<sheetData>
<row r="1"/>
<row r="2"/>
<row r="3"/>
<row r="4"/>
'
'
'
<row r="500000"/>
<row r="500001"/>
</sheetData>
</worksheet>
As shown, in general, there can be hundreds of thousands of <row> tags (in theory 2^20 tags).
I need to update this file by adding a tag after the <sheetData> tag using XML::Twig.
Due to memory restrictions I cannot load the <sheetData> tag because this would load all the <row> child tags and exceed my machines memory capacity.
The only way I have found to solve me problem is as follows:
1.) Using XML::Twig twig_roots on the <row> tags with the twig_print_outside_roots option I remove all row tags and store them in a hash.
use strict;
use warnings;
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::Twig;
use File::Temp qw/ tempfile tempdir /;
my %InfoRemoved;
my $zipName='/home/public/TestFile.xlsx'
my $zip = Archive::Zip->new();
my $zipread=$zip->read($zipName);
my $sheetX="xl/worksheets/sheet1.xml";
my $sheetXC=$zip->contents($sheetX);
(my $outP2, my $fh_outP2) = tempfile(DIR => '/tmp',SUFFIX => '.xml',UNLINK => 1);
open($outP2, '>:encoding(UTF-8)', $fh_outP2);
my $tP2= XML::Twig->new(
twig_roots => {'/worksheet/sheetData/row' => sub {Remove_and_store_tag(@_,\%InfoRemoved)} },
twig_print_outside_roots => $outP2);
$tP2->parse($sheetC);
close $outP2;
This produces the new file:
<worksheet>
<sheetData/>
</worksheet>
2.) Next I add my tag after the <sheetData> tag, which now has no <row> child tags.
(my $outP3, my $fh_outP3) = tempfile(DIR => '/tmp',SUFFIX => '.xml',UNLINK => 1);
open($outP3, '>:encoding(UTF-8)', $fh_outP3);
my $tP3= XML::Twig->new(
twig_roots => {'/worksheet/sheetData' => sub {Add_tag(@_)} },
twig_print_outside_roots => $outP3);
$tP3->parsefile($fh_outP2);
close $outP3;
This produces the new file:
<worksheet>
<sheetData/>
<tagname/>
</worksheet>
3.) Finally I put all the row tags back from the hash generated in 1.)
(my $outP4, my $fh_outP4) = tempfile(DIR => '/tmp',SUFFIX => '.xml',UNLINK => 1);
open($outP4, '>:encoding(UTF-8)', $fh_outP4);
my $tP4= XML::Twig->new(
twig_roots => {'/worksheet/sheetData' => sub {Add_rows_back_in(@_,\%InfoRemoved)} },
twig_print_outside_roots => $outP4);
$tP4->parsefile($fh_outP3);
close $outP4;
Which produces the required result:
<worksheet>
<sheetData>
<row r="1"/>
<row r="2"/>
<row r="3"/>
<row r="4"/>
'
'
'
<row r="500000"/>
<row r="500001"/>
</sheetData>
<tagname/>
</worksheet>
In the code above the following subs have been used:
sub Remove_and_store_tag{
my ( $twig, $elt, $InfoRemoved) = @_;
$InfoRemoved->{$elt->{'att'}->{'r'}}=$elt->sprint;
$elt->delete;
}
sub Add_tag{
my ( $twig, $elt) = @_;
$elt->insert_new_elt( 'after', => 'tagname');
$elt->print;
}
sub Add_rows_back_in{
my ( $twig, $elt, $InfoRemoved) = @_;
foreach my $irow (sort {$a <=> $b} keys %$InfoRemoved){
$elt->insert_new_elt( 'last_child', => $InfoRemoved->{$irow});
}
$elt->print;
}
Is there an 'easier' way ?