How to process large xml downloaded from wikitionary

64 views Asked by At

I downloaded the file enwiktionary-20231101-pages-articles.xml from wiktionary, but unfortunately it seems useless since I cannot open it. The data size is around 8GB. I tried VSCode, the problem is still "cannot open the large XML". I tried this snipped in Java, it was the same problem.

  try {
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            DocumentBuilder builder = factory.newDocumentBuilder();
            Document document = builder.parse("src/main/resources/enwiktionary-20231101-pages-articles.xml");

            // Access elements and data from the XML
            NodeList nodeList = document.getElementsByTagName("page");
            System.out.println(nodeList.getLength());
            
//            for (int i = 0; i < nodeList.getLength(); i++) {
//                Node node = nodeList.item(i);
//                System.out.println(node);
//                break;
//            }
        } catch (Exception e) {
            e.printStackTrace();
        }

I found this link is like a replacement for above file. https://dictionaryapi.dev/, which can saves me a lot of effort in processing XML format. My current only concern is to get list of words, so I can download them via above link. Do you know how to achieve this? Thanks!

2

There are 2 answers

0
Michael Kay On

You need to use a streaming technology (one that doesn't build a tree in memory). The usual candidates are Java SAX processing, Python ElementTree, or Streaming XSLT 3.0.

0
jdweng On

Use a powershell script. Folllowing uses a combination of XmlReader and Xml Linq and then writes results to a csv file.

using assembly System.Xml
using assembly System.Xml.Linq

$uri = 'https://dumps.wikimedia.org/shnwiktionary/20231101/shnwiktionary-20231101-pages-articles.xml.bz2'
$zipPath = 'c:\Program Files\7-Zip\7z.exe'
$tempFolder = 'c:\temp\'
$Source = 'shnwiktionary-20231101-pages-articles.xml.bz2'
$Target = 'shnwiktionary-20231101-pages-articles.xml'
$csv = 'c:\temp\test.csv'

$response = Invoke-WebRequest -URI $uri
[System.IO.File]::WriteAllBytes($tempFolder + $source, $response.Content)

cd $tempFolder
Set-Alias Start-SevenZip $zipPath
Start-SevenZip x $Source


$reader = [System.Xml.XmlReader]::Create($tempFolder + $Target)
$reader.MoveToContent();
$nsName = $reader.GetAttribute('xmlns');
$namespace = [System.Xml.Linq.XNamespace]::Get($nsName);
$table = [System.Collections.ArrayList]::new()
while($reader.EOF -eq $False)
{
   if ($reader.Name -ne 'page')
   {
      $reader.ReadToFollowing('page')
   }
   if ($reader.EOF -eq $False)
   {
       $page = [System.Xml.Linq.XElement]::ReadFrom($reader)
       $title = $page.Element($namespace + 'title').Value
       $ns = $page.Element($namespace + 'ns').Value
       $id = $page.Element($namespace + 'id').Value

       $revision = $page.Element($namespace + 'revision')
       $parentId = $revision.Element($namespace + 'id').Value
       $timestamp = $revision.Element($namespace + 'timestamp').Value

       $contributor = $revision.Element($namespace + 'contributor')
       $username = $contributor.Element($namespace + 'username').Value
       $userId = $contributor.Element($namespace + 'id').Value

       $model = $revision.Element($namespace + 'model').Value
       $format = $revision.Element($namespace + 'format').Value
       $text = $revision.Element($namespace + 'text').Value

       $newRow = [pscustomobject]@{
           title = $title
           ns = $ns
           id = $id
           parantId = $ParentId
           timestamp = $timestamp
           username = $username
           userId = $userId
           model = $model
           format = $format
           $text = $text
       }
       $table.Add($newRow) | out-null 
   }
}
$table | Export-CSV -Path $csv -NoTypeInformation