I downloaded the file enwiktionary-20231101-pages-articles.xml from wiktionary, but unfortunately it seems useless since I cannot open it. The data size is around 8GB. I tried VSCode, the problem is still "cannot open the large XML". I tried this snipped in Java, it was the same problem.
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse("src/main/resources/enwiktionary-20231101-pages-articles.xml");
// Access elements and data from the XML
NodeList nodeList = document.getElementsByTagName("page");
System.out.println(nodeList.getLength());
// for (int i = 0; i < nodeList.getLength(); i++) {
// Node node = nodeList.item(i);
// System.out.println(node);
// break;
// }
} catch (Exception e) {
e.printStackTrace();
}
I found this link is like a replacement for above file. https://dictionaryapi.dev/, which can saves me a lot of effort in processing XML format. My current only concern is to get list of words, so I can download them via above link. Do you know how to achieve this? Thanks!
You need to use a streaming technology (one that doesn't build a tree in memory). The usual candidates are Java SAX processing, Python ElementTree, or Streaming XSLT 3.0.