benchmarking XML processing
I do a lot of XML processing at work and the performance of different native libraries was bothering me - some were way too slow for what should be a simple task.
So I started playing around with different languages, libraries, and parsing models (DOM vs. SAX) to see what actually makes a difference.
Meet xml-i (pronounced “XML eye”) - a CLI tool that takes an XML file, counts how often each node appears, optionally filtered by name.
The baseline is written in Rust using quick-xml and it’s consistently the fastest of the bunch. But the alien/ directory is where it gets interesting - C++, Java, Scala, Julia, .NET, PowerShell, Python, …
I also threw in “(noxml)” tests - text-only, non-validating parsers that strip the XML structure and just count raw text. It’s significantly faster, but without proper validation it’s useless in real-world applications. It proves that if you cut corners you can be fast, but that doesn’t help when you actually need to process valid XML 😉
The benchmark results are the best part. On a 3.2 GB file:
- Rust (quick-xml) finishes in 2.8 seconds using ~2 MB of memory
- C++ (pugixml) does it in 4.5 seconds but chews through 7.6 GB of RAM
- Python takes 52 seconds and peaks at 12.6 GB
- PowerShell Core sits at 9.3 seconds with 130 MB - not bad for a scripting language
Takeaway: our computers are inredibly fast, if it’s slow, you’re most likely doing it wrong.
Check it out at github.com/mwallner/xml-i - PRs with new languages are always welcome 😉
~ till next time