Stream Output When Parsing Big Xml With Elixir | by José de Zárate | May, 2022

Nicely, this was laborious to me to search out out, and consequently, I’d prefer to share it.

There are two large gamers in elixir’s XML parsing ecosystem:

  • SweetXml , which is the standard “parser” if you feed it with some string or stream with XML content material, it’s going to produce some construction of parts by which every ingredient is what SweetXml thinks an XML ingredient’s illustration needs to be.
  • Saxy , which is predicated on SAX parsing. You present it with a module and a few state. The module is to know what to do when sax xml-related occasions happen (comparable to “start-document”, “start-element”, and so forth …) whereas studying some string or stream with XML contents on it. The module is to do issues to the state offered to him relying of those occasions.

I need to learn an enormous XML file that has some parts very repeated, and need to produce some sort of “iterator” from it. One thing like this:

Given the XML file:

<rss>
<merchandise>
<title>I am a title</title>
<id>xx1</id>
</merchandise>
<merchandise>
<title>second title</title>
<id>xx2</id>
<additional>additional content material</additional>
<no matter>
<some>factor</some>
<some>thing2</some>
</no matter>
</merchandise>
<merchandise>
<additional>factor</additional>
</merchandise>
</rss>

I’d like to supply some iterator that, when iterated, produces this:

[
"title": "I'm a title", "id": "xx1",

"title": "second title",
"id": "xx2",
"extra": "extra content",
"whatever":
"some": ["thing", "thing2"]

,
"additional": "factor"
]
  • I don’t want to carry some construction that represents the complete xml file in reminiscence.
  • I don’t even want to carry the complete record of things into reminiscence, as a result of I can use some type of “iterator” that means that the one factor I’ve to carry into reminiscence is one merchandise per time

Saxy is extremely quick and performant, but it surely’s based mostly on the idea that, as you learn the XML file, you “fill” some state object (with no matter you need, and the quantity you need, however, nonetheless, you fill it).

On this situation, I might “fill” the state with the record of things. That, in fact, is rather a lot much less reminiscence than it will take to carry the complete XML construction in reminiscence. However nonetheless it establishes a relationship between the dimensions of the XML file and the dimensions of the saved in-memory record, which I don’t like as a result of that implies that if I take advantage of a large enough file, I can devour extra reminiscence than I’m allowed to.

SweetXml offers some perform referred to as stream_tags and if you see what it does, evidently it hits the spot!!! as a result of it says it’s simply what I would like: parse an xml and, because it finds sure tags, stream the SweetXml illustration of them , and it doesn’t construct into reminiscence any construction representing xml. So this needs to be all I would like:

iex > list_iterator = File.stream("some_feed.xml") |> SweetXml.stream_tags!(:merchandise, discard: [:item])

and that will be it. list_iterator just isn’t the whole record however simply an iterator over it, which implies I don’t have to carry the complete record in reminiscence. So theoretically, "some_feed.xml" could be as large as I need, and no reminiscence penalty for that. However …

It doesn’t work like that. I don’t know precisely why, however there’s some accumulation in place, which implies there’s some reminiscence hoarding in place, meaning for sufficient large xml information, I’ll cross my reminiscence allowance.

The concept is that, even when saxy “accumulates” one thing in some state , I “clear” the state so long as I have to, throughout the parsing, so the state doesn’t find yourself consuming loads of reminiscence.

Saxy can parse an XML file in a number of “partial” makes an attempt (though the state it accumulates to stays the identical) like this:

:okay, partial = Partial.new(MyEventHandler, initial_state)
:cont, partial = Partial.parse(partial, "<foo>")
:cont, partial = Partial.parse(partial, "<bar></bar>")
:cont, partial = Partial.parse(partial, "</foo>")
:okay, state = Partial.terminate(partial)

Once more, the trick is to ensure initial_state it’s “emptied” on occasion.

The handler: what to do when sax occasions are discovered

First, we have now to construct a module that is aware of what to do to the state when XML occasions are discovered throughout XML parsing:

As you’ll be able to see:

  • the preliminary state offered to parser is %current_element: nil, stack: nil, gadgets: [] That’s, the preliminary state accommodates an empty record of things
  • when the module parser finishes parsing an <merchandise> ingredient, a brand new merchandise is added to the state
  • so, at any level throughout the parsing, the state (beneath the important thing gadgets ) accommodates a listing of parsed gadgets to this point. If we did nothing else, then after parsing the complete XML file, we might fetch the complete record of parsed gadgets obtained from that file, and that will be okay for not too large xml information, however, if the XML file is large enough, then the amassed record of things is large, too and meaning consuming up loads of reminiscence!.

The output we wish is a Stream, so the one reminiscence consumption is the reminiscence to carry one merchandise. In an effort to try this, we’ve acquired to streamize XML parsing, which we are able to do because of Saxy.Partial module. Whereas we try this, we additionally handle yielding already processed gadgets and eradicating them from the state. This manner, we are able to make certain that state is rarely going to develop an excessive amount of.

As you’ll be able to see, so long as it parses the XML it generates a stream by which each ingredient is a listing of some gadgets, and when retrieving these gadgets (by way of the fetch_items perform), we additionally take away these gadgets from the state , due to this fact the state by no means holds too many gadgets, due to this fact the state by no means eats an excessive amount of reminiscence!!!

More Posts