How to Sort a 20G File in Rust. And know how to deal with file handling | by applied.math.coding | Mar, 2022

The issue of kind a file that doesn’t match into reminiscence already has a typical answer referred to as ‘exterior kind’.

This story is about giving some particulars on how one might implement such an exterior sorting algorithm with Rust.

Though there already exists a bunch of ready-to-use crates for this drawback, implementation-wise it is extremely attention-grabbing to have a look at since one offers with many essential ideas from commonplace file dealing with.

Allow us to rapidly recap how exterior sorting works. The implementation of a typical sorting algorithm requires the focused record to be held in reminiscence.

In case the obtainable reminiscence isn’t giant sufficient, we first learn the record in smaller parts that match into reminiscence. Every portion is sorted through the use of some sorting algorithm and subsequently written to a file devoted for this portion. This leaves us with plenty of information every containing a sorted record and every with the ability to slot in reminiscence.

Lastly, we put a readable stream onto every of those information and all the time choose the smallest component to be written right into a ‘giant’ final result file. In case you are not aware of this merging technique, you’ll simply perceive the main points from the code later.

As a way to have a big file of about 20G in dimension, we write the primary n iterations of the logistic regression inside its chaotic area:

x_n+1 = r * x_n * (1 - x_n )

That is very appropriate because it produces a considerable amount of non-repeating information.

The Iterator for this sequence is carried out like this:

In Rust, one can create a writable stream right into a file by merely doing,

use std::fs::remove_file, File;let mut file = File::create("filename").unwrap();

Then we might begin writing to it line by line,

let mut logistic = Logistic(0.35);...writeln!(file, "", logistic.subsequent().unwrap());

This strategy works for small quantities of information however in our case, it will most likely take a lot too lengthy. Writing to a file line by line is a really inefficient I/O operation.

Because of this, we first buffer as a lot as potential to jot down operations in reminiscence and when this buffer overflows, it’s getting written to the file in a single I/O operation:

The concept is to wrap the File with a BufWriter::with_capacity. Right here we are able to specify what number of bytes shall be buffered earlier than doing the precise writing to the file.

Additionally essential is to name file.flush().unwrap() when utilizing such buffering. This ensures that each one information are getting written from the buffer to the file earlier than the buffer is getting dropped.

The core methodology of the algorithm is the next:

Right here we’re studying parts that match into reminiscence from the generated giant file. To open a file for acquiring a readable stream one simply do,

let file = File::open('filename').unwrap();

However once more, studying it line by line would end in very inefficient I/O operations. Due to this fact, we once more do the buffering trick however this time the opposite manner spherical:

let file = BufReader::with_capacity(
10_000_000, File::open(filename).unwrap()

Subsequent, we buffer-read the file line by line right into a vector. This we do till a certain quantity of reminiscence is used max_mem_use after which kind this portion of the record by calling sort_and_write_to_file. For comfort, this methodology as well as is writing the sorted portion into a brief file.

The sorting used right here is simply Rust’s commonplace sorting methodology on a Vec(on the time, a merge-sort like algorithm):

v.sort_by(|a, b| a.partial_cmp(b).unwrap());

We have to use partial_cmp since on f64 typed information solely the PartialOrd trait is carried out (not the Ord ).

Nonetheless attention-grabbing is the strategy merge. Right here buffered readable streams are being put onto every momentary information file after which they’re getting learn line by line (buffered) and in contrast with one another. The bottom worth is written (buffered) to a outcome file. We solely learn the following line from a given stream in case its worth turned out to be the bottom:

active_readers is a vector holding Line -iterators of all buffered readable streams onto all of the momentary information information (tmp_file_names). Then a vector named worth is created that originally holds the primary line from every momentary file. We repeatedly searched the bottom worth in values and write the latter to the outcome file: writeln!(file, “”, min_val);

On this case, we substitute the corresponding index of values with the following line of the corresponding momentary file ought to it not have reached EOF.

This manner, we guarantee to provide an ascending sorted record in a ‘giant’ outcome file.

For comfort, let me paste your complete code I used to provide and type an nearly 20G file. As a way to make this run in your system, you would possibly want to regulate the worldwide variables BUFFER_CAPACITY and MAX_MEM_USE to your system’s reminiscence. Nonetheless word, relying on these settings, this would possibly produce a lot of momentary information.

More Posts