Rust is an up-and-coming systems programming language developed by Mozilla, the organization behind the Firefox web browser. I first heard of Rust around the onset of 2016, and was mildly curious. See, at the time the extent of my experience with systems programming was limited to just C / C++, and I was glad to see that there is some effort underway to rectify their weak-points through a brand new language. However, it is only now (May of 2017) that I have finally decided to buckle down and start investing some time in learning it.
After giving the Rust Book (2nd edition) a not so brief read, I set out to test my understanding of Rust by producing a small-ish sized project application. Combining that with my unrelated desire to refresh my dwindling memory of what I know about computer networking, I decided to kill two birds with one stone by implementing a bare-bones downloader of web content. To up the difficulty just a notch, I also wished it to fetch and process multiple parts of the target content in parallel, thereby cutting down overall download time to a fraction of its original value.
I gave the application the code name 'Squid Downloader', and drew this handsome logo to accompany it:
.. . :. ..:. :...: : o. .. :..:. :. :. .:: ---- .:. :. . : . / O O\ .: o :. o .\ / : . . : ..:///\\\ o . : . : \\\/// o .:. . :.. ///\\\ ..: : : . o \\\///. :. o .: : o /.\\. :. .. : .:. : : / :. :.:
Table of contents
Simplified, a 'download' is the request for and then transfer of data from a server to a client. The data being transferred does not traverse the medium between the server and client as one homogeneous unit, but rather as a stream of little pieces (more commonly known as packets) arriving one after the other.
The most basic download strategy consists of (not surprisingly) a single such stream, that over time transfers the entirety of the target data content to the client. However, if we introduce the possibility of concurrent operations to the equation, then a more appealing strategy becomes available.
First, regard the target content as a sequence of elementary units of information (when the content is digital it is often convenient to use bytes, and so we shall). Now, given that the target content \(T\) is made up of \(n\) bytes in sequence \((x_1, x_2, \ldots x_n)\), divide it into a set \(C\) of \(m\) chunks taking the form:
Fetching data in a single stream would then be equivalent to "dividing" \(T\) into a single chunk \(R(1, n+1)\) and requesting that from the server. But with concurrency available to us, we can divide \(T\) into several chunks \(c_1, c_2, \ldots c_m\) and retrieve their contents through several streams simultaneously. Once the data has arrived at the client, all that is left to do is to glue these chunks back together in the correct order to reproduce the original sequence of bytes of the target content.
HTTP does allow for clients to request ranges of the target content by including the relevant header in the request, so why is it not possible for a client to have \(m\) be arbitrarily large (up to \(m = n\)) to request many bytes in parallel? Well, technically it is possible, but in practice pursuing such greed is futile for many reasons, with these here being the principal ones:
- The network's data throughput is insufficient.
- The client/server machine's processing power is insufficient.
- The server forbids clients from making too many requests in a short interval of time (to prevent abuse or because such a service is reserved for specific clients).
- The server refuses to open too many streams concurrently to the same client (again, to prevent abuse or because such a service is reserved for specific clients).
Therefore, in practice the number of open concurrent streams at a time is rather modest, but the benefit of partial content fetching is still sometimes noticeable. For example: while testing the application during development, in one instance I measured a 50% reduction in download time. Of course, the degree of improvement varies on a case-by-case basis, depending on the client and server machines, the network infrastructure connecting them, and the content being requested.