I've been spending some time on the #bittorrent channel on Freenode,
to see if I could get any good information from other bittorrent
client developers. After some poor initial contact, I did get some
good suggestions, as well as some tidbits of information I wasn't
aware of.
One of the tidbits was that most of the bittorrent client developers
frowned on the use of selective downloading for torrents, to the
point where some refuse to implement it in their clients (apparently
the latest
mainline client doesn't even have it, though I
haven't checked due to license issues).
This concerned me, as the
DebTorrent client I'm working on
will rely heavily on selective downloading to only download packages
to be installed by the user. It's not clear to me that it would be
an issue though. I'm sure it will make it more difficult to find a
rare package in a large swarm of peers, but that's to be expected,
and is nicely solved by using an
HTTP download from a backup
mirror. This backup downloading could become extreme
though, making the bittorrent-like download (peer-to-peer) mostly
just an HTTP download (client-server). To see if it will be a
problem, I conducted a simulation of different sized swarms to
determine the amount of unnecessary HTTP downloading that will
occur.
First, some assumptions to make the simulation easier:
- peers join and download sequentially and one at a time
- peers never leave
- peers download all the packages for their system at once (i.e.
all fresh installs, no upgrades)
- peers are all interested in the same version and architecture of
packages
- there are N total peers, and each can make C connections to other
peers
With these assumptions, the simulation becomes quite simple. I used
the
popcon data to assign packages appropriately to the N
peers. The peers download their assigned packages from the C
previous peers, if possible, or otherwise by HTTP. I ran this
multiple times, varying N and C, and calculating each time how much
was downloaded through the debtorrent protocol, and how much through
HTTP.
This graph shows the percentage of the total download that used HTTP
from a mirror. The optimal line shows the minimum possible (i.e. all
peers are connected), which corresponds to only one copy (the first)
of every package being downloaded using HTTP.
This verifies my previous thinking, which is that fracturing the
download population into multiple small swarms results in lots of
inefficiencies, and lots of HTTP downloading. All efforts will need
to be taken to keep the downloading populations together. (This is
especially difficult for unstable, when a new swarm could be created
twice a day due to archive updates.) Swarms of 1,000 peers or more
seem to be sufficient to minimize the HTTP downloading.
This graph shows the amount of unnecessary HTTP downloading that
occurred, which is just the difference of each line from the optimal
one in the previous plot. This shows the danger of selective
downloading, as all of this unnecessary HTTP downloading occurs
because the swarm is too big to find peers that have the desired
package (they do exist). Fortunately, this unnecessary downloading
seems to approach a maximum as the swarm size increases.
For the large swarm sizes needed to make the downloading efficient,
and assuming a reasonable number of connections of 100 (bittorrent
defaults to maintaining at least 40, and large swarms usually means
a large number of connections), we can expect to be using HTTP for
about 4% of the download. This number is surprisingly low due to the
popularity distribution of packages (only rare ones are hard to
find, but they aren't downloaded very often). I think 4% is
manageable in our situation, given that there is a backup method to
find the rare packages, though it does show the need to have that
backup method available. This is probably more of a problem for
regular bittorrent clients, in which there is no backup method.