File Sharing Concepts Page

Before you proceed, you should be warned: this is pretty technical stuff ahead. Normal users shouldn't be here. Visitors from Blocks, InfoAnarchy.org, Ars Technica OpenForum, ShouldExist.org, eDonkey, or anywhere else -- come on in! I've been expecting you! :)

[Update 2:45 AM Monday July 2nd -- I last updated this in APRIL? I need to get back into this. Even if I don't have the coding talent to actually implement this design, at least I should be able to create a GUI for other coders to build off of.]

I have a FAQ now!

(Manual hit counter: 902 hits as of 2:45 AM July 2nd -- mostly from search engines now.)

Update Handling: Each individual numbered part (left-indented, big type) will have updates for that section added at the end unless otherwise noted. If I only update the last section, all the updates will be at the end of the document. I might sneak an update in at the end of part 0, though...so you should check the end of part 0 also (for example).

I would make this a message board, but I don't know how, and don't really want to take the time to dig the information up myself. If you think something should go here, or can think of a resource I should consider, please email me at . Should I organize or format the page in a different way? Did I explain something unclearly? Please, do something you'd never do for any other page author -- give me an email. (For Lynx users...sorry...in ICAO phonetic alphabet form: Mike 'at' Mike Sierra Papa Echo November Charlie Echo Romeo 'dot' November Echo Tango)

I'd like to reiterate: there is not now, nor will there ever be, a company behind this page or this work. I don't collect email addresses and forward them to people. There are no ad banners on this entire site, right? :) I even spam-proofed my email address by putting it in a graphic.


Here's a collection of concepts and resources for file sharing. I'm trying to find a way to put them all together, without making my brain explode.

TODO: give credit to sites and programs I've taken ideas from. They deserve the credit and links.


Part 0: sketch and doodle

The whole idea here is to take a bunch of cool ideas and concepts, combine them together, and see if I can figure out how they all fit together. If I can do that, I'll have created a decent framework for a future p2p application. Feel free to skip this part if you like. So, here's some random ideas:

Batch downloads. I like the idea of training the user to not expect his data to arrive immediately, or to think that his file must come from one user. eDonkey2k does a great job of this -- you can safely queue up gigs of files, come back in a week, and a majority of the files will be complete. Try that on Napster, and you'll be lucky to get half of your downloads complete.

Virtual connections that aren't IP connections. I think persistent nodes are the key to having a good self-organizing network, but how do you make a persistent node out of a dial-up modem user? Give the node something other than his IP number or connection number to identify himself by: perhaps a public key pair? It remains to be seen how we're going to implement a public key trust network, but keep all of the nodes anonymous -- and convince the users that they're really anonymous.

The big one: optional security. Users need to be able to pick how much security they need for various parts of the system. This will be discussed in detail below...but completely secure systems are so difficult to use, and so wasteful of resources, as to be unusable...and completely insecure systems are so open that individuals and companies can easily attack and abuse the system -- for example, Napster's servers were easy to attack, because the index and location servers there had no anonymity protection at all.

Mixnets. I like mixnets. :) This falls under optional security: some users are going to want anonymity protection and traffic deniability, and will be willing to sacrifice the extra bandwidth and processing power required. Mixnets are only bad and wasteful in Gnutella because they tried to do indexing and namespace stuff through broadcast messages. A routed, non-broadcast, mixnet is very efficient, so to speak -- messages get where they're supposed to go, and they 'waste' precisely as much bandwidth as there are hops in the route. This isn't so bad for file data, if all or most of the machines in the chain are 'interested' in the file, and file data is cached and redistributed anyway.

I don't know what to call this, but I like the idea of maximizing resource utility by using a simple 'community feedback' system. If I (a human) am a member of a community -- a chat room for example -- I'm going to be constantly gauging what benefit that connection with a community is giving me. I'm going to have a constant running 'boredom meter' that tells me whether or not participation in this community is worthwhile. (This 'boredom meter' has an analog in filesharing: cost/benefit estimation.) If I find that I'm getting bored -- not enough perceived benefit for the cost -- I might complain, or I might leave. Complain or leave -- that seems very simple, but we should be able to produce some sophisticated grouping behavior with those two impulses. We'll have to define what 'complaining' means down below, though.

Encryption tunnels. What an interesting way to combat 'right and ability to control' and 'direct infringement' -- let people pass along traffic, but not be capable of seeing what's in the traffic. Use this and some trust over a mixnet, and you have an untraceable OpenNap server: put the OpenNap server on a mixnet, such that it only connects directly to certain trusted peers. The OpenNap server advertises its presence through the mixnet -- people know to connect to it through its 'mixnet'-network address and public key. Messages are encrypted and passed through the mixnet...and now it becomes very difficult to find out the identity of this OpenNap server. This isn't foolproof -- obviously, with enough resources, someone interested in one specific server could find out the machine's identity. However, now this can't be automated. If we design a user interface that makes it easy to set up a trust network like this, the complexity of the network could make it very difficult and costly to attack OpenNap servers en-masse. (I use OpenNap because there's been recent media coverage. Our system will probably use something completely different -- but you get the idea.)

Does all of this seem seedy? Do you think people will assume that anyone who participates in any of this extra security or identity protection is automatically a criminal? Remember that this is what computers do -- they take complicated things, and take the manual labor out of them. Sure, some of these methods may seem like seedy criminal behavior turned digital -- but this behavior is usually criminal in real life because it's so costly! It takes time and effort to route anonymous messages around -- take envelopes out of the mailbox, unwrap only one envelope, and mail it out again. Pass things around by word-of-mouth only. Use aliases. In real life, these things are difficult to do and take time and effort...so it can be concluded that the people doing them probably need the extra security or protection. That is, they're probably doing something illegal, so the extra 'cost' is worth it. But this is digital -- these are computers we're talking about. It's very easy to let the computer stand out on the streetcorner for us. We're not peddling high-value illegal material -- many of us merely don't want certain advertising companies using our personal information to enhance their seedy business. This 'shifty behavior' becomes worthwhile at the half-penny-per-transaction level, because computers do all the work. Were it the real world, this same kind of 'shifty behavior' would only be justified at the tens-of-dollars-per-transaction level.

I wonder what encrypted tunnels do to community feedback. If data is encrypted, I can't ascertain the value of the traffic -- it's all deadweight to me. It does nothing for me but consume my bandwidth. Unless it's traffic going to and from somebody I care about. So it seems people who will allow traffic to be routed through them will probably either just be generous, and require that the routed encrypted traffic be a small percentage of its total bandwidth, or check that the tunnel traffic is going to and from a known 'friend' (so they know the traffic is probably worth it).

Ratios. I know some people like ratios -- I just answered a thread in eDonkey's forum, where some people were wondering why their movie download wasn't finishing, and wasn't sure if it's fair for several people to get that partial movie from him, while he's trying to finish that movie from someone else. This is all part of that community feedback stuff. It isn't fair to think that you can always download from the community without uploading. (Perhaps the amount of resources contributed by generous users will equal the amount of resources consumed by leechers...) It remains to be seen. I wonder if broadband users will be willing to contribute a small fraction, perhaps 1 KB/sec, of their bandwidth toward encrypted tunneled traffic -- traffic that will appear to have no value to them, since they can't read or cache or benefit from it. I wonder if 'ratio nuts' can be convinced that (in the 'take only pictures, leave only footprints' sense) they have used the network fairly and been good citizens...if their upstream and downstream bandwidth is the same. If they download five movies at 3 GB total, then they must upload 3 GB of data before their conscience can be clear. Some enforcement of this is possible -- eDonkey already shares partial blocks of files you've downloaded. If someone can conclude that you aren't sharing your downloads, they can stop being nice to you. Perhaps.

Spam Prevention. What is email spam anyway? What's so frustrating about it? To me, email spam is an abuse of a communications path, where the 'sending side' can be either a human or a machine, but the 'receiving side' has to be a human. I get annoyed when random companies send me mass marketing messages...because I know the person sending the message doesn't care about me...just my money. Conversely, I'm excited and happy when I receive a mail message that was typed and sent by one person exclusively for me -- I know I'm communicating with someone. One way to completely prevent spam: a common-sense 'challenge/response' system. I'll come up with a 'challenge' -- something common sense, that would be very easy for a human to understand and answer, but very difficult for a computer to automatically guess. If someone wants to send me email, they have to 'prove they're human' by answering a simple question: "finish this phrase: if you're happy and you know it, clap..." for example. The computer can check for me that the person answered "your hands", and allow the message through. I merely have to rotate the challenges every now and then...most lazy users would probably only rotate challenges once they've been spammed.

Hash Cash. In that system, though, there's a risk that people will choose challenges that are too hard -- require a web search or dictionary lookup, for example, or will reference cultural icons that just don't exist where the sender is at. So perhaps there should be an alternative: hash cash. The idea is that you can expend a certain amount of CPU resources as a gift or token, in exchange for the resources that will be expended by a human for reading that message. I wouldn't feel so bad receiving email spam, if I was sure that the company that sent it had to put a fast modern processor to work for two solid minutes factoring and counting to find prime numbers, to get me that message. The trick is to find an asymmetric calculation that's very difficult to create, but very easy to confirm. TODO: I need to find a way to do this!

Offline File Index. Many people will probably download files from the network and move them to CD. Files on CD aren't immediately available to the network...but I'm sure many people who download and burn rare files probably want to contribute back to the network. Perhaps the system should let people index their CD collections, and when files that happen to be on CD are in high demand but short supply on the network, perhaps the client could prompt the person to insert the CD with the rare data. The other idea proposed on the eDonkey forum, automatic caching of recently downloaded data, is probably something that could be built into the data store discussion, detailed elsewhere.

Mojo. Where trust and ratings are very personal things and form a 'local economy', a 'Mojo'-like system is a 'global economy'. How are global trading tokens created and allocated in real life? A central authority certifies that it will manage the creation (and scarcity) of a token. If people trust that central authority, then they accept that token. US Dollars are pretty widely accepted in the US...unless you're paying rent. (*grumble*) People generally accept that the US Treasury will behave in a predictable way. Canadian currency isn't as widely accepted down here (banks and airports pretty much exclusively take Canadian currency) because of the increased risk and transaction cost in converting Canadian currency. Similarly, in the MojoNation system there is one central currency (and set of servers) hard-wired into the client. Suppose someone 'overthrows the Confederacy'? (forgive the Civil War reference) All that currency might be unusable if the financial system fails.

Then again, where there's a will, there's a way. Perhaps a happy medium can be found between global economy and local economy: anyone can be free to start their own 'currency server', but should realize that there isn't much merit in having as many currencies as users. Some currency will become worthless (if the currency's agent vanishes)...and perhaps some currency will become very popular and will become a standard. (Especially if the currency's agent has been around for a long time, and advertises redundant servers, famous and reputable administrators, etc.) Of course, a 'stranger in town' with no currency whatsoever, but with goods to barter, should still be allowed to do business. But depending on the community, he will either be removed from the community, or ignored, or used, or observed closely, etc. This leaves a lot of latitude for individual client rules. --credit to ArsTechnica and eDonkey readers for this discussion

[Added April 11] PDA. Perhaps the same concepts we're using for high-bandwidth utilities could be applied to low-connectivity pocket devices? Email on the road is nice and all...but perhaps it'd be nice to extend our concepts of nodes and connectivity so even a PDA could participate. (Stop picturing mp3's on your palm pilot...this is called 'reduction to absurdity'. This is intended to bend your mind a bit and make you stop thinking about Napster-alikes when you think of peer-to-peer.) The main difference in doing peer-to-peer information transfer over PDA's and cell-phone-modems instead of DSL lines is that bandwidth and storage are *vastly* more expensive...so community policies are going to be set very tight. Nobody wants their PDA to spend half its available storage and bandwidth on routing blocks of MP3 file data around. These users are going to want moderated newsgroup-like content, articles, online books, etc. These users are probably going to have a ton of time on their hands...and aren't going to want to spend that time reading spam on long business flights. (Scale this up just a bit and you'll have your average modem user.) All this started because I ordered one of those Agenda Linux PDA's -- www.agendacomputing.com.

[Added July 2] Separate the GUI from the engine. I wonder how many programmers haven't started building their dream project because there are some aspects of either the GUI or the back-end logic that still elude them, and they know they will never be able to build it all singlehanded. Perhaps the ideal way to implement this system would be with two executables that communicate over an IP socket -- one for the GUI and one for the engine.

Is P2P new? Some people have complained about the fame this 'p2p' thing has received, and claimed that it really isn't anything new. My opinion on this: it's true that UNIX servers and their administrators have always been able to do amazing things. Between the talk daemon, ftp and web servers, nntp servers, everything these nifty p2p clients can do was already possible, and has been done for many years. P2P is still special, though. These P2P programs take all of these common general-purpose Windows PCs, their users with restrictive internet connections and no special experience, and turn them into information publishers. What once required special experience and operating systems now comes in a neat little package.

Part 1: Concept Categories

Network Transports

These are the methods of communication used to get information from one peer to another. Transports can be simple direct-connect protocols, tunneling or masquerading mechanisms, traffic-analysis-avoidance measures, etc.

Message-Passing

These represent a way to pass signals and messages from one node to another. A network transport of some kind has to already exist between them. This would more accurately be called 'datagram transport' -- the important details are that the data being passed is more likely to be of interest only to the sender and recipient. The data probably expires very quickly, and should not be cached for later replay.

Data Store

This represents data of interest to more than one person. A popular web page or MP3 file is a perfect example of this -- more than one person is likely to be interested in this information. Data stores must have one built-in namespace. The method of storing data can vary -- data can be encrypted, hidden behind a cryptic namespace, and may or may not even have a filename.

Namespace

A namespace is a relationship between all objects of a certain type and some kind of name for each object. There must be many different namespaces: a data store that deliberately doesn't associate filenames with data might use a cluster-number and node-number for its namespace. Computers might use 'network address' namespaces. Files correlate a 'filename' namespace with a 'data block' namespace.

Index

An index is a grouping of namespace objects with 'memes'. An index is where you figure out which namespace objects go with which concepts...whether you're searching for MP3 files by a certain artist, or searching for network nodes with a certain trust or interest metric.

Trust Network

A trust network is a system of: nyms, certificates, and assertions. (This uses public-key cryptography, so it helps to be familiar with PGP and the network-of-trust present there. PGP is really powerful stuff -- military-grade encryption for the little guy. If it would be helpful, email me at...the email address in the graphic above...and I'll write up an explanation of PGP and this public key stuff.) A nym is a public key pair -- only one individual holds a secret key, so this nym can be identified and can make trusted assertions, even when its identity really isn't known. An assertion is some kind of statement made using this public key stuff -- like 'this is my public key' or 'This file is from such-and-such series, episode such-and-such, and is the complete episode and of good quality.' or 'This network node has been spamming me for the past several hours! If you trust me and receive this certificate, then don't give services to this node!' A certificate is an assertion signed by a nym and fixed into a data file or database entry somewhere. There's a window of identity vulnerability when a certificate is published -- someone could be monitoring all of a users' connections, and see that they published a certain certificate. So for a user to have an 'anonymous nym' -- publish one-sentence reviews and group files and tell the world about content -- but not tell the entire world which person is associated with all of that data (a marketing company's dream) then we're going to need to publish these certificates anonymously somehow. (Perhaps relay the certificates through a trusted peer.)

Community Policy

A Community Policy is a collection of rules and guidelines which exist on one node only. The node's software will be assigning and using resources on behalf of the user, so the user should be able to instruct the software in how it should act. The client will make decisions based on the perceived benefit and perceived cost of actions it's deciding on. The user will need to tell the client how much benefit/cost ratio is allowed, and how to weigh different kinds of resources. (For example...I have DSL, so bandwidth is less expensive to me than for a 28.8 modem user. Anime movies and episodes, especially subtitled, are more valuable to me than Hollywood movies.) The client can then make informed decisions on my behalf. (It might encourage people to proxy their anime transfers through me, because then I get to keep a copy!)

Part 2: Concept Correlation and Discussion

The goal here is to make a list of every combination of the concepts listed in part 1. I'll then brainstorm and discuss how the two items work together, what they might need from each other, etc. This will be long and tiring, but perhaps after this thought exercise some interesting implementation details will become obvious, and we'll be that much closer to designing the next-generation filesharing client!

Network Transports & Message-Passing

One of the concepts in network transports is tunneled connections. Those connections will need to be yet another kind of datagram to be routed around.

Network Transports & Data Store

As data gets routed around the network, it can be stored and cached along the way. If data is never cached, for a route of node length N data blocks require (N-1) times the bandwidth from the community. Caching reduces this load.

Network Transports & Namespace

Perhaps a namespace will be required for network transports. It seems that if you're on a server-oriented (Napster-like) network, but you need to access resources that are on a private Mixnet, you will need some kind of information that tells you how to talk to the private mixnet.

Blocks has an interesting way of encoding routes into each message. Each node has a connection letter associated with each IP connection. Data can be routed around the network if you reference a network location by a routing path. I can send and receive network traffic with network node DGBAB -- the traffic gets to the location and back without problems. The machine at the other end only knows it's communicating with a network node CEADE. The conversion between those two network addresses happens one node at a time, as the message passes over the network. When someone routes a message to me they pop my node letter off of the route first. When I receive the message, I push the node letter I received the message on onto the route.

At my node, the remote node is address DGBAB.

At the first hop, my address is E and the remote address is GBAB

At the second hop, my address is DE and the remote address is BAB

At the third hop, my address is ADE and the remote address is AB

At the fourth hop, my address is EADE and the remote address is B

At the remote node, my address is CEADE.

Each of these letters is an abstraction, used to hide the true identities of each connection. For each node, those connection letters represent an IP connection to another node -- but the IP address is being hidden.

This works well for referencing direct connections -- it already does, in Blocks. Encrypted tunnel connections can be represented by routing letters too...but since those connections are for protection, not for network framework...it seems more likely that an encrypted tunnel connection will be always on -- survive reboots, survive network location changes, etc. That makes connection letters seem more scarce. Perhaps someone wants to give you a routing string that points to an OpenNap server -- the OpenNap server doesn't support routing, so the route should terminate at the last node on the mixnet, with one more routing letter pointing to the OpenNap server. The request goes out, comes back, and is re-encoded and sent back through the mixnet. Suppose we have a 'central server' which participates in the mixnet? It may have ten thousand connections.

So the 'namespace' is the entire routing string. This string could merely point to an IP and port...or perhaps it could point to an IP and port to hop onto a mixnet, and then some routing characters after that.

Network Transports & Index

Perhaps we should be able to index and search network connections and addresses? Not sure how to carry this off...we'll have to see if something becomes obvious in lower levels.

Network Transports & Trust Network

Network nodes and addresses are as transient as the wind...so the only way to build trust with a node is to require the node to identify itself somehow. I really don't like the idea of having to communicate with someone who both knows my physical network address (IP number) and knows my nym public key -- he can then relate the two. Also, someone can trick me into thinking that I'm supplying my nym public key to an anonymous channel when the channel actually isn't anonymous -- create a phony mixnet chain of three or four nodes, all controlled by an attacker, and coerce someone into revealing their nym to a source that could also trace their identity.

Perhaps node trust will be a mostly-unused feature:

Erm, interruption in thought: I just had a moment of clarity. Trust network public keys *ARE* encrypted tunnel endpoints! That's all these certificates are: messages. On one hand, people can choose to keep the same public key pair for every transaction. They're allowing their movements to be tracked that way -- good for trust and future behavior prediction (so I know whether or not to be nice to someone) but bad for anonymity. On the other hand, people can choose to create a new public key pair for every individual encrypted tunnel. Then we don't really know whether identities are linked or not -- just as the user wants it. Is there a way to combine the two approaches? Give someone deniability, but also the ability to say 'this well-known but anonymous anime trader says you should give me this file, ahead of everyone else in the queue.' Not sure. I'll have to sleep on this. :)

Network Transports & Community Policy

[Below added April 10, 2001] This is how you make a network self-healing and self-optimizing: use community policy rules to shape the user's network connection to match his or her needs and desires. The community policy will probably be able to affect load-balancing of connections, and handle connecting and disconnecting nodes that don't match the community policy (spammers, abusers, or even freeloaders if the user is so inclined.)

This load-balancing is interesting: with each connection there's a set of transactions going on...and each transaction has a certain amount of inherent benefit and a certain inherent cost (mostly bandwidth). If the user isn't happy with the cost/benefit ratio, he can adjust it by rate-limiting the connection, or just certain kinds of packets on a connection. This decreases the cost to acceptable levels. So when you're sending encrypted tunneled traffic through the nodes of complete strangers on the mixnet, don't be surprised if your connection gets rate-limited down to oblivion. Maybe with more intelligent routing (and complaint/response or WILL/WONT/DO/DONT/CAN/CANT negotiation, we can route our traffic through someone who doesn't mind so much.)

Message-Passing & Data Store

These are literally opposites of each other -- messages have very limited lifespan, like 'somebody on this connection is spamming me!'...while data store objects have a large lifespan, like a nym key or a block of data from a shared file or something.

Perhaps there should be a wider range than just temporary or permanent. Ping packets have shorter lives than ICQ instant messages (eFront to the contrary), which have shorter lives than email messages, which have shorter lives than HTML pages. Perhaps data store objects become messages while they're on the wire, and it's up to the individual data store to decide the permanency of each object.

Message-Passing & Namespace

Again, oil and water. Objects will be referred to by their identities according to a namespace, and those references will happen inside messages sometimes. I can't think of any other similarity.

Message-Passing & Index

You'll use messages to handle search queries and results, most likely. Ah, the Gnutella problem. Central index servers are about as efficient as you can get -- as long as someone doesn't mind being the index server. I think I explained above how you can have central index servers without making the real-world identity or real network location of the index server known.

Message-Passing & Trust Network

Messages can have signatures or encryption applied using the trust network. I don't want to think of these encrypted messages as a virtual connection, because perhaps the connection model isn't relevant here. Once someone has your public key and you have theirs, you two have a 'connection' capability already. You just send messages to each other. A 'connection' can go dormant for weeks, as long as both sides have some memory of who the other is -- the other side's public key stored somewhere along with some state information.

Message-Passing & Community Policy

I treated 'complain and leave' impulses very briefly above. I imagine that an existing small network will be a 'community' where each node knows each other node, and they all probably know each other from somewhere else anyway. Suppose a newcomer tries to join the community. How does that newcomer know how to act? When the newcomer does something, he'll get feedback from the other nodes -- 'quit asking for mp3 files; we're an anime fansub community' or 'go away, spammer' or something like that. These can be modeled as complaints -- perhaps an impulse sent automatically to try to convince the other node to either conform to your policy, or to go away. These impulses could even take a 'negotiation' form: WILL/WONT/DO/DONT/CAN/CANT messages. (Yes, I stole this from the RFC describing telnet.)

Data Store & Namespace

A namespace will be needed for objects inside a data store. (Otherwise, how would we find items we've stored there?) Yes, we all could've guessed.

Data Store & Index

[Below added April 11, 2001] These two seem to be very closely related. Indexes help you find data on both your own data store and on remote data stores. When an index is local its size and efficiency affect disk space, and when an index is remote its size and efficiency affect bandwidth.

Data Store & Trust Network

[Below added July 2, 2001] I should describe here how we plan to store trust network information in a data store.

Data Store & Community Policy

Namespace & Index

Namespace & Trust Network

Perhaps some names should be reachable or resolvable only by trusted parties?

Consider my OpenNap server example above. Perhaps the route to the OpenNap server would consist of an IP number to start with...routing letters going from host-to-host...and then a special name at the end that jumps to the OpenNap server. Only some trusted parties can tell what the actual IP of that server is from the special name.

Namespace & Community Policy

Index & Trust Network

Index & Community Policy

Trust Network & Community Policy

(Quick list of concepts, for my ease in cutting-and-pasting: Network Transports, Message-Passing, Data Store, Namespace, Index, Trust Network, Community Policy)