Traffic Conditioning For Inexpensive Installations
Business-Class Performance From Free Software and Commodity Hardware

By Michael Spencer



Broadband internet connections don't handle heavy server loads very well.  When many 
connections are in contention for the same limited upstream bandwidth, problems occur 
that degrade overall link performance.  I have found a solution that can be implemented 
with inexpensive software on existing hardware, which sustains reasonable performance 
ever under extremely heavy load.  I will describe the problems that occur when dozens of 
connections all compete for bandwidth, offer some possible theoretical solutions, and 
then describe an implementation in detail that solves these problems.

Before I talk about the problem, you might want to know if the solution is reasonable.  
My proposed solution only works well if your upstream bandwidth is constant.  If you 
have a home DSL connection with a fixed upstream rate, this solution is ideal.  If you 
share a university or company internet connection, and you don't actually administer the 
university or company connection, this solution won't work well for you.  Cable modem 
users might or might not work, depending on whether or not their upstream is fixed in 
hardware, or simply shares whatever is left over after all other users are done with it.

My proposed solution uses advanced networking features in the 2.4 series Linux kernel.  
You will need to have a Linux machine responsible for routing all Internet-bound traffic 
to your border router (cable modem, DSL modem, etc.).  The ideal way to do this is to 
put two network interfaces in a dedicated machine.  One interface is on the local network 
segment, and one interface leads directly to the border router.

Not everyone has this kind of hardware just laying around, so it may be possible to 
implement this solution even if your linux machine and border router share the same 
network segment with the rest of your network, or even if you only have one machine 
connected to your border router (or your border router is a card installed inside your 
machine).

If you only have one network segment but you already have a dedicated Linux machine 
on the network, it may be possible to reconfigure your network so all traffic must pass 
through the Linux machine.  That is the configuration I use at home, and my sample 
configuration detailed below assumes this.

If you only have one computer connected via ethernet to a border router, or if your border 
router is a card installed inside your computer, you may still be able to use this technique.  
VMware Incorporated sells a virtual machine monitor called VMware ($100 for a 
personal or educational license, $300 for a commercial license).  You can create a limited 
Linux virtual machine with VMware and bind that to the network interface your Internet 
connection is on.  Then configure the VMware machine to communicate directly with the 
border gateway, and configure your desktop computer to use the VMware machine as its 
default gateway.  You will need to leave VMware running at all times when you need to 
use the Internet, but VMware can be configured to use as little as 16 MB of memory and 
very little CPU time.

Once you have a Linux machine between your border router and your network, you will 
need to add support for several advanced options to your Linux machine.  You will 
probably need to recompile your kernel with support for the experimental Shaper device, 
as well as several items in "QoS and/or fair queuing":  the CBQ and SFQ packet 
schedulers, the U32 classifier, and Traffic Policing.  These options can be found under 
"Networking Options" in the kernel configuration program.

You will also need two userspace tools:  tc and shapecfg.  Tc can be found in the iproute2 
package in your distribution, or source is available from ftp://ftp.inr.ac.ru/ip-
routing/iproute2-2.2.4-now-ss??????.tar.gz .  Shapecfg is available with most 
Linux distributions already.

Before we talk about implementation details, it helps to understand why this solution is 
needed in the first place.

The biggest problem is that most inexpensive border gateway devices (cable modems, 
DSL routers) are configured with rather large packet queues.  If you send them data faster 
than they can transmit it, they will happily queue up to nearly two full seconds worth of 
data.  When that queue is almost completely full, any new packet you send will either 
have to wait up to 2000 ms before being transmitted, or it will be dropped entirely.

We must understand the nature of the problem, and the kinds of things any possible 
solution can hope to achieve.  We have complete, perfect control over our upstream, but 
not much control at all over our downstream.

This causes problems with TCP's error recovery mechanisms.  If one stream is sending 
huge amounts of data, filling up the border router's queue, but a second stream is sending 
small amounts of data to an unreliable host, the unreliable host's requests for 
retransmission will have to wait up to an additional 2000 ms before they see the data they 
requested.

Time-sensitive traffic is also hurt by a long send queue.  If you have one stream that's 
time-insensitive but is moving a lot of data, it will assume it gets the best transfer rate 
when the router's buffer is full.  If you have another stream running at the same time 
that's time-sensitive but not moving much data, it's going to have to endure the high 
latency induced by the full buffer in the router.

Last, TCP doesn't actually work with other connections to ensure a fair division of 
bandwidth.  If you have several streams, each to different hosts, each stream is going to 
try to use as much of the link as possible without flooding.  Foreign hosts that can 
acknowledge packets quickly may receive a disproportionately large amount of the 
available bandwidth.

The Linux kernel offers a variety of tools for solving these problems.  The first is the 
shaper device.  This is a virtual ethernet device that's bound to an existing ethernet 
device, and configured with a maximum speed.  For example, if the physical ethernet 
interface you're shaping is eth0, the shaped interface is shaper0.  All packets sent via 
interface shaper0 actually exit via interface eth0.

The other tools are all part of the advanced routing and traffic control subsystem.

The second tool is the queuing discipline, called a qdisc.  When packets are enqueued for 
transmission but can't be transmitted immediately, the system has a choice of several 
behaviors to follow when dequeuing and transmitting packets.  If no other behavior is 
specified, Linux uses the "pfifo" (plain first-in first-out) qdisc.  This qdisc simply 
transmits packets in the order they were received.

One alternative qdisc is the "sfq" (Stochastic Fair Queuing) qdisc.  This qdisc tries to 
track the individual flows present in the queue, and tries to allocate a "virtual queue" for 
each.  It then dequeues packets with a round-robin scheduler.  The end result is that all 
flows have a chance to send data at an equal rate.

The third tool is the traffic-control class and the classifying filter.  Each class represents a 
specific kind of traffic, as determined by filter rules.  Individual classes may contain other 
classes, and can have maximum transmission rates or priorities assigned to them.  Each 
class also has one qdisc assigned to it.  Packets enqueue in a class, and dequeue using the 
qdisc assigned to the class.

I will now present an example implementation of these techniques, with observed 
performance results and a discussion.

I designed my traffic conditioning solution with the following goals and restraints in 
mind:

?	My internet connection is a home-office DSL line provided by Qwest, with a solid 
640 kbit/sec upstream data rate.  This rate isn't shared with any other subscribers.
?	My border router is a Cisco 678.  It supports arbitrary routing rules, but doesn't 
have any provisions for fair queuing or traffic shaping.
?	I have three computers on my network:  a linux server called mspencer.net, a 
Windows 2000 desktop machine called michael.mspencer.net, and a Windows XP 
desktop machine called luann.mspencer.net.
?	The users of luann.mspencer.net are end-users, and their perceived performance is 
important.  They don't run any applications that send a large amount of data, so 
their outbound traffic should have an extremely high priority.
?	The desktop machine michael.mspencer.net runs three main types of traffic:  
time-critical traffic like multiplayer gaming;  normal interactive traffic like web 
browsing;  and low-priority high-throughput traffic like peer-to-peer file sharing.
?	The server machine mspencer.net runs five main types of traffic:  interactive 
sessions with users;  ftp sessions for identified users with valid usernames;  public 
http traffic with public Internet users;  bulk low-priority http traffic with public 
Internet users requesting extremely large objects (over 50 MB);  and low-priority 
peer-to-peer file sharing traffic.

My first task was to instruct my machines to route all traffic to and from the public 
Internet through my Linux server, mspencer.net.  Before making any changes, my 
network was as follows:
?	router.mspencer.net (209.180.104.206) was the Cisco 678 DSL router.  Its routing 
table said to forward all traffic destined for 209.180.104.200/29 directly onto the 
ethernet interface.
?	michael.mspencer.net (209.180.104.204), luann.mspencer.net (209.180.104.205), 
and mspencer.net (209.180.104.202) all routed outbound Internet traffic directly 
through router.mspencer.net (209.180.104.206).

I changed the Cisco router's routing table such that it would pass packets to mspencer.net 
(209.180.104.202) directly onto the ethernet segment, but would route any traffic 
destined for michael.mspencer.net (209.180.104.204) or luann.mspencer.net 
(209.180.104.205) through mspencer.net (209.180.104.202).  

Next I turned on ip packet forwarding on mspencer.net (209.180.104.202).  I didn't 
modify its routing table.  mspencer.net still knows it's directly connected via ethernet to 
both desktop machines and the router.

Finally I updated the network settings for both desktop machines, telling them to use 
mspencer.net (209.180.104.202) instead of router.mspencer.net (209.180.104.206) for 
their default gateway.

My network then looked like this:
?	router.mspencer.net (209.180.104.206) had its routing table changed to:
cbos#show route
[TARGET]         [MASK]           [GATEWAY]       [M][P] [TYPE]    [IF]   [AGE]
0.0.0.0          0.0.0.0          0.0.0.0          1     SA        WAN0-0   0
209.180.104.201  255.255.255.255  209.180.104.202  1     SHAR      ETH0     0
209.180.104.203  255.255.255.255  209.180.104.202  1     SHAR      ETH0     0
209.180.104.204  255.255.255.255  209.180.104.202  1     SHAR      ETH0     0
209.180.104.205  255.255.255.255  209.180.104.202  1     SHAR      ETH0     0
209.180.104.200  255.255.255.248  0.0.0.0          1     LA        ETH0     0
216.161.72.0     255.255.255.0    0.0.0.0          1     A         WAN0-0   0
?	mspencer.net (209.180.104.202) had ip forwarding enabled
?	Both michael.mspencer.net (209.180.104.204) and luann.mspencer.net 
(209.180.104.205) had their default gateway changed:
 

I have not changed the physical topology of my network at all.  I still only have one 
ethernet switch with four machines plugged in, and none of my machines are multi-
homed.  However, my Linux machine is now able to control all of my Internet-bound 
traffic.

My first task was to attach a class-based queue to only my Internet-bound traffic.  I want 
my desktop machines and my Linux server to be able to communicate at full ethernet 
speed without causing the Linux server to believe Internet bandwidth is being consumed.

I compiled support for traffic shaping and class-based queuing into my kernel and 
rebooted.  I also downloaded binary packages for the shapecfg and tc tools.

I created a shaper device with the command "shapecfg attach shaper0 eth0" and 
"shapecfg speed shaper0 10000000".  This created a virtual ethernet device called 
shaper0 with a transmission limit of 10 mbit/sec.  Please note that while the original 
intended use of the traffic shaper device is to actually limit your transmission rate, I used 
it merely to create a second ethernet device that transmits on the same physical link.

I brought the shaper interface up and assigned it the same IP address as my ethernet 
interface.  This won't cause a problem – inbound packets will always arrive via eth0, but 
outbound packets may be sent via eth0 or shaper0.

Next I configured a second IP address for certain Apache virtual hosts.  I created an 
ethernet alias eth0:0 with IP 209.180.104.203, and also a matching shaper alias shaper0:0 
with the same IP.

Next I adjusted my routing table so outbound traffic destined for a host on the local 
ethernet segment was sent through interface eth0 like normal, but outbound traffic 
destined for the public Internet was sent through interface shaper0.  While this has no 
immediate effect (both interfaces send data to the same wire) this lets me single out 
traffic on the shaper0 interface.

[root@mspencer /root]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
209.180.104.0   0.0.0.0         255.255.255.0   U     0      0        0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U     0      0        0 lo
0.0.0.0         209.180.104.206 0.0.0.0         UG    0      0        0 shaper0

I then started implementing the "Ultimate Traffic Conditioner" script found at 
http://lartc.org.

I edited the root class, called "1:", changing its queuing discipline from the default pfifo 
(plain first-in first-out) qdisc to the cbq (class-based queuing) qdisc.
tc qdisc add dev shaper0 root handle 1: cbq avpkt 1000 bandwidth 10mbit
This class doesn't perform any shaping or prioritizing of traffic on its own.  Its subclasses 
will do that.

Next I created a main class, called "1:1".
tc qdisc add dev shaper0 parent 1: classid 1:1 cbq rate 530kbit allot 1600 prio 
5 bounded isolated
This class has a maximum transmission rate of 530 kbit/sec.  It is also marked "bounded" 
and "isolated".  A bounded class is not allowed to borrow unused bandwidth from sibling 
classes, and an isolated class is not allowed to lend unused bandwidth to other siblings.

I then created two child classes, called "1:10" and "1:20".
tc class add dev shaper0 parent 1:1 classid 1:10 cbq rate 53kbit allot 1600 
prio 1 avpkt 1528
tc class add dev shaper0 parent 1:1 classid 1:20 cbq rate 477kbit allot 1600 
prio 2 avpkt 1528
Because these classes are siblings, and neither is bounded nor isolated, they will borrow 
unused bandwidth from each-other. However, if both classes are being sent enough traffic 
to fill their queues, they will neither lend nor borrow from each other.  That is, if class 
1:10 is being asked to send at 400 kbit/sec and class 1:20 is being asked to send at 600 
kbit/sec, then neither class will borrow from the other.  They won't share bandwidth at a 
rate proportionate to their inputs (40% for 1:10 and 60% for 1:20).  They will stick with 
their assigned rates (10% for 1:10 and 90% for 1:20).

I changed the default queuing discipline (qdisc) on both 1:10 and 1:20, so packets are 
dequeued fairly.
tc qdisc add dev shaper0 parent 1:10 handle 10: sfq perturb 10
tc qdisc add dev shaper0 parent 1:20 handle 20: sfq perturb 10
The Stochastic Fair Queue queuing discipline (sfq) was explained above.  It will try to 
identify individual flows in the traffic that passes through its queue, will create a virtual 
queue for each flow, and will try to dequeue packets from each virtual queue with a 
round-robin scheduler.  The result of this is that if you have twenty connections all 
flooding the interface at their maximum speeds, each connection will receive roughly one 
twentieth of the available bandwidth.  The perturb setting tells sfq how often to 
recompute the hash buckets it uses to form virtual queues with.

I then added filters, to determine which packets will be handled by which classes.
tc filter add dev shaper0 parent 1: protocol ip prio 11 u32 match ip protocol 1 
0xff flowid 1:10
The u32 classifying filter works by comparing individual bits in a packet against a match 
criteria.  The "protocol" keyword is an alias for the address of the protocol number in the 
IP header.  The next number (1) specifies what value to compare with, and the number 
after that (0xff) is a bitmap which specifies which bits in the specified location should be 
compared.  0xff corresponds to 11111111, which means that all eight bits must match.  
0x01 corresponds to 00000001, which means that only the rightmost bit would have to 
match.  Traffic matching this filter is handled by class 1:10.

tc filter add dev shaper0 parent 1: protocol ip prio 10 u32 match ip tos 0x10 
0xff flowid 1:10
This filter checks the value of the type-of-service flag in outbound packets.  Public 
Internet routers generally don't honor the requested type of service, but some applications 
(especially ssh) set the type of service on outgoing packets.  This filter adds interactive 
ssh session traffic to the high priority class 1:10.

tc filter add dev shaper0 parent 1: protocol ip prio 12 u32 match ip protocol 6 
0xff match u8 0x05 0x0f at 0 match u16 0x0000 0xffc0 at 2 match u8 0x10 0xff at 
33 flowid 1:10
This filter checks several bits in the IP header, to match only TCP acknowledgements.  
We want to give acknowledgements high priority, so downloads go faster.

tc filter add dev shaper0 parent 1: protocol ip prio 40 u32 match ip dst 
0.0.0.0/0 flowid 1:20
This filter is a catch-all that should send all other traffic to 1:20.

The net result of this is that certain kinds of traffic are selected as "high-priority", and are 
transmitted almost immediately.  All other traffic is grouped together and forwarded 
according to sfq's round-robin scheduler.

This had a dramatic affect on my link's performance.  My average round trip (as 
measured with ICMP echo requests to and from my ISP's nameserver) went from 1800-
2000 ms to a striking 60-90 ms.  The reason for this was that ICMP packets were not 
required to wait with the rest of the outbound traffic.  It was put at the head of the line 
immediately.  Note that this would not have been possible if the Cisco DSL router was 
allowed to maintain the outbound queue.

Before applying this sfq-based conditioner, my per-task bandwidth usage looked like this:
?	Desktop michael.mspencer.net:  around 93 bytes/sec
?	Bulk http downloads:  40.5 KB/sec
?	FTP user pseudonym, retrieving one file:  9.5 KB/sec
?	FTP user dev, retrieving one file with a download accelerator (four simultaneous 
transfers):  4.6 KB/sec
?	All other http traffic:  16.1 KB/sec
Grand total:  70.7 KB/sec.  Because my upstream link can't actually send that quickly 
some of that was being dropped, and the DSL router's queue was kept completely full.

After applying the sfq-based conditioner, per-task bandwidth usage looked like this:
?	Desktop michael.mspencer.net:  around 8 bytes/sec   (demand was lower)
?	Bulk http downloads:  52.2 KB/sec
?	FTP user pseudonym, retrieving one file:  8.4 KB/sec
?	FTP user dev, retrieving one file with four simultaneous transfers:  3.7 KB/sec
?	All other http traffic:  868 bytes/sec  (demand must have died down)
Grand total:  65.2 KB/sec.  This is very close to, but not beyond, my upstream link's 
capacity.  No outbound packets were dropped by my DSL router, and the router's queue 
stayed empty.

These results were mostly satisfactory, but with some problems.  FTP user pseudonym is 
capable of much higher transfer rates, and his transfer is more valuable to me than the 
bulk http downloads consuming more than 80% of my available bandwidth.  Upon 
further examination, two http users were downloading eight to ten different sections of 
the same file simultaneously, using a download accelerator (user-agent "DA 5.0").

This reveals another weakness of the sfq queuing discipline:  it weighs each connection 
equally, so if someone uses many simultaneous transfers they will be allocated a 
disproportionately high amount of bandwidth.

I then implemented additional classes and classifier rules for the kinds of traffic I wanted 
to give lower priority to.

tc class add dev shaper0 parent 1:1 classid 1:30 cbq rate 69kbit allot 1600 
prio 3 avpkt 1528
tc class add dev shaper0 parent 1:1 classid 1:40 cbq rate 69kbit allot 1600 
prio 4 avpkt 1528
tc qdisc add dev shaper0 parent 1:30 handle 30: sfq perturb 10
tc qdisc add dev shaper0 parent 1:40 handle 40: sfq perturb 10

I also modified class 1:20 (normal traffic), decreasing its bandwidth allocation.
tc class add dev shaper0 parent 1:1 classid 1:20 cbq rate 371kbit allot 1600 
prio 2 avpkt 1528

I then added filter rules for low-priority traffic, so those kinds of traffic would be handled 
by classes 1:30 and 1:40.  For example:

tc filter add dev shaper0 parent 1: protocol ip prio 13 u32 match ip src 
209.180.104.203 flowid 1:30
This assigned all bulk-download http traffic to class 1:30.
tc filter add dev shaper0 parent 1: protocol ip prio 26 u32 match ip sport 3072 
0xfc00 flowid 1:40
This assigns all traffic with source ports in a range common to a specific peer-to-peer file 
sharing program to class 1:40.

Note that because these two new classes are not isolated, they will donate any bandwidth 
they don't use to their siblings.  Since the normal-priority class 1:20 is not bounded, it 
will borrow bandwidth from siblings if it has more demand than it can handle alone.

After applying these changes my link's performance was mostly the same from my point 
of view.  My users, however, noticed a big change.

?	Desktop michael.mspencer.net:  around 8 bytes/sec
?	Bulk http downloads:  12.7 KB/sec
?	FTP user pseudonym, retrieving one file:  44.5 KB/sec
?	FTP user dev, retrieving one file with a download accelerator (four simultaneous 
transfers):  4.4 KB/sec
?	All other http traffic:  1.0 KB/sec
Grand total:  62.6 KB/sec.  My allocations didn't add up precisely to what they did 
before, so less data was allowed to transmit.  Overall performance was still excellent, and 
bandwidth was allocated to more appropriate tasks.

One future improvement might be to create a new queuing discipline based mostly off of 
the sfq source code.  Even though my bulk http downloads were getting a suitably low 
amount of bandwidth, that total allocation was divided up unfairly between two 
download accelerator users and three single-transfer users.  The modification I have in 
mind would be to simply reimplement sfq with a new name, but make the new version 
blind to port numbers.  That way, if a download accelerator user tries to increase their 
transfer speed by opening more connections they don't actually receive more data, and 
don't gain any unfair advantage over the other users.  For example, they might go from 
one connection (6 KB/sec) to two connections (3 KB/sec each) to six connections (1 
KB/sec each) but not actually receive any more data.  Meanwhile other users with only 
one connection are unaffected.

Many consumer and home-office broadband internet connections have more than enough 
bandwidth for heavy-duty server tasks, but lack a sophisticated traffic management 
system that enables reasonable performance under heavy load.  Business DS-1 (T1) 
connections already come with Fair Queuing enabled, but most administrators aren't even 
aware of the difference.  They do notice, however, that a home DSL line with one third of 
the upstream bandwidth of a DS-1 tends to perform much worse under only one third of a 
T1's normal workload.  With the traffic management techniques discussed here, home 
server owners can handle the same kinds of heavy workload as more expensive lines.


Michael Spencer Jr.	Page 9	3/23/2004