Traffic Conditioning For Inexpensive Installations Business-Class Performance From Free Software and Commodity Hardware By Michael Spencer Broadband internet connections don't handle heavy server loads very well. When many connections are in contention for the same limited upstream bandwidth, problems occur that degrade overall link performance. I have found a solution that can be implemented with inexpensive software on existing hardware, which sustains reasonable performance ever under extremely heavy load. I will describe the problems that occur when dozens of connections all compete for bandwidth, offer some possible theoretical solutions, and then describe an implementation in detail that solves these problems. Before I talk about the problem, you might want to know if the solution is reasonable. My proposed solution only works well if your upstream bandwidth is constant. If you have a home DSL connection with a fixed upstream rate, this solution is ideal. If you share a university or company internet connection, and you don't actually administer the university or company connection, this solution won't work well for you. Cable modem users might or might not work, depending on whether or not their upstream is fixed in hardware, or simply shares whatever is left over after all other users are done with it. My proposed solution uses advanced networking features in the 2.4 series Linux kernel. You will need to have a Linux machine responsible for routing all Internet-bound traffic to your border router (cable modem, DSL modem, etc.). The ideal way to do this is to put two network interfaces in a dedicated machine. One interface is on the local network segment, and one interface leads directly to the border router. Not everyone has this kind of hardware just laying around, so it may be possible to implement this solution even if your linux machine and border router share the same network segment with the rest of your network, or even if you only have one machine connected to your border router (or your border router is a card installed inside your machine). If you only have one network segment but you already have a dedicated Linux machine on the network, it may be possible to reconfigure your network so all traffic must pass through the Linux machine. That is the configuration I use at home, and my sample configuration detailed below assumes this. If you only have one computer connected via ethernet to a border router, or if your border router is a card installed inside your computer, you may still be able to use this technique. VMware Incorporated sells a virtual machine monitor called VMware ($100 for a personal or educational license, $300 for a commercial license). You can create a limited Linux virtual machine with VMware and bind that to the network interface your Internet connection is on. Then configure the VMware machine to communicate directly with the border gateway, and configure your desktop computer to use the VMware machine as its default gateway. You will need to leave VMware running at all times when you need to use the Internet, but VMware can be configured to use as little as 16 MB of memory and very little CPU time. Once you have a Linux machine between your border router and your network, you will need to add support for several advanced options to your Linux machine. You will probably need to recompile your kernel with support for the experimental Shaper device, as well as several items in "QoS and/or fair queuing": the CBQ and SFQ packet schedulers, the U32 classifier, and Traffic Policing. These options can be found under "Networking Options" in the kernel configuration program. You will also need two userspace tools: tc and shapecfg. Tc can be found in the iproute2 package in your distribution, or source is available from ftp://ftp.inr.ac.ru/ip- routing/iproute2-2.2.4-now-ss??????.tar.gz . Shapecfg is available with most Linux distributions already. Before we talk about implementation details, it helps to understand why this solution is needed in the first place. The biggest problem is that most inexpensive border gateway devices (cable modems, DSL routers) are configured with rather large packet queues. If you send them data faster than they can transmit it, they will happily queue up to nearly two full seconds worth of data. When that queue is almost completely full, any new packet you send will either have to wait up to 2000 ms before being transmitted, or it will be dropped entirely. We must understand the nature of the problem, and the kinds of things any possible solution can hope to achieve. We have complete, perfect control over our upstream, but not much control at all over our downstream. This causes problems with TCP's error recovery mechanisms. If one stream is sending huge amounts of data, filling up the border router's queue, but a second stream is sending small amounts of data to an unreliable host, the unreliable host's requests for retransmission will have to wait up to an additional 2000 ms before they see the data they requested. Time-sensitive traffic is also hurt by a long send queue. If you have one stream that's time-insensitive but is moving a lot of data, it will assume it gets the best transfer rate when the router's buffer is full. If you have another stream running at the same time that's time-sensitive but not moving much data, it's going to have to endure the high latency induced by the full buffer in the router. Last, TCP doesn't actually work with other connections to ensure a fair division of bandwidth. If you have several streams, each to different hosts, each stream is going to try to use as much of the link as possible without flooding. Foreign hosts that can acknowledge packets quickly may receive a disproportionately large amount of the available bandwidth. The Linux kernel offers a variety of tools for solving these problems. The first is the shaper device. This is a virtual ethernet device that's bound to an existing ethernet device, and configured with a maximum speed. For example, if the physical ethernet interface you're shaping is eth0, the shaped interface is shaper0. All packets sent via interface shaper0 actually exit via interface eth0. The other tools are all part of the advanced routing and traffic control subsystem. The second tool is the queuing discipline, called a qdisc. When packets are enqueued for transmission but can't be transmitted immediately, the system has a choice of several behaviors to follow when dequeuing and transmitting packets. If no other behavior is specified, Linux uses the "pfifo" (plain first-in first-out) qdisc. This qdisc simply transmits packets in the order they were received. One alternative qdisc is the "sfq" (Stochastic Fair Queuing) qdisc. This qdisc tries to track the individual flows present in the queue, and tries to allocate a "virtual queue" for each. It then dequeues packets with a round-robin scheduler. The end result is that all flows have a chance to send data at an equal rate. The third tool is the traffic-control class and the classifying filter. Each class represents a specific kind of traffic, as determined by filter rules. Individual classes may contain other classes, and can have maximum transmission rates or priorities assigned to them. Each class also has one qdisc assigned to it. Packets enqueue in a class, and dequeue using the qdisc assigned to the class. I will now present an example implementation of these techniques, with observed performance results and a discussion. I designed my traffic conditioning solution with the following goals and restraints in mind: ? My internet connection is a home-office DSL line provided by Qwest, with a solid 640 kbit/sec upstream data rate. This rate isn't shared with any other subscribers. ? My border router is a Cisco 678. It supports arbitrary routing rules, but doesn't have any provisions for fair queuing or traffic shaping. ? I have three computers on my network: a linux server called mspencer.net, a Windows 2000 desktop machine called michael.mspencer.net, and a Windows XP desktop machine called luann.mspencer.net. ? The users of luann.mspencer.net are end-users, and their perceived performance is important. They don't run any applications that send a large amount of data, so their outbound traffic should have an extremely high priority. ? The desktop machine michael.mspencer.net runs three main types of traffic: time-critical traffic like multiplayer gaming; normal interactive traffic like web browsing; and low-priority high-throughput traffic like peer-to-peer file sharing. ? The server machine mspencer.net runs five main types of traffic: interactive sessions with users; ftp sessions for identified users with valid usernames; public http traffic with public Internet users; bulk low-priority http traffic with public Internet users requesting extremely large objects (over 50 MB); and low-priority peer-to-peer file sharing traffic. My first task was to instruct my machines to route all traffic to and from the public Internet through my Linux server, mspencer.net. Before making any changes, my network was as follows: ? router.mspencer.net (209.180.104.206) was the Cisco 678 DSL router. Its routing table said to forward all traffic destined for 209.180.104.200/29 directly onto the ethernet interface. ? michael.mspencer.net (209.180.104.204), luann.mspencer.net (209.180.104.205), and mspencer.net (209.180.104.202) all routed outbound Internet traffic directly through router.mspencer.net (209.180.104.206). I changed the Cisco router's routing table such that it would pass packets to mspencer.net (209.180.104.202) directly onto the ethernet segment, but would route any traffic destined for michael.mspencer.net (209.180.104.204) or luann.mspencer.net (209.180.104.205) through mspencer.net (209.180.104.202). Next I turned on ip packet forwarding on mspencer.net (209.180.104.202). I didn't modify its routing table. mspencer.net still knows it's directly connected via ethernet to both desktop machines and the router. Finally I updated the network settings for both desktop machines, telling them to use mspencer.net (209.180.104.202) instead of router.mspencer.net (209.180.104.206) for their default gateway. My network then looked like this: ? router.mspencer.net (209.180.104.206) had its routing table changed to: cbos#show route [TARGET] [MASK] [GATEWAY] [M][P] [TYPE] [IF] [AGE] 0.0.0.0 0.0.0.0 0.0.0.0 1 SA WAN0-0 0 209.180.104.201 255.255.255.255 209.180.104.202 1 SHAR ETH0 0 209.180.104.203 255.255.255.255 209.180.104.202 1 SHAR ETH0 0 209.180.104.204 255.255.255.255 209.180.104.202 1 SHAR ETH0 0 209.180.104.205 255.255.255.255 209.180.104.202 1 SHAR ETH0 0 209.180.104.200 255.255.255.248 0.0.0.0 1 LA ETH0 0 216.161.72.0 255.255.255.0 0.0.0.0 1 A WAN0-0 0 ? mspencer.net (209.180.104.202) had ip forwarding enabled ? Both michael.mspencer.net (209.180.104.204) and luann.mspencer.net (209.180.104.205) had their default gateway changed: I have not changed the physical topology of my network at all. I still only have one ethernet switch with four machines plugged in, and none of my machines are multi- homed. However, my Linux machine is now able to control all of my Internet-bound traffic. My first task was to attach a class-based queue to only my Internet-bound traffic. I want my desktop machines and my Linux server to be able to communicate at full ethernet speed without causing the Linux server to believe Internet bandwidth is being consumed. I compiled support for traffic shaping and class-based queuing into my kernel and rebooted. I also downloaded binary packages for the shapecfg and tc tools. I created a shaper device with the command "shapecfg attach shaper0 eth0" and "shapecfg speed shaper0 10000000". This created a virtual ethernet device called shaper0 with a transmission limit of 10 mbit/sec. Please note that while the original intended use of the traffic shaper device is to actually limit your transmission rate, I used it merely to create a second ethernet device that transmits on the same physical link. I brought the shaper interface up and assigned it the same IP address as my ethernet interface. This won't cause a problem – inbound packets will always arrive via eth0, but outbound packets may be sent via eth0 or shaper0. Next I configured a second IP address for certain Apache virtual hosts. I created an ethernet alias eth0:0 with IP 209.180.104.203, and also a matching shaper alias shaper0:0 with the same IP. Next I adjusted my routing table so outbound traffic destined for a host on the local ethernet segment was sent through interface eth0 like normal, but outbound traffic destined for the public Internet was sent through interface shaper0. While this has no immediate effect (both interfaces send data to the same wire) this lets me single out traffic on the shaper0 interface. [root@mspencer /root]# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 209.180.104.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 209.180.104.206 0.0.0.0 UG 0 0 0 shaper0 I then started implementing the "Ultimate Traffic Conditioner" script found at http://lartc.org. I edited the root class, called "1:", changing its queuing discipline from the default pfifo (plain first-in first-out) qdisc to the cbq (class-based queuing) qdisc. tc qdisc add dev shaper0 root handle 1: cbq avpkt 1000 bandwidth 10mbit This class doesn't perform any shaping or prioritizing of traffic on its own. Its subclasses will do that. Next I created a main class, called "1:1". tc qdisc add dev shaper0 parent 1: classid 1:1 cbq rate 530kbit allot 1600 prio 5 bounded isolated This class has a maximum transmission rate of 530 kbit/sec. It is also marked "bounded" and "isolated". A bounded class is not allowed to borrow unused bandwidth from sibling classes, and an isolated class is not allowed to lend unused bandwidth to other siblings. I then created two child classes, called "1:10" and "1:20". tc class add dev shaper0 parent 1:1 classid 1:10 cbq rate 53kbit allot 1600 prio 1 avpkt 1528 tc class add dev shaper0 parent 1:1 classid 1:20 cbq rate 477kbit allot 1600 prio 2 avpkt 1528 Because these classes are siblings, and neither is bounded nor isolated, they will borrow unused bandwidth from each-other. However, if both classes are being sent enough traffic to fill their queues, they will neither lend nor borrow from each other. That is, if class 1:10 is being asked to send at 400 kbit/sec and class 1:20 is being asked to send at 600 kbit/sec, then neither class will borrow from the other. They won't share bandwidth at a rate proportionate to their inputs (40% for 1:10 and 60% for 1:20). They will stick with their assigned rates (10% for 1:10 and 90% for 1:20). I changed the default queuing discipline (qdisc) on both 1:10 and 1:20, so packets are dequeued fairly. tc qdisc add dev shaper0 parent 1:10 handle 10: sfq perturb 10 tc qdisc add dev shaper0 parent 1:20 handle 20: sfq perturb 10 The Stochastic Fair Queue queuing discipline (sfq) was explained above. It will try to identify individual flows in the traffic that passes through its queue, will create a virtual queue for each flow, and will try to dequeue packets from each virtual queue with a round-robin scheduler. The result of this is that if you have twenty connections all flooding the interface at their maximum speeds, each connection will receive roughly one twentieth of the available bandwidth. The perturb setting tells sfq how often to recompute the hash buckets it uses to form virtual queues with. I then added filters, to determine which packets will be handled by which classes. tc filter add dev shaper0 parent 1: protocol ip prio 11 u32 match ip protocol 1 0xff flowid 1:10 The u32 classifying filter works by comparing individual bits in a packet against a match criteria. The "protocol" keyword is an alias for the address of the protocol number in the IP header. The next number (1) specifies what value to compare with, and the number after that (0xff) is a bitmap which specifies which bits in the specified location should be compared. 0xff corresponds to 11111111, which means that all eight bits must match. 0x01 corresponds to 00000001, which means that only the rightmost bit would have to match. Traffic matching this filter is handled by class 1:10. tc filter add dev shaper0 parent 1: protocol ip prio 10 u32 match ip tos 0x10 0xff flowid 1:10 This filter checks the value of the type-of-service flag in outbound packets. Public Internet routers generally don't honor the requested type of service, but some applications (especially ssh) set the type of service on outgoing packets. This filter adds interactive ssh session traffic to the high priority class 1:10. tc filter add dev shaper0 parent 1: protocol ip prio 12 u32 match ip protocol 6 0xff match u8 0x05 0x0f at 0 match u16 0x0000 0xffc0 at 2 match u8 0x10 0xff at 33 flowid 1:10 This filter checks several bits in the IP header, to match only TCP acknowledgements. We want to give acknowledgements high priority, so downloads go faster. tc filter add dev shaper0 parent 1: protocol ip prio 40 u32 match ip dst 0.0.0.0/0 flowid 1:20 This filter is a catch-all that should send all other traffic to 1:20. The net result of this is that certain kinds of traffic are selected as "high-priority", and are transmitted almost immediately. All other traffic is grouped together and forwarded according to sfq's round-robin scheduler. This had a dramatic affect on my link's performance. My average round trip (as measured with ICMP echo requests to and from my ISP's nameserver) went from 1800- 2000 ms to a striking 60-90 ms. The reason for this was that ICMP packets were not required to wait with the rest of the outbound traffic. It was put at the head of the line immediately. Note that this would not have been possible if the Cisco DSL router was allowed to maintain the outbound queue. Before applying this sfq-based conditioner, my per-task bandwidth usage looked like this: ? Desktop michael.mspencer.net: around 93 bytes/sec ? Bulk http downloads: 40.5 KB/sec ? FTP user pseudonym, retrieving one file: 9.5 KB/sec ? FTP user dev, retrieving one file with a download accelerator (four simultaneous transfers): 4.6 KB/sec ? All other http traffic: 16.1 KB/sec Grand total: 70.7 KB/sec. Because my upstream link can't actually send that quickly some of that was being dropped, and the DSL router's queue was kept completely full. After applying the sfq-based conditioner, per-task bandwidth usage looked like this: ? Desktop michael.mspencer.net: around 8 bytes/sec (demand was lower) ? Bulk http downloads: 52.2 KB/sec ? FTP user pseudonym, retrieving one file: 8.4 KB/sec ? FTP user dev, retrieving one file with four simultaneous transfers: 3.7 KB/sec ? All other http traffic: 868 bytes/sec (demand must have died down) Grand total: 65.2 KB/sec. This is very close to, but not beyond, my upstream link's capacity. No outbound packets were dropped by my DSL router, and the router's queue stayed empty. These results were mostly satisfactory, but with some problems. FTP user pseudonym is capable of much higher transfer rates, and his transfer is more valuable to me than the bulk http downloads consuming more than 80% of my available bandwidth. Upon further examination, two http users were downloading eight to ten different sections of the same file simultaneously, using a download accelerator (user-agent "DA 5.0"). This reveals another weakness of the sfq queuing discipline: it weighs each connection equally, so if someone uses many simultaneous transfers they will be allocated a disproportionately high amount of bandwidth. I then implemented additional classes and classifier rules for the kinds of traffic I wanted to give lower priority to. tc class add dev shaper0 parent 1:1 classid 1:30 cbq rate 69kbit allot 1600 prio 3 avpkt 1528 tc class add dev shaper0 parent 1:1 classid 1:40 cbq rate 69kbit allot 1600 prio 4 avpkt 1528 tc qdisc add dev shaper0 parent 1:30 handle 30: sfq perturb 10 tc qdisc add dev shaper0 parent 1:40 handle 40: sfq perturb 10 I also modified class 1:20 (normal traffic), decreasing its bandwidth allocation. tc class add dev shaper0 parent 1:1 classid 1:20 cbq rate 371kbit allot 1600 prio 2 avpkt 1528 I then added filter rules for low-priority traffic, so those kinds of traffic would be handled by classes 1:30 and 1:40. For example: tc filter add dev shaper0 parent 1: protocol ip prio 13 u32 match ip src 209.180.104.203 flowid 1:30 This assigned all bulk-download http traffic to class 1:30. tc filter add dev shaper0 parent 1: protocol ip prio 26 u32 match ip sport 3072 0xfc00 flowid 1:40 This assigns all traffic with source ports in a range common to a specific peer-to-peer file sharing program to class 1:40. Note that because these two new classes are not isolated, they will donate any bandwidth they don't use to their siblings. Since the normal-priority class 1:20 is not bounded, it will borrow bandwidth from siblings if it has more demand than it can handle alone. After applying these changes my link's performance was mostly the same from my point of view. My users, however, noticed a big change. ? Desktop michael.mspencer.net: around 8 bytes/sec ? Bulk http downloads: 12.7 KB/sec ? FTP user pseudonym, retrieving one file: 44.5 KB/sec ? FTP user dev, retrieving one file with a download accelerator (four simultaneous transfers): 4.4 KB/sec ? All other http traffic: 1.0 KB/sec Grand total: 62.6 KB/sec. My allocations didn't add up precisely to what they did before, so less data was allowed to transmit. Overall performance was still excellent, and bandwidth was allocated to more appropriate tasks. One future improvement might be to create a new queuing discipline based mostly off of the sfq source code. Even though my bulk http downloads were getting a suitably low amount of bandwidth, that total allocation was divided up unfairly between two download accelerator users and three single-transfer users. The modification I have in mind would be to simply reimplement sfq with a new name, but make the new version blind to port numbers. That way, if a download accelerator user tries to increase their transfer speed by opening more connections they don't actually receive more data, and don't gain any unfair advantage over the other users. For example, they might go from one connection (6 KB/sec) to two connections (3 KB/sec each) to six connections (1 KB/sec each) but not actually receive any more data. Meanwhile other users with only one connection are unaffected. Many consumer and home-office broadband internet connections have more than enough bandwidth for heavy-duty server tasks, but lack a sophisticated traffic management system that enables reasonable performance under heavy load. Business DS-1 (T1) connections already come with Fair Queuing enabled, but most administrators aren't even aware of the difference. They do notice, however, that a home DSL line with one third of the upstream bandwidth of a DS-1 tends to perform much worse under only one third of a T1's normal workload. With the traffic management techniques discussed here, home server owners can handle the same kinds of heavy workload as more expensive lines. Michael Spencer Jr. Page 9 3/23/2004