In this site I will try to offer information about IP Network Quality of Service (QoS) focusing problems from a practical point of view.




2.1.- Linux Traffic Control

Each network card (NIC = Network Interface Card) on a Linux box is drived by a Network Driver which control the hardware. The Network Driver acts as an exchange mechanism of packets between the Linux Networking Code and the physical network. This first approach can be viewed this way:

In this diagram by Christian [9] the Linux Networking Code can request the Network Driver to send a packet to the physical network (packet is leaving the box) or the Network Driver can give a packet it has received from the physical network to the Linux Networking Code (packet is entering the box).
Because our interest is to use Linux as a pure router to forward packets from one interface to another in the same box, it will not generate or consume any packet and our diagram can be extended to include two NICs:

This way packets can enter from the physical network A to the NIC number 0; the NIC 0's Network Driver gives the received packets to the Linux Networking Code where they are forwarded; finally the Linux Networking Code request the NIC 1's Network Driver to send the packets to the physical network B.
Let's now put our eyes in the Linux Networking Code. Reading from [8] Saravanan wrote:
The basic principle involved in the implementation of QoS in linux is shown in figure below. This figure shows how the kernel processes incoming packets, and how it generates packets to be sent to the network. The input de-multiplexer examines the incoming packets to determine if the packets are destined for the local node. If so, they are sent to the higher layer for further processing. If not, it sends the packets to the forwarding block. The forwarding block, which may also received locally generated packets from the higher layer, looks up the routing table and determines the next hop for the packet. After this, it queues the packets to be transmitted on the output interface. It is at this point that the linux traffic control comes into play. Linux traffic control can be used to build a complex combination of queuing disciplines, classes and filters that control the packets that are sent on the output interface.

Very well. But we are going to refine things a little. First, we are not interested in local process, this means, we will not accept packets for local and we will not generate packets from local. Our Linux box will be a pure router: no packets are generated, no packets are consumed, packets are just forwarded or perhaps dropped. This way your box is a transfer machine and it is a lot better if you keep it as this. Do not lend any service except forwarding with a router. Don't do that.
Observe also that after packets are forwarded by the forwarding block they are enqueued in an output queue previously to be transmitted on the output interface. On this output queue something called Linux Traffic Control takes care of packet management. Reading from Werner [8] we can have a better and detailed information including a very nice figure:
Figure below shows roughly how the kernel processes data received from the network, and how it generates new data to be sent on the network: packets arrive on via an input interface, where they may be policed. Policing discards undesirable packets, e.g. if traffic is arriving too fast. After policing, packets are either directly forwarded to the network (e.g. on a different interface, if the machine is acting as a router or a bridge), or they are passed up to higher layers in the protocol stack (e.g. to a transport protocol like UDP or TCP) for further processing. Those higher layers may also generate data on their own and hand it to the lower layers for tasks like encapsulation, routing, and eventually transmission.
"Forwarding" includes the selection of the output interface, the selection of the next hop, encapsulation, etc. Once all this is done, packets are queued on the respective output interface. This is the second point where traffic control comes into play. Traffic control can, among other things, decide if packets are queued or if they are dropped (e.g. if the queue has reached some length limit, or if the traffic exceeds some rate limit), it can decide in which order packets are sent (e.g. to give priority to certain flows), it can delay the sending of packets (e.g. to limit the rate of outbound traffic), etc. Once traffic control has released a packet for sending, the device driver picks it up and emits it on the network.

Werner explanation helps us a lot. Again to keep our box as a pure router no packets will be sent to the upper layers and no packets will be received from the upper layers. Our packet route will be then: Input Interface Ingress Policing Input Demultiplexing (just to say that packets are not for local use) Forwarding Output Queuing Output Interface.
Where are we going to exercise control over our packets? Just into the blue blocks: Ingress Policing and Output Queuing. These blocks are called the "Traffic Control Code of the Linux kernel". Ingress policing will be our first point of control; here policing discards undesirable packets to enforce an upper limit on the entering throughput. The second point will be the Output (Egress) Queue; here packets are queued and then dropped, delayed or prioritized according to the policy we are interested to implement.
Drilling down let's see how the Linux Traffic Control Code is implemented. As is name says it is just C Language code consisting of four major conceptual components:
  • Queuing disciplines
  • Classes
  • Filters
  • Policers
Queuing disciplines are algorithms which control how packets enqueued are treated. A very simple queuing discipline could be a FIFO (First In-First Out) queue of 16 packets. In this queue packets enter at the tail and leave the queue from the head. Something as simple like picture in figure 2.1.5 shown below. 

In this the queue is moving 20 packets. Packets 1 and 2 already entered and left the queue, packets 3 to 18 are just in the queue and packets 19 and 20 are in the process of entering the queue. Of course, before a new packet can be accepted another has to be dispatched. If packets arrive very fast, faster than they can be dispatched, the router doesn't have any other choice that dropping them. But we don't want this; we want to conserve our packets to protect TCP protocol behavior and applications than send/receive these packets. Or better yet, we want to exercise a true control over them: which are going to be forwarded and how fast; which are going to be delayed; and which, lamentably, dropped. The queue is an area of memory on our router. If we are expecting packets of say, 1000-byte size, we have to reserve an area of 16KB on memory to implement this simple 16-packets FIFO queue.
To step ahead we would like to extend now a little our knowledge about the Linux Traffic Control code. At this time Christian [9] explanation of the 'Class' concept come to us as a ring to the finger:
The basic building block of the Traffic Control system is qdiscs as described above. The queuing disciplines mentioned in the previous section are comparatively simple to deal with. That is, they are setup and maybe given some parameters. Afterwards packets can be enqueued to them and dequeued from them as described.
But many queuing are of a different nature. These qdiscs do not store packets themselves. Instead, they contain other qdiscs, which they give packets to and take packets from. Such qdiscs are known as qdiscs with classes.
For example, one could imagine a priority-based queuing discipline with the following properties:
  1. Each packet enqueued to the queuing discipline is assigned a priority. For example the priority could be deduced from the source or destination IP address of the packet. Let us say that the priority is a number between 1 and 5.
  2. When a packet is dequeued it will always select a packet it contains with the lowest priority number.
A way to implement such a queuing discipline is to make the priority-based queuing discipline contain 5 other queuing disciplines numbered from 1 to 5. The priority-based queuing discipline will then do the following:
  1. When a packet is enqueued, it calculates the priority number, i.e. a number between 1 and 5. It then enqueues the packet to the queuing discipline indicated by this number
  2. When a packet is dequeued it always dequeues from the non-empty queuing discipline with the lowest number.
What is interesting about this, is that the 5 contained queuing disciplines could be arbitrary queuing disciplines. For example sfq queues or any other queue.
In Linux this concept is handled by classes. That is, a queuing discipline might contain classes. In this example, the priority queuing discipline has 5 classes. Each class can be viewed as a socket to which you can plug in any other queuing discipline. When a qdisc with classes is created, it will typically assign simple FIFO queues to the classes it contains. But these can be replaced with other qdiscs by the tc program.
Bravo!! Some clearing: qdisc means queuing discipline; the qdisc he mentioned in a previous section was just a FIFO queue. But what he is magisterialy describing here is a qdisc called PRIO queue; a queue with classes, certainly. SFQ is another of these qdiscs. And tc is a program written by Alexey Kuznetsov to manage qdiscs on Linux.
But Werner in [6] has something to say about this stuff; let's see:
Queuing disciplines and classes are intimately tied together: the presence of classes and their semantics are fundamental properties of the queuing discipline. But ability doesn't end yet - classes normally don't store their packets themselves, but they use another queuing discipline to take care of that. That queuing discipline can be arbitrarily chosen from the set of available queuing disciplines, and it may well have classes, which in turn use queuing disciplines, etc.
Nice, really nice. If we don't misunderstood the explanation we can have a tree with hierarchies; queing disciplines have classes but these classes normally don't store their packets themselve, instead they use another queuing discipline to take care of that. And these queuing disciplines can have classes again which in turn use queuing disciplines and so on.
But let's review Christian [9] explanation above to extend even more our knowledge about Linux Traffic Control code. He wrote: "For example the priority could be deduced from the source or destination IP address of the packet". It sounds known for us. Search back this document and have a look to the "classifier" concept. Christian is trying to tell us that the PRIO qdisc he described has five classes numbered 1, 2, 3, 4 and 5. When a packet arrives to our router we snoop its IP header to see its source or destination address and based on either of these, or both, we select a priority from 1 to 5 for the packet; and with this priority on hand we put the packet in one of the classes of the qdisc according.
What are we doing? Just classifying the packet. We are telling: okay, you are coming from network, say, 203.16/16, then your priority will be 3; ... take the door number 3 please, where you will be attended.
To do this we need some kind of filter for selecting packets. Someone like a usher that when you arrive to the theater ask you: what your chair number is? Okay. Go there and search for row number 26, etc., etc. If you search back above a little, a filter is the third conceptual component of Werner [6] explanation. Let's read from him again:
Each network device has a queuing discipline associated with it, which controls how packets enqueued on that device are treated. Figure 2.1.6 shows the symbol we use for a queuing discipline without externally visible internal structure. For example, sch_fifo is such a simple queuing discipline, which just consists of a single queue, where all packets are stored in the order in which they have been enqueued, and which is emptied as fast as the respective device can send. More elaborate queuing disciplines may use filters to distinguish among different classes of packets and process each class in a specic way, e.g. by giving one class priority over other classes. Figure 2.1.7 shows an example of such a queuing discipline. Note that multiple filters may map to the same class.

Eureka!! Now we have some pictures. It helps more than a hundred words. In fact this kind of pictures are going to be our everyday bread when dealing with queuing disciplines. It's a schematic and easy form of representing them. Looking at the pictures we can have a fast mental scheme of how is the qdisc behavior.
For example, looking at figure 2.1.7, packets enter the main qdisc by the left; immediately the classifier takes care of them and using filters select the class where each packet has to be put. When already in the class the queuing discipline on it takes care now. If this last qdisc doesn't have classes, as in the figure is shown, the packet is delivered to be sent to the physical network. Observe also something interesting: two (or more) filters can point out to the same class. Let's see now another of these diagrams and its explanation taken from the Werner [6] document:
Figure 2.1.8 shows an example of such a stack: first, there is a queuing discipline with two delay priorities. Packets which are selected by the filter go to the high-priority class, while all other packets go to the low-priority class. Whenever there are packets in the high-priority queue, they are sent before packets in the low-priority queue (e.g. the sch_prio queuing discipline works this way). In order to prevent high-priority traffic from starving low-priority traffic, we use the token bucket filter (TBF) queuing discipline, which enforces a rate of at most 1 Mbps. Finally, the queuing of low-priority packets is done by a FIFO queuing discipline. Note that there are better ways to accomplish what we've done here, e.g. by using class-based queuing (CBQ).

Great!! Werner talked about four Linux queuing disciplines (FIFO, PRIO, TBF and CBQ) and clears a lot our understanding about this stuff. The diagram implements three of these qdiscs. Now we know we can implement a default class where packets are sent when they do not match any of our filters. Seems like something well-known for us; something like "leave a default codepoint and assign it to the best-effort PHB". Also Werner's figure can be associated with those taken from RFC 2475 specification; see figure 1.4.1 somewhere above.
Let's see now with a little detail how the diagram represents our queuing discipline. The main qdisc is the PRIO queue which receives the packets. It applies the filter and select those of them marked as "high priority". How the mark is implemented? We don't know where the packets were marked but we know that marking could be based on MF (multi-field selection) or DS-codepoint.
Either being the case the classifier put them in the "High" class above. Rest of the packets (not marked with our high priority identifier) go to the "Low" class below. In the "High" class a TBF qdisc is implemented. As Werner explained before: Typically, each class "owns" one queue… This time the queue assigned to the "High" class is a TBF queue. What this queue is trying to do? Just controlling the maximum throughput traversing through it to 1 Mbps (have a look to the diagram). To the right of the TBF queue representation there's a little queue and a little clock shown. They represent the queue behavior and that some sort of metering is been made on it to know how much is the flow flowing.
The "Low" class queue is a FIFO queue. We saw something about it somewhere above. A simple FirstIn-FirstOut queue. It doesn't try to exercise any control just to enqueue and dequeue packets as they are arriving. Finally qdiscs for both classes deliver the packets on the right side of the main queue to be dispatched to the physical network.
A very interesting observation to be done here is that the "High" priority class is controlled by an upper level of throughput (implemented by TBF). We talked something about this before. High priority class traffic has to be controlled to avoid "starving" of lower priority classes. Of course, with this PRIO qdisc implementation this will work just if the transport protocol is responsive, like TCP. Why?
To step ahead let's now read from Saravanan [8] document:
This section discusses queuing disciplines, which form a basic building block for the support of QoS in linux. It also discusses the various queuing disciplines that are supported in linux. Each network device has a queue associated with it. There are 11 types of queuing disciplines that are currently supported in linux, which includes:
  • Class Based Queue (CBQ)
  • Token Bucket Flow (TBF)
  • Clark-Shenker-Zhang (CSZ)
  • First In First Out (FIFO)
  • Priority Traffic Equalizer (TEQL)
  • Stochastic Fair Queuing (SFQ)
  • Asynchronous Transfer Mode (ATM)
  • Random Early Detection (RED)
  • Generalized RED (GRED)
  • Diff-Serv Marker (DS_MARK)
Queues are identified by a handle <major number:minor number>, where the minor number is zero for queues. Handles are used to associate classes to queuing disciplines. Classes are discussed in the next subsection.
Gosh!! A lot of queues. Actually there are some more just developed but not officially implemented in the Linux kernel. Like HTB (Hierarchical Token Bucket) written by Martin Devera; WRR (Weighted Round Robin) written by Christian Worm Mortensen and IMQ (Intermediate Queuing Discipline) written by Patrick McHardy. Additionally some other exist not implemented yet on Linux, like CBT (Class-Based Threshold); FRED (Flow-Based RED); DRR (Differential Round Robin) and D-CBQ (Decoupled-CBQ). Cisco also has some polished queues like WFQ, D-WFQ, WRED, CBWFQ, CQ, PQ and stop you from count; there are a lot.
But don't be afraid. For our work we will concentrate on these: FIFO, PRIO, TBF, SFQ, RED, GRED, CBQ and DSMARK. (Note: really, after this work was started I decided to use HTB instead of CBQ to implement DS; then CBQ will be replaced with HTB along this document) It's going to be a very nice trip. Al least I will do my best effort. By the way, for dealing with all these beasts we will need some mechanism to talk with them, this means, a tool which permit us to be the conductor of this orchestra. This tool was developed some years ago by Alexey Kuznetsov and it is called tc (traffic control). tc is not as friendly as most of us would like it should be, but, it's what we have. Recently, Werner Almesberger, really impressed for the lovely care that everybody feel for tc, wrote tcng (traffic control new generation). People still screaming. I'm just joking, of course.
This time our strategic will be very simple. First we will present a brief approach to the behavior of each queue; some theoretical but no so deeper support to understand how the queue was conceived. Next, how the queue is configurated using tc, and within this battle, an explanation about parameters and recommended settings.
For this part of our study we will have the little help of our friends: Bert Hubert and Lartc people [7]; Bert Hubert and his excellent work writing a manpage for these stuffs; Werner Almesberger, Jamal Hadi Salim and Alexey Kuznetsov [10] mainly for understanding GRED and DSMARK; some shallow diving into Alexey Kuznetsov code and documentation to understand how tc works; and last but not least: for illustrating as better as possible because Linux people are not distinguished for being good pedagoge teachers, I'm going to use some figures and concepts taken from "Supporting Differentiated Service Classes: Queue Scheduling Disciplines" by Chuck Semeria of Juniper Networks [11]. Hoping to not having rights problem with these people (in case of having please make a collect to help me on my defense). I'm going to dare to use their excellent documentation to support our explanation. Without losing more time let's start with the simplest one, the FIFO queuing discipline.