An study of network load
Written by Leonardo Balliache

Last year, one customer of mine called me to evaluate some congestion problem they had in their network. The company, let's call them Company X, have had congestion problems in the LAN, very frequent collisions in the main switch and very low performance. They operate an small 10 Mbps LAN with a 100 Mbps main switch, seven 10 Mbps hubs, 50-60 Window PCs, a Netware Server, an Internet Server, and a 1536 kbps ADSL Internet connection.
 
Initially they had a 256 kbps ADSL Internet connection and things go more or less, but after upgrading the Internet connection problems worsen and become harder.
 
Because in the LAN three protocols were running, IPX/SPX, TCP and UDP, I suggested to make an study of the network load to understand better what's going on, just before having any decision about network upgrading.
 
To take the traffic samples, a SuSE Linux box was installed and connected to one of the hubs. In the Linux box, using cron, was configured an script to take samples every 10 minutes, during labor and no labor hours and days. The samples were taken using tcpdump to have a copy of the moving packets in the LAN. Each sample took a copy of 100000 packets.
 
Later on, the tcpdump samples were parsed using awk and plotted using gnuplot. To filter noises we use EWMA (Exponential Weighted Moving Average) to smooth the outputs using µ = 0.9 before ploting. Also, a daily printed system throughput summary was obtained from the samples.
 
Samples were taken during the days 16, 17, 18, 19, 20, and 21 of last octuber 2003, following the next schedule:
 
Beginning at 7:30 a.m, one sample every 10 minutes; last sample at 11:50 a.m. Samples in the morning: 27.
 
Beginning at 1:00 p.m, one sample every 10 minutes; last sample at 4:50 p.m. Samples in the evening: 24.
 
Total number of samples: (27+24) * 6 = 306 samples.
 
For each sample the following graphs were obtained:
 
  1. A protocol distribution graph where instant throughput by protocol was plotted vs time. There is one graph of this type per sample identified by the acronym proto.
     
  2. A source address protocol distribution graph where instant throughput by source address by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.srcadr, tcp.srcadr, and udp.srcadr.  
  1. A destination address protocol distribution graph where instant throughput by destination address by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.dstadr, tcp.dstadr, and udp.dstadr.
     
  2. A source port protocol distribution graph where instant throughput by source port by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.srcport, tcp.srcport, and udp.srcport.

  3. A destination port protocol distribution graph where instant throughput by destination port by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.dstport, tcp.dstport, and udp.dstport.
     
  4. A connection protocol distribution graph where instant throughput per connection (peer of addresses)  by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.conn, tcp.conn, and udp.conn.
     
  5. A host protocol distribution graph where instant throughput per host (source and destination addresses)  by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.host, tcp.host, and udp.host.
     
  6. A port protocol distribution graph where instant throughput per port (source and destination ports)  by protocol was plotted vs time. There are three graphs of this type per sample (one per each protocol ipx, tcp, and udp), identified by the acronym ipx.port, tcp.port, and udp.port.
Graphs are png files identify as follows:

mm.dd.yyyy-@hh.uu.acronym.png

Where mm is the month, dd the day,  yyyy the year, hh the hour, and uu the minute when the sample was taken, acronym the acronym indicated above and png the file extension. This way, the following file:

10.16.2003-@07.30.ipx.conn.png

Corresponds to a sample took in octuber 16, 2003, at 7:30 a.m., and represents instant throughput per connection.

10.21.2003-@014.50.proto.png

Corresponds to a sample took in octuber 21, 2003, at 2:50 p.m., and represents instant throughput per protocol.
Also, a daily printed system throughput report (a text file) was obtained identified as proto.txt. This file is very large because each sample is detailed in the report. You can have a copy in the link above.
There is really a bundle of graph files; for each sample we have: one for proto, three for each of the following: srcadr, srcport, dstadr, dstport, conn, host, and port. Total: 22 graph files per sample. Total number of files:
 
 

22 files/sample * 306 samples = 6732 files + 6 daily printed reports = 6738 files.
 

 
Okay. I packed these files in six zip files called: 10.16.2003.zip, 10.17.2003.zip, 10.18.2003.zip, 10.19.2003.zip, 10.20.2003.zip, and 10.21.2003.zip, using the sampling day. Each file is 9 MB aproximately. I uploaded to my site the file corresponding to the day octuber, 16, 2003. It is here: 10.16.2003.zip. If you want to check the other files, please let me know to e-mail you a temporal link. Files are very large and waste my scarce web space.
If you are curious and have enough time there is a lot to see in these graphs. They could serve as a mean to understand better the relationship between different kind of protocol traffic when they compete for network resources. People that have better size up that I have could get very interesting conclusions analizing the graphs. If you need them to advance an study or a work just let me know, the graphs are at the order of you as soon as my credit is respected indicating the address of this site.
Next, let's have a look to some of the files about traffic during the day 10.16. 2003.
The protocol distribution graph at 4:10 p.m.

This graph represents the more common behavior of the Company X's LAN. Normally, during labor hours, LAN is almost engaged by the IPX/SPX protocol. The IPX high and steep peaks show that this protocol is very aggressive. Do not forget that this outputs are smoothed previously to be plotted. TCP has a softly behavior. UDP contribution is negligible. The very aggresive behavior of IPX protocol let's to presume that burstiness could be a problem in this network.
Sometime the network is controlled by TCP, when stations using the administrative software with its database in the Netware Server are less busy, as here at 8:40 a.m.

At 9:40 a.m. things go as this:

Again, the IPX protocol dominates the scene. The system throughput share at this time (see printed report) is: 
 
IPX:   84.84%
TCP:   15.07%
UDP:   0.09%

Let's see the source address protocol distribution graph for the TCP protocol at 9:40 a.m.

These are the top 5 source addresses. Internet server dominates the scene.
Same graph for destination addresses is:

Again, the Internet Server seems to dominate the scene, at least, having the larger system throughput (see the printed report). Addresses on the top, at right, are showed from top to bottom based in the system throughput. It is easy to see that higher peak of throughput doesn't mean higher system throughput.
Checking by addresses is very difficult; we could have a lot of connections.
UDP has a minor contribution. Here we have the source address protocol distribution graph for the UDP protocol at 9:40 a.m.

Again, the main Interner Server controls the UDP scene. The destination address protocol distribution graph for the UDP protocol at 9:40 a.m. is as follows:

Hum.., very interesting. .255 is a broadcast address. Nevertheless, we checked broadcast and multicast traffic and didn't find any abnormal situation. This traffic is always no more than 1% of the total traffic in the LAN. It is very important to check this type of traffic. Some alien problems are caused by abnormal (excesive) broadcast and multicast traffic.
IPX/SPX uses MAC addresses to identify the flows. Here we have the source address protocol distribution graph for the IPX protocol at 9:40 a.m.

This protocol does play hard, doesn't it? It's like an american football player. I have a great respect for Novell and its premium product Netware, but their protocol doesn't help anything in a congested LAN; it is very aggressive and does not collaborate with trying to maintain congestion under control.
00...00.01 is the Netware Server. In this sample (10.16.2003 at 9:40 hours), Netware dominates the LAN. Its contribution is 84.84%, against 15.07% of TCP and 0.09% of UDP. Have a look below to the printer report at this hour. The system throughput is just 933.22 kbps (10% of the IPX peaks), but IPX contribution to the system throughput is only 791 kbps.
The destination address protocol distribution graph for the IPX protocol at 9:40 a.m. is here:

Very short and fast transmissions. And retransmission I've got to suppose. It would be very nice to have some way to check losses, but, using tcpdump in a promiscuous interface, this is not an easy task.
When analyzing traffic is very interesting to know whether some pair of addresses cover the LAN bandwidth. To check this we happen to use a connection protocol distribution graph where instant throughput per connection (peer of addresses)  by protocol is plotted vs time. Here we have this graph for the TCP protocol at 9:40 a.m.

Again checking by addresses is not easy. Pairs include the Internet Server address most of the time but the other address vary frequently. There is another problem: with the use of P2P protocols every PC can be connected to every PC and the number of addresses to manage grow geometrically. Anyway, it is necessary to check all the graphs (having the time, of course) to see if something abnormal can be detected.
The same graph for the UDP protocol is here:

This graph confuses me more than a little. First the broadcast addresses, and second the high level throughput flow from address 192.168.0.2 to address 192.168.0.255. This should be DNS traffic but we have to check first the ports being used to be sure of this asseveration. I really don't know too much about DNS protocol and its behavior. Also, P2P protocols like FastTrack protocol uses UDP to discover peers. People from Company X are using E-mule (and probably Kaaza, see below), and perhaps these lightning flows could be provoked by this protocol. However, the UDP contribution to the total system throughput is really negligible as can be seen in the printed report; this consideration doesn't push us to much to investigate a little more. Well, I'm going to check again the source file generated by tcpdump to be sure that everything is going right; this output is very suspicious, then, if someone can have some explanation for this behavior, his/her explanation will be well received. I really made a mistake because in the printed report I mixed TCP and UDP traffic in the same output, then I can't check which ports are using each protocol. This mistake has to be fixed for the next time.
Being IPX the dominating protocol this type of graph (by connection) is more useful to check IPX behavior. Number of IPX machines is low and the check is more consistent. Here we have the graph:

As it was expected the Netware Server was involved in the first four connections. Something really interesting is that we have a direct connection between two clients. I thought that Netware connections were only between the clients and the servers. Again, I'm not a Netware expert and limit myself to present the resulting graphs. Of course, being very interested in having information exchange with people who know better about this protocol.
 
Why do we try to check individual connections? To check if two, three or four Netware clients cover the IPX traffic with the Netware server.  The graph is just a sample at 9:40 a.m., in octuber 16, 2003, then it would be better to check the summary at the end of the day. We have this information in the daily printed system throughput report; for the day being considered this report shows the following information for the IPX protocol distribution by host:
 

LOAD DISTRIBUTION BY HOST IPX

Host                    Bytes    Share
--------------------------------------

00:00:00:00:00:01   926188676   98.05%
00:30:4f:1a:3c:00   417351520   44.18%
00:20:af:dc:d6:f3   129483773   13.71%
00:30:4f:16:aa:63   115361648   12.21%
00:e0:29:5b:99:bd    70966979    7.51%
00:01:02:65:8d:fb    61102223    6.47%
00:00:1b:4c:ed:8d    52157250    5.52%
00:e0:7d:84:b8:f9    35770822    3.79%
00:01:02:be:56:a4    21097706    2.23%
00:40:f4:00:02:f9    13376371    1.42%
00:01:02:65:84:87    12130274    1.28%
00:80:ad:80:3d:45    10675469    1.13%

Observe that one station, identifies by the MAC address 00:30:4f:1a:3c:00, shares almost 45% of the IPX traffic with the Netware server. Could this behavior be confirmed for the other 5 sample days? It's just a matter to check the other daily printed system throughput reports. For example, in octuber, 17, 2003:
 

LOAD DISTRIBUTION BY HOST IPX

Host                    Bytes    Share
--------------------------------------

00:00:00:00:00:01   682797871   94.35%
00:30:4f:1a:3c:00   390771358   54.00%
00:20:af:dc:d6:f3   118329478   16.35%
00:e0:7d:84:b8:f9    74825385   10.34%
00:01:02:65:8d:fb    36140643    4.99%
00:00:1b:4c:ed:8d    21682782    3.00%
00:e0:29:5b:99:bd    19451039    2.69%
00:30:4f:16:aa:63    19091716    2.64%
00:01:02:65:84:a3    15609283    2.16%
00:00:1b:4a:f1:d6    15457972    2.14%
00:04:75:89:39:27    13286254    1.84%
00:01:02:be:56:a4    13069616    1.81%
00:80:ad:8e:ba:14    10447772    1.44%
00:01:02:65:84:87     7874014    1.09%

The same pattern for the station 00:30:4f:1a:3c:00, and inclusive, for the second station 00:20:af:dc:d6:f3. Let's check the next day; here we have the sample in octuber, 18, 2003:

LOAD DISTRIBUTION BY HOST IPX

Host                    Bytes    Share
--------------------------------------
ff:ff:ff:ff:ff:ff     2129256   79.22%
00:a0:24:e7:fb:f9     1388372   51.66%
00:30:4f:16:aa:63      507798   18.89%
00:a0:24:e7:fc:5d      461196   17.16%
00:00:00:00:00:01      218938    8.15%
00:01:02:65:84:a3      150810    5.61%
00:00:1b:4a:f1:d6      140392    5.22%
00:40:f4:00:02:f9      114134    4.25%
00:80:ad:74:7c:ca       97431    3.63%
00:c0:6c:12:51:15       73169    2.72%
00:01:02:be:56:a4       47828    1.78%
Okay, what's going on here. Nothing to be worried. Octuber, 18, 2003 is saturday, not a labor day. IPX protocol contribution is just Netware broadcast. Then, let's jump sunday, and check monday, octuber, 20, 2003:


LOAD DISTRIBUTION BY HOST IPX

Host                    Bytes    Share
--------------------------------------
00:00:00:00:00:01  1204422210   99.42%
00:30:4f:1a:3c:00   453485777   37.44%
00:e0:7d:84:b8:f9   279426525   23.07%
00:01:02:be:56:a4   145055683   11.97%
00:20:af:dc:d6:f3   118014000    9.74%
00:00:1b:4c:ed:8d   101067037    8.34%
00:30:4f:16:aa:63    66115394    5.46%
00:80:ad:8e:ba:14    27824359    2.30%
00:80:ad:80:3d:45    11406624    0.94%
00:01:02:65:84:87     5044033    0.42%

Same patter in the first station. Finally tuesday, octuber, 21, 2003:


LOAD DISTRIBUTION BY HOST IPX

Host                    Bytes    Share
--------------------------------------
00:00:00:00:00:01   781949019   98.88%
00:30:4f:1a:3c:00   258782525   32.72%
00:30:4f:16:aa:63   252469788   31.92%
00:e0:7d:84:b8:f9   146697572   18.55%
00:20:af:dc:d6:f3    57005106    7.21%
00:80:ad:8e:ba:14    22282571    2.82%
00:01:02:be:56:a4    19209753    2.43%
00:00:1b:4c:ed:8d    17604644    2.23%
00:01:02:65:8d:fb    15260340    1.93%
00:04:75:89:39:27     5492469    0.69%
00:01:02:65:84:a3     1500767    0.19%
00:01:02:65:84:87     1304750    0.16%

Eureka!!  Pattern is repeated again. This check we have done is very important because now we know that the Netware stations: 00:30:4f:1a:3c:00, 00:20:af:dc:d6:f3, 00:30:4f:16:aa:63, 00:e0:7d:84:b8:f9, and 00:01:02:be:56:a4, cover most of the IPX traffic with the Netware server in the LAN.
Observe that checking by addresses does help us a lot to frame the situation when trying with the IPX protocol; instead this was not the case for TCP or UDP protocols where multiple addresses makes difficult to have a clear understanding of what is going on. With TCP and UDP is better to check by ports. Next is the source port protocol distribution graph at 9:40 a.m., where instant throughput by source port was plotted vs time for the TCP protocol:


Incredible!! Port 80 traffic (www) is in the fourth position by system throughput. The TCP traffic is being used mainly to transport some other kind of services. Port assignments is regulated by IANA but some people simply ignore the rules and use port numbers following their own interests. Checking IANA port assignments at http://www.iana.org/assignments/port-numbers let's see what they say about ports 4662, 3861 and 4374.
winshadow-hd    3861/tcp   winShadow Host Discovery
winshadow-hd    3861/udp   winShadow Host Discovery

#               4662-4671  Unassigned

#               4359-4425  Unassigned
Well... Looking at Internet for winShadow you can find this: http://download.com.com/3000-7240-10173793.html ; winShadow is a P2P protocol for PC discovering. Port 4662 is not assigned to any protocol by IANA but if you make a search in Google using the words "Port 4662", you will receive a bundle of links and sooner than later you will discover that this port is used for e-mule to make P2P connections. Port 4374 is neither IANA assigned to any protocol but I can't find any interesting information when searching using the word "Port 4374" or even "TCP:4374".
The same graph but for destination port is as follows:

Again the port 4662 dominates the TCP scene.
Let's see what's going on with UDP protocol when checking by ports; first the source port protocol distribution graph for UDP at 9:40 a.m.:

Nothing extraordinary; ports 137 and 138 are for Netbios. This could explain the UDP broadcast traffic that we saw above when searching for addresses. Netbios uses broascast to discover and propagate netbios station names. Port 4672 is rfa (remote file access server); checking in Internet this port is used for e-mule traffic too. Port 53 is just DNS.
Looking to the same graph for destination port, we have:

Something very similar to the first graph. In general we didn't put too much stress verifying UDP protocol because as I told you before its contribution to the total system throughput is really negligible.
Having the throughput distribution by source, destination, or even by connection don't give us the total throughput supported for each station (host) or port. Checking by host (source or destination) gives us the load supported for each station.
Let's see the host protocol distribution graph where instant throughput per host (source and destination addresses)  by protocol was plotted vs time for the TCP protocol:

Okay; the Internet Server appears to dominate the scene. The rest of addresses are for stations not belonging to the Company X. Let's see what about the same graph for the UDP protocol:

Here the broadcast traffic, probably from the Netbios protocol, is the dominating protocol for the internal network 192.168.0/24. The Internet Server follows. I think that if Company X re-check their Window settings they could use only the TCP protocol eliminating Netbios and saving some kbps of broadcast traffic. Some friend of mine that know about Window (I know about WindOw the O because is round) told me that using WINS they could eliminate most of the Netbios broadcast traffic. Again I'm not an expert and it's just a matter to make some tests and measures. Well, like to travel on train? thetrainline.com is the leading independent retailer of train tickets online. Book in advance and save on average 43%. I always use their services.
Finally, for host distribution, let's see the graph for the IPX/SPX protocol; here we have:

This graph just confirm what we was talking about (see above) that five machines cover most of the IPX network traffic.
Now let's have a look to the same graphs but this time using ports. Here we have the port protocol distribution graph where instant throughput per port (source and destination ports) by protocol was plotted vs time for the TCP protocol:

For TCP, e-mule is the indisputable winner. Let's see what about UDP:

For UDP, Netbios is the stronger protocol. Port 4672 is also for e-mule traffic.
What about the system throughput distribution for these ports?; having a look to the printed report we have:

Load distribution by ports tcp+udp

             Port         Bytes     Share
-----------------------------------------
             4662       4998868    82.41%
             3861       1281523    21.13%
            other       1281506    21.13%
             4987        902034    14.87%
             4374        879984    14.51%
               80        844288    13.92%
             1414        708072    11.67%
            29564        459024     7.57%
            36322        355742     5.86%
            16261        252996     4.17%
               25        167339     2.76%

To reduce the system throughput report length, I combined TCP and UDP protocols in the same measure and, probably, that was an error. But having into account that UDP protocol contribution is negligible, we can observe that e-mule traffic covers 82.41% of the TCP traffic. Being the TCP traffic in the order of 15% of the total traffic in the network during labor hours, then e-mule traffic is using 12.36% of the network resources. Also, traffic through port 3861 is related, probably, with e-mule, because it is winShadow Host Discovery traffic, used to discover P2P PC peers; then, practically all the network resources used by the TCP protocol are commited with e-mule.
Recommendations
Recommendations to the customer Company X are very simple because their problems are very focalized:
  1. They should upgrade the network to increase its throughput. They could begin by replacing hubs by switches; switch prices have been going down generously since sometime and this change can improve network performance very easy. Next step is to upgrade the network to 100 Mbps.
     
  2. As a palliative, while they decide to advance the upgrade, they should not use e-mule. We suppose that people working on Company X are using e-mule for themselve, but it doesn't have anything to do with current business. E-mule is not only consuming 15% of the network throughput, but it is also pushing number of TCP and UDP connections to the limit. This exhausts switch and hub resources. Have a look to the printed report to see how many connections are being opened at the same time. For example, the day 10.16.2003 at 16.20 hours, there were 491 TCP connections and 863 UDP connections. It would be a good exercise to check the number of connections again with e-mule out of the network.
     
  3. This is a delicate matter, but, it has also to be said. IPX/SPX is not the best friend to operate a network to its limit, just running most of its time overloaded and congested. The protocol is very aggresive as graphs above shown; also, it is a proprietary protocol and we don't know too much about its innermost; and, finally, we are for sure it is not better, by a long distance, that TCP to keep networks health sound. Some Netware specialist should advice Company X about the possibility to continue using Netware running on TCP. In fact, we have heard that new versions of Netware are for running on TCP as a first option. People from Company X insist to use Netware because they don't trust in NT/Server2000, beginning by virus problems. We understand and share this believing. Then, the best solution would be to try to run Netware over TCP.
     
  4. To having better performance and less transients in Company X's network, it could help a lot to increase the host and server's buffer across the network. Higher buffer space improves network behavior by reducing packet loss and soften sharp transients generated by aggresive protocol schemes. This network doesn't transport any kind of delay sensitive traffic and the increasing latency will be worth payed by the traffic softness. The buffers can be incremented very easy using Window and Netware facilities to adjust these operation parameters.
     
  5. There is a very interesting situation with IPX/SPX traffic. As was seen above, most of the IPX/SPX traffic (almost 85% of the network bandwidth resources) were consumed between the Netware Server and three of four emblematic machines. Then, it would be very important to locate these machines in several arms of the main switch to improve traffic balance. This can be done at a very low cost and this change could improve performance very fast when better traffic balance is achieved.
     
  6. Finally, it is very important to say that any idea to upgrade again the Internet connection to a higher value, without upgrading before the local network, has to be rejected. This upgrade will make thing worst. We think it was an error to upgrade the Internet connection from 256 Kbps to 1536 Kbps not having, previously, an studied LAN upgrade. This is one of the cases where a decision to upgrade something, in fact, make thing worst than before. In Buying additional bandwidth is always the solution? we explain why network upgrade decisions can't be taken flying with no instruments, and why a load study is a must before trying to advance a network upgrade.