. mombe.org
home of the mad cow
  Not A Blog
 

Friday, May 06, 2005

FreeBSD DEVICE_POLLING, GigE and traffic

We have a FreeBSD machine on campus that is tasked with keeping 1300 students off our academic network. This machine is a 2.8GHz machine running on an Intel Winter Park mother board with an Intel PRO/1000 MF Gigabit fibre NIC in it.

About two months ago we started to experience problems with the machine going into live lockup during periods of high network activity. What was happening was that the machine was spending its entire life processing interrupts from the network card. This seemed an obvious candidate for polling(4) so we duly compiled a kernel with DEVICE_POLLING and HZ=1000. This seemed to solve our problems nicely.

Recently however, we've heard complaints from the students that transfer speeds on their part of the network were somewhat slower than they used to be. It is worth explaining at this point that our student network consists of 40 separate subnets (one per residence) and that the machine in question is the default gateway for each of these subnets. Thus is has about 40 vlan interfaces on it all sitting on the same em interface provided by the fibre NIC. Connectivity to the rest of our network is provided by the 100Mbps on-board fxp, but experience has shown that majority of the traffic stays within the residence system. A few tests showed that their complaints were indeed justified.

We initially though the problem might be saturated 100Mbps links in some of our regional switching centres, so we got MRTG to draw us some graphs. This quickly showed that we were barking up the wrong tree.

Some research showed that the em device was generating large numbers of error (in the region of 600 per second) and that throughput appeared to be capped at just over 20,000 packets a second (amounting to about 20MB/s traffic on a gigabit link). This was clearly a problem as the left hand side of the two graphs below shows:

Errors Per Second
Packets Per Second
(Note: These graphs are over 48 hours. The final 24 hours is available in more detail: errors packets)

Google suggested a first take at a solution. It appears that what's been happening is that packets have been arriving at the interface faster than the polling loop has been able to remove them. This means that the NIC's buffer will fill up and eventually packets will be discarded as there is no space to store them. It seams that each time the buffer overflows, it generates an error, so the 600 errors/sec that we were seeing at peak times could probably be interpreted as 600 out of 1000 polling cycles found that the buffer had been exceed. Not good.

The obvious thing to do was to increase the frequency at which we polled the network card for packets so at about 8.30am (just to the right of middle on the graphs) we installed a new kernel with the HZ value set to 1500. This made a noticeable difference in the number of errors compared to the same time the previous day and our packets/sec count appeared to stop table-topping.

Unfortunately at about 1pm we noticed the error rate start to creep up again and saw that the packet/sec count had again showed signs of reaching a plateau. A new kernel was compiled and installed at about 3.30pm. Based on our previous success, this time the HZ value was set at 2500. You can see the impact of this as a slight drop in the errors and a slight increase in the number of packets we could handle. Clearly we weren't getting bang for buck any more out of the HZ value.

Some more research followed and this time we decided to experiment with the kern.polling.user_frac value. This value controls how much time the scheduler is prepared to allocate to handling the polling loop and how much it reserves for user process. The idea is by reserving CPU time we can prevent live lock up at the expense of degraded performance at times of heavy load. By default this value is 50%, so we decided to reduce it to 30% (leaving 70% of the CPU to handle our network traffic). This happened at 5pm and caused the big change visible in both graphs, showing that our problems were now clearly CPU related.

polling(4) suggested another knob to tweak, kern.polling.max_burst. Together with the HZ kernel option, kern.polling.max_burst controls the total number of packets the machine is prepared to handle each second. Its default value is 150, so we decided to double it to 300. This happened at around 8pm and the results were surprising ... You'll notice a decrease in performance and an increase in errors, which was completely counter to what was expected. My suspicion is that we'd reached the CPU's limit and that we were increasing the number of TCP retries. As a result, we backed off back to the default kern.polling.max_burst value of 150.

So what effect did all of this have on network throughput? The answer to that can be seen in our bandwidth graph for the same day:

ResNet Bandwidth
Again, the last 24 hours is available in more detail.

You'll notice that throughput went from peaking at around 20MB/s to peaking at around 35MB/s. A clear improvement. No doubt there are plenty of people who're a little happier about how fast their crap is downloading.

We weren't satisfied there, however, so this morning we did even more optimisations. Since we know we're CPU bound, and this machine's primary use is to provide a firewall, we looked at the firewall rule-sets. We generate graphs of the traffic generated by each residence and at the time this was all happening, ipfw had four count rules for each of 40 residences. This was 160 odd rules just dedicated to counting traffic. By judicious use of ipfw skipto rules, we managed to reduce this by about two thirds — meaning that for a particular packet, the firewall would only have to process about 50 rules. The effects of this aren't visible on the graphs above, but it did give a slight increase (of about 5000/second) in the number of packets we could process and showed our bandwidth use now peaks at over 40MB/s.

For good measure, just in case the fxp 100Mbps interface was slowing things down, we changed it to another em interface running at 1Gbps (it is in fact the second fibre port on the PRO/1000 MF NIC). It hasn't made any noticeable difference though, which is entirely what was expected. Ironically, the 100Mbps links in our regional switching centres are now significantly closer to saturation than they were (we're peaking at 65% utilisation rather than 30%). Its still not (yet) a problem though.

All of this, however, means that we've effectively doubled the throughput of our residence network in the last 24 hours. Students in residence probably owe David and I pizza for two days of unpaid overtime ;-)

We're still not happy though. More optimisations can be made to the firewall to reduce its CPU load and probably more tweaks can be made to the polling knobs. A good start is to perhaps reduce HZ back down to the point at which it stops improving things and then maybe to play with kern.polling.max_burst again.

Another consideration is the fact that the fibre NIC is capable of operating on a PCI-X bus (64 bit, 66 MHz), but the Winter Park board it is in only has a PCI bus. This shouldn't matter as the PCI bandwidth should exceed the 125MB/s that gigabit Ethernet is capable of, but it is a possible bottleneck. We're considering moving the firewall (disk + NIC) to a Torrey Pines-based machine to see if a PCI-X slot will improve things. We could also throw more processor at the problem — the 2.8GHz P4 is now old technology and 3.6GHz processors are easily available. We want to exhaust the free solutions first though ;-)

Update: 2005/05/06 18:10

While I've been dealing with Telkom, David's spent a large part of this afternoon looking at optimising the firewall ruleset that runs on this machine. In particular we've changed the way stats are gathered, and we've tried to reduce the number of rules that a single packet has to pass through before it gets to the end of the ruleset. This has increased the network throughput to a new peak of 55MB/s and, more importantly, has reduced the amount of time the CPU spends processing network traffic from the kern.polling.user_frac figure of 70% to about 45% — in other words we now have capacity to spare. The graph of the last 48 hours shows it all:

ResNet Bandwidth

We go from about 10MB/s two days ago to over 55MB/s at the moment. We're now sitting at just under half the gigabit link's capacity, and we're table-topping the graphs for our regional switching centre uplinks. Which brings us full circle to where this all started :-)

posted by guy at: 12:29 SAST | path: /systems | permanent link

Wednesday, April 06, 2005

QMQP and SMTP

RUCUS is a big Qmail shop. They run Qmail as a mail server on one of their machines, and make use of QMQP to queue mail from their other machines. This avoids the need for a full-blown mail server package on all except the actual mail server and works very well when the QMQP server is running.

Unfortunately it's been helluva unstable recently and has been crashing a lot. This is irritating because you can't send mail from any of the machines (you've lost the benefit of a local queue). So what we need is another QMQP server to act as a backup in case the main one dies. A normal mini-qmail install supports this, but unfortunately there isn't a convenient place to house the backup QMQP server.

All this got me wondering whether one could do QMQP -> SMTP protocol translation and use a standard SMTP server instead. The backup MX server, for instance, which runs Exim and doesn't know diddly-squat about QMQP. Google didn't seem to think so, which bugged me.

So being JAPH I decided that I'd make one. It wasn't too difficult. It just required getting your head around netstrings and realising that Text::Netstring wasn't going to do what I needed it to do.

Believing this might be useful to someone else, the source code is available. It isn't terribly complicated and relies on the tcpserver from DJB's ucspi-tcp package to do networking. I think it honours the QMQP protocol fairly well.

So now I can run a QMQP to SMTP gateway on the QMQP client machine and use it to fall over when the QMQP server fails. It sort of defeats the idea of mini-qmail, but might help minimise the effects of a server crashing.

YMMV and all that.

posted by guy at: 12:27 SAST | path: /systems | permanent link

Bloxsom Powered

© 2002-2005, webmaster@mombe.org
 
 
RSS Valid XHTML 1.0!

Creative Commons License