PDA

View Full Version : Fragment buffer overflows



purple
02-17-2005, 07:30 AM
Ok, I'm pretty much out of things to explain this. So I was hoping someone else here could maybe clue me in if I explain what is going on.

This is what is happening on the console:


Debug: SEQ: Giving up on finding arq 010b in stream zone-client cache, skipping!
Debug: SEQ: Giving up on finding arq 0119 in stream zone-client cache, skipping!
Debug: SEQ: Giving up on finding arq 011c in stream zone-client cache, skipping!
Debug: SEQ: Giving up on finding arq 011d in stream zone-client cache, skipping!
Debug: SEQ: Giving up on finding arq 011e in stream zone-client cache, skipping!

Warning: !!!! EQPacketFragmentSequence::addFragment(): buffer overflow adding in new fragment to buffer with seq 0264 on stream 3, opcode 0164. Buffer is size 103392 and has been filled up to 103074, but tried to add 506 more!


Now let me explain what is happening behind the scenes. I've walked log files for this and am sure this is what is up, but I'm at a loss to explain it.

Packets in the EQ Protocol are sometimes sequenced, which means that they are assigned a chronilogical number that keeps going up one. This is references as arq or seq above (they are the same thing). Packets which are sequenced cannot be processed until the whole sequence up to the recieved one have been processed. Unsequenced packets can be processed immediately.

When you zone, the server throws a boatload of information at you. Some of this information is very large, and cannot be sent over in one packet. This is where fragmentation comes in. The oversized payload is broken into pieces and sent across multiple packets. Each packet seems to be a maxsize of 512 bytes, they is compressed. So for example, the ItemPlayerPacket which lists all items on your person and in your bank can be over 100k in size, split across a ~200 packets. In the case above, the ItemPlayerPacket is 103392 bytes in size.

When the first fragment in an oversized packet is received, it contains some extra information. Every fragment in the oversized packet is sequenced. The first one though, also contains the total length of the fragment. That is where the 103392 comes from. The first fragment says it needs a buffer that big to write into. It doesn't say "and I last until this arq sequence number". It just says it wants a buffer of a specific size.

As it fragment comes in, if it is the expected member of the sequence and an oversized packet, then it is added to the existing fragment buffer. This happens until the buffer is exactly filled. You can guarantee this happens because of sequencing. The server packs the split up fragments sequenced properly, so we should see them that way. By refusing to process sequenced packets that aren't the next one we need, we know that when the buffer is full, the fragment is complete and we can process the oversized payload.

The problem comes where we "miss" packets. That's what those "giving up on finding arq" messages are. Seq will give up on waiting for a sequence number it needs if its sequenced packet cache has more than a set number of items in it. This set number of items is called arqSeqGiveUp and I think it defaults to 512. If the cache of future packets (packets we received but whose arq sequence number is ahead of the one that we are expecting) gets too large, we give up on ever seeing the one we are expecting.

Unfortunately, once we give up on an arq sequenced packet which goes in our fragment, we have no way of ever picking back up fragmentation processing again until someone zones. The only information we have about a set of fragments is the total size we expect all the fragments put together to be. When we skip a piece, we don't know how big that piece was or how it impacts the oversized packet as a whole. Right now, seq just plays dumb and keeps trying to work, so it will overflow the allocated fragment buffer because it doesn't know to complete the previous fragment and start a new one.

Mikey was kind enough to provide logs and I walked through them. I can see where it never receives the packets it gives up on. I can see where it fills the cache up to the points where it starts giving up (512 packets in the cache). It looks like the combination of lots of items/augs on person and in the bank, good sized guild (though by no means that large), and lots of zone spawns/doors/ground spawns can make zoning take almost 1000 packets.

So my initial thought is that things are working ok, but the cache is giving up too early. You can up the arqSeqGiveUp under network->advanced. Maybe up that to 1024 first to see if it helps. If not, maybe try 1536. The problem involved with higher numbers is that if you do indeed miss a sequenced packet for some reason, seq will wait for it and wait for it and wait for it. So instead of things crashing because it gave up on the expected arq sequenced packet, it will just hang there and not process any sequenced packets because it hasn't received one it needed.

The only other alternative I can come up with is that the people having problems have craptastic network cards and are losing packets because zoning is slamming the card too hard. But I find this hard to believe. In the logs, I can also see the eq client struggling to make it's packet sequences work out, sending a lot of 00 11 packets signifying that it recieved future sequenced packets, which the server resends missing packets for. It's just that we don't get what we need to make sense of the stream.

I don't know much about pcap or why it could be dropping packets. I don't know much about how older network cards would impact things, or slower linux boxes or anything. I am pretty sure that the net code is working properly, it's just not seeing what it needs to see in order to work. If increasing arqSeqGiveUp doesn't help, I'm pretty clueless as to what this could be. Maybe posting up your versions of pcap and specs on your linux box (ram, processor, network card) might raise a red flag.

Good luck getting through this post! And look at that, I put the helpful suggestions at the end. Silly me!

Dedpoet
02-17-2005, 07:44 AM
FWIW, I'm really only getting this occassionally zoning into PoK (highly populated), and Natimbi/Barindu which have had serious server-side lag issues on my server. I have gotten it one or two other times in other zones, though I can't tie that to anything.

I don't have my libpcap version here at work, but I'm on a laptop - P3 1.13, 256MB, onboard 3Com NIC, RedHat 9 with updated autoconf, automake, and libtools to work with newest Seq versions. It's no speed demon, but has always been sufficient. 2 Seq machines and 2 EQ machines are attached to a 10/100 hub and that uplinks to my router, which has a couple more boxes on it. Nothing strange or ridiculous.

As I said, it's pretty rare and/or localized to zones that I'm currently avoiding because they are sucking right now anyway, so I don't know if that will help you or not. I'll try bumping my arqSeqGiveUp and see what that does.

Thanks for all the hard work, purple.

purple
02-17-2005, 08:00 AM
Can you try bumping arqSeqGiveUp to 1024, Dedpoet? If that doesn't work, try 1536. If that doesn't work (though you'll probably switch from crashing to hanging) put it back to 1024 and turn on realtime (Network->Real Time Thread). You have to zone for those to take effect I think.

Realtime looks like it schedules the packet capture thread with higher priority and with a realtime flag set.

Ratt
02-17-2005, 09:54 AM
Crappy network card, crappy cable, crappy hub, etc... will definitely cause this problem. If you search WAY back, you'll find some threads on this issue... and it's always been traced back to substandard equipment.

I'm not saying something hasn't changed in the stream, but equipment can most definitely cause this problem.

Dedpoet
02-17-2005, 01:06 PM
Yeah, I've been around here for quite a while and have read every thread - I do remember this issue. All I was pointing out is that I have never once had the issue until the netcode changes. I use name brand nics, cables, and hubs, and my network sees a significant amount of traffic, yet this problem is new to me. I don't run any background junk like p2p or game servers while I play EQ, I have a good cable connection, and on top of everything, this happens in specific zones. I think purple is right on with it being a sequencing issue. I'm at work right now but will be playing tonight. I'll bump to 1024 and zone in and out of PoK a bunch of times and see what I can break.

Ratt
02-17-2005, 01:57 PM
I'm just taking a stab in the dark, but if compression was turned on, could the packets that are being compressed be spanned over multiple packets... and they are not getting decompressed properly prior to processing, and thus you are missing the sequence? Or am I misunderstanding the problem?

purple
02-17-2005, 02:05 PM
This isn't me having the problem. Everything is working great for me. It's select others who are seeing problems. That's why it's so hard to isolate.

Compression is applied at the protocol level to individual packets right now. Failed uncompress calls should print warnings to the console.

Mikey
02-17-2005, 06:56 PM
My setup:

EQ and ShowEQ PC are connected to a hub and are the only machines on the hub. The hub then uplinks to a router that has a built-in switch. Other machines are connected at the router, but because of the switch, the EQ and ShowEQ PC's should be isolated from the rest of the network traffic. The switch/router is a D-Link DI-624. The hub is a NetGear DS104.

The EQ PC is a P4 3.06GHz with 1GB of RAM. The ShowEQ PC is a laptop that I have setup dual booting with a P4 3.2GHz and 1GB of RAM. The network card for the laptop is an SiS 900 based PCI Fast Ethernet built into the laptop.

I never had the problem until the 1/26 patch. Until then, ShowEQ has run flawlessly for me for hours on end.

Dedpoet
02-18-2005, 08:02 AM
My GiveUp was only at 384. Patched to the newest revision and bumped GiveUp to 1024 and didn't have any issues last night, even zoning in and out of PoK constantly for several minutes and messing around in Natimbi, Barindu, and Kod`Taz (which have all been problematic for me). I did get a random seg fault while standing around in the Guild Hall doing nothing, but couldn't recreate it and didn't feel like gdb'ing it, as it was late. The only information on the console was a couple of messages for spells wearing off. If it happens again, I'll debug. Either the increased GiveUp value is helping me out, or I was very lucky last night. Thanks for all of your hard work, purple.

purple
02-18-2005, 08:43 AM
I've gotten a couple reports of that segfault and a backtrace that I don't understand. Gonna do some valgrinding today to see if it turns anything up. It's totally separate from the fragmentation buffer issue and the 64bit zlib thing. Those are the 3 issues I'm aware of with this so far.

Mikey
02-19-2005, 02:59 AM
I bumped the Arq Seq Give Up value to 1024 (was 512) and tested. I did not get the overflow, but did receive a seg fault. I ran again and attached gdb. Got this on crash:

Program received signal SIGSEGV, Segmentation fault.

[Switching to Thread -152065152 (LWP 9378)]

0x004bada3 in strlen () from /lib/tls/libc.so.6

Here is a back trace of the stack:

#0 0x004bada3 in strlen () from /lib/tls/libc.so.6

#1 0x00490446 in vfprintf () from /lib/tls/libc.so.6

#2 0x004ad176 in vsnprintf () from /lib/tls/libc.so.6

#3 0x08186504 in seqWarn (

format=0x81b15b0 "SEQ: received sequenced %spacket outside the bounds of reasonableness on stream %s (%d) netopcode=%04x size=%d. Expecting seq=%04x got seq=%04x, reasonableness being %d in the future.") at diagnosticmessages.cpp:30

#4 0x08089b5b in EQPacketStream::processPacket (this=0x90632f0,

packet=@0xfeefcd40, isSubpacket=false) at packetformat.h:118

#5 0x08089ffc in EQPacketStream::handlePacket (this=0x90632f0,

packet=@0xfeefcd40) at packetstream.cpp:485

#6 0x0808fce8 in EQPacket::dispatchPacket (this=0x1865, packet=@0xfeefcd40)

at packet.cpp:716

#7 0x0809326b in EQPacket::qt_invoke (this=0x8fdfe08, _id=-17838784,

_o=0xfeefcd8e) at packet.cpp:599

#8 0x0428b3a0 in QObject::activate_signal ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#9 0x0428ba7a in QObject::activate_signal ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#10 0x045be2cd in QTimer::timeout () from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#11 0x042ab0ac in QTimer::event () from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#12 0x0422c849 in QApplication::internalNotify ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#13 0x0422c9da in QApplication::notify ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#14 0x04220bbe in QEventLoop::activateTimers ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#15 0x041dcbde in QEventLoop::processEvents ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#16 0x04241e75 in QEventLoop::enterLoop ()

from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#17 0x04241dce in QEventLoop::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#18 0x0422ba4b in QApplication::exec () from /usr/lib/qt-3.3/lib/libqt-mt.so.3

#19 0x080696c4 in main (argc=1, argv=0xfeeff3a0) at main.cpp:689

This was run with the original 2/15 patch file. I will try to reproduce this asap with the newest patch. In the mean time, maybe this will help.

purple
02-19-2005, 07:53 AM
Whoa, that's a good stack trace. Dumb error. New patch soon.

purple
02-19-2005, 08:07 AM
Well, of course the error is just in the warning message, so it's was gonna start breaking after that of course. Didn't want to get your hopes up!

purple
02-19-2005, 08:28 AM
That new patch is up.

Thinking about it, that message is what will happen when it is still going to break. Instead of giving up on missed packets, it will keep holding out for them while the sequenced packets keep rolling in. Once you start seeing packets more than 1024 in the future, it will drop them so if arqSeqGiveUp is more than 1024, it will just never give up because instead of putting the future packets on the cache, they are dropped. That's a bit silly I guess.

QuerySEQ
02-19-2005, 09:00 AM
Just an FYI, I set my Network Options to Real Time Thread and Dropped my ArqSeq give up to 488, much better speed of processing.

But, I'm on fiber.

Question, is 512 in that, just for packet delay checking? or is there a reason it is at 512?

i.e. packet size at 512... does arq/seq time give correlate to packet size???

purple
02-19-2005, 10:25 AM
No. arqSeqGiveUp is the number of sequenced packets that must be seen inside a certain window (right now this window is 1024 sequence numbers) before giving up on ever seeing the expected sequenced packet.

So lets say you zone in, and the server sends you 80 packets of character profile then 210 packets of items, then 300 packets of zone spawns, then 80 packets of your guild. Every one of those packets gets an arq sequence number and each packet is only processed when all before it have been processed. So packet #1 is processed, then #2, then #3. If the packets come in as #1, #4, #5, #6, then #2, 4-6 will be waiting in a cache and not processed until we see 2 and 3. arqSeqGiveUp sets the size that the cache can be before we give up on ever seeing the arq sequenced packet we are expecting next.

If we never see #3, things will stop processing until the cache has arqSeqGiveUp number of packets in it. Note that giving up will hose things potentially, if the sequenced packet it is waiting on that never arrived is important to finishing a fragment. Honestly, I'm not sure it even makes sense to ever give up because there is no real way to recover from missing a packet. But the old seq network code did it so I left it in there. But the old network protocol tells you when a specific fragment is starting a new oversized packet, something that isn't known in the new network protocol.

So setting arqSeqGiveUp to 512 is like saying "I'm willing to cache 512 future packets in case the stream gets out of order, but once I have 512 packets waiting in the cache, I'm gonna assume I'm never gonna see the one I'm waiting on ever again."

Mikey
02-20-2005, 05:58 AM
I think I found it!!! It seems that my packets are coming in too fast and the packet capture library is dropping them. I'm just not sure how to fix this. I modified my version of ShowEQ to print out the packet capture stats when the application requests a packet from the packet cache, like so:



uint16_t PacketCaptureThread::getPacket(unsigned char *buff)
{
uint16_t ret;
struct packetCache *pc = NULL;
pthread_mutex_lock (&m_pcache_mutex);
ret = 0;
pc = m_pcache_first;
if (pc)
{
m_pcache_first = pc->next;
if (!m_pcache_first)
m_pcache_last = NULL;
}
pthread_mutex_unlock (&m_pcache_mutex);
if (pc)
{
printf("Returning Packet\n");
pcap_stat pcs;
pcap_stats(m_pcache_pcap,&pcs);
printf("Cap:%d Drop:%d IfDrop:%d\n",pcs.ps_recv,pcs.ps_drop,pcs.ps_ifdrop);
ret = pc->len;
memcpy (buff, pc->data, ret);
free (pc);
}
return ret;
}


And I found this in the console output during the zoning packet hammering:

Returning Packet
Cap:428 Drop:60 IfDrop:0

That will definately cause the issue I have. Any ideas on how to fix it?

Once the site for pcap/tcpdump is back up(currently not working), I'm going to make sure I have the latest pcap libraries as a start.

purple
02-20-2005, 07:04 AM
Thanks for doing that work, Mikey. I've read some bad things about linux and libpcap, so that doesn't surprise me. What version of linux are you running? Hopefully we can put some flags some where or kernel params and work this out.

Mikey
02-20-2005, 03:57 PM
I'm running Fedora Core 3. I checked the version of libpcap and it is current.

purple
02-20-2005, 04:12 PM
Kernel version I mean. What does uname -a say?

Mikey
02-20-2005, 05:05 PM
Kernel is 2.6.9-1.667smp

Mikey
02-20-2005, 07:16 PM
I think I may have fixed it. I added the following to /etc/sysctl.conf

net.core.rmem_default=8388608
net.core.rmem_max=8388608
net.ipv4.tcp_rmem= 4096 87380 8388608

These increase the default size of the receive buffer for sockets.

It seems to have fixed it. I'm going to test it more throughly and see how it goes.

Mikey
02-21-2005, 02:25 PM
Used ShowEQ most of last night and didn't have any problems. For now, I think we will call it fixed.

EMT99a
03-03-2005, 09:40 PM
I added that to my /etc/sysctl.conf and although SEQ stabilized, it started lagging really bad... any thoughts???

Acid1789
03-04-2005, 04:36 PM
You could possibly skip fragmented packets (although the real problem seems to be acutally getting the missed packets). If you have the start of the packet fragment, you know how large the entire buffer is. You also more than likely have future packets after the packet you are missing. You could probably take a prety good guess as to which packets belong to the fragment then skip the entire fragment.

As long as they dont implement the EQ2 encryption in EQ1, that would work. Unfortunately you cant skip anything in EQ2.

purple
03-04-2005, 04:45 PM
What kernel do you use and how much ram do you have? You might just need to find a happy medium between the default socket receive buffer default and max sizes and that ginormous one that Mikey sets.

purple
03-04-2005, 04:48 PM
If all you get are 00 0d packets once a fragment starts, sure. But there's nothing that stops other sequenced packets happening in the middle and nothing stops you from missing those too. You could look for a smaller than max size fragment and assume it is the last one, but I don't like making that assumption either.

If packets are being missed, there's a bigger problem afoot that needs to be addressed.

Acid1789
03-04-2005, 05:18 PM
I agree the bigger problem needs to be solved. I was merely suggesting you could skip the missing fragment piece if you need to (ie: pcap wont always capture on certain machines)

Even if you got a continuous stream of fragmented packets (0x000D), you could determine where the next fragment(s) started. Unless you were missing the start of that fragment chain as well. You dont really have to assume a whole lot here, you just have to step back a bit and look at the entire stream you do have and figure out which piece(s) of the puzzle you are missing. You could then continue on with processing of pieces you do have and saving the pieces you dont have until later. Or you could just completely skip the pieces you dont have.

purple
03-04-2005, 05:45 PM
Not to go all Quackrabbit on you, but I disagree.

There's nothing that signals "hey I am the first fragment of a set." The only thing you have to go on is that the previous fragment completed the previous set (or the session re-initializated). You could make guesses based on the data by checking for a sane buffer size/opcode pair, but it is still possible (though exceedingly unlikely) for those to coincidentally line up.

The only other thing you have to go on is that the last fragment of the set is usually smaller because the oversized payload size usually doesn't work out to be 501+505*x. So the last fragment has a smaller data payload in it in order to complete the oversized payload. But that isn't necessarily always true. There's also nothing that guarantees each intermediate fragment is the maximum possible size either, even though it would be silly for it not to be.

Without making assumptions, I can't see any way to deal with missing packets. Each individual 00 0d in a larger oversized payload just doesn't have enough information to key off of to tell you anything. This works fine for the client because it can 00 11 to get retransmits. But being a passive sniffer, we just have to roll with what we do see.

Mikey
03-04-2005, 06:00 PM
From what I've seen of the packets that are being missed, even if you could determine that you have an incomplete fragmented packet and skip it, ShowEQ wouldn't work any better than it does when it crashes. The packets are being skipped during zoning when the information about the current spawns in the zone is being sent. If you skip information about the zone's current spawns, you won't get your skittles and may as well just close ShowEQ.

While skipping the incomplete data (if we could) will make it so that ShowEQ won't crash, it won't get ShowEQ working much better.

Acid1789
03-04-2005, 06:37 PM
Given:

...
Packet12 = Fragment Start, buffer size: 4000, data: 996
Packet13 = Fragment Piece, data: 500
Packet14 = missing
Packet15 = Fragment Piece, data: 500
Packet16 = missing
Packet17 = Fragment Start, buffer size: 1000, data: 400
Packet18 = Fragment Piece, data: 400
Packet19 = Fragment Piece, data: 200

In the given example, you could determine there is a fragmented buffer following the missing pieces by examining each data and matching it with following packets. The probability of getting a false positive here is very slim (and catchable with decompression).

And to Mikey's point, yes it does no good to skip packets really since that is the whole point of showeq. But what if the data is guild data or something else you wouldnt necessarily care about. Im not suggesting this as a good solution, im just merely pointing out that is possible to skip packets in the middle of a fragment with EQ1.

Also the wait limit implemented in seq may be a large piece of the problem as well. If the client recieves a future packet out of order, it sends a 0x0011 - 0x0014 packet (depending on channel its waiting on) to the server. The server may be (and usually is) in the middle of sending other packets. The requested packets wont get sent until the next update of your connection. If seq gives up on finding this packet before the server gets around to sending it again, seq fails. The real solution should be to not ever give up on any packets. If the underlying sniffer or network architecture is failing to deliver packets, then that problem needs to be addressed separately. I dont believe you can fix this in seq. The sony channels (0x0009 - 0x000A) are reliable socket channels meaning these packets are guaranteed to be delivered via the ack/seq scheme that is in place. Therefore, if you are missing 'reliable' data, the problem in the middle needs to be addressed.

purple
03-04-2005, 06:58 PM
That's why arqSeqGiveUp is configurable. Not ever giving up on packets is unrealistic. It is possible for a passive sniffer to miss packets under heavy load. The best solution I could come up with would be to detect a problem and drop packets until a new session response/request is made, since once you miss a packet, you might be screwed on that stream until the next session. But I haven't bothered to implement that.

All that aside, given a packet, how do you tell it is a fragment start? It's easy to type an example, but there's no way to tell if a given 00 0d packet is the first in a fragment unless it is the first seen since a session request/response, or unless the fragment before it has been completed. You have no idea what was missed or how big it was. I feel like I'm repeating myself, but the only clues you can possibily have are by watching for smaller than 505 payloads or assuming each is the first packet and checking the buffer size/opcode against known buffer size/opcodes. Neither of those is appealing.

EQ1 doesn't compress at the application level, only at the protocol level, so you won't get an inflate problem on the entire oversized buffer.

Acid1789
03-04-2005, 07:30 PM
You can detect a fragment start by reading the first 4 bytes of the fragmented packet to get the total length of the set. If the packets following that total a complete fragment set, then it is a fragment start packet.


Not ever giving up on packets is unrealistic

Thats not the case, EQ2 demands that you have every packet due to the encryption scheme. It shouldnt be possible for a passive sniffer ot miss a packet, if it is, there is a bug in the implementation of the software or the hardware.

As a solution to the problem, maybe you should adjust the default arqSeqGiveUp to something high like you mentioned above (or whatever the real data stream warrants).

If you cant live without that data anyway, you may as well wait for it for a while. Chances are, you will get it eventually.

purple
03-04-2005, 07:58 PM
Bits is bits. There's no way to tell a payload length from payload data that belongs in the middle of the oversized payload. It's very improbable it lines up right, but it is possible. Even with that aside, it seems like a lot of processing time to keep trying to detect fragment starts like that to me. So either you add overhead every 00 0d you get or add a lot of overhead when you decide you want to give up on ever seeing a specific seq and analyze your entire cache trying to see if it is possible to give up without trashing the stream.

The default arqSeqGiveUp is 512 which isn't that bad. And the problem in this thread is that they aren't getting the packets. I looked through a huge log from Mikey and the packets were missing.

And missing packets while sniffing is a distinct possibility, espeically on Linux where for some reason pcap is kinda crappy compared to BSD for example.

quackrabbit
03-05-2005, 08:31 AM
Not to go all Quackrabbit on you, but I disagree.
WTF??!!