Delay of incoming network package on Linux - How to analyse? - udp

The problem is: Sometimes tcpdump sees that the receiving of a UDP packet is held back until the next incoming UDP packet, although the network tap device shows it goes without delay through the cable.
Scenary: My profinet stack on Linux (located in user space) has a cyclic connection where it receives and sends Profinet protocol packets every 4ms (via raw sockets). About every 30 ms it also receives UDP packets in another thread on a UDP socket and replies them immediately, according to that protocol. It's around 10% CPU load. Sometimes it seems such received UDP packets are stuck in the network driver. After 2 seconds the next UDP packet comes in and both, the missed UDP packet and that next one is received. There are no dropped packets.
My measuring:
I use tcpdump -i eth0 --time-stamp-precision=nano --time-stamp-type=adapter_unsynced -w /tmp/tcpdump.pcap to record the UDP traffic to a RAM disk file.
At the same time I use a network tap device to record the traffic.
Question:
How to find out where the delay comes from (or is it a known effect)?
(2. What does the timestamp (which tcpdump sets to each packet) tell me? I mean, which OSI layer refers it to, in other words: When is it taken?)
Topology: "embedded device with Linux and eth0" <---> tap-device <---> PLC. The program "tcpdump" is running on the embedded device. The tap device is listening on the cable. The actual Profinet connection is between PLC and embedded device. A PC is connected on the tap device to record what it is listening to.
Wireshark (via tap and tcpdump): see here (packet no. 3189 in tcpdump.pcap)

It was a bug in the freescale Fast Ethernet Driver (fec_main.c) which NXP has fixed by its awesome support now.
The actual answer (for the question "How to find out where the delay comes from?") is: One has to build a Linux with kernel tracing on, patch the driver code with kernel tracing and then analyse such tracing with the developer Linux tool trace-cmd. It's a very complicated thing but I'm very happy it is fixed now:
trace-cmd record -o /tmp/trace.dat -p function -l fec_enet_interrupt -l fec_enet_rx_napi -e 'fec:fec_rx_tp' tcpdump -i eth0 --time-stamp-precision=nano --time-stamp-type=adapter_unsynced -w /tmp/tcpdump.pcap

Related

Fragmented UDP packet loss?

We have an application doing udp broadcast.
The packet size is mostly higher than the mtu so they will be fragmented.
tcpdump says the packets are all being received but the application doesn't get them all.
The whole stuff isn't happening at all if the mtu is set larger so there isn't fragmentation. (this is our workaround right now - but Germans don't like workarounds)
So it looks like fragmentation is the problem.
But I am not able to understand why and where the packets get lost.
The app developers say they can see the loss of the packets right at the socket they are picking them up. So their application isn't losing the packets.
My questions are:
Where is tcpdump monitoring on linux the device?
Are the packets there already reassembled or is this done later?
How can I debug this issue further?
tcpdump uses libpcap which gets copies of packets very early in the Linux network stack. IP fragment reassembly in the Linux network stack would happen after libpcap (and therefore after tcpdump). Save the pcap and view with Wireshark; it will have better analysis features and will help you find any missing IP fragments (if there are any).

What happens between receiving network data on the ethernet port and apache2 doing something?

This one is kind of a vague question, because my own understanding is about as vague. I'm interested in what needs to happen for sporadic voltages on the network cables to cause a program running on your computer to do something.
Say I'm running apache2 on my webserver. Somebody triggers the correct sequence of events on their own internet-connected computer, which results in network data arriving at the server. Then what?
My guess is that there is some peripheral component on the motherboard which listens to the data, which then raises an interrupt in the CPU. Somehow, in the interrupt service routine, Linux must ask the apache2 code to do something. Is this correct? If so, would anyone be willing to share a few extra details?
Thanks
I'll outline what happens from the bottom up, making references to code wherever possible.
Layer 1 (PHY)
Ethernet card (NIC) receives and decodes the signal on the wire, and pushes it into a shift register
See Ethernet over twisted-pair for the line codes details for each variant of *BASE-T Ethernet
When full ethernet frame is received, it is placed into a receive (RX) queue in hardware
NIC raises an interrupt, using bus-specific mechanism (either PCI IRQ line, or message-signaled interrupt)
Interrupt controller (APIC) receives interrupt and directs it to a CPU
CPU saves running context and switches to interrupt context
CPU loads interrupt handler vector and begins executing it
IRQs can be shared by multiple devices. The kernel has to figure out which device is actually interrupting. I'll refer to e100.c driver as it is implemented in one C file and well-commented.
Linux kernel looks at all devices that share this IRQ, calling their driver to determine which device actually raised this interrupt. The driver function called is whatever was passed by the driver to request_irq. (See for_each_action_of_desc() in __handle_irq_event_percpu()).
Each driver of devices sharing this IRQ will look at their device's status register to see if they have an interrupt pending
NIC driver interrupt handler (e.g. e100_intr()) sees the NIC indeed interrupted. It disables the device interrupt (e.g. e100_disable_irq()) and schedules a NAPI callback (__napi_schedule()). NIC driver "claims" the interrupt by returning IRQ_HANDLED. Interrupt ends.
Linux kernel NAPI subsystem calls back NIC driver (e.g. e100_poll) which reads the packet from the NIC RX queue and puts it into a struct sk_buff (SKB), and pushes it into the kernel network stack (e.g. e100_rx_indicate()).
The whole TCP/IP stack is implemented in the Linux kernel for performance reasons:
Layer 2 (MAC)
Kernel ethernet layer looks at ethernet packet and verifies that it is destined for this machine's MAC address
Ethernet ethernet layer sees Ethertype == IP, hands it to IP layer
Note, the protocol is actually set by the device driver (e.g. in e100_indicate()).
Layer 3 (IP)
Kernel IP layer receives packet (ip_rcv())
Kernel IP layer queues up all IP fragments
When all IP frags are recieved, it processes the IP packet. It looks at the protocol field and sees that it is TCP, hands it to TCP layer
Layer 4 (TCP)
Kernel TCP layer receives packet (tcp_v4_rcv()).
Kernel TCP layer looks at src/dst IP/port and matches it up with an open TCP connection (socket) (tcp_v4_rcv() calls __inet_lookup_skb()).
If it is a SYN packet (new connection):
TCP will see that there is a listening socket open for port 80
TCP creates a new connection object for this new connection
Kernel wakes up the task that is sleeping, blocked on an accept call - or select
If it is not a SYN packet (there is data):
Kernel queues up the TCP data from this segment on the socket
Kernel wakes up a task that is asleep, blocked on a recv call - or select (sock_def_readable())
Layer 5 (Application - HTTP)
Apache (httpd) will wake up, depending on the system call it is blocked on:
accept() returns when a new child connection is available (this is handled with a wrapper called apr_socket_accept())
recv() returns when a socket has new data, which has been read into a userspace buffer
Apache processes the buffer, parsing HTTP protocol strings
Additional Resources
Linux networking stack from the ground up

Inaccurate packet counter in OpenvSwitch

I attempted to send a file from host A to B and capture the packet loss using OpenvSwitch. I connected host A and B to an OpenvSwitch VM separately and connect the two OpenvSwitch VMs. The topology looks like this:
A -- OVS_A -- OVS_B -- B
On each OpenvSwitch VM, I added two very simple flows using the commands below:
ovs-ofctl add-flow br0 in_port=1,actions=output:2
ovs-ofctl add-flow br0 in_port=2,actions=output:1
Then I sent a 10GB file between A and B and compared the packet counts of the egress flow on the sending switch and the ingress flow on the receiving switch. I found that the packet count on the receiving switch is much larger than the count on the sending switch, indicating more packets are received than being sent!
I tried to match more specific flows, e.g. a TCP flow from IP A.A.A.A to B.B.B.B on port C and got the same result. Is there anything wrong with my settings? Or this is a known bug in OpenvSwitch? Any ideas?
BTW, is there any other way to passively capture packet loss rate? Meaning measuring the loss rate w/o introducing any intrusive test flows, but simply use statistics available on the sending/receiving ends or switches.
Thanks in advance!
I just realized that it was not Open vSwitch's fault. I tested with a UDP stream and packet count was correct. I also used tcpdump to capture inbound TCP packets on the switches and the switch at the receiving end had more packets than that at the sending end. The result is consistent with that captured with Open vSwitch's flow counters. I guess I must have missed something important about TCP.

icmp packets appearing with udp packets

I am sending udp packets from one pc to other. I am observing the traffic on wireshark on the pc where I am receiving udp packets. One interesting thing I see is icmp packets appearing suddenly. Then they disappear and again appear in a cyclic manner. What can be the reason for this. Am I doing some thing wrong. And what bad effects can it have on my udp reception performance.
Please also find the attached wireshark figure taken from the destination pc.
The ICMP packets are created by the other host because the UDP port is not open. The ICMP packet includes the first X bytes of the packet that was dropped so the sender can read out which session was affected.

Why is SNMP usually run over UDP and not TCP/IP?

This morning, there were big problems at work because an SNMP trap didn't "go through" because SNMP is run over UDP. I remember from the networking class in college that UDP isn't guaranteed delivery like TCP/IP. And Wikipedia says that SNMP can be run over TCP/IP, but UDP is more common.
I get that some of the advantages of UDP over TCP/IP are speed, broadcasting, and multicasting. But it seems to me that guaranteed delivery is more important for network monitoring than broadcasting ability. Particularly when there are serious high-security needs. One of my coworkers told me that UDP packets are the first to be dropped when traffic gets heavy. That is yet another reason to prefer TCP/IP over UDP for network monitoring (IMO).
So why does SNMP use UDP? I can't figure it out and can't find a good reason on Google either.
UDP is actually expected to work better than TCP in lossy networks (or congested networks). TCP is far better at transferring large quantities of data, but when the network fails it's more likely that UDP will get through. (in fact, I recently did a study testing this and it found that SNMP over UDP succeeded far better than SNMP over TCP in lossy networks when the UDP timeout was set properly). Generally, TCP starts behaving poorly at about 5% packet loss and becomes completely useless at 33% (ish) and UDP will still succeed (eventually).
So the right thing to do, as always, is pick the right tool for the right job. If you're doing routine monitoring of lots of data, you might consider TCP. But be prepared to fall back to UDP for fixing problems. Most stacks these days can actually use both TCP and UDP.
As for sending TRAPs, yes TRAPs are unreliable because they're not acknowledged. However, SNMP INFORMs are an acknowledged version of a SNMP TRAP. Thus if you want to know that the notification receiver got the message, please use INFORMs. Note that TCP does not solve this problem as it only provides layer 3 level notification that the message was received. There is no assurance that the notification receiver actually got it. SNMP INFORMs do application level acknowledgement and are much more trustworthy than assuming a TCP ack indicates they got it.
If systems sent SNMP traps via TCP they could block waiting for the packets to be ACKed if there was a problem getting the traffic to the receiver. If a lot of traps were generated, it could use up the available sockets on the system and the system would lock up. With UDP that is not an issue because it is stateless. A similar problem took out BitBucket in January although it was syslog protocol rather than SNMP--basically, they were inadvertently using syslog over TCP due to a configuration error, the syslog server went down, and all of the servers locked up waiting for the syslog server to ACK their packets. If SNMP traps were sent over TCP, a similar problem could occur.
http://blog.bitbucket.org/2012/01/12/follow-up-on-our-downtime-last-week/
Check out O'Reilly's writings on SNMP: https://library.oreilly.com/book/9780596008406/essential-snmp/18.xhtml
One advantage of using UDP for SNMP traps is that you can direct UDP to a broadcast address, and then field them with multiple management stations on that subnet.
The use of traps with SNMP is considered unreliable. You really should not be relying on traps.
SNMP was designed to be used as a request/response protocol. The protocol details are simple (hence the name, "simple network management protocol"). And UDP is a very simple transport. Try implementing TCP on your basic agent - it's considerably more complex than a simple agent coded using UDP.
SNMP get/getnext operations have a retry mechanism - if a response is not received within timeout then the same request is sent up to a maximum number of tries.
Usually, when you're doing SNMP, you're on a company network, you're not doing this over the long haul. UDP can be more efficient. Let's look at (a gross oversimplification of) the conversation via TCP, then via UDP...
TCP version:
client sends SYN to server
server sends SYN/ACK to client
client sends ACK to server - socket is now established
client sends DATA to server
server sends ACK to client
server sends RESPONSE to client
client sends ACK to server
client sends FIN to server
server sends FIN/ACK to client
client sends ACK to server - socket is torn down
UDP version:
client sends request to server
server sends response to client
generally, the UDP version succeeds since it's on the same subnet, or not far away (i.e. on the company network).
However, if there is a problem with either the initial request or the response, it's up to the app to decide. A. can we get by with a missed packet? if so, who cares, just move on. B. do we need to make sure the message is sent? simple, just redo the whole thing... client sends request to server, server sends response to client. The application can provide a number just in case the recipient of the message receives both messages, he knows it's really the same message being sent again.
This same technique is why DNS is done over UDP. It's much lighter weight and generally it works the first time because you are supposed to be near your DNS resolver.