size of ICMP type 11 packet payload - size

What's the size of the ICMP packet payload when the type is 11, i.e. time exceeded?
Since it contains an IP header and the first 8 Bytes of the IP packet payload generating the ICMP message, I thought its size was 20 + 8 = 28.
I'm replaying some common user traffic with TTL=1. In the ICMP messages I have dumped I noticed that:
all ICMP packets generated by UDP packets have payload of size 28 Bytes
all those generated by TCP packets have payload of size 40 Bytes
Since I need to match ICMP time-exceeded messages with the packets that triggered them by comparing those bytes, this piece of information is essential, but I can't find figure out why this happens.

The problem is that you're quoting the 8-byte header payload from RFC 792, Page 4, but the requirements were changed by RFC 1812...
Time Exceeded Message (in RFC 792)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Data Datagram |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RFC 1812, Section 4.3.2.3 dramatically increases the allowable payload in an ICMP Error message (emphasis mine):
4.3.2.3 Original Message Header
Historically, every ICMP error message has included the Internet
header and at least the first 8 data bytes of the datagram that
triggered the error. This is no longer adequate, due to the use of
IP-in-IP tunneling and other technologies. Therefore, the ICMP
datagram SHOULD contain as much of the original datagram as possible
without the length of the ICMP datagram exceeding 576 bytes. The
returned IP header (and user data) MUST be identical to that which
was received, except that the router is not required to undo any
modifications to the IP header that are normally performed in
forwarding that were performed before the error was detected (e.g.,
decrementing the TTL, or updating options).
The ICMP Errors you're generating from Scapy packets should contain all the information from the IP and TCP layers of the original packet.

As you noted, the ICMP payload is the IP header plus 8 octets of the original packet's payload. IP headers, however, are not always 20 octets long; 20 is only the minimum. The IP header itself may contain options, and the header length is indicated by the value in the IHL field of the header. See sec 3.1 of RFC 791. So it looks like the TCP packets have 12 additional octets of options in their IP headers. RFC 791 defines some standard options such as source routing and timestamping. You'll have to decode the header to determine what options are being used.

I would like to add for future reference that not only do ICMP payloads vary in size as Mike said, they might also be longer than 128 Bytes in the case of ICMP extensions for MPLS. See this draft for more information

Related

Understanding iptables output for filtering packets

I am working on one sles12 system where IPTables are configured in this way:
num target prot opt source destination
1 ACCEPT tcp -- anywhere anywhere tcp dpt:bctp owner UID match dd-test-user
2 DROP tcp -- anywhere anywhere tcp dpt:bctp
3 DROP all -- anywhere instance-data.us-west-2.compute.internal owner GID match test
4 ACCEPT all -- anywhere ip-100-34-0-0.us-west-2.compute.internal/21 owner GID match test
Can someone please help me understand this?
With IPTable rule 2, All packets will be dropped?
What does dpt:bctp mean here? I could not find anything about it manual.
Does Rule 4 even get chance to be applied for the process running from group id of "test" group?
I tried searching online documentation of iptables, but I could not find answer.
I found out that bctp stands for one of the protocol defined on the system
cat /etc/services | grep bctp
bctp 8999/tcp # Brodos Crypto Trade Protocol [Alexander_Sahler]
bctp 8999/udp # Brodos Crypto Trade Protocol [Alexander_Sahler]
Rule 2 is applied when destination port (dpt) is this protocol (8999/tcp). Rule 3 and 4 are applied for rest of ports from the process belonging to users from the group "test".

How do I target the MSS value in a TCP packet using BPF

I am learning BPF and converting some iptables rules to BPF bytcode. I am primarily using the nfbpf_compile application to do this, rather than trying to write C or Assembler. I am having a lot of luck but the syntax of one rule is escaping me.
I'd like to drop packets with the syn flag set that is also missing an MSS value. In iptables the MSS is targetted with --tcp-option 2. I know that MSS is in the TCP options that start at byte 22 of the TCP packet, and MSS is 'kind' 2. I am able to filter the MSS by using tcp[22:2]==$NUMBER in BPF syntax. However, what I want to do is target SYN packets where the MSS is missing entirely.
I have tried every variant of "null" I can think of but am having no luck.
Does anyone know the equivalent of iptables ! --tcp-option 2 in BPF syntax?
An example of something I have tried:
$ ./nfbpf_compile RAW 'tcp[22:2]==0x0' (I know this won't work..it's an example)
12,48 0 0 0,84 0 0 240,21 0 8 64,48 0 0 9,21 0 6 6,40 0 0 6,69 4 0 8191,177 0 0 0,72 0 0 22,21 0 1 0,6 0 0 65535,6 0 0 0
# iptables -I INPUT -m bpf --bytecode '12,48 0 0 0,84 0 0 240,21 0 8 64,48 0 0 9,21 0 6 6,40 0 0 6,69 4 0 8191,177 0 0 0,72 0 0 22,21 0 1 0,6 0 0 65535,6 0 0 0
' -j DROP
TL;DR If you know there are only 0 or 1 TCP options, or if you know the MSS option is always the first option, then you can use the following filter:
tcp && (tcp[tcpflags] == tcp-syn) && ((((tcp[12] & 0xf0) >> 2) < 21) || tcp[20] != 2)
If you don't know this (there are several TCP options and the MSS option may be any of them), which is generally the case, I don't think it's possible to express a matching filter with nfbpf_compile's syntax. In that case, I would recommend writing a C program and loading it with -m bpf --object-pinned /path/to/pinned/bpf.
Let me explain the above filter first. You have two cases to match: 1) there is no TCP option or 2) the first TCP option is not the MSS:
tcp[12] & 0xf0 extracts the data offset field from the TCP header, i.e., the number of 32bit word in the TCP header.
(tcp[12] & 0xf0) >> 2 multiplies this by 4 to get the number of bytes.
If you have less than 21 bytes in your TCP header, then you know there are no TCP options.
tcp[20] != 2 checks that the Option-Kind field of the first TCP option (starts at offset 20) is not 2, the Option-Kind for MSS.
Why is the general case harder to match? TCP options have a variable length (depending on their Option-Kind) and there is a variable, bounded number of TCP options. Say you want to extend the above filter to match on the second TCP option. You first need to know where that option starts; the first option has a variable length so this is not a fixed offset.
With cBPF (the BPF bytecode emitted by nfbpf_compile), you might be able to express that by storing the current option offset in the X register and then loading a byte into register A with the 2nd addressing mode (see the Linux documentation, BPF engine and instruction set). However, I do not think you can do this with the limited nfbpf_compile syntax (assuming it's the same syntax as tcpdump's).

Why received ZFS dataset uses less space than original?

I have a dataset on the server1 that I want to back up to the second server2.
Server1 (original):
zfs list -o name,used,avail,refer,creation,usedds,usedsnap,origin,compression,compressratio,refcompressratio,mounted,atime,lused storage/iscsi/webhost-old produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
storage/iscsi/webhost-old 67,8G 1,87T 67,8G Út kvě 31 6:54 2016 67,8G 16K - lz4 1.00x 1.00x - - 67,4G
Sending volume to the 2nd server:
zfs send storage/iscsi/webhost-old | pv | ssh -c arcfour,aes128-gcm#openssh.com root#10.0.0.2 zfs receive -Fduv pool/bkp-storage
received 69,6GB stream in 378 seconds (189MB/sec)
Server2 zfs list produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
pool/bkp-storage/iscsi/webhost-old 36,1G 3,01T 36,1G Pá pro 29 10:25 2017 36,1G 0 - lz4 1.15x 1.15x - - 28,4G
Why is there such a difference in sizes? Thanks.
From what you posted, I noticed 3 things that seemed odd:
the compressratio is 1.15x on system 2, but 1.00x on system 1
on system 2, used is 1.27x higher than logicalused
the logicalused and the number zfs receive report are ~2.3x higher on system 1 than system 2
These terms are all defined in the man page, but are still confusing to reverse-engineer explanations for in practice.
(1) could happen if you enabled compression on the source dataset after you wrote all the data to it, since ZFS doesn't rewrite the data to compress it when you enable that setting. The data sent by zfs send is uncompressed unless you use -c, but system 2 will try to compress it as it runs zfs receive if the setting is enabled on the destination dataset. If both system 1 and system 2 had the same compression settings before the data was written, they would have the same compressratio as well.
(2) can happen due to metadata written along with your data, but in this case it's too high for "normal" metadata, which accounts for 1-2% of most pools. It's probably caused by a pool-wide setting, like configuring RAID-Z, or a weird combination of striping and mirroring (like 4 stripes, but with one of them being a mirror).
For (3), I re-read the man page to try to figure it out:
logicalused
The amount of space that is "logically" consumed by this dataset and
all its descendents. See the used property. The logical space
ignores the effect of the compression and copies properties, giving a
quantity closer to the amount of data that applications see.
If you were sending a dataset (instead of a single iSCSI volume) and the send size matched system 2's logicalused value (instead of system 1's), I would guess you forgot to send some child datasets (i.e. by using zfs send -R). However, neither of those are true in this case.
I had to do some additional digging -- this blog post from 2005 might contain the explanation. If system 1 didn't have compression enabled when the data was written (like I guessed above for (1)), the function responsible for not writing zeroed-out blocks (zio_compress_data) would not be run, so you probably have a bunch of empty blocks written to disk, and accounted for in the logicalused size. However, since lz4 is configured on system 2, it would run there, and those blocks would not be counted.

Understanding the bitstream generated for iCE40 I/O tiles

When I synthesize an empty circuit using Yosys and arachne-pnr, I get a few irregular bits:
.io_tile 6 17
IoCtrl IE_1
.io_tile 6 0
IoCtrl REN_0
IoCtrl REN_1
These are also part of every other file I could generate so far. Since an unused I/O tile has both IE bits set, I read this as:
for IE/REN block 6 17 0, the input buffer is enabled
for IE/REN block 6 0 0, the input buffer is enabled and the pull-up resistor is disabled
for IE/REN block 6 0 1, the input buffer is enabled and the pull-up resistor is disabled
However, according to the documentation, there is no IE/REN block 6 17 0.
What is the meaning of these bits? If the IE bit of block 6 17 0 is unset because the block doesn't exist, why aren't the bits of the other blocks which don't exist unset, too? The other IE/REN blocks seem to correspond to I/O blocks 6 0 1 and 7 0 0. What do these blocks do, and why are they always configured as inputs?
The technology library entry for SB_IO does not mention the IE bit. How is it related to the PIN_TYPE parameter settings?
When I use an I/O pin as an input, the REN bit is set (the pull-up resistor disabled). This suggests that the pull-up resistors are primarily intended to keep unused pins from floating, not to provide a pull-up resistor for conditionally connected inputs (e.g. buttons). Is this assumption correct? Would it be ok to use the internal pull-up resistors for that purpose?
The technology library says the following:
defparam IO_PIN_INST.PULLUP = 1'b0;
// By default, the IO will have NO pull up.
// This parameter is used only on bank 0, 1,
// and 2. Ignored when it is placed at bank 3
Does this mean bank 3 doesn't have pull-up resistors, or merely that they can't be re-enabled using Verilog? What would happen if I clear that bit in the ASCII bitstream manually? (I'd be trying this, but the iCEstick evaluation board doesn't make any pin on bank 3 accessible – a coincidence? – and I'm not sure if I want to mess with the hardware yet.)
When I use an I/O pin as an output, the IE bit isn't cleared, but the input pin function is set to PIN_INPUT. What effect does this have, and why is it done?
The default behavior for unused IO pins is to enable the pullup resistors and disable input enable. On iCE40 1k chips this means IE_0 and IE_1 are set and REN_0 and REN_1 are cleared in an unused IO tile. (On 8k chips IE_* is active high, i.e. all bits are cleared in an unused IO tile on an 8k chip.)
icebox_explain by default hides tiles that have "uninteresting" contents. (Run icebox_explain -A to disable this feature.)
It looks like arachne-pnr does not set those bits for IO pins that are not available in the current package. Thus you get some unusual bit pattern in some IO tiles that contain IE/REN bits for IO blocks not connected to any package pin.
This is what a "normal" unused IO tile looks like on the 1k architecture:
$ icebox_explain -mAt '1 0' example.asc
Reading file 'example.asc'..
Fabric size (without IO tiles): 12 x 16
.io_tile 1 0
B0 ------------------
B1 ------------------
B2 ------------------
B3 ------------------
B4 ------------------
B5 ------------------
B6 ---+--------------
B7 ------------------
B8 ------------------
B9 ---+--------------
B10 ------------------
B11 ------------------
B12 ------------------
B13 ------------------
B14 ------------------
B15 ------------------
IoCtrl IE_0
IoCtrl IE_1
Would it be ok to use the internal pull-up resistors for that purpose?
Yes.
The technology library entry for SB_IO does not mention the IE bit. How is it related to the PIN_TYPE parameter settings?
When D_IN_0 or D_IN_1 from SB_IO is connected to something, then this implies IE.
When I use an I/O pin as an output, the IE bit isn't cleared
Note that IE is active low on 1k chips and active high on 8k chips. When you use an I/O pin as output-only pin on a 1k device, then the corresponding IE bit should be set, otherwise it should be cleared.
I found the explanation, so I'm sharing it here for future reference:
.io_tile 6 17
IoCtrl IE_1
I/O block 16 7 0 is connected to a pin in some packages but not in the TQFP package.
.io_tile 6 0
IoCtrl REN_0
IoCtrl REN_1
This corresponds to I/O blocks 6 0 1 and 7 0 0 (pins 49 and 50) into whose input paths the PLL PORTA and PORTB clocks are fed, respectively.

How to set up RSS hash fuction in XL710 to receive IPv4 flow type?

In DPKD the ETH_RSS_IPV4 data flow is not activated by default for XL710 Intel NIC. So, when you want to distribute packets among lcores you have to select other IPv4 data flows which are supported by XL710, namely ETH_RSS_FRAG_IPV4, ETH_RSS_NONFRAG_IPV4_TCP, ETH_RSS_NONFRAG_IPV4_UDP, ETH_RSS_NONFRAG_IPV4_SCTP, and ETH_RSS_NONFRAG_IPV4_OTHER. However you will face a silly problem when you are dealing with the fragmented IP packets. If you choose to go with ETH_RSS_FRAG_IPV4 and ETH_RSS_NONFRAG_IPV4_TCP options then some fragmented packets of a connection will fall into another queue, because they don't have L4 port numbers. If you exclude ETH_RSS_NONFRAG_IPV4_TCP function then the ETH_RSS_FRAG_IPV4 hash function will not be applied to non-fragmented packets and those packets will go to queue 0. All other combination of hash functions will not work. So, what should we do?
The behavior of XL710 is not compatible with the conventions in DPDK. So, you must directly work with the API offered by i40e driver in order to set up RSS for ETH_RSS_IPV4. As mentioned in the Intel® Ethernet Controller 710 Series Specification Update, page 18 (release Jan 2017):
Functions that require the Hash (RSS) filters on IPv4 packets should
set all IPv4 PCTYPEs in the PFQF_HENA / VFQF_HENA (PCTYPEs 31, 33…36)
Supported packet types (PCTYPE) are mentioned in Intel® Ethernet Controller 710 Series Datasheet pages 597 and 598 (release Jan 2017). You can see that there is no packet type defined for IPv4.
However there is a solution. The clue is to modify the input set for all required flow types (or packet types). Let's try it with testpmd tool which is provided by DPDK in app folder. After compiling DPDK and the app, run the testpmd application:
./app/test-pmd/testpmd -c ff -n 2 -w 0a:00.0 -w 0a:00.1 -- -i --rxq=4 --txq=4
We have two XL710 in our system. With the following commands you can configure XL710 to behave as you want to support IPv4 data flow.
port config all rss all
set_hash_input_set 0 ipv4-tcp src-ipv4 select
set_hash_input_set 0 ipv4-tcp dst-ipv4 add
set_hash_input_set 0 ipv4-udp src-ipv4 select
set_hash_input_set 0 ipv4-udp dst-ipv4 add
set_hash_input_set 1 ipv4-tcp src-ipv4 select
set_hash_input_set 1 ipv4-tcp dst-ipv4 add
set_hash_input_set 1 ipv4-udp src-ipv4 select
set_hash_input_set 1 ipv4-udp dst-ipv4 add
set_hash_global_config 0 default ipv4-frag enable
set_hash_global_config 0 default ipv4-tcp enable
set_hash_global_config 0 default ipv4-udp enable
set_hash_global_config 1 default ipv4-frag enable
set_hash_global_config 1 default ipv4-tcp enable
set_hash_global_config 1 default ipv4-udp enable
It selects the proper input set for TCP and UDP flow types by removing the L4 port section. The set_hash_global_config command enables the symmetric hash if you need it. By modifying the TCP input set, it behaves just like Frag IPv4 flow type and as a result all packets belonging to the same connection go to the same lcore.
Note that the default input set for Frag IPv4 and NonFIPv4, Other is IP4-S and IP4-D. So it doesn't need to be modified. Remember to modify all other IPv4 flows input set and symmetric quality of them.
You can find the API functions of those commands by looking at the source code of the testpmd application.