How to set up RSS hash fuction in XL710 to receive IPv4 flow type? - nic

In DPKD the ETH_RSS_IPV4 data flow is not activated by default for XL710 Intel NIC. So, when you want to distribute packets among lcores you have to select other IPv4 data flows which are supported by XL710, namely ETH_RSS_FRAG_IPV4, ETH_RSS_NONFRAG_IPV4_TCP, ETH_RSS_NONFRAG_IPV4_UDP, ETH_RSS_NONFRAG_IPV4_SCTP, and ETH_RSS_NONFRAG_IPV4_OTHER. However you will face a silly problem when you are dealing with the fragmented IP packets. If you choose to go with ETH_RSS_FRAG_IPV4 and ETH_RSS_NONFRAG_IPV4_TCP options then some fragmented packets of a connection will fall into another queue, because they don't have L4 port numbers. If you exclude ETH_RSS_NONFRAG_IPV4_TCP function then the ETH_RSS_FRAG_IPV4 hash function will not be applied to non-fragmented packets and those packets will go to queue 0. All other combination of hash functions will not work. So, what should we do?

The behavior of XL710 is not compatible with the conventions in DPDK. So, you must directly work with the API offered by i40e driver in order to set up RSS for ETH_RSS_IPV4. As mentioned in the Intel® Ethernet Controller 710 Series Specification Update, page 18 (release Jan 2017):
Functions that require the Hash (RSS) filters on IPv4 packets should
set all IPv4 PCTYPEs in the PFQF_HENA / VFQF_HENA (PCTYPEs 31, 33…36)
Supported packet types (PCTYPE) are mentioned in Intel® Ethernet Controller 710 Series Datasheet pages 597 and 598 (release Jan 2017). You can see that there is no packet type defined for IPv4.
However there is a solution. The clue is to modify the input set for all required flow types (or packet types). Let's try it with testpmd tool which is provided by DPDK in app folder. After compiling DPDK and the app, run the testpmd application:
./app/test-pmd/testpmd -c ff -n 2 -w 0a:00.0 -w 0a:00.1 -- -i --rxq=4 --txq=4
We have two XL710 in our system. With the following commands you can configure XL710 to behave as you want to support IPv4 data flow.
port config all rss all
set_hash_input_set 0 ipv4-tcp src-ipv4 select
set_hash_input_set 0 ipv4-tcp dst-ipv4 add
set_hash_input_set 0 ipv4-udp src-ipv4 select
set_hash_input_set 0 ipv4-udp dst-ipv4 add
set_hash_input_set 1 ipv4-tcp src-ipv4 select
set_hash_input_set 1 ipv4-tcp dst-ipv4 add
set_hash_input_set 1 ipv4-udp src-ipv4 select
set_hash_input_set 1 ipv4-udp dst-ipv4 add
set_hash_global_config 0 default ipv4-frag enable
set_hash_global_config 0 default ipv4-tcp enable
set_hash_global_config 0 default ipv4-udp enable
set_hash_global_config 1 default ipv4-frag enable
set_hash_global_config 1 default ipv4-tcp enable
set_hash_global_config 1 default ipv4-udp enable
It selects the proper input set for TCP and UDP flow types by removing the L4 port section. The set_hash_global_config command enables the symmetric hash if you need it. By modifying the TCP input set, it behaves just like Frag IPv4 flow type and as a result all packets belonging to the same connection go to the same lcore.
Note that the default input set for Frag IPv4 and NonFIPv4, Other is IP4-S and IP4-D. So it doesn't need to be modified. Remember to modify all other IPv4 flows input set and symmetric quality of them.
You can find the API functions of those commands by looking at the source code of the testpmd application.

Related

Understanding iptables output for filtering packets

I am working on one sles12 system where IPTables are configured in this way:
num target prot opt source destination
1 ACCEPT tcp -- anywhere anywhere tcp dpt:bctp owner UID match dd-test-user
2 DROP tcp -- anywhere anywhere tcp dpt:bctp
3 DROP all -- anywhere instance-data.us-west-2.compute.internal owner GID match test
4 ACCEPT all -- anywhere ip-100-34-0-0.us-west-2.compute.internal/21 owner GID match test
Can someone please help me understand this?
With IPTable rule 2, All packets will be dropped?
What does dpt:bctp mean here? I could not find anything about it manual.
Does Rule 4 even get chance to be applied for the process running from group id of "test" group?
I tried searching online documentation of iptables, but I could not find answer.
I found out that bctp stands for one of the protocol defined on the system
cat /etc/services | grep bctp
bctp 8999/tcp # Brodos Crypto Trade Protocol [Alexander_Sahler]
bctp 8999/udp # Brodos Crypto Trade Protocol [Alexander_Sahler]
Rule 2 is applied when destination port (dpt) is this protocol (8999/tcp). Rule 3 and 4 are applied for rest of ports from the process belonging to users from the group "test".

Why received ZFS dataset uses less space than original?

I have a dataset on the server1 that I want to back up to the second server2.
Server1 (original):
zfs list -o name,used,avail,refer,creation,usedds,usedsnap,origin,compression,compressratio,refcompressratio,mounted,atime,lused storage/iscsi/webhost-old produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
storage/iscsi/webhost-old 67,8G 1,87T 67,8G Út kvě 31 6:54 2016 67,8G 16K - lz4 1.00x 1.00x - - 67,4G
Sending volume to the 2nd server:
zfs send storage/iscsi/webhost-old | pv | ssh -c arcfour,aes128-gcm#openssh.com root#10.0.0.2 zfs receive -Fduv pool/bkp-storage
received 69,6GB stream in 378 seconds (189MB/sec)
Server2 zfs list produces:
NAME USED AVAIL REFER CREATION USEDDS USEDSNAP ORIGIN COMPRESS RATIO REFRATIO MOUNTED ATIME LUSED
pool/bkp-storage/iscsi/webhost-old 36,1G 3,01T 36,1G Pá pro 29 10:25 2017 36,1G 0 - lz4 1.15x 1.15x - - 28,4G
Why is there such a difference in sizes? Thanks.
From what you posted, I noticed 3 things that seemed odd:
the compressratio is 1.15x on system 2, but 1.00x on system 1
on system 2, used is 1.27x higher than logicalused
the logicalused and the number zfs receive report are ~2.3x higher on system 1 than system 2
These terms are all defined in the man page, but are still confusing to reverse-engineer explanations for in practice.
(1) could happen if you enabled compression on the source dataset after you wrote all the data to it, since ZFS doesn't rewrite the data to compress it when you enable that setting. The data sent by zfs send is uncompressed unless you use -c, but system 2 will try to compress it as it runs zfs receive if the setting is enabled on the destination dataset. If both system 1 and system 2 had the same compression settings before the data was written, they would have the same compressratio as well.
(2) can happen due to metadata written along with your data, but in this case it's too high for "normal" metadata, which accounts for 1-2% of most pools. It's probably caused by a pool-wide setting, like configuring RAID-Z, or a weird combination of striping and mirroring (like 4 stripes, but with one of them being a mirror).
For (3), I re-read the man page to try to figure it out:
logicalused
The amount of space that is "logically" consumed by this dataset and
all its descendents. See the used property. The logical space
ignores the effect of the compression and copies properties, giving a
quantity closer to the amount of data that applications see.
If you were sending a dataset (instead of a single iSCSI volume) and the send size matched system 2's logicalused value (instead of system 1's), I would guess you forgot to send some child datasets (i.e. by using zfs send -R). However, neither of those are true in this case.
I had to do some additional digging -- this blog post from 2005 might contain the explanation. If system 1 didn't have compression enabled when the data was written (like I guessed above for (1)), the function responsible for not writing zeroed-out blocks (zio_compress_data) would not be run, so you probably have a bunch of empty blocks written to disk, and accounted for in the logicalused size. However, since lz4 is configured on system 2, it would run there, and those blocks would not be counted.

Delimiter string in Telit GL 868 Dual V3

I am using Telit modem GL 868 Dual V3. AT command AT#SCFG has 2 parameters- packet size to be used and data sending time-out for TCP. Is there any AT command which specifies that if any delimiter string is found, then that data will be sent on TCP ignoring the packet size and time-out?
There are commands #PADFWD, #PADCMD which serves the purpose of delimiter.
Below is a snapshot from AT commands reference guide for telit modem.

LVS: All connections are InActConn

All connections are InActConn
I'm a newbie using LVS. I've tried LVS/TUN and LVS/DR, the result is the same, all connections are InActConn. But the realservers can be reach (through PING). Pls help!!!
OS: CentOS 6.2
RemoteAddress:Port Forward Weight ActiveConn InActConn
UDP 192.168.10.240:2345 rr
-> 192.168.10.251:2345 Tunnel 1 0 10
-> 192.168.10.252:2345 Tunnel 1 0 9
-> 192.168.10.253:2345 Tunnel 1 0 9
This is the expected behavior for services not maintaining connections, like UDP. You may want to read the LVS Howto, especially the part about Active/Inactive connections :
http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.ipvsadm.html#ActiveConn
Old Question : But I got to this post from Google and want to paste my findings here.
In the above answer, the link pasted by #remi-ggacogne missed 1 step for Real server.
You have to turn rp_filter off (esp. in Centos / RHEL ) https://www.slashroot.in/linux-kernel-rpfilter-settings-reverse-path-filtering
Open /etc/sysctl.conf and paste below lines ( as per your network interface )
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.tunl0.rp_filter = 0
To make the above active -->
$systcl -p

size of ICMP type 11 packet payload

What's the size of the ICMP packet payload when the type is 11, i.e. time exceeded?
Since it contains an IP header and the first 8 Bytes of the IP packet payload generating the ICMP message, I thought its size was 20 + 8 = 28.
I'm replaying some common user traffic with TTL=1. In the ICMP messages I have dumped I noticed that:
all ICMP packets generated by UDP packets have payload of size 28 Bytes
all those generated by TCP packets have payload of size 40 Bytes
Since I need to match ICMP time-exceeded messages with the packets that triggered them by comparing those bytes, this piece of information is essential, but I can't find figure out why this happens.
The problem is that you're quoting the 8-byte header payload from RFC 792, Page 4, but the requirements were changed by RFC 1812...
Time Exceeded Message (in RFC 792)
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Type | Code | Checksum |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| unused |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Internet Header + 64 bits of Original Data Datagram |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
RFC 1812, Section 4.3.2.3 dramatically increases the allowable payload in an ICMP Error message (emphasis mine):
4.3.2.3 Original Message Header
Historically, every ICMP error message has included the Internet
header and at least the first 8 data bytes of the datagram that
triggered the error. This is no longer adequate, due to the use of
IP-in-IP tunneling and other technologies. Therefore, the ICMP
datagram SHOULD contain as much of the original datagram as possible
without the length of the ICMP datagram exceeding 576 bytes. The
returned IP header (and user data) MUST be identical to that which
was received, except that the router is not required to undo any
modifications to the IP header that are normally performed in
forwarding that were performed before the error was detected (e.g.,
decrementing the TTL, or updating options).
The ICMP Errors you're generating from Scapy packets should contain all the information from the IP and TCP layers of the original packet.
As you noted, the ICMP payload is the IP header plus 8 octets of the original packet's payload. IP headers, however, are not always 20 octets long; 20 is only the minimum. The IP header itself may contain options, and the header length is indicated by the value in the IHL field of the header. See sec 3.1 of RFC 791. So it looks like the TCP packets have 12 additional octets of options in their IP headers. RFC 791 defines some standard options such as source routing and timestamping. You'll have to decode the header to determine what options are being used.
I would like to add for future reference that not only do ICMP payloads vary in size as Mike said, they might also be longer than 128 Bytes in the case of ICMP extensions for MPLS. See this draft for more information