How to scale Cisco Joy capturing speed over 5 GBPS or even more - packet-capture

Currently I am capturing network packets using tcpreplay at a speed of 800 MBPS but I want to scale it over 5 GBPS.
I am running Joy on a server with 16GB Ram and 8 Cores
Tcpreplay Output:
`Actual: 2427978 packets (2098973496 bytes) sent in 20.98 seconds
Rated: 100003501.6 Bps, 800.02 Mbps, 115678.59 pps
Flows: 49979 flows, 2381.11 fps, 2426216 flow packets, 1756 non-flow
Statistics for network device: vth0
Successful packets: 2427978
Failed packets: 0
Truncated packets: 0
Retried packets (ENOBUFS): 0
Retried packets (EAGAIN): 0`
Total Packets Captured: 2412876
I am running Joy on 4 threads but even if I use 24 threads I am not able to see any drastic change in the capturing or receiving speed.
Joy is using af_packet with Zero Copy Ring Buffer and even Cisco mercury use the same mechanism to write packets but they claim that Mercury can write at 40 GBPS on a server-class hardware so anyone have any suggestion on this issue then please revert back.

Related

DVB-S2 communication between two USRP B200

Thank you for reading this.
I'm having difficulties with DVB-S2 communication between two USRP B200 SDR boards that are connected with SMA cable.
For hardware set-up, I'm using Raspberry Pi 4 (4 GB) to execute GNU Radio Companion, and using USB 3.0 port and cable to connect RPi and USRP B200. And I connected a DC block at the Tx port, as described in USRP homepage manual for DVB-S2. (So, the sequence is RPi 4-USB 3.0-USRP (Tx)-DC block-SMA cable (1 m)-USRP (Rx)-USB 3.0-RPi 4)
I have attached my hardware set-up pictures below.
I am trying to send some sample video through DVB-S2 communication. And I got DVB-S2 GRC flowcharts from links below. I've attached the screenshots, too.
https://github.com/drmpeg/gr-dvbs2
https://github.com/drmpeg/gr-dvbs2rx
At last trial, it was successful with RF options setting like below:
-Constellation: QPSK
-Code rate: 2/5
-Center Freq.: 1 GHz
-1 Mbps (sample rate) * 2 sps (sample per symbol) = 2 Mbps (bandwidth)
-Tx relative gain: 40 dB
(Regarding the code rate and bandwidth, I could see the video was received with 0.8 Mbps data rate)
But the problem is:
-this connection is very unstable as it does often fail even when the RF setting is the same.
-I need to raise the data rate as high as possible, but it's too low for me now. As I know, USRP B200 support ~61.44 Msps, but when I require about above 4 Mbps bandwidth, the log shows Us (underflow) at Tx and Os (overflow) at Rx. I confirmed that the clock rate setting is fine with 56 MHz.
-So I tried using other constellations, code rate, sample rate combinations but they failed.And for 8PSK option, I put 3 into sps variable at the Rx side as 8PSK is 3 bits per sample, but Rx flowchart rejected and saying 'sps needs to be even integer >= 2'. And it was not allowed to use 16APSK or beyond constellations in this USRP or in this flowgraph.
I guess I am missing something.
Is there any way that I can make stable connection and raise up the data rate?
I would really appreciate if you could help me.

GTX 970 bandwidth calculation

I am trying to calculate the theoretical bandwidth of gtx970. As per the specs given in:-
http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-970/specifications
Memory clock is 7Gb/s
Memory bus width = 256
Bandwidth = 7*256*2/8 (*2 because it is a DDR)
= 448 GB/s
However, in the specs it is given as 224GB/s
Why is there a factor 2 difference? Am i making a mistake, if so please correct me.
Thanks
The 7 Gbps seems to be the effective clock, i.e. including the data rate. Also note that the field explanation for this Wikipedia list says that "All DDR/GDDR memories operate at half this frequency, except for GDDR5, which operates at one quarter of this frequency", which suggests that all GDDR5 chips are in fact quad data rate, despite the DDR abbreviation.
Finally, let me point out this note from Wikipedia, which disqualifies the trivial effective clock * bus width formula:
For accessing its memory, the GTX 970 stripes data across 7 of its 8 32-bit physical memory lanes, at 196 GB/s. The last 1/8 of its memory (0.5 GiB on a 4 GiB card) is accessed on a non-interleaved solitary 32-bit connection at 28 GB/s, one seventh the speed of the rest of the memory space. Because this smaller memory pool uses the same connection as the 7th lane to the larger main pool, it contends with accesses to the larger block reducing the effective memory bandwidth not adding to it as an independent connection could.
The clock rate reported is an "effective" clock rate and already takes into account the transfer on both rising and falling edges. The trouble is the factor of 2 for DDR.
Some discussion on devtalk here: https://devtalk.nvidia.com/default/topic/995384/theoretical-bandwidth-vs-effective-bandwidth/
In fact, your format is correct, but the memory clock is wrong. GeForce GTX 970's memory clock is 1753MHz(refers to https://www.techpowerup.com/gpu-specs/geforce-gtx-970.c2620).

Can't configure GPU bios properly

I have 6 GPUs of RX470. It should be mining average 25-27 mh/s each but it's only 20 mh/s. Overall is 120 instead of 150-170. I think the problem is GPU bios configuration but can't figure out any other thing. Any suggestions?
25 mh/s is what you would expect from an RX 480 stock. To get the same hashrate for RX 470, you'd be looking at overclocking memory speed (+600). In terms of how to overclock, it depends on whether your running linux or windows.

Why no linear scaling of Redis Cluster

I am trying to build one horizontal scalability system based on Redis Cluster. So I've measured the throughput of Redis Cluster with different nodes. But finally, the measured result doesn't show the linear scalability as the cluster spec stated, “High performance and linear scalability up to 1000 nodes.”
redis cluster benchmark:
The image above shows the measure result of redis cluster of (3+3), (4+4), (5+5), (6+6), (8+8), (10+10) and (12+12). (3+3) means 3 master node plus 3 slave nodes. The result of C (reate) and you (update) don't show the linear scalability of redis cluster as following picture.
I'd like to know why these measured result don't show the linear scalability. Is there any possible reason to limit the scaling?
My test environment and related information are described as below
Server
HW: HP BL460c G9, 24 CPU (E5-2620 v3 #2.40GHz), 64G memory, 300G disk
I have two machines. In order to know the capacity of one HW machine, I run all master nodes on one machine and all slaves nodes on another machine. All redis nodes will be include in one Redis Cluster.
OS: SLES 12
I have updates some system settings to achieve higher performance.
echo 65535 > /proc/sys/net/core/somaxconn
echo 65535 > /proc/sys/net/ipv4/tcp_max_syn_backlog
echo never > /sys/kernel/mm/transparent_hugepage/enabled
sysctl vm.overcommit_memory=1
sysctl vm.swappiness=0
Furthermore, I've turned off the swap, which could cause very unstable throughput when AOF re-write happened even swappiness is already set to 0. As observed, 15 million records in my test will occupy around 48G memory.
Redis 3.0.6: To eliminate the burst caused by RDB, I turned off all RDB and only enable AOF. For other configurations in redis.conf, just left with default values.
Client
HW: HP DL380 G7, 16 CPU (E5620 #2.40GHz), 24G memory, 600G disk
OS: SLES 12
YCSB (0.6.0) with jedis (2.8.0)
I will use hash key to store all records (1 key and 21 fields) and N sorted sets to store all keys and its random scores. Here N is the number of master nodes in the cluster. N sorted sets will be distributed evenly in each master node.
The YCSB workload configuration is pasted below:
workload=com.yahoo.ycsb.workloads.CoreWorkload
recordcount=15000000
operationcount=150000000
insertstart=0
fieldcount=21
fieldlength=188
readallfields=true
writeallfields=false
fieldlengthdistribution=zipfian
readproportion=0.0
updateproportion=1.0
insertproportion=0
readmodifywriteproportion=0.0
scanproportion=0
maxscanlength=1000
scanlengthdistribution=uniform
insertorder=hashed
requestdistribution=zipfian
hotspotdatafraction=0.2
hotspotopnfraction=0.8
table=subscriber
measurementtype=histogram
histogram.buckets=1000
timeseries.granularity=1000
At most cases, the computer resource is enough in my view though the throughput already hit the limit.
CPU: there are much CPU left, 60~70% CPU idle
I/O usage: it's not so busy, it's 30~40% utility at peak time.
Memory: only memory could be exhausted almost at peak time, i.e. when AOF re-write happened. At most time it's around 80%.

What are the 0 bytes at the end of an Ethernet frame in Wireshark?

after ARP protocol in a frame, there are many 0 bytes. Does anyone know the reason for the existence of these 0 bytes?
Check the Ethernet II accordion, all the 0 are labelled as padding.
Ethernet requires that all packets be at least 60 bytes long (64 bytes if you include the Frame Check Sequence at the end), so if a packet is less than 60 bytes long (including the 14-byte Ethernet header), additional padding bytes have to be added to the end of the packet.
(Those padding bytes will not show up on packets sent by the machine running Wireshark; the padding is added by the Ethernet hardware, and packets being sent by the machine capturing the traffic are given to the program before being handed to the hardware, so they haven't been padded.)