I used the TCP AsyncSocket to transfer a large file from one machine to another using local connection (using host as local IP address).
First, I did the setup for single TCP socket connection and felt data transfer rate is slow. Its around 1mb/sec.
To make it faster I created 10 TCP sockets (Connecting on separate ports on separate threads) and started reading the partitions of file simultaneously. But It didn't make any difference. Transfer rate is almost same as that of single TCP socket connection (Or even slower).
Any idea ? Why multiple TCP sockets is not transferring data in parallel ? Any ways or suggestions to transfer file fast over TCP ?
Parallelizing an I/O operation only helps if the I/O channel isn't saturated and the task is single-core bound.
Quite likely, adding additional I/O channels will actually slow things down as you now have multiple clients competing for a scarce resource.
What you need to figure out is where is your bottleneck? Only once you've quantified the actual cause of your performance issue will you be able to fix it.
Related
I use winsock2 with C++ to connect thousands of clients to my server using the TCP protocol.
The clients will rarely send packets to the server.... there is about 1 minute between 2 packets from a client.
Right now I iterate over each client with non-blocking sockets to check if the client has sent anything to the server.
But I think a much better design of the server would be to wait until it receives a packet from the thousands of clients and then ask for the client's socket. That would be better because the server wouldn't have to iterate over each client, where 99.99% of the time nothing happens. And with a million clients, a complete iteration might take seconds....
Is there a way to receive data from each client without iterating over each client's socket?
What you want is to use IO completion ports, I think. See https://learn.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports. Microsoft even has an example at GitHub for this: https://github.com/microsoft/Windows-classic-samples/blob/main/Samples/Win7Samples/netds/winsock/iocp/server/IocpServer.Cpp
BTW, do not expect to server million connections on single machine. Generally, there are limits, like available memory, both user space and kernel space, handle limits, etc. If you are careful, you can probably serve tens of thousands of connection per process.
Server needs to push data to 100K of clients which cannot be connected directly since the machine are inside private network. Currently thinking of using Rabbitmq, Each client subscribed to separate queue, when server has data to be pushed to the client, it publish the data to the corresponding queue. Is there any issues with the above approach? Number of clients may go upto 100K. Through spike, i expecting the memory size to be of 20GB for maintaining the connection. We can still go ahead with this approach if the memory not increasing more than 30GB.
the question is too much generic.
I suggest to read this RabbitMQ - How many queues RabbitMQ can handle on a single server?
Then you should consider to use a cluster to scale the number of the queues
The question originated from: https://groups.google.com/forum/#!topic/logstash-users/cYv8ULhHeE0
By comparing below logstash scale out strategies, tcp load balancer has best performance if traffics/cpu load are balanced.
However, it seems hard to balance traffic all the time due to nature of logstash-forwarder <-> logstash tcp connections.
Anyone got better idea to make traffic/cpu load more balanced across logstash nodes?
Thanks for advise :)
< My scenario >
10+ service node equipped with logstash-forwarder to forward logs to central logstash node(cluster)
each service node's log average throughput, throughput daily distribution, log type's filter complexity varies a lot
log average throughput: e.g. service_1: 0.5k event/s; service_2: 5k event/s
throughput daily distribution: e.g. service_1 peak at morning, service_2's peak at night
log type's filter complexity: by consuming 100% single logstash node's CPU, service_1's log type can be processed at 300 event/s, while service_2's log type is 1500 event/s
< TCP load balancer >
Since tcp connection are persistence between logstash-forwarder and logstash, which means, whether eventually the tcp connection amount are balanced or distributed by least connection, least load, across all logstash nodes. It doesn't guarantee traffics/cpu load are balanced across all logstash nodes. According to my scenario, each tcp connection's traffic varies on daily average, over time, and it's event complexity.
So in worse case, let's say, logstash_1 and logstash_2 both has 10 tcp connection, but logstash_1's cpu load might 3x more than logstash_2 since logstash_1's connection contains higher traffic, complexer event.
< Manual assign logstash-forwarders to logstash >
Might face the same situation as of TCP load balancer, since we can plan to distributing load based on historical daily average traffic, but it changed over time ,and no HA of course.
< message queue >
architecture as: service node with logstash-forwarder -> queuer: logstash to rabbitmq -> indexer: logstash from rabbitmq and to ElasticSearch
around 30% of CPU overhead on sending message to or receiving message from queue broker, for all nodes.
I’ll focus on one aspect of your question; which is load-balancing a RabbitMQ cluster. RabbitMQ clusters always consist of a single Master node, and 0…n Slave nodes. It is therefore favourable to force connections to the Master node, rather than implement round-robin, leastconn, etc.
RabbitMQ will automatically route traffic directly to the Master node, even if your Load Balancer routes to a different node. This posts explains the conceptin greater detail.
I'm writing both client and server code using WCF, where I need to know the "perceived" bandwidth of traffic between the client and server. I could use ping statistics to gather this information separately, but I wonder if there is a way to configure the channel stack in WCF so that the same statistics can be gathered simultaneously while performing my web service invocations. This would be particularly useful in cases where ICMP is disabled (e.g. ping won't work).
In short, while making my regular business-related web service calls (REST calls to be precise), is there a way to collect connection speed data implicitly?
Certainly I could time the web service round trip, compared to the size of data used in the round-trip, to give me an idea of throughput - but I won't know how much of that perceived bandwidth was network related, or simply due to server-processing latency. I could perhaps solve that by having the server send back a time delta, representing server latency, so that the client can compute the actual network traffic time. If a more sophisticated approach is not available, that might be my answer...
The ICMP was not created with the intention of trying those connection speed statistics, but rather if a valid connection was made between two hosts.
My best guess is that the amount of data sent in those REST calls or ICMP traffic is not enough to calculate a perceived connection speed / bandwidth.
If you calculate by these metrics, you will get very big bandwidth statistics or very low, use as an example the copy box in windows XP. You need a constant and substantial amount of data to be sent in order to calculate valid throughput statistics.
I have a lot of clients (around 4000).
Each client pings my server every 2 seconds.
Can these ping requests put a load on the server and slow it down?
How can I monitor this load?
Now the server response slowly but the processor is almost idle and the free memory is ok.
I'm running Apache on Ubuntu.
Assuming you mean a UDP/ICMP ping just to see if the host is alive, 4000 hosts probably isn't much load and is fairly easy to calculate. CPU and memory wise, ping is handled by you're kernel, and should be optimized to not take much resources. So, you need to look at network resources. The most critical point will be if you have a half-duplex link, because all of you're hosts are chatty, you'll cause alot of collisions and retransmissions (and dropped pings). If the links are all full duplex, let's calculate the actual amount of bandwidth required at the server.
4000 client #2 seconds
Each ping is 72 bytes on the wire (32 bytes data + 8 bytes ICMP header + 20 bytes IP header + 14 bytes Ethernet). * You might have some additional overhead if you use vlan tagging, or UDP based pings
If we can assume the pings are randomly distributed, we would have 2000 pings per second # 72 bytes = 144000 bytes
Multiple by 8 to get Bps = 1,152,000 bps or about 1.1Mbps.
On a 100Mbps Lan, this would be about 1.1% utilization just for the pings.
If this is a lan environment, I'd say this is basically no load at all, if it's going across a T1 then it's an immense amount of load. So you should basically run the same calculation on which network links may also be a bottle neck.
Lastly, if you're not using ICMP pings to check the host, but have an application level ping, you will have all the overhead of what protocol you are using, and the ping will need to go all the way up the protocol stack, and you're application needs to respond. Again, this could be a very minimal load, or it could be immense, depending on the implementation details and the network speed. If the host is idle, I doubt this is a problem for you.
Yes, they can. A ping request does not put much CPU load on, but it certainly takes up bandwidth and a nominal amount of CPU.
If you want to monitor this, you might use either tcpdump or wireshark, or perhaps set up a firewall rule and monitor the number of packets it matches.
The other problem apart from bandwidth is the CPU. If a ping is directed up to the CPU for processing, thousands of these can cause a load on any CPU. It's worth monitoring - but as you said yours is almost idle so it's probably going to be able to cope. Worth keeping in mind though.
Depending on the clients, ping packets can be different sizes - their payload could be just "aaaaaaaaa" but some may be "thequickbrownfoxjumpedoverthelazydog" - which is obviously further bandwidth requirements again.