JVM Runtime.availableProcessors() returns 2 when it should be 4 - amazon-eks

I'm running openjdk11 on alpine linux in a container in an AWS EKS cluster.
The application determines the size of a threadpool based on the number of CPUs as returned by Runtime.getRuntime().availableProcessors()
This call is returning 2 processors even though the container shows that 4 CPUs are available:
# cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Any idea why and how to solve the problem?
Update
Doing some more digging (prompted by some great questions from #gohm'c in the comments), I found a way to add some trace log prints to the JVM with -Xlog:os+container=trace
[0.001s][trace][os,container] CPU Shares is: 1536
[0.001s][trace][os,container] CPU Share count based on shares: 2
Now, I defined in resources.requests.cpu: "1500m".
I don't know why the slight discrepancy but I changed the value of the CPU request, and indeed the CPU Shares in the log trace changes accordingly.
I understand how the resources.limits.cpu value could affect the CPUs that the JVM sees. But why is the resources.requests.cpu value doing that! This seems like a bug to me? Any thoughts?

Related

concurrent testing of login functionality with 50 users/threads is not working

i have given the thread count = 50
rampup period =0
for 48 threaads it is getting passed , for 2 threads there is no failure captured in the selenium log files.
I am expecting concurrent login of 50 users with 0 rampup period , i am not able to find out the exact reason of failure . please suggest the fixes to handle this scenario.
Check jmeter.log file for any suspicious entries
Add View Results Tree listener to your test plan - it will allow you to inspect request and response details
50 real browsers might be too high for a single machin
as per WebDriver Sampler documentation
From experience, the number of browser (threads) that the reader creates should be limited by the following formula:
C = N + 1
where
C = Number of Cores of the host running the test
and N = Number of Browser (threads).
as per Firefox 62.0 system requirements
512MB of RAM / 2GB of RAM for the 64-bit version
So you will need a machine with 51 cores and 100 GB of RAM in order to ensure there will no be JMeter-side bottleneck. If your machine hardware specifications are lower - you will have to go for Remote Testing

How to change the default number of pods per node?

I'm playing around with AWS EKS and I've set up a cluster with a t3.small node. Somehow the number of allocatable pods is set to 8:
Allocatable:
cpu: 2
ephemeral-storage: 19316009748
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 1902056Ki
pods: 8
It seems to me that the resource request on the node is on the low side
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
310m (15%) 0 (0%) 140Mi (7%) 340Mi (18%)
and so I'd like to ask:
Is 8 a default value?
How can I change that?
Thanks.
The calculation of the max number of pods in a given node in AWS is defined by this formula:
Max Pods = (Maximum Network Interfaces for instance type) * (IPv4 Addresses per Interface) - 1
Since you used t3.small, the value comes down to 3 * 4 - 1 = 11. The number 11 also depends upon the compute resources you're associating with each pod.
The list of network interfaces are listed here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI

Why flink container vcore size is always 1

I am running flink on yarn(more precisely in AWS EMR yarn cluster).
I read flink document and source code that by default for each task manager container, flink will request the number of slot per task manager as the number of vcores when request resource from yarn.
And I also confirmed from the source code:
// Resource requirements for worker containers
int taskManagerSlots = taskManagerParameters.numSlots();
int vcores = config.getInteger(ConfigConstants.YARN_VCORES,
Math.max(taskManagerSlots, 1));
Resource capability = Resource.newInstance(containerMemorySizeMB,
vcores);
resourceManagerClient.addContainerRequest(
new AMRMClient.ContainerRequest(capability, null, null,
priority));
When I use -yn 1 -ys 3 to start flink, I assume yarn will allocate 3 vcores for the only task manager container, but when I checked the number of vcores for each container from yarn resource manager web ui, I always see the number of vcores is 1. I also see vcore to be 1 from yarn resource manager logs.
I debugged the flink source code to the line I pasted below, and I saw value of vcores is 3.
This is really confuse me, can anyone help to clarify for me, thanks.
An answer from Kien Truong
Hi,
You have to enable CPU scheduling in YARN, otherwise, it always shows that only 1 CPU is allocated for each container,
regardless of how many Flink try to allocate. So you should add (edit) the following property in capacity-scheduler.xml:
<property>
<name>yarn.scheduler.capacity.resource-calculator</name>
<!-- <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value> -->
<value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
ALso, taskManager memory is, for example, 1400MB, but Flink reserves some amount for off-heap memory, so the actual heap size is smaller.
This is controlled by 2 settings:
containerized.heap-cutoff-min: default 600MB
containerized.heap-cutoff-ratio: default 15% of TM's memory
That's why your TM's heap size is limitted to ~800MB (1400 - 600)
#yinhua.
Use the command to start a session:./bin/yarn-session.sh, you need add -s arg.
-s,--slots Number of slots per TaskManager
details:
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/deployment/yarn_setup.html
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/cli.html#usage
I get the answer finally.
It's because yarn is use "DefaultResourceCalculator" allocation strategy, so only memory is counted for yarn RM, even if flink requested 3 vcores, but yarn simply ignore the cpu core number.

Wrong balance between Aerospike instances in cluster

I have an application with a high load for batch read operations. My Aerospike cluster (v 3.7.2) has 14 servers, each one with 7GB RAM and 2 CPUs in Google Cloud.
By looking at Google Cloud Monitoring Graphs, I noticed a very unbalanced load between servers: some servers have almost 100% CPU load, while others have less than 50% (image below). Even after hours of operation, the cluster unbalanced pattern doesn't change.
Is there any configuration that I could change to make this cluster more homogeneous? How to optimize node balancing?
Edit 1
All servers in the cluster have the same identical aerospike.conf file:
Aerospike database configuration file.
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 32
batch-index-threads 32
proto-fd-max 15000
batch-max-requests 200000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
#address any
port 3000
}
heartbeat {
mode mesh
mesh-seed-address-port 10.240.0.6 3002
mesh-seed-address-port 10.240.0.5 3002
port 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 3
memory-size 5G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
file /data/aerospike.dat
write-block-size 1M
filesize 180G
}
}
Edit 2:
$ asinfo
1 : node
BB90600F00A0142
2 : statistics
cluster_size=14;cluster_key=E3C3672DCDD7F51;cluster_integrity=true;objects=3739898;sub-records=0;total-bytes-disk=193273528320;used-bytes-disk=26018492544;free-pct-disk=86;total-bytes-memory=5368709120;used-bytes-memory=239353472;data-used-bytes-memory=0;index-used-bytes-memory=239353472;sindex-used-bytes-memory=0;free-pct-memory=95;stat_read_reqs=2881465329;stat_read_reqs_xdr=0;stat_read_success=2878457632;stat_read_errs_notfound=3007093;stat_read_errs_other=0;stat_write_reqs=551398;stat_write_reqs_xdr=0;stat_write_success=549522;stat_write_errs=90;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=4;stat_rw_timeout=1862;udf_read_reqs=0;udf_read_success=0;udf_read_errs_other=0;udf_write_reqs=0;udf_write_success=0;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=7021;stat_proxy_reqs_xdr=0;stat_proxy_success=2121;stat_proxy_errs=4739;stat_ldt_proxy=0;stat_cluster_key_err_ack_dup_trans_reenqueue=607;stat_expired_objects=0;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=0;stat_compressed_pkts_received=0;err_tsvc_requests=110;err_tsvc_requests_timeout=0;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=17;err_rw_pending_limit=19;err_rw_cant_put_unique=0;geo_region_query_count=0;geo_region_query_cells=0;geo_region_query_points=0;geo_region_query_falsepos=0;fabric_msgs_sent=58002818;fabric_msgs_rcvd=57998870;paxos_principal=BB92B00F00A0142;migrate_msgs_sent=55749290;migrate_msgs_recv=55759692;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=7228;migrate_num_incoming_refused=0;queue=0;transactions=101978550;reaped_fds=6;scans_active=0;basic_scans_succeeded=0;basic_scans_failed=0;aggr_scans_succeeded=0;aggr_scans_failed=0;udf_bg_scans_succeeded=0;udf_bg_scans_failed=0;batch_index_initiate=40457778;batch_index_queue=0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0;batch_index_complete=40456708;batch_index_timeout=1037;batch_index_errors=33;batch_index_unused_buffers=256;batch_index_huge_buffers=217168717;batch_index_created_buffers=217583519;batch_index_destroyed_buffers=217583263;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=7021;proxy_action=5519;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=551089;write_prole=1055431;read_dup_prole=14232;rw_err_dup_internal=0;rw_err_dup_cluster_key=1814;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=1767;rw_err_ack_badnode=0;client_connections=366;waiting_transactions=0;tree_count=0;record_refs=3739898;record_locks=0;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=296;partition_replica=572;partition_desync=0;partition_absent=3228;partition_zombie=0;partition_object_count=3739898;partition_ref_count=4096;system_free_mem_pct=61;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=90;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=53184;uptime=1001388;stat_write_errs_notfound=0;stat_write_errs_other=90;heartbeat_received_self=0;heartbeat_received_foreign=145137042;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_running=0;query_long_running=0;query_short_queue_full=0;query_long_queue_full=0;query_short_reqs=0;query_long_reqs=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0
3 : features
cdt-list;pipelining;geo;float;batch-index;replicas-all;replicas-master;replicas-prole;udf
4 : cluster-generation
61
5 : partition-generation
11811
6 : edition
Aerospike Community Edition
7 : version
Aerospike Community Edition build 3.7.2
8 : build
3.7.2
9 : services
10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.41:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.23:3000
10 : services-alumni
10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.23:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.41:3000
I have a few comments about your configuration. First, transaction-threads-per-queue should be set to 3 or 4 (don't set it to the number of cores).
The second has to do with your batch-read tuning. You're using the (default) batch-index protocol, and the config params you'll need to tune for batch-read performance are:
You have batch-max-requests set very high. This is probably affecting both your CPU load and your memory consumption. It's enough that there's a slight imbalance in the number of keys you're accessing per-node, and that will reflect in the graphs you've shown. At least, this is possibly the issue. It's better that you iterate over smaller batches than try to fetch 200K records per-node at a time.
batch-index-threads – by default its value is 4, and you set it to 32 (of a max of 64). You should do this incrementally by running the same test and benchmarking the performance. On each iteration adjust higher, then down if it's decreased in performance. For example: test with 32, +8 = 40 , +8 = 48, -4 = 44. There's no easy rule-of-thumb for the setting, you'll need to tune through iterations on the hardware you'll be using, and monitor the performance.
batch-max-buffer-per-queue – this is more directly linked to the number of concurrent batch-read operations the node can support. Each batch-read request will consume at least one buffer (more if the data cannot fit in 128K). If you do not have enough of these allocated to support the number of concurrent batch-read requests you will get exceptions with error code 152 BATCH_QUEUES_FULL . Track and log such events clearly, because it means you need to raise this value. Note that this is the number of buffers per-queue. Each batch response worker thread has its own queue, so you'll have batch-index-threads x batch-max-buffer-per-queue buffers, each taking 128K of RAM. The batch-max-unused-buffers caps the memory usage of all these buffers combined, destroying unused buffers until their number is reduced. There's an overhead to allocating and destroying these buffers, so you do not want to set it too low compared to the total. Your current cost is 32 x 256 x 128KB = 1GB.
Finally, you're storing your data on a filesystem. That's fine for development instances, but not recommended for production. In GCE you can provision either a SATA SSD or an NVMe SSD for your data storage, and those should be initialized, and used as block devices. Take a look at the GCE recommendations for more details. I suspect you have warnings in your log about the device not keeping up.
It's likely that one of your nodes is an outlier with regards to the number of partitions it has (and therefore number of objects). You can confirm it with asadm -e 'asinfo -v "objects"'. If that's the case, you can terminate that node, and bring up a new one. This will force the partitions to be redistributed. This does trigger a migration, which takes quite longer in the CE server than in the EE one.
For anyone interested, Aerospike Enterpirse 4.3 introduced 'uniform-balance' which homogeneously balances data partitions. Read more here: https://www.aerospike.com/blog/aerospike-4-3-all-flash-uniform-balance/

Aerospike cluster not clean available blocks

we use aerospike in our projects and caught strange problem.
We have a 3 node cluster and after some node restarting it stop working.
So, we make test to explain our problem
We make test cluster. 3 node, replication count = 2
Here is our namespace config
namespace test{
replication-factor 2
memory-size 100M
high-water-memory-pct 90
high-water-disk-pct 90
stop-writes-pct 95
single-bin true
default-ttl 0
storage-engine device {
cold-start-empty true
file /tmp/test.dat
write-block-size 1M
}
We write 100Mb test data after that we have that situation
available pct equal about 66% and Disk Usage about 34%
All good :slight_smile:
But we stopped one node. After migration we see that available pct = 49% and disk usage 50%
Return node to cluster and after migration we see that disk usage became previous about 32%, but available pct on old nodes stay 49%
Stop node one more time
available pct = 31%
Repeat one more time we get that situation
available pct = 0%
Our cluster crashed, Clients get AerospikeException: Error Code 8: Server memory error
So how we can clean available pct?
If your defrag-q is empty (and you can see whether it is from grepping the logs) then the issue is likely to be that your namespace is smaller than your post-write-queue. Blocks on the post-write-queue are not eligible for defragmentation and so you would see avail-pct trending down with no defragmentation to reclaim the space. By default the post-write-queue is 256 blocks and so in your case that would equate to 256Mb. If your namespace is smaller than that you will see avail-pct continue to drop until you hit stop-writes. You can reduce the size of the post-write-queue dynamically (i.e. no restart needed) using the following command, here I suggest 8 blocks:
asinfo -v 'set-config:context=namespace;id=<NAMESPACE>;post-write-queue=8'
If you are happy with this value you should amend your aerospike.conf to include it so that it persists after a node restart.