concurrent testing of login functionality with 50 users/threads is not working - testing

i have given the thread count = 50
rampup period =0
for 48 threaads it is getting passed , for 2 threads there is no failure captured in the selenium log files.
I am expecting concurrent login of 50 users with 0 rampup period , i am not able to find out the exact reason of failure . please suggest the fixes to handle this scenario.

Check jmeter.log file for any suspicious entries
Add View Results Tree listener to your test plan - it will allow you to inspect request and response details
50 real browsers might be too high for a single machin
as per WebDriver Sampler documentation
From experience, the number of browser (threads) that the reader creates should be limited by the following formula:
C = N + 1
where
C = Number of Cores of the host running the test
and N = Number of Browser (threads).
as per Firefox 62.0 system requirements
512MB of RAM / 2GB of RAM for the 64-bit version
So you will need a machine with 51 cores and 100 GB of RAM in order to ensure there will no be JMeter-side bottleneck. If your machine hardware specifications are lower - you will have to go for Remote Testing

Related

Catchpoint pause vs. waitForNoRequest - What's the difference?

I have a test that was alerting because it was taking extra time for an asset to load. We changed from waitForNoRequest to a pause (at Catchpoint's suggestion). That did not seem to have the expected effect of waiting for things to load. We increased the pause from 3000 to 12000 and that helped to allow the page to load and stop the alert. We noticed some more alerts, so I tried to increase the pause to something like 45000 and it would not allow me to pause for that long.
So the main question here is - what functionality does both of these different features provide? What do I gain by pausing instead of waiting, if anything?
Here's the test, data changed to protect company specific info. Step 3 is where we had some failures and we switched between pause and wait.
// Step - 1
open("https://website.com/")
waitForNoRequest("2000")
click("//*[#id=\"userid\"]")
type("//*[#id=\"userid\"]", "${username}")
setStepName("Step1-Login-")
// Step - 2
clickMouseAndWait("//*[#id=\"continue\"]")
waitForVisible("//*[#id=\"challenge-password\"]")
click("//*[#id=\"challenge-password\"]")
type("//*[#id=\"challenge-password\"]", "${password}")
setStepName("Step2-Login-creds")
// Step - 3
clickMouseAndWait("//*[#id=\"signIn\"]")
setStepName("Step3-dashboard")
waitForTitle("Dashboard")
waitForNoRequest("3000")
click("//*[#id=\"account-header-wrapper\"]")
waitForVisible("//*[#id=\"logout-link\"]")
click("//*[#id=\"logout-link\"]")
// Step - 4
clickAndWait("//*[text()=\"Sign Out\"]")
waitForTitle("Login - ")
verifyTextPresent("You have been logged out.")
setStepName("Step5-Logout")
Rachana here, I’m a member of the Technical Service Team here at Catchpoint, I’ll be happy to answer your questions.
Please find the differences below between waitForNoRequest and Pause commands:
Pause
Purpose: This command pauses the script execution for a specified amount of time, whether there are HTTP/s requests downloading or not. Time value is provided in milliseconds, it can range between 100 to 30,000 ms.
Explanation: This command is used when the agent needs to wait for a set amount of time and this is not impacted by the way the requests are loaded before proceeding to the next step or command. Only a parameter is required for this action.
WaitForNoRequest
Purpose: This commands waits for a specified amount of time, when there was no HTTP/s requests downloading. The wait time parameter can range between 1,000 to 5,000 ms.
Explanation: The only parameter for this action is a wait time. The agent will wait for that specified amount of time before moving onto the next step/command. Which will, in return, allow necessary requests more time to load after document complete.
For instance when you add waitforNoRequest(5000), initially agent waits 5000 ms after doc complete for any network activity. During that period if there is any network activity, then the agent waits another 5000 ms for the next network activity to end and the process goes on until no other request loads within the specified timeframe(5000 ms).
A pause command with 12000 ms, gives exactly 12 seconds to load the page. After 12 seconds the script execution will continue to next command no matter the page is loaded or not.
Since waitForNoRequest has a max time value of 5000 ms, you can tell the agent to wait for a gap of 5 seconds when there is no network activity. In this case, the page did not have any network activity for 3 seconds and hence proceeded to the next action. The page was not loaded completely and the script failed.
I tried to increase the pause to something like 45000 and it would not allow me to pause for that long.
We allow a maximum of 30 seconds pause time hence 45 seconds will not work.
Please reach out to our support team and we’ll be glad to connect you with our scripting SMEs and help you with any scripting needs you might have.

How to speed up selenium tests on AWS device farm?

I'm using Python for testing on AWS device farm. It seems that starting a selenium takes very very long. This is the code I use:
from time import time
from boto3 import client
from selenium import webdriver
def main():
start = time()
device_farm_client = client("devicefarm", region_name='us-west-2')
test_grid_url_response = device_farm_client.create_test_grid_url(
expiresInSeconds=666,
projectArn="arn:aws:devicefarm:us-west-2:..."
)
driver = webdriver.Remote(
command_executor=test_grid_url_response['url'],
desired_capabilities=webdriver.DesiredCapabilities.CHROME,
)
driver.get('https://api.ipify.org')
print(f"Your IP is: {driver.find_element_by_tag_name('pre').text}")
driver.quit()
print(f"took: {time() - start:.2f}")
if __name__ == '__main__':
main()
Output:
Your IP is: 100.10.10.111
took: 99.89s
Using existing selenium-hub infrastructure the IP is obtained in less than 2 seconds!
Is there any way how to reduce the time radically?
To reduce the overall execution time for complete test suite execution take advantage of the 50 concurrent sessions given you by default at no cost. Check this link. For eg:
Lets assume following details
one test Suite has 200 Selenium test cases
each test case takes around 10 seconds to execute
One AWS Device Farm Selenium Session takes around 60 seconds to start
then I will divide my 200 test cases into 50 concurrent sessions by running concurrent batches of 4 test cases per session.
Total Execution Time = (60 seconds to start each session + 10 seconds to start all 50 concurrent sessions with rate of 5 sessions per second + 4*10 seconds to execute the test cases in each session) = 60+10+40 = 110 seconds to finish complete test suite execution
WHEREAS
If you are existing selenium-hub infrastructure and lets say following details are assumed
200 Selenium test cases to execute
2 seconds to start a session
assume at max you can run 10 concurrent sessions
Total Execution Time = 2 seconds to start each session + 20*10 seconds to execute the test cases in each session = 200+2 = 202 seconds to finish complete test suite execution

jmeter and apachetop - why I see different values?

Probably explanation is simple - but I couldn't find answer to my question:
I am running jmeter test from one VM (worker) to another (target). On worker I have jmeter with 100 threads (100 users). On target I have API that runs on Apache. When I run "apachetop -f access_log" on target, I see only about 7 req/s.
Can someone explain me, why I don't see 100 req/s on target?
In test result in jmeter I see always 200 OK, so all request are hitting the target, and moreover target always responds. So I am not dropping any requests here. Network bandwidth between machines is 1G. What I am missing here?
Thanks,
Daddy
100 users doesn't necessarily mean 100 requests per second, even more, it is highly unlikely.
According to JMeter glossary:
Elapsed time. JMeter measures the elapsed time from just before sending the request to just after the last response has been received. JMeter does not include the time needed to render the response, nor does JMeter process any client code, for example Javascript.
Roughly, if JMeter is able to get response from server in 1 second - you will get 100 requests/second. If response time will be 2 seconds - throughput will be 50 requests/second, etc, response time 4 seconds - 25 requests/second, etc.
Also JMeter configuration matters. If you don't provide enough loops you may run into a situation where some threads already finished and some are not even started. See JMeter Test Results: Why the Actual Users Number is Lower than Expected article for more detailed explanation.
Your target load = 100 threads ( you are assuming it should generate 100 req/sec as per your plan)
Your actual load = 7 req / sec = 7*3600 / hour = 25200
Per thread throughput = 25200 / 100 threads = 252 iterations / thread / hour
Per transaction time = 3600 / 252 = 14.2 secs
This means, JMeter should be actually sending each request every 14 secs per thread. i.e., 100 requests for every 14.2 secs.
Now, analyze your JMeter summary report for the transaction timers to find out where the remaining 13.2 secs are being spent.
Possible issues are
1. High DNS resolution time (DNS issue)
2. High connection setup time (indicates load balancer issues)
3. High Request send time (indicates n/w or firewall throttling issues)
4. High request receive time (same as #3)
Now, the time that you see in Apache logs are mostly visible to JMeter as time to first byte time. I am not sure about the machine that you are running your testing. If your worker can support curl, use Curl to find the components for a single request.
echo 'request payload for POST'
| curl -X POST -H 'User-Agent: myBrowser' -H 'Content-Type: application/json' -d #- -s -w '\nDNS time:\t%{time_namelookup}\nTCP Connect time:\t%{time_connect}\nAppCon Protocol time:\t%{time_appconnect}\nRedirect time:\t%{time_redirect}\nPreXfer time:\t%{time_pretransfer}\nStartXfer time:\t%{time_starttransfer}\n\nTotal time:\t%{time_total}\n' http://mytest.test.com
If the above output indicates no such issues then the time must have been spent within JMeter. You should tune your JMeter implementation by using various options like beanshell / JSR223 etc.

Wrong balance between Aerospike instances in cluster

I have an application with a high load for batch read operations. My Aerospike cluster (v 3.7.2) has 14 servers, each one with 7GB RAM and 2 CPUs in Google Cloud.
By looking at Google Cloud Monitoring Graphs, I noticed a very unbalanced load between servers: some servers have almost 100% CPU load, while others have less than 50% (image below). Even after hours of operation, the cluster unbalanced pattern doesn't change.
Is there any configuration that I could change to make this cluster more homogeneous? How to optimize node balancing?
Edit 1
All servers in the cluster have the same identical aerospike.conf file:
Aerospike database configuration file.
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 32
batch-index-threads 32
proto-fd-max 15000
batch-max-requests 200000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
#address any
port 3000
}
heartbeat {
mode mesh
mesh-seed-address-port 10.240.0.6 3002
mesh-seed-address-port 10.240.0.5 3002
port 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 3
memory-size 5G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
file /data/aerospike.dat
write-block-size 1M
filesize 180G
}
}
Edit 2:
$ asinfo
1 : node
BB90600F00A0142
2 : statistics
cluster_size=14;cluster_key=E3C3672DCDD7F51;cluster_integrity=true;objects=3739898;sub-records=0;total-bytes-disk=193273528320;used-bytes-disk=26018492544;free-pct-disk=86;total-bytes-memory=5368709120;used-bytes-memory=239353472;data-used-bytes-memory=0;index-used-bytes-memory=239353472;sindex-used-bytes-memory=0;free-pct-memory=95;stat_read_reqs=2881465329;stat_read_reqs_xdr=0;stat_read_success=2878457632;stat_read_errs_notfound=3007093;stat_read_errs_other=0;stat_write_reqs=551398;stat_write_reqs_xdr=0;stat_write_success=549522;stat_write_errs=90;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=4;stat_rw_timeout=1862;udf_read_reqs=0;udf_read_success=0;udf_read_errs_other=0;udf_write_reqs=0;udf_write_success=0;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=7021;stat_proxy_reqs_xdr=0;stat_proxy_success=2121;stat_proxy_errs=4739;stat_ldt_proxy=0;stat_cluster_key_err_ack_dup_trans_reenqueue=607;stat_expired_objects=0;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=0;stat_compressed_pkts_received=0;err_tsvc_requests=110;err_tsvc_requests_timeout=0;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=17;err_rw_pending_limit=19;err_rw_cant_put_unique=0;geo_region_query_count=0;geo_region_query_cells=0;geo_region_query_points=0;geo_region_query_falsepos=0;fabric_msgs_sent=58002818;fabric_msgs_rcvd=57998870;paxos_principal=BB92B00F00A0142;migrate_msgs_sent=55749290;migrate_msgs_recv=55759692;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=7228;migrate_num_incoming_refused=0;queue=0;transactions=101978550;reaped_fds=6;scans_active=0;basic_scans_succeeded=0;basic_scans_failed=0;aggr_scans_succeeded=0;aggr_scans_failed=0;udf_bg_scans_succeeded=0;udf_bg_scans_failed=0;batch_index_initiate=40457778;batch_index_queue=0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0;batch_index_complete=40456708;batch_index_timeout=1037;batch_index_errors=33;batch_index_unused_buffers=256;batch_index_huge_buffers=217168717;batch_index_created_buffers=217583519;batch_index_destroyed_buffers=217583263;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=7021;proxy_action=5519;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=551089;write_prole=1055431;read_dup_prole=14232;rw_err_dup_internal=0;rw_err_dup_cluster_key=1814;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=1767;rw_err_ack_badnode=0;client_connections=366;waiting_transactions=0;tree_count=0;record_refs=3739898;record_locks=0;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=296;partition_replica=572;partition_desync=0;partition_absent=3228;partition_zombie=0;partition_object_count=3739898;partition_ref_count=4096;system_free_mem_pct=61;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=90;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=53184;uptime=1001388;stat_write_errs_notfound=0;stat_write_errs_other=90;heartbeat_received_self=0;heartbeat_received_foreign=145137042;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_running=0;query_long_running=0;query_short_queue_full=0;query_long_queue_full=0;query_short_reqs=0;query_long_reqs=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0
3 : features
cdt-list;pipelining;geo;float;batch-index;replicas-all;replicas-master;replicas-prole;udf
4 : cluster-generation
61
5 : partition-generation
11811
6 : edition
Aerospike Community Edition
7 : version
Aerospike Community Edition build 3.7.2
8 : build
3.7.2
9 : services
10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.41:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.23:3000
10 : services-alumni
10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.23:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.41:3000
I have a few comments about your configuration. First, transaction-threads-per-queue should be set to 3 or 4 (don't set it to the number of cores).
The second has to do with your batch-read tuning. You're using the (default) batch-index protocol, and the config params you'll need to tune for batch-read performance are:
You have batch-max-requests set very high. This is probably affecting both your CPU load and your memory consumption. It's enough that there's a slight imbalance in the number of keys you're accessing per-node, and that will reflect in the graphs you've shown. At least, this is possibly the issue. It's better that you iterate over smaller batches than try to fetch 200K records per-node at a time.
batch-index-threads – by default its value is 4, and you set it to 32 (of a max of 64). You should do this incrementally by running the same test and benchmarking the performance. On each iteration adjust higher, then down if it's decreased in performance. For example: test with 32, +8 = 40 , +8 = 48, -4 = 44. There's no easy rule-of-thumb for the setting, you'll need to tune through iterations on the hardware you'll be using, and monitor the performance.
batch-max-buffer-per-queue – this is more directly linked to the number of concurrent batch-read operations the node can support. Each batch-read request will consume at least one buffer (more if the data cannot fit in 128K). If you do not have enough of these allocated to support the number of concurrent batch-read requests you will get exceptions with error code 152 BATCH_QUEUES_FULL . Track and log such events clearly, because it means you need to raise this value. Note that this is the number of buffers per-queue. Each batch response worker thread has its own queue, so you'll have batch-index-threads x batch-max-buffer-per-queue buffers, each taking 128K of RAM. The batch-max-unused-buffers caps the memory usage of all these buffers combined, destroying unused buffers until their number is reduced. There's an overhead to allocating and destroying these buffers, so you do not want to set it too low compared to the total. Your current cost is 32 x 256 x 128KB = 1GB.
Finally, you're storing your data on a filesystem. That's fine for development instances, but not recommended for production. In GCE you can provision either a SATA SSD or an NVMe SSD for your data storage, and those should be initialized, and used as block devices. Take a look at the GCE recommendations for more details. I suspect you have warnings in your log about the device not keeping up.
It's likely that one of your nodes is an outlier with regards to the number of partitions it has (and therefore number of objects). You can confirm it with asadm -e 'asinfo -v "objects"'. If that's the case, you can terminate that node, and bring up a new one. This will force the partitions to be redistributed. This does trigger a migration, which takes quite longer in the CE server than in the EE one.
For anyone interested, Aerospike Enterpirse 4.3 introduced 'uniform-balance' which homogeneously balances data partitions. Read more here: https://www.aerospike.com/blog/aerospike-4-3-all-flash-uniform-balance/

Aerospike cluster not clean available blocks

we use aerospike in our projects and caught strange problem.
We have a 3 node cluster and after some node restarting it stop working.
So, we make test to explain our problem
We make test cluster. 3 node, replication count = 2
Here is our namespace config
namespace test{
replication-factor 2
memory-size 100M
high-water-memory-pct 90
high-water-disk-pct 90
stop-writes-pct 95
single-bin true
default-ttl 0
storage-engine device {
cold-start-empty true
file /tmp/test.dat
write-block-size 1M
}
We write 100Mb test data after that we have that situation
available pct equal about 66% and Disk Usage about 34%
All good :slight_smile:
But we stopped one node. After migration we see that available pct = 49% and disk usage 50%
Return node to cluster and after migration we see that disk usage became previous about 32%, but available pct on old nodes stay 49%
Stop node one more time
available pct = 31%
Repeat one more time we get that situation
available pct = 0%
Our cluster crashed, Clients get AerospikeException: Error Code 8: Server memory error
So how we can clean available pct?
If your defrag-q is empty (and you can see whether it is from grepping the logs) then the issue is likely to be that your namespace is smaller than your post-write-queue. Blocks on the post-write-queue are not eligible for defragmentation and so you would see avail-pct trending down with no defragmentation to reclaim the space. By default the post-write-queue is 256 blocks and so in your case that would equate to 256Mb. If your namespace is smaller than that you will see avail-pct continue to drop until you hit stop-writes. You can reduce the size of the post-write-queue dynamically (i.e. no restart needed) using the following command, here I suggest 8 blocks:
asinfo -v 'set-config:context=namespace;id=<NAMESPACE>;post-write-queue=8'
If you are happy with this value you should amend your aerospike.conf to include it so that it persists after a node restart.