jMeter database load testing: randomizing a query predicate - sql

I need to conduct a series of database performance tests using jMeter.
The database has ~32m accounts, and ~15 billion transactions.
I have configured a JDBC connection configuration and a JDBC request with a single SELECT statement and a hardcoded vAccountNum and this works fine.
SELECT col1,col2,col3,col4,col5 from transactions where account=vAccountNum
I need to measure how many results sets can be completed in five minutes for 1 session; then add sessions and tune until server resources are exhausted.
What is the best way to randomize vAccountNum so that I can get an equal distribution of accounts returned?

Depending on what type vAccountNum is the choices are in:
Various JMeter Functions like
__Random function - to generate random number within defined range
__threadNum function - returns current thread's number (1 for first thread, 2 for second, etc.)
__counter function - a simple counter which is being incremented by 1 each time it is called
CSV Data Set Config - to read pre-defined vAccountNum values from CSV file. In that case make sure that you provide enough account numbers so you won't be hammering the server with the same query which likely to be returned from cache.

Related

Listing BigQuery Tables in `huge/big` Datasets - 30K-40K+ tables

The task is to programmatically list all the tables within the given dataset with more than 30-40K tables
Initial option we explored was using tables.list API (as we do all the times for normal datasets with reasonable number of tables in them)
Looks like this API returns max 1000 entries (even if we try to set maxResults to bigger value)
To take next 1000 we need to “wait” for response of previous request then extract pageToken and repeat call and so on
For the datasets with 30K – 40K+ this can take up to 10-15 and more sec (under good weather)
So the timing is a problem for us that we want to address!
In above mentioned calls we are getting back only nextPageToken and tables/tableReference/tableId so size of response is extremely small!
Question:
Is there way to somehow increase maxResults, so to get all tables in one (or very few) call(s) (assuming it will be much faster than doing 30-40 calls)?
The workaround we tried so far is to use __TABLES_SUMMARY__ with jobs.insert or jobs.query API.
This way – the whole result is returned within the seconds – but in our particular case – using BigQuery jobs API is not an option for multiple reasons. We want to be able to use list API

Target Based commit point while updating into table

One of my mappings is running for a really long time (2 hours).From the session log i can see the statment "Time out based commit poin" which is tking most of the time and Busy percentage for the SQL tranfsormation is very high(which is taking time,I ran the SQL query manually in DB,its working fine ).So, basically there is a router which splits the record between insert and update.And the update stream is taking long.It has a SQL transforamtion,Update statrtergy and aggregator.I added an sorter before aggregator but no luck.
Also changed comit interval ,Lins Sequential Buffer lenght and Maximum memory allowed by checking some of the other blogs.Could you please help me with this.
If possible try to avoid the transformations which are creating cache because in future if the input records increase. Cache size will also increase and decrease the throughput
1) Aggregator : Try to use the Aggregation in SQL override itself
2) Sorter : Try to do the same in the SQL Override itself
Generally SQL transformation is slow for huge data loads, because for each input record an SQL session is invoked and a connection is established to database and the row is fetched. Say for example there are 1 million records, 1 million SQL sessions are initiated in the backend and the database is called.
What the SQL transformation doing ? Is it just generating a Surrogate key or its fetching a value from a table based on derived value from the stream
For fetching a value from a table based on derived value from the stream:
Try to use lookup
For generating Surrogate key, Use Oracle Sequence instead
Let me know if its purpose is any thing other than that
Also do the below checks
Sort the session log on thread and just make a note of start and end times of
the following
1) lookup caches creation (time between Query issued --> First row returned --> Cache creation completed)
2) Reader thread first row return time
Regards,
Raj

SQL - When data are transfered

i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye
First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.
You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.
according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.

how to get next 1000 records the fastest way

I'm using Azure Table Storage.
Let's say i have a Partition in my Table with 10,000 records, and I would like to get records number 1000 to 1999. And next time i would like to get records number 4000 to 4999 etc.
What is the fastest way of doing that?
All I can find till now are two options, which I don't like very much:
1. run a query which returns all 10,000 records, and filter out what I want when I get all 10,000 records.
2. Run a query whichs returns 1000 records at a time, and use a continuation token to get the next 1000 records.
Is it possible to get a continuation token without downloading all corresponding records? It would be great if i can get Continuation Token 1, than get Continuation token 2, and with CT2 get records 2000 to 2999.
Theoretically you should be able to use continuation tokens without downloading the actual data for the first 1000 recors by closing the connection you have after the first request. And I mean closing it at TCP level. And before you read all data. Then open a new connection and use continuation token there. Two WebRequests will not do it since the HTTP implementation will likely use keep alive wchich means all your data is going to be read in the background even though you don't read it in your code. Actually you can configure your HTTP requests to not use keep alive.
However, another way is naturally if you know the RowKey and can search on that but I assume you don't know which row keys will be in each 1000 entity batch.
Last I would ask why you have this problem in the first place. And what your access pattern is. If inserts are common and getting these records is rare I wouldn't bother making it more efficient. if this is like a paging problem i would probably get all data on the first request and cache it (in the cloud). if inserts are rare but you need to run this query often I would consider making the insertion of data have one partion for every 1000 entities and rebalance as needed (due to sorting) as entities are inserted.

SQL connection lifetime

I am working on an API to query a database server (Oracle in my case) to retrieve massive amount of data. (This is actually a layer on top of JDBC.)
The API I created tries to limit as much as possible the loading of every queried information into memory. I mean that I prefer to iterate over the result set and process the returned row one by one instead of loading every rows in memory and process them later.
But I am wondering if this is the best practice since it has some issues:
The result set is kept during the whole processing, if the processing is as long as retrieving the data, it means that my result set will be open twice as long
Doing another query inside my processing loop means opening another result set while I am already using one, it may not be a good idea to start opening too much result sets simultaneously.
On the other side, it has some advantages:
I never have more than one row of data in memory for a result set, since my queries tend to return around 100k rows, it may be worth it.
Since my framework is heavily based on functionnal programming concepts, I never rely on multiple rows being in memory at the same time.
Starting the processing on the first rows returned while the database engine is still returning other rows is a great performance boost.
In response to Gandalf, I add some more information:
I will always have to process the entire result set
I am not doing any aggregation of rows
I am integrating with a master data management application and retrieving data in order to either validate them or export them using many different formats (to the ERP, to the web platform, etc.)
There is no universal answer. I personally implemented both solutions dozens of times.
This depends of what matters more for you: memory or network traffic.
If you have a fast network connection (LAN) and a poor client machine, then fetch data row by row from the server.
If you work over the Internet, then batch fetching will help you.
You can set prefetch count or your database layer properties and find a golden mean.
Rule of thumb is: fetch everything that you can keep without noticing it
if you need more detailed analysis, there are six factors involved:
Row generation responce time / rate(how soon Oracle generates first row / last row)
Row delivery response time / rate (how soon can you get first row / last row)
Row processing response time / rate (how soon can you show first row / last row)
One of them will be the bottleneck.
As a rule, rate and responce time are antagonists.
With prefetching, you can control the row delivery response time and row delivery rate: higher prefetch count will increase rate but decrease response time, lower prefetch count will do the opposite.
Choose which one is more important to you.
You can also do the following: create separate threads for fetching and processing.
Select just ehough rows to keep user amused in low prefetch mode (with high response time), then switch into high prefetch mode.
It will fetch the rows in the background and you can process them in the background too, while the user browses over the first rows.