Apache Geode : Improve query performance when order by clause is provided in the query - apache

We are currently facing performance issues when order by clause is provided as a part of the query.
Current Specs:
We are running two geode servers with capacity of 20Gb(Max heap size) each. Geode has around 3.1 million records and the table has 1.48 million.
Query:
query --query="SELECT DISTINCT cashFlowId,upstreamSystem,upstreamSystemTxnDate,valueDate,amount,status FROM WHERE AND account IN SET ('XYZ','ABC') AND valueDate >= TO_DATE('20180320', 'yyyyMMdd') AND status = 'Booked' AND isActive = true AND category = 'Actual' ORDER BY amount DESC LIMIT 100"
The above query retrieves the output in 13-15 seconds after 2-3 times.
Actual Result Set: 666553
No of Records in the table: 1.49 million
What have we tried/observed so far?
We found that the index (type: range) is being picked correctly.
No improvement even after allocating more memory to JVM .
Verified that IN operator has no impact on the query performance. We tried the same using OR operator
On removing the Order by clause, the query gets completed in 2 seconds. We figured that sorting is eating most of the time.
Could you please guide or shed some information in improving the query performance?
Server Metrics:
Category | Metric | Value
--------- | --------------------- | ------------
cluster | totalHeapSize | 47135
cache | totalRegionEntryCount | 3100429

Like Urizen said, check the number of GC's going on but there is more. Here is the code and it looks fairly tight: Geode Order By Comparator. There is another factor related to the nature of distributed sort order that has little to do with Geode as a product. Each node does its ordering but when the results get returned from each node, those results need to be merged with the results from other nodes. In other words, given a set of {2,4,3,1,6,5}, node 1 can sort {2,5,6} and node 2 sorts {1,3,4} but the controlling node needs to do a merge for you to get {1,2,3,4,5,6}. I suspect that there's some of that going on as well. This has nothing to do with Geode per se but just distributed order by's. In database performance optimization theory, the database is the worst place to do an order by.
I'm wondering here if the better way to do this is to return 2 answer sets: 1) your answer set that you want but unsorted, and 2) a small KV collection of items where K is amount and V is the key. Then on the client you do a sort of the small KV collection and iterate over the KV collection reading your larger answer set in that order.
If you didn't want to write a function to do that, you could do one additional query up front to select amount, key FROM ..., wrap that in a sorted collection and then do your full unsorted query. This should be really quick since your 2 seconds is partially being consumed by network on such a large answer set.
Jason may have some technical insights but removing the load from the server may be the answer if you have large answer sets like you do.

Related

Optimize joining two big tables ORACLE 19C

how can I optimize the query below :
SELECT A.CNACT, A.FACML, A.LCACT, H.CAECH, H.CMECH, H.MCCMP,H.DAHIS,RANK() OVER (PARTITION BY H.CNACT,H.CAECH,H.CMECH ORDER BY H.DAHIS DESC) RK
FROM NATACF A,HISTER H WHERE A.CNACT = H.CNACT;
select count (*) FROM NATACF; -->74794
select count (*) FROM HISTER; -->2100720
you find in attachment the execution plan
Thank you.
As you see window sort and hash JOIN are not optimised effectively. What is the best way to optimise this?
the screenshot below of prod database :
Long story short, you want ALL data from *both" tables - no filtering in place.
Oracle reads whole smaller (driving) table into hash map.
Uses joining column CNACT as a key
They reads whole bigger table and performs lookup in the hashmap for each row read.
the complexity is O(N+M), each row is read only once
There is no way to evaluate such a query in faster way (aside from dirty tricks like putting both tables into CLUSTER, pinning tables in buffer cache,...).
PS: it is strange that explain plan shows 2 sec - both tables are not actually so big.
While prod DB says 5hours.
Try to execute the query using set timing on set echo off set pagesize 0 set termout off set feedback off set pause off set verify off set headings off. Basically read the whole result and then discard it and print exec. time. And you will see.
Maybe it is the app (or network) who has problem to transfer the whole big result set. Is such a case you would see "SQL*Net Message to client" wait event in AWR. Like the database is waiting for the application to accept more data. Like you are sending about 14GB of data into the application.
For example Java has problems with GC, or each row is used to costly Java Object creation.
we resolved the problem using the with :
WITH Z AS(SELECT
| X.DATRA,X.COINI,X.COINT,X.NUCPT,N.COBCN,X.CNACT,N.LCACT,N.CNACR,DECODE(X.CSOPT,NULL,X.CAECH,O.CAECR) CAECH,
| DECODE(X.CSOPT,NULL,X.CMECH,O.CMECR) CMECH,X.CSOPT,X.MTSNA,N.COTSJ,X.CSENS,X.QTCCP,D.TXCHA,R.MAINT,X.CODEV,
| D.CDVRF,C.TYEDI,C.NUSES
10: | FROM CUMPOR X,
| MRXIDE C,.....
The WITH clause - The materialized subquery data is persistent through the query.

How to limit BigQuery query size for testing a query sample through the web user-interface?

I would like to know if it is possible to limit the bigquery query size when running a query through the web user-interface?
My idea is just to test the query but instead of querying all my tables; I would like just to query a part of it with for instance a number of row.
Limit is not optimizing my query cost, so the idea is to find a function similar to "row_number" or "fetch".
Sorry I'm a marketer and not a developer, so thank you in advance for your kind help.
How to limit BigQuery query size for testing ... ?
1 - Try to minimize number of tables involved in your testing
In your query – there are 60+ tables involved for respectively dates between 2016-12-11 and nowadays
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20170315'))
Instead you can use same day as start and end of time range, thus drastically reducing number of involved tables (down to just one table) and overall scan size. For example
SELECT <fields_list> FROM
TABLE_DATE_RANGE([XXX:85801771.ga_sessions_],
TIMESTAMP('20161211'),
TIMESTAMP('20161211'))
2 - Minimize number of rows. Ability to do so really depends on how your table is being loaded with data. If table loaded incrementally - you can use so called table decorators.
Note - this technique works with tables within last 7 days
For example, below will scan only data that was in table at one hour ago (so called snapshot decorator)
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000]
This works well with the most recent day's table especially at the start of the day when size of table is not big yet
So, to limit further, you can use below version (so called range decorator) - gives you data added between one hour and half an hour ago
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170212#-3600000--1800000]
Finally, #0 is a special case that references the oldest possible snapshot of the table: either 7 days in the past, or the table's creation time if the table is less than 7 days old. For example
SELECT <fields_list> FROM [XXX:85801771.ga_sessions_20170210#0]
3 - Test against Sampled Table. If you expect experimenting with your query again and again - you can first prepare downsized version of your table with just as many rows as you need and applying sampling logic that fit in your business logic. To limit number of rows you can use LIMIT Clause. To get random rows you can use RAND function for example
After sampled table is prepared - run all your query against it till when you have final version - after this - you can run it against your original table(s)
And btw, to create sampled table you need to set destination table under options in Web UI.

Amazon Redshift queries mysteriously dying

Why is my Amazon Redshift query sometimes working, sometimes getting killed, and sometimes running out of memory?
This is a simple query:
dev=# EXPLAIN SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
QUERY PLAN
------------------------------------------------------------------------------------
XN Seq Scan on annotated_apache_logs (cost=0.00..114376.71 rows=9150137 width=207)
Filter: (date = '2015-09-15'::date)
Pulling about 9 million rows:
dev=# SELECT count(*) FROM annotated_apache_logs WHERE date = '2015-09-15';
count
---------
9150137
(1 row)
And choking:
dev=# SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
out of memory
Sometimes the sql says Killed. Sometimes it works. Sometimes I get out of memory. No idea why. The table looks like this (I've removed rows not in the above query):
CREATE TABLE IF NOT EXISTS annotated_apache_logs (
row_number double precision,
browser_cookie character varying(240),
timestamp integer,
request_path character varying(2500),
status character varying(12),
outcome character varying(128),
duration integer,
referrer character varying(2500)
)
DISTKEY (date)
SORTKEY (browser_cookie);
And I've worked very hard to get all of those columns as small as I can to reduce memory usage. What do I look for now? If I read the EXPLAIN output correctly, this might return a couple of gigs of data. Not much data, no joins, nothing fancy. For a "petabyte scale data warehouse", that's trivial, so I'm assuming I'm missing something fundamental here.
You should use cursors to fetch the result set in chunks. See http://docs.aws.amazon.com/redshift/latest/dg/declare.html
If your client application uses an ODBC connection and your query creates a result set that is too large to fit in memory, you can stream the result set to your client application by using a cursor. When you use a cursor, the entire result set is materialized on the leader node, and then your client can fetch the results incrementally.
Edit:
Assuming that you want the entire result set rather than filtering using where/limit.
If your query is actually running out of memory, check what is the concurrency for the WLM queue under which this query runs. Try to increase the available memory for this queue or reduce the concurrency, this will allow your query to have more memory.
P.S:
When it says "Petabyte scale", it does not mean it has petabyte of RAM for you. There are a lot of factors which decide how much memory your query is actually getting while execution,
What is the node type you are using?
How many nodes?
What other queries are running when you are running this query?

How to do server paging in SQL correct?

My situation: My application is slow. As slow as it gets... mostly because I have the feeling my Server paging for my dataTables / grids are wrongly implemented.
Let's start:
I have a SQL Server 2008 database, one table with all the information, 10 columns in it, at the moment 19K rows
My application is based on a JavaScript and ASP.Net backend code.
My SQL query is:
WITH Ordered AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Created DESC) AS 'RowNumber'
FROM Meetings
WHERE State IN ('Appointed', 'Accepted')
AND [xxx] LIKE '%1%'
AND [yyy] LIKE '%2%'
)
SELECT *
FROM Ordered
WHERE RowNumber BETWEEN 1 AND 41;
So at the moment this query runs around 27 to 32 seconds, which means over 30 seconds I got a timeout... on 19k rows in 1 year... which means in 1 month latest every query will run against dead...
As far as I am understand the order for this query is the problem: No index done here.
Because the query first sorts, then selects all with a manual row number, then selects only 40... (of course on page 2 of my grid it gets Rows 41 to 81...)
I COULD do an Index on my "Created desc" and the query would be much much faster, BUT every column is sortable for my grid which means "Created desc" could be every other column of my table and of course desc and asc order!
So, how to improve this?
//Edit:
Sorry to forget that:
The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds...
Which means the "WITH ORDERES AS" is the problem here!
First things first: you have a performance problem, approach it with a proper methodology and measure appropriately. The inner query (Inner Select) runs 6 seconds, while the total query runs 31 seconds... Which means ... is amateurism. Read How to analyse SQL Server performance for correct ways to measure performance. And before we continue, if you start from 6 seconds you have already lost the game.
Now, on to the question.
WHERE State in('Appointed','Accepted') AND [xxx] LIKE '%1%' AND [yyy] LIKE '%2%'
This expression is basically non-indexable. Even if you add an index on State it will not help because of the low cardinality (few values with many rows each). And like '% ... %' is unindexable because it searches for values in the middle of the text.
You could try to replace like '% ... %' with a full-text search like CONTAINS ... which will be faster, provider you search for specific enough terms. But it does require you to deploy and configure properly the full-text indexes.
As for the paging, I do not favor much the ROWNUMBER approach. Even when a sort column exists, it involves a scan and count to skip the number of rows and gets slower and slower as you go to higher pages. I much more recommend the key based approach:
SELECT TOP (page size) ...
WHERE keys > <last row>
ORDER BY...
but this approach is more difficult to implement as it requires keeping track of keys rather than the page number.
But expect no miracles. You are asking a relational OLTP system to do the work of an ElasticSearch/Solr. It will never work as you expect. Use a tool appropriate for the job (a Search engine). Also read Dynamic Search Conditions in T‑SQL for a more thorough discussion, but again, expect no miracles.

Cassandra secondary index get_indexed_slices timing out

I am using Cassandra 0.8 with 2 secondary indexes for columns like "DeviceID" and "DayOfYear". I have these two indexes in order to retrieve data for a device within a range of dates. Whenever I get a date filter, I convert this into DayOfYear and search using indexed slices using .net Thrift API. Currently I cannot upgrade the DB as well.
My problem is I usually do not have any issues retrieving rows using the get_indexed_slices query for the current date (using current day of year). But whenever I query for yesterday's day of year (which is one of the indexed column), I get a time out for the first time I make the query. And most of the times, it returns when I query the second time and 100% during the third time.
Both these columns are created as double data type in the column family and I generally get 1 record per minute. I have 3 nodes in the cluster and the nodetool reports suggest that the nodes are up and running, though the load distribution report from nodetool looks like this.
Starting NodeTool
Address DC Rack Status State Load Owns
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 7.59 GB 51.39%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 394.24 MB 3.81%
xxx.xx.xxx.xx datacenter1 rack1 Up Normal 4.42 GB 44.80%
and my configuration in YAML is as below.
hinted_handoff_enabled: true
max_hint_window_in_ms: 3600000 # one hour
hinted_handoff_throttle_delay_in_ms: 50
partitioner: org.apache.cassandra.dht.RandomPartitioner
commitlog_sync: periodic
commitlog_sync_period_in_ms: 120000
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 24
sliced_buffer_size_in_kb: 64
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: true
snapshot_before_compaction: false
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
multithreaded_compaction: false
compaction_throughput_mb_per_sec: 16
compaction_preheat_key_cache: true
rpc_timeout_in_ms: 50000
index_interval: 128
Is there something I may be missing? Are there any problems in the config?
Duplicate your data in another column family where the key is your search data. Row slice are mutch faster
Personally I never got to use secondary index in production environments. Or I had problems with timeout, or the speed of data retrieve by secondary index was lower that the amount of data inserted. I think that it is related with not sequentially reading data and HD seek time.
If you come from a relational model, playOrm is just as fast and you can be relational on a noSQL store BUT you just need to partition your extremely large tables. IF you do that, you can then use "scalable JQL" to do your stuff
#NoSqlQuery(name="findJoinOnNullPartition", query="PARTITIONS t(:partId) select t FROM TABLE as t INNER JOIN t.security as s where s.securityType = :type and t.numShares = :shares")
IT also has the #ManyToOne, #OneToMany, etc. etc. annotations for a basic ORM though some things work differently in noSQL but a lot is the similar.
I finally solved my problem in a different way. In fact I realized the problem is with my data model.
The problem comes because we we come from a RDBMS background. I restructured the data model a little and now, I get responses faster.