Why is my Amazon Redshift query sometimes working, sometimes getting killed, and sometimes running out of memory?
This is a simple query:
dev=# EXPLAIN SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
QUERY PLAN
------------------------------------------------------------------------------------
XN Seq Scan on annotated_apache_logs (cost=0.00..114376.71 rows=9150137 width=207)
Filter: (date = '2015-09-15'::date)
Pulling about 9 million rows:
dev=# SELECT count(*) FROM annotated_apache_logs WHERE date = '2015-09-15';
count
---------
9150137
(1 row)
And choking:
dev=# SELECT row_number, browser_cookie, "timestamp", request_path,
status, outcome, duration, referrer
FROM annotated_apache_logs
WHERE date = '2015-09-15';
out of memory
Sometimes the sql says Killed. Sometimes it works. Sometimes I get out of memory. No idea why. The table looks like this (I've removed rows not in the above query):
CREATE TABLE IF NOT EXISTS annotated_apache_logs (
row_number double precision,
browser_cookie character varying(240),
timestamp integer,
request_path character varying(2500),
status character varying(12),
outcome character varying(128),
duration integer,
referrer character varying(2500)
)
DISTKEY (date)
SORTKEY (browser_cookie);
And I've worked very hard to get all of those columns as small as I can to reduce memory usage. What do I look for now? If I read the EXPLAIN output correctly, this might return a couple of gigs of data. Not much data, no joins, nothing fancy. For a "petabyte scale data warehouse", that's trivial, so I'm assuming I'm missing something fundamental here.
You should use cursors to fetch the result set in chunks. See http://docs.aws.amazon.com/redshift/latest/dg/declare.html
If your client application uses an ODBC connection and your query creates a result set that is too large to fit in memory, you can stream the result set to your client application by using a cursor. When you use a cursor, the entire result set is materialized on the leader node, and then your client can fetch the results incrementally.
Edit:
Assuming that you want the entire result set rather than filtering using where/limit.
If your query is actually running out of memory, check what is the concurrency for the WLM queue under which this query runs. Try to increase the available memory for this queue or reduce the concurrency, this will allow your query to have more memory.
P.S:
When it says "Petabyte scale", it does not mean it has petabyte of RAM for you. There are a lot of factors which decide how much memory your query is actually getting while execution,
What is the node type you are using?
How many nodes?
What other queries are running when you are running this query?
Related
how can I optimize the query below :
SELECT A.CNACT, A.FACML, A.LCACT, H.CAECH, H.CMECH, H.MCCMP,H.DAHIS,RANK() OVER (PARTITION BY H.CNACT,H.CAECH,H.CMECH ORDER BY H.DAHIS DESC) RK
FROM NATACF A,HISTER H WHERE A.CNACT = H.CNACT;
select count (*) FROM NATACF; -->74794
select count (*) FROM HISTER; -->2100720
you find in attachment the execution plan
Thank you.
As you see window sort and hash JOIN are not optimised effectively. What is the best way to optimise this?
the screenshot below of prod database :
Long story short, you want ALL data from *both" tables - no filtering in place.
Oracle reads whole smaller (driving) table into hash map.
Uses joining column CNACT as a key
They reads whole bigger table and performs lookup in the hashmap for each row read.
the complexity is O(N+M), each row is read only once
There is no way to evaluate such a query in faster way (aside from dirty tricks like putting both tables into CLUSTER, pinning tables in buffer cache,...).
PS: it is strange that explain plan shows 2 sec - both tables are not actually so big.
While prod DB says 5hours.
Try to execute the query using set timing on set echo off set pagesize 0 set termout off set feedback off set pause off set verify off set headings off. Basically read the whole result and then discard it and print exec. time. And you will see.
Maybe it is the app (or network) who has problem to transfer the whole big result set. Is such a case you would see "SQL*Net Message to client" wait event in AWR. Like the database is waiting for the application to accept more data. Like you are sending about 14GB of data into the application.
For example Java has problems with GC, or each row is used to costly Java Object creation.
we resolved the problem using the with :
WITH Z AS(SELECT
| X.DATRA,X.COINI,X.COINT,X.NUCPT,N.COBCN,X.CNACT,N.LCACT,N.CNACR,DECODE(X.CSOPT,NULL,X.CAECH,O.CAECR) CAECH,
| DECODE(X.CSOPT,NULL,X.CMECH,O.CMECR) CMECH,X.CSOPT,X.MTSNA,N.COTSJ,X.CSENS,X.QTCCP,D.TXCHA,R.MAINT,X.CODEV,
| D.CDVRF,C.TYEDI,C.NUSES
10: | FROM CUMPOR X,
| MRXIDE C,.....
The WITH clause - The materialized subquery data is persistent through the query.
I am running a query on localhost, I am extremely unfamiliar with SQL. I am using a golang library to generate the query statement. This is for an enterprise app so I don't have time to evaluate and code all possible performance cases. I'd prefer good performance for the largest possible queries:
upto 6 query parameters eg. BETWEEN 'created' AND 'abandoned', BETWEEN X AND Y, IN (1,2,3.....25), IN ('A', 'B', 'C'....'Z')
JOINs/subqueries between a 2-5 tables
returning between 50K-5M records (LAT and LNG)
Currently I am using JOIN to find the lat, lng for a record, and some query parameters. Should I join differently, (left, right)? Should the FROM table be the record or the relation? Subqueries?
Is this query performance reasonable from a UI perspective? This is on localhost (docker) on a fairly low performance laptop, under WSL (16GB RAM / 6 core CPU [2.2GHz]).
-- [2547.438ms] [rows:874731]
SELECT "Longitude","Latitude"
FROM Wells
JOIN Well_Reports ON Well_Reports.Well_ID = Wells.Well_ID
JOIN Lithologies on Lithologies.Well_Report_ID = Well_Reports.Well_Report_ID
where Lithologies.Colour IN
(
'NULL',
'Gray','White','Dark','Black','Dark Gray','Dark Brown','Dark Red','Dark Blue',
'Dark Green','Dark Yellow','Bluish Green','Brownish Gray','Brownish Green','Brownish Yellow',
'Light','Light Gray','Light Red','Greenish Gray','Light Yellow','Light Green','Light Blue',
'Light Brown','Blue Gray','Greenish Yellow','Greenish Gray'
);
The UI is a heatmap. I haven't really hit performance issues returning 1-million rows.
Angular is the framework. Im breaking the HTTP response into 10K record chunks
My initial impression was 3+ seconds is too long for a UI to start populating data. I was already breaking the response to the UI into chunks, that portion was efficient and async. It never occurred to me to simply break the SQL requests into smaller chunks with LIMIT and OFFSET, so the server can start responding with data immediately (<200ms) even if it takes +5s to completely finish loading.
I'll write an answer to this effect.
Thanks and best regards,
schmorrison
A few things.
where someColumn in (null, ...)
This will not return rows where the value of someColumn is null because, x in (a, b, c ...) translates to x = a or x = b or x = c, and null is not equal to null.
You need a construct like this instead
where someColumn is null or someColumn in (...)
Second, you mentioned you're returning between 50k and 5M rows to the UI. I question the sanity of this... how is the UI rendering 5 million sets of coordinates for the user to see/use? I suppose there could be some extreme edge cases where this is really what you need to do, but it just seems really unlikely.
Third, if your concern is UI responsiveness, the proper way to handle that is to make asynchronous requests. I don't know anything about golang, but I did find this page with a quick google search. Study up on that kind of technique.
Finally, if you really do need to work with data sets this big, the critical point will be to identify the common search criteria and talk to your DBA about appropriate indexes. We can't provide much help in this regard without a lot more schema information, but if you have a specific query that is taking a long time with a particular set of parameters, you can come back and create a question for that query, along with providing the query plan, and we can help you out.
As pointed out by The Impaler,
"If you want good performance for dynamic queries that return 5
million rows, then you'll need to learn the intricacies of database
engines, not only SQL. A high level knowledge unfortunately won't cut
it."
and I expected as much.
I was simply querying the SQL DB for the whole set at once, this is wrong and I know that.
The changes I made generates the following statements (I know the specifics have changed):
-- [145.955ms] [rows:1]
SELECT count(*)
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
);
Calculated total(69977) chunk (17494) currentpage(3)
-- [117.195ms] [rows:17494]
SELECT "Easting","Northing"
FROM "tblWellLogs"
WHERE (
DateWellCompleted BETWEEN
'1800-01-01 06:58:00.000' AND '2021-09-12 06:00:00.000'
)
AND "FinalStatusOfWellL" NOT IN
(
5,6,7,8,9,16,27,36
)
AND WaterUseL IN
(
1,29,3,8,26,4,6
)
AND (
wyRate BETWEEN
0 AND 3321
)
AND (
TotalOrFinishedDepth BETWEEN
0 AND 248
)
ORDER BY (SELECT NULL)
OFFSET 34988 ROWS
FETCH NEXT 17494 ROWS ONLY;
Chunk count: 17494
JSON size: 385135
Request time: 278.6612ms
The response to the UI is as so:
GET /data/wells/xxxx/?page=3&completed.start=1800-01-01T06:58:36.000Z&completed.end=2021-09-12T06:00:00.000Z&status=supply,research,geothermal,other,unknown&use=domestic,commercial,industrial,municipal,irrigation,agriculture&depth.start=0&depth.end=248&rate.start=0&rate.end=3321&page=3 HTTP/1.1 200 385135
{
"data":[[0.0, 0.0],......],
"count": 87473,
"total": 874731,
"pages": 11,
"page": 1
}
In this case, the DBs are essentially immutable (I may get an updated dataset once per year). I figure I can predefine a set of chunk size variations for each DB based on query result size rather than just DB size. I am also retrieving the next page before the client requests it. I am caching requests at the browser and client layers. I only perform the Count(*) request if the client doesn't provide it, and it isn't in the cache.
I did find that running the request concurrently simply over-burdened my CPU, all the requests returned almost simultaneously but took +5s each rather than ~1second each.
We are currently facing performance issues when order by clause is provided as a part of the query.
Current Specs:
We are running two geode servers with capacity of 20Gb(Max heap size) each. Geode has around 3.1 million records and the table has 1.48 million.
Query:
query --query="SELECT DISTINCT cashFlowId,upstreamSystem,upstreamSystemTxnDate,valueDate,amount,status FROM WHERE AND account IN SET ('XYZ','ABC') AND valueDate >= TO_DATE('20180320', 'yyyyMMdd') AND status = 'Booked' AND isActive = true AND category = 'Actual' ORDER BY amount DESC LIMIT 100"
The above query retrieves the output in 13-15 seconds after 2-3 times.
Actual Result Set: 666553
No of Records in the table: 1.49 million
What have we tried/observed so far?
We found that the index (type: range) is being picked correctly.
No improvement even after allocating more memory to JVM .
Verified that IN operator has no impact on the query performance. We tried the same using OR operator
On removing the Order by clause, the query gets completed in 2 seconds. We figured that sorting is eating most of the time.
Could you please guide or shed some information in improving the query performance?
Server Metrics:
Category | Metric | Value
--------- | --------------------- | ------------
cluster | totalHeapSize | 47135
cache | totalRegionEntryCount | 3100429
Like Urizen said, check the number of GC's going on but there is more. Here is the code and it looks fairly tight: Geode Order By Comparator. There is another factor related to the nature of distributed sort order that has little to do with Geode as a product. Each node does its ordering but when the results get returned from each node, those results need to be merged with the results from other nodes. In other words, given a set of {2,4,3,1,6,5}, node 1 can sort {2,5,6} and node 2 sorts {1,3,4} but the controlling node needs to do a merge for you to get {1,2,3,4,5,6}. I suspect that there's some of that going on as well. This has nothing to do with Geode per se but just distributed order by's. In database performance optimization theory, the database is the worst place to do an order by.
I'm wondering here if the better way to do this is to return 2 answer sets: 1) your answer set that you want but unsorted, and 2) a small KV collection of items where K is amount and V is the key. Then on the client you do a sort of the small KV collection and iterate over the KV collection reading your larger answer set in that order.
If you didn't want to write a function to do that, you could do one additional query up front to select amount, key FROM ..., wrap that in a sorted collection and then do your full unsorted query. This should be really quick since your 2 seconds is partially being consumed by network on such a large answer set.
Jason may have some technical insights but removing the load from the server may be the answer if you have large answer sets like you do.
I am using SQL SERVER 2012 that is running on windows datacenter 2012,
I have a database with a table that is build as followed :
[ID] (pk,int not null)
[Start] (float,null)
[End] (float, null)
[CID] (int,null) --country id
I have a web service that gets an IP, translate it to decimal
(may refer to this : IP address conversion to decimal and vice versa) and request the database server for the country id
The table mentioned at first contains ~200K rows with start and end values representing IP ranges as decimal and a countryid related to each range,
I have encountered a really high CPU usage against some heavy traffic we have been dealing, so i added indexes on the start and end columns, afterwards the cpu got a little bit better but i think it should have been much more, its simply suppose to work as a search in a sorted list which should be extremely fast, though the expected result i had from adding the index were far from reality,
I suppose it is because its not searching a list but searching a range
What would be the best way to efficient this situation, since i am just sure that the resources this simple action is taking me is way to much than it should.
Here is a picture from the activity monitor now (lower traffic, after indexing) :
This is running on Azure ExtraLarge VM (8 cores 14GB memory) - the vm is doing nothing but running a sql server with 1 table that only translates this 1 request ! the VM CPU on this lower traffic is ~30% and ~70% on higher traffic, i am sure some structure/logical changes should make a really small server\service handle this easily.
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
This gets you the nearest IP range above the given IP. You then need to test whether the EndIP also matches. So:
SELECT *
FROM (
SELECT TOP 1 *
FROM IP
WHERE StartIP <= yourIP
ORDER BY StartIP
) x
WHERE EndIP >= yourIP
This amounts to a single-row index seek. Perfect performance.
The reason SQL Server cannot automatically do this is that it cannot know that IP ranges are ordered, meaning that the next StartIP is always greater than the current EndIP. We could have ranges of the form (100, 200), (150, 250). That is clearly invalid but it could be in the table.
In my opinion you main problem is the lack of "parameterization" because (a) query compilation is/can be expensive and (b) these "unparamterized" queries seems to have a lot of executions. And the available screenshot shows two things regarding this aspect:
1) The recent expensive queries aren't "parameterized".
2) High values for "Plan count":
Plan Count The number of cached query plans for this query. A large
number might indicate a need for explicit query parameterization. For
more information, see Specifying Query Parameterization Behavior by
Using Plan Guides.
Source
So, I would try to use parameters for these queries:
SELECT TOP(1) CountryId FROM [IP] WHERE Column1 <= #param AND #param <= Column2
If you can't change the application (how SQL requests are sended to SQL Server) then you could try plan guides:
http://technet.microsoft.com/en-US/library/ms191275(v=sql.90).aspx
As part of a data analysis project, I will be issuing some long running queries on a mysql database. My future course of action is contingent on the results I obtain along the way. It would be useful for me to be able to view partial results generated by a SELECT statement that is still running.
Is there a way to do this? Or am I stuck with waiting until the query completes to view results which were generated in the very first seconds it ran?
Thank you for any help : )
In general case the partial result cannot be produced. For example, if you have an aggregate function with GROUP BY clause, then all data should be analysed, before the 1st row is returned. LIMIT clause will not help you, because it is applied after the output is computed. Maybe you can give a concrete data and SQL query?
One thing you may consider is sampling your tables down. This is good practice in data analysis in general to get your iteration speed up when you're writing code.
For example, if you have table create privelages and you have some mega-huge table X with key unique_id and some data data_value
If unique_id is numeric, in nearly any database
create table sample_table as
select unique_id, data_value
from X
where mod(unique_id, <some_large_prime_number_like_1013>) = 1
will give you a random sample of data to work your queries out, and you can inner join your sample_table against the other tables to improve speed of testing / query results. Thanks to the sampling your query results should be roughly representative of what you will get. Note, the number you're modding with has to be prime otherwise it won't give a correct sample. The example above will shrink your table down to about 0.1% of the original size (.0987% to be exact).
Most databases also have better sampling and random number methods than just using mod. Check the documentaion to see what's available for your version.
Hope that helps,
McPeterson
It depends on what your query is doing. If it needs to have the whole result set before producing output - such as might happen for queries with group by or order by or having clauses, then there is nothing to be done.
If, however, the reason for the delay is client-side buffering (which is the default mode), then that can be adjusted using "mysql-use-result" as an attribute of the database handler rather than the default "mysql-store-result". This is true for the Perl and Java interfaces: I think in the C interface, you have to use an unbuffered version of the function that executes the query.