SQL GROUP BY taking too long for multiple columns

SQL GROUP BY taking too long for multiple columns - sql

It takes a reasonable amount of time to run the following query and get the results:
SELECT {col1} COUNT(col3) FROM {table1} GROUP BY {col1} LIMIT 100
But when I run for multiple columns (as below), it takes for ever!
SELECT {col1}, {col2} COUNT(col3) FROM {table1} GROUP BY {col1}, {col2} LIMIT 100
I looked for ways to make it more efficiently, but I did not find anything that works for me. I am posting to get new thoughts/directions. Thanks!
{table1} is relatively huge (~12B rows)

Related

Redshift SQL result set 100s of rows wide efficiency (long to wide)

Scenario: Medical records reporting to state government which requires a pipe delimited text file as input.
Challenge: Select hundreds of values from a fact table and produce a wide result set to be (Redshift) UNLOADed to disk.
What I have tried so far is a SQL that I want to make into a VIEW.
;WITH
CTE_patient_record AS
(
SELECT
record_id
FROM fact_patient_record
WHERE update_date = <yesterday>
)
,CTE_patient_record_item AS
(
SELECT
record_id
,record_item_name
,record_item_value
FROM fact_patient_record_item fpri
INNER JOIN CTE_patient_record cpr ON fpri.record_id = cpr.record_id
)
Note that fact_patient_record has 87M rows and fact_patient_record_item has 97M rows.
The above code runs in 2 seconds for 2 test records and the CTE_patient_record_item CTE has about 200 rows per record for a total of about 400.
Now, produce the result set:
,CTE_result AS
(
SELECT
cpr.record_id
,cpri002.record_item_value AS diagnosis_1
,cpri003.record_item_value AS diagnosis_2
,cpri004.record_item_value AS medication_1
...
FROM CTE_patient_record cpr
INNER JOIN CTE_patient_record_item cpri002 ON cpr.cpr.record_id = cpri002.cpr.record_id
AND cpri002.record_item_name = 'diagnosis_1'
INNER JOIN CTE_patient_record_item cpri003 ON cpr.cpr.record_id = cpri003.cpr.record_id
AND cpri003.record_item_name = 'diagnosis_2'
INNER JOIN CTE_patient_record_item cpri004 ON cpr.cpr.record_id = cpri004.cpr.record_id
AND cpri003.record_item_name = 'mediation_1'
...
) SELECT * FROM CTE_result
Result set looks like this:
record_id diagnosis_1 diagnosis_2 medication_1 ...
100001 09 9B 88X ...
...and then I use the Reshift UNLOAD command to write to disk pipe delimited.
I am testing this on a full production sized environment but only for 2 test records.
Those 2 test records have about 200 items each.
Processing output is 2 rows 200 columns wide.
It takes 30 to 40 minutes to process just just the 2 records.
You might ask me why I am joining on the item name which is a string. Basically there is no item id, no integer, to join on. Long story.
I am looking for suggestions on how to improve performance. With only 2 records, 30 to 40 minutes is unacceptable. What will happen when I have 1000s of records?
I have also tried making the VIEW a MATERIALIZED VIEW however, it takes 30 to 40 minutes (not surprisingly) to compile the materialized view also.
I am not sure which route to take from here.
Stored procedure? I have experience with stored procs.
Create new tables so I can create integer id's to join on and indexes? However, my managers are "new table" averse.
?
I could just stop with the first two CTEs, pull the data down to python and process using pandas dataframe which I've done before successfully but it would be nice if I could have an efficient query, just use Redshift UNLOAD and be done with it.
Any help would be appreciated.
UPDATE: Many thanks to Paul Coulson and Bill Weiner for pointing me in the right direction! (Paul I am unable to upvote your answer as I am too new here).
Using (pseudo code):
MAX(CASE WHEN t1.name = 'somename' THEN t1.value END ) AS name
...
FROM table1 t1
reduced execution time from 30 minutes to 30 seconds.
EXPLAIN PLAN for the original solution is 2700 lines long, for the new solution using conditional aggregation is 40 lines long.
Thanks guys.

Without some more information it is impossible to know what is going on for sure but what you are doing is likely not ideal. An explanation plan and the execution time per step would help a bunch.
What I suspect is getting you is that you are reading a 97M row table 200 times. This will slow things down but shouldn't take 40 min. So I also suspect that record_item_name is not unique per value of record_id. This will lead to row replication and could be expanding the data set many fold. Also is record_id unique in fact_patient_record? If not then this will cause row replication. If all of this is large enough to cause significant spill and significant network broadcasting your 40 min execution time is very plausible.
There is no need to be joining when all the data is in a single copy of the table. #PhilCoulson is correct that some sort of conditional aggregation could be applied and the decode() syntax could save you space if you don't like case. Several of the above issues that might be affecting your joins would also make this aggregation complicated. What are you looking for if there are several values for record_item_value for each record_id and record_item_name pair? I expect you have some discovery of what your data holds in your future.

Alternative of IN in SQL to execute it in less time

I wrote a query but I want to know if this execution time will be slow or fast? Can we use any alternative to IN since it is used 4 times in such a small query.
Can we have a better way to write this query ? Moreover I am confused because the someone wants to query it with large number of ep_et_id and the way I wrote this is no where have a place to mention the ep_et_id
The below query fetches 74133 results in 14.55 secs in direct Postgres call. But I believe this will take more time if I call it from the webpage.
select lp, ob_id
from ob_in
where ob_id in (select ob_id
from ob_for_e
where ep_et_id in (select ep_et_id
from rrts rep
left join rrts_inf rinf on rep.rrts_id = rinf.rrts_id
where rrts_type in ('FR','IN')
and rinf.status in('V')))
The tables I am using here are ob_in, ob_for_e, rrts , rrts_inf
The location point lp are in table ob_in and I put many IN to fetch IDs from the table to apply the final condition on table rrts_inf for type FR , IN and Status with V

Group By seems to add inordinate amount of computation in a simple Postgres query

I have a Postgres query as such:
select id,
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id
order by id;
It's nothing fancy but works fine -- only takes about 3 minutes when querying a large database to return the result.
My question is, why does this slight modification to the above code cause the time for the query to more-than quadruple?
select id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id, b.ad_description
order by id;
What is going on? The mere inclusion of one simple column of (albeit unique) information is bogging my query down. I am somehow asking Postgres to do a tremendously larger amount of work. For the life of me, I don't see how.
I'd like to preemptively apologize for not including any raw data. I'm hoping this simplified example of what I'm really facing is clear enough for some kind soul to make an enlightening comment. I can say that I'm going over a million rows in each table.
Thanks in advance.

i think the size of the return set is the issue. since your result set now has a millionish rows, that would explain the extra time (although it does seem excessive). a couple of things. first, the first column in the select is not bound to a range variable, it probably doesn't matter. but, you might want to make it select b.id, b.ad_description instead of id, b.ad_description. second, normally i use a group by when one of the columns is an aggregate that isn't in the group by statement, like a 'count()' or something. maybe i'm missing something, but, you might get the same result with:
select distinct b.id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
order by b.id, b.ad_description;
you might want to play with some of the numbers in the postgresql.conf file to beef up workspace/query space.
finally, i have this funny hunch that the order by and the group by matching might also help performance.
the LIMIT 5 wouldn't help much with performance, because the entire query would need to be finished before the first 5 rows would come out.
-g

Vague count in sql select statements

I guess this has been asked in the site before but I can't find it.
I've seen in some sites that there is a vague count over the results of a search. For example, here in stackoverflow, when you search a question, it says +5000 results (sometimes), in gmail, when you search by keywords, it says "hundreds" and in google it says aprox X results. Is this just a way to show the user an easy-to-understand-a-huge-number? or this is actually a fast way to count results that can be used in a database [I'm learning Oracle at the moment 10g version]? something like "hey, if you get more than 1k results, just stop and tell me there are more than 1k".
Thanks
PS. I'm new to databases.

Usually this is just a nice way to display a number.
I don't believe there is a way to do what you are asking for in SQL - count does not have an option for counting up until some number.
I also would not assume this is coming from SQL in either gmail, or stackoverflow.
Most search engines will return a total number of matches to a search, and then let you page through results.
As for making an exact number more human readable, here is an example from Rails:
http://api.rubyonrails.org/classes/ActionView/Helpers/NumberHelper.html#method-i-number_to_human

With Oracle, you can always resort to analytical functions in order to calculate the exact number of rows about to be returned. This is an example of such a query:
SELECT inner.*, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
This will give you the total number of rows for your specific subquery. When you want to apply paging as well, you can further wrap these SQL parts as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*,ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... your own, sorted search query ...]
) inner
)
WHERE ROWNUM < :max_row
) outer
WHERE outer.RNUM > :min_row
Replace min_row and max_row by meaningful values. But beware that calculating the exact number of rows can be expensive when you're not filtering using UNIQUE SCAN or relatively narrow RANGE SCAN operations on indexes. Read more about this here: Speed of paged queries in Oracle
As others have said, you can always have an absolute upper limit, such as 5000 to your query using a ROWNUM <= 5000 filter and then just indicate that there are more than 5000+ results. Note that Oracle can be very good at optimising queries when you apply ROWNUM filtering. Find some info on that subject here:
http://www.dba-oracle.com/t_sql_tuning_rownum_equals_one.htm

Vague count is a buffer which will be displayed promptly. If user wants to see more results then he can request more.
It's a performance facility, after displaying the results the sites like google keep searching for more results.

I don't know how fast this will run, but you can try:
SELECT NULL FROM your_tables WHERE your_condition AND ROWNUM <= 1001
If count of rows in result will equals to 1001 then total count of records will > 1000.

this question gives some pretty good information

When you do an SQL query you can set a
LIMIT 0, 100
for example and you will only get the first hundred answers. so you can then print to your viewer that there are 100+ answers to their request.
For google I couldn't say if they really know there is more than 27'000'000'000 answer to a request but I believe they really do know. There are some standard request that have results stored and where the update is done in the background.

SQL Server aggregate performance

I am wondering whether SQL Server knows to 'cache' if you like aggregates while in a query, if they are used again.
For example,
Select Sum(Field),
Sum(Field) / 12
From Table
Would SQL Server know that it has already calculated the Sum function on the first field and then just divide it by 12 for the second? Or would it run the Sum function again then divide it by 12?
Thanks

It calculates once
Select
Sum(Price),
Sum(Price) / 12
From
MyTable
The plan gives:
|--Compute Scalar(DEFINE:([Expr1004]=[Expr1003]/(12.)))
|--Compute Scalar(DEFINE:([Expr1003]=CASE WHEN [Expr1010]=(0) THEN NULL ELSE [Expr1011] END))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*), [Expr1011]=SUM([myDB].[dbo].[MyTable].[Price])))
|--Index Scan(OBJECT:([myDB].[dbo].[MyTable].[IX_SomeThing]))
This table has 1.35 million rows
Expr1011 = SUM
Expr1003 = some internal thing to do with "no rows" etc but is Expr1011 basically
Expr1004 = Expr1011 / 12

According to the execution plan, it doesn't re-sum the column.

good question, i think the answer is no, it doesn't not cache it.
I ran a test query with around 3000 counts in it, and it was much slower than one with only a few. Still want to test if the query would be just as slow selecting just plain columns
edit: OK, i just tried selecting a large amount of columns or just one, and the amount of columns (when talking about thousands being returned) does effect the speed.
Overall, unless you are using that aggregate number a ton of times in your query, you should be fine. Push comes to shove, you could always save the outcome to a variable and do the math after the fact.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas