Joining and Aggregating a Large Number of Fact Tables Efficiently in Redshift - sql

I have a number of (10M+ rows) fact tables in Redshift, each with a natural key memberid and each with a column timestamp. Let's say I have three tables: transactions, messages, app_opens, with transactions looking like this (all the other tables have similar structure):
memberid
revenue
timestamp
374893978
3.99
2021-02-08 18:34:01
374893943
7.99
2021-02-08 19:34:01
My goal is to create a daily per-memberid aggregation table that looks likes this, with a row for each memberid and date:
memberid
date
daily_revenue
daily_app_opens
daily_messages
374893978
2021-02-08
4.95
31
45
374893943
2021-02-08
7.89
23
7
The SQL I'm currently using for this is the following, which involves unioning separate subqueries:
SELECT memberid,
date,
max(NVL(daily_revenue,0)) daily_revenue,
max(NVL(daily_app_opens,0)) daily_app_opens,
max(NVL(daily_messages,0)) daily_messages
FROM
(
SELECT memberid,
trunc(timestamp) as date,
sum(revenue) daily_revenue,
NULL AS daily_app_opens,
NULL AS daily_messages
FROM transactions
GROUP BY 1,2
UNION ALL
SELECT memberid,
trunc(timestamp) as date,
NULL AS daily_revenue,
count(*) daily_app_opens,
NULL AS daily_messages
FROM app_opens
GROUP BY 1,2
UNION ALL
SELECT memberid,
trunc(timestamp) as date,
NULL AS daily_revenue,
NULL AS daily_app_opens,
count(*) daily_messages
FROM messages
GROUP BY 1,2
)
GROUP BY memberid, date
This works fine and produces the expected output, but I'm wondering if this is the most efficient way to carry out this kind of query. I have also using FULL OUTER JOIN in place of UNION ALL, but the performance is essentially identical.
What's the most efficient way to achieve this in Redshift?

Seeing the EXPLAIN plan would help as it would let us see what the most costly parts of the query are. Based on a quick read of the SQL it looks pretty good. The cost of scanning the fact tables is likely meaningful but this is a cost you have to endure. If you can restrict the amount of data read with a where clause this can be reduced but doing this may not meet your needs.
One place that you should review is the distribution of these tables. Since you are grouping by accountid having this as the distribution key will make this process faster. Grouping will need bring rows of the same accountid value together, distributing on these values will greatly cut down on network traffic within the cluster.
At large data sizes and with everything else optimized I'd expect UNION ALL to out perform FULL OUTER JOIN but this will depend on a number of factors (like how much the data size is reduced by the accountid aggregation). 10M rows is not very big in Redshift terms (I have 160M rows of wide data on a minimal cluster) so I don't think you will see much difference between these plans at these sizes.

Related

SQL to find unique counts between two date fields

I was reading this but can't manage to hack it to work on my own problem.
My data has the following fields, in a single table in Postgres:
Seller_id (varchar) (contains_duplicates).
SKU (varchar) (contains duplicates).
selling_starts (datetime).
selling_ends (datetime).
I want to query it so I get the count of unique SKUs on sale, per seller, per day. If there are any null days I don't need these.
I've tried before querying it by using another table to generate a list of unique "filler" dates and then joining it to where the date is more than the selling_starts and less than the selling_ends fields. However, this is so computationally expensive that I get timeout errors.
I'm vaguely aware there are probably more efficient ways of doing this via with statements to create CTEs or some sort of recursive function, but I don't have any experience of this.
Any help much appreciated!
try this :
WITH list AS
( SELECT generate_series(date_trunc('day', min(selling_starts)), max(selling_ends), '1 day') AS ref_date
FROM your_table
)
SELECT seller_id
, l.ref_date
, count(DISTINCT sku) AS sku_count
FROM your_table AS t
INNER JOIN list AS l
ON t.selling_starts <= l.ref_date
AND t.selling_ends > l.ref_date
GROUP BY seller_id, l.ref_date
If your_table is large, you should create indexes to accelerate the query.

Put many columns in group by clause in Oracle SQL

In Oracle 11g database, Suppose we have table, CUSTOMER and PAYMENT as follows
Customer
CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_AGE | CUSTOMER_CREATION_DATE
--------------------------------------------------------------------
001 John 30 1 Jan 2017
002 Jack 10 2 Jan 2017
003 Jim 50 3 Jan 2017
Payment
CUSTOMER_ID | PAYMENT_ID | PAYMENT_AMOUNT |
-------------------------------------------
001 900 100.00
001 901 200.00
001 902 300.00
003 903 999.00
We want to write an SQL to get all columns from table CUSTOMER together with the sum of all payment of each customer. There are many possible ways to do this but I would like to ask which one of the following is better.
Solution 1
SELECT C.CUSTOMER_ID
, MAX(C.CUSTOMER_NAME) CUSTOMER_NAME
, MAX(C.CUSTOMER_AGE) CUSTOMER_AGE
, MAX(C.CUSTOMER_CREATION_DATE) CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) TOTAL_PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID;
Solution 2
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, SUM(P.PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN PAYMENT P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
GROUP BY C.CUSTOMER_ID, C.CUSTOMER_NAME, C.CUSTOMER_AGE, C.CUSTOMER_CREATION_DATE
Please notice in Solution 1 that I use MAX not because I actually want the max results, but I because I want "ONE" row from the columns which I know are equal for all rows with the same CUSTOMER_ID
While in solution 2, I avoid putting the misleading MAX in SELECT part by putting the columns in GROUP BY part instead.
With my current knowledge, I prefer Solution 1 because it is more important to comprehend the logic in GROUP BY part than in the SELECT part. I would put only a set of unique keys to express the intention of the query, so the application can infer the expected number of rows. But I don't know about the performance.
I ask this question because I am reviewing a code change of a big SQL that put 50 columns in the GROUP BY clause because the editor want avoid the MAX function in SELECT part. I know we can refactor the query in someway to avoid putting the irrelevant columns in both GROUP BY and SELECT part, but please discard that option because it will affect the application logic and require more time to do the test.
Update
I have just done the test on my big query in both versions as everyone suggested. The query is complex, it has 69 lines involving more than 20 tables and the execution plan is more than 190 lines, so I think this is not the place to show it.
My production data is quite small now, it has about 4000 customers and the query was run against the whole database. Only table CUSTOMER and a few reference table has TABLE ACCESS FULL in the execution plan, the others tables have access by indexes. The execution plans for both versions have a little bit difference in join algorithm (HASH GROUP BY vs SORT AGGREGATE) on some part.
Both versions use about 13 minutes, no significant difference.
I also have done the test on the simplified versions similar to the SQL in the question. Both version has exactly the same execution plan and elapse time.
With the current information, I think the most reasonable answer is that it is unpredictable unless test to decide the quality of both versions as the optimizer will do the job. I will very appreciate if anyone could give any information to convince or reject this idea.
Another option is
SELECT C.CUSTOMER_ID
, C.CUSTOMER_NAME
, C.CUSTOMER_AGE
, C.CUSTOMER_CREATION_DATE
, P.PAYMENT_AMOUNT
FROM CUSTOMER C
JOIN (
SELECT CUSTOMER_ID, SUM(PAYMENT_AMOUNT) PAYMENT_AMOUNT
FROM PAYMENT
GROUP BY CUSTOMER_ID
) P ON (P.CUSTOMER_ID = C.CUSTOMER_ID)
To decide which one of three is better just test them and see the execution plans.
Neither. Do the sum on payment, then join the results.
select C.*, p.total_payment -- c.* gets all columns from table alias c without typing them all out
from Customer C
left join -- I've used left in case you want to include customers with no orders
(
select customer_id, sum(payment_amount) as total_payment
from Payment
group by customer_id
) p
on p.customer_id = c.customer_id
Solution 1 is costly.
Even though optimizer could avoid the unecessary sorting,
at some point you will be forced to add indexes/constraints
over irrelevant columns to improve performance.
Not a good practice in the long term.
Solution 2 is the Oracle way.
Oracle documentation states that:
GROUP BY clause must contain only aggregates or grouping columns
Oracle engineers had valid reasons to do that,
however this does not apply to other RDBMS where you
can simply put GROUP BY c.customerID and all will be fine.
For the sake of code readability a --comment would be cheaper.
In general, not embracing any platform principles would have a cost:
more code, weird code, memory, disk space, performance, etc.
In Solution 1 the query will repeat the MAX function for each column. I don't know exactly how the MAX function works but I assume that it sorts all elements on the column than pick the first (best case scenario). It is kind of a time bomb, when your table gets bigger this query will get worst very fast. So if you consern about performance you should pick the solution 2. It looks messier but will be better for the application.

SQL Server cross join performance

I have a table that has 14,091 rows (2 columns, let's say first name, last name). I then have a calendar table that has 553 rows of just dates (first of each month). I do a cross join in order to get every combination of first name, last name, & first of month because this is my requirement. This takes just over a minute.
Is there anything I can do about this to make it faster or can a cross join never get any faster like I suspect?
People Table
first_name varchar2(100)
last_name varchar2(1000)
Dates Table
dt DateTime
select a.first_name, a.last_name, b.dt
from people a, dates b
It will be slow as it making all possible combinations. 14091 * 553. It will not going to be fast until you have either index or inner join.
Yeah. Takes over a minute. Let's get this clear. You talk of 14091 * 553 rows - that is 7792323. Rounded that is 7.8 million rows. And loading them into a data table (which is not known for performance).
Want to see slow? Put them into a grid. THEN you see slow.
The requirements make no sense in a table. None. Absolutely none.
And no, there is no way to speed up the loading of 7.8 million rows into a data structure that is not meant to hold these amounts of data.

t-sql GROUP BY with COUNT, and then include MAX from the COUNT

Suppose you had a table of "Cars" with hundreds of thousands of rows,
and you wanted to do a GROUP BY:
SELECT CarID
, CarName
, COUNT(*) AS Total
FROM dbo.tbl_Cars
GROUP BY CarID
, CarName
The grouping leaves you with a result akin to:
CarID CarName Total
1872 Olds 202,121
547841 BMW 175,298
9877 Ford 10,241
All fine and well.
My question, though, is what is the best way to get the
Total and the MAX Total into one table, in terms of performance and
clean coding, so you have a result like:
CarID CarName Total Max Total
1872 Olds 202,121 202,121
547841 BMW 175,298 202,121
9877 Ford 10,241 202,121
One approach would be to put the GROUP result into a temp table,
and then get the MAX from the temp table into a local variable.
But I'm wondering what the best way to do this would be.
UPDATE
The Common Table Expression seems the most elegant to write,
yet similar to #EBarr, my limited testing indicates a significantly slower performance.
So I won't be going with the CTE.
As the link #EBarr has for the COMPUTE option indicates the feature
is deprecated, that doesn't seem the best route, either.
The option of a local variable for the MAX value and the use of
a temp table will likely be the route I go down, as I'm not
aware of performance issues with it.
A bit more detail about my use case: it could probably end up being a
series of other SO questions. But suffice to say that I'm loading
a large subset of data into a temp table (so a subset of tbl_Cars is
going into #tbl_Cars, and even #tbl_Cars may be further filtered
and have aggregations performed on it), because I have to perform multiple filtering
and aggregation queries on it within a single stored proc
that returns multiple result sets.
UPDATE 2
#EBarr's use of a windowed function is nice and short. Note to self:
if using a RIGHT JOIN to an outer reference table, the COUNT()
function should select a column from tbl_Cars, not '*'.
SELECT M.MachineID
, M.MachineType
, COUNT(C.CarID) AS Total
, MAX(COUNT(C.CarID)) OVER() as MaxTotal
FROM dbo.tbl_Cars C
RIGHT JOIN dbo.tbl_Machines M
ON C.CarID = M.CarID
GROUP BY M.MachineID
, M.MachineType
In terms of speed, it seems fine, but at what point do you have to be
worried about the number of reads?
Mechanically there are a few ways to do this. You could use temp tables/table variable. Another way is with nested queries and/or a CTE as #Aaron_Bertrand showed. A third way is to use WINDOWED FUNCTIONS such as...
SELECT CarName,
COUNT(*) as theCount,
MAX(Count(*)) OVER(PARTITION BY 'foo') as MaxPerGroup
FROM dbo.tbl_Cars
GROUP BY CarName
A DISFAVORED (read depricated) fourth way is using the COMPUTE keyword as such...
SELECT CarID, CarName, Count(*)
FROM dbo.tbl_Cars
GROUP BY CarID, CarName
COMPUTE MAX(Count(*))
The COMPUTE keyword generates totals that appear as additional summary columns at the end of the result set (see this). In the query above you will actually see two record sets.
Fastest
Now, the next issue is what's the "best/fastest/easiest." I immediately think of an indexed view. As #Aaron gently reminded me, indexed views have all sorts of restrictions. The above, strategy, however, allows you to create an indexed view on the SELECT...FROM..GROUP BY. Then selecting from the indexed view apply the WINDOWED FUNCTION clause.
Without knowing more, however, about your design it is going to be difficult for anyone tell you what's best. You will get lighting fast queries from an indexed view. That performance comes at a price, though. The price is maintenance costs. If the underlying table is the target of a large amount of insert/update/delete operations the maintenance of the indexed view will bog down performance in other areas.
If you share a bit more about your use case and data access patterns people will be able to share more insight.
MICRO PERFORMANCE TEST
So I generated a little data script and looked at sql profiler numbers for the CTE performance vs windowed functions. This is a micro-test, so try some real numbers in your system under real load.
Data generation:
Create table Cars ( CarID int identity (1,1) primary key,
CarName varchar(20),
value int)
GO
insert into Cars (CarName, value)
values ('Buick', 100),
('Ford', 10),
('Buick', 300),
('Buick', 100),
('Pontiac', 300),
('Bmw', 100),
('Mecedes', 300),
('Chevy', 300),
('Buick', 100),
('Ford', 200);
GO 1000
This script generates 10,000 rows. I then ran each of the four following queries multiple times :
--just group by
select CarName,COUNT(*) countThis
FROM Cars
GROUP BY CarName
--group by with compute (BAD BAD DEVELOPER!)
select CarName,COUNT(*) countThis
FROM Cars
GROUP BY CarName
COMPUTE MAX(Count(*));
-- windowed aggregates...
SELECT CarName,
COUNT(*) as theCount,
MAX(Count(*)) OVER(PARTITION BY 'foo') as MaxInAnyGroup
FROM Cars
GROUP BY CarName
--CTE version
;WITH x AS (
SELECT CarName,
COUNT(*) AS Total
FROM Cars
GROUP BY CarName
)
SELECT x.CarName, x.Total, x2.[Max Total]
FROM x CROSS JOIN (
SELECT [Max Total] = MAX(Total) FROM x
) AS x2;
After running the above queries, I created an indexed view on the "just group by" query above. Then I ran a query on the indexed view that performed a MAX(Count(*)) OVER(PARTITION BY 'foo'.
AVERAGE RESULTS
Query CPU Reads Duration
--------------------------------------------------------
Group By 15 31 7 ms
Group & Compute 15 31 7 ms
Windowed Functions 14 56 8 ms
Common Table Exp. 16 62 15 ms
Windowed on Indexed View 0 24 0 ms
Obviously this is a micro-benchmark and only mildly instructive, so take it for what it's worth.
Here's one way:
;WITH x AS
(
SELECT CarID
, CarName
, COUNT(*) AS Total
FROM dbo.tbl_Cars
GROUP BY CarID, CarName
)
SELECT x.CarID, x.CarName, x.Total, x2.[Max Total]
FROM x CROSS JOIN
(
SELECT [Max Total] = MAX(Total) FROM x
) AS x2;
SQL Server 2008 R2 and newer versions, you can use :
GROUP BY CarID, CarName WITH ROLLUP

SQL aggregation question

I have three tables:
unmatched_purchases table:
unmatched_purchases_id --primary key
purchases_id --foreign key to events table
location_id --which store
purchase_date
item_id --item purchased
purchases table:
purchases_id --primary key
location_id --which store
customer_id
credit_card_transactions:
transaction_id --primary key
trans_timestamp --timestamp of when the transaction occurred
item_id --item purchased
customer_id
location_id
All three tables are very large. The purchases table has 590130404 records. (Yes, half a billion) Unmatched_purchases has 192827577 records. Credit_card_transactions has 79965740 records.
I need to find out how many purchases in the unmatched_purchases table match up with entries in the credit_card_transactions table. I need to do this for one location at a time (IE run the query for location_id = 123. Then run it for location_id = 456) "Match up" is defined as:
1) same customer_id
2) same item_id
3) the trans_timestamp is within a certain window of the purchase_date
(EG if the purchase_date is Jan 3, 2005
and the trans_timestamp is 11:14PM Jan 2, 2005, that's close enough)
I need the following aggregated:
1) How many unmatched purchases are there for that location
2) How many of those unmatched purchases could have been matched with credit_card_transactions for a location.
So, what is a query (or queries) to get this information that won't take forever to run?
Note: all three tables are indexed on location_id
EDIT: as it turns out, the credit_card_purchases table has been partitioned based on location_id. So that will help speed this up for me. I'm asking our DBA if the others could be partitioned as well, but the decision is out of my hands.
CLARIFICATION: I only will need to run this on a few of our many locations, not all of them separately. I need to run it on 3 locations. We have 155 location_ids in our system, but some of them are not used in this part of our system.
try this (I have no idea how fast it will be - that depends on your indices)
Select Count(*) TotalPurchases,
Sum(Case When c.transaction_id Is Not Null
Then 1 Else 0 End) MatchablePurchases
From unmatched_purchases u
Join purchases p
On p.purchases_id = u.unmatched_purchases_id
Left Join credit_card_transactions c
On customer_id = p.customer_id
And item_id = u.item_id
And trans_timestamp - purchase_date < #DelayThreshold
Where Location_id = #Location
At least, you'll need more indexes. I propose at least the folloging:
An index on unmatched_purchases.purchases_id, one on purchases.location_id and
another index on credit_card_transactions.(location_id, customer_id, item_id, trans_timestamp).
Without those indexes, there is little hope IMO.
I suggest you to query ALL locations at once. It will cost you 3 full scans (each table once) + sorting. I bet this will be faster then querying locations one by one.
But if you want not to guess, you at least need to examine EXPLAIN PLAN and 10046 trace of your query...
The query ought to be straightforward, but the tricky part is to get it to perform. I'd question why you need to run it once for each location when it would probably be more eficient to run it for every location in a single query.
The join would be a big challenge, but the aggregation ought to be straightforward. I would guess that your best hope performance-wise for the join would be a hash join on the customer and item columns, with a subsequent filter operation on the date range. You might have to fiddle with putting the customer and item join in an inline view and then try to stop the date predicate from being pushed into the inline view.
The hash join would be much more efficient with tables that are being equi-joined both having the same hash partitioning key on all join columns, if that can be arranged.
Whether to use the location index or not ...
Whether the index is worth using or not depends on the clustering factor for the location index, which you can read from the user_indexes table. Can you post the clustering factor along with the number of blocks that the table contains? That will give a measure of the way that values for each location are distributed throughout the table. You could also extract the execution plan for a query such as:
select some_other_column
from my_table
where location_id in (value 1, value 2, value 3)
... and see if oracle thinks the index is useful.