Redshift db.
Table A is a date/calendar table.
Table B is a member table. Table B structured as a slowly changing dimension type 6. It has nearly 200 M records.
The goal is to write a performant performant query that gives the count of members for every day in the last 4 years. My first attempted resulting in a query like so:
select
date,
location,
sub_location,
race,
gender,
dob,
member_type,
count(distinct member_id)
from date_table d
join member_table m
on m.row_start <= d.full_date
and m.row_end >= d.full_date
and m.is_active = 'Y'
and m.row_end >= '2019-01-01'
where d.date_key >= 20190101
and d.date_key <= to_char(current_date, 'yyyymmdd')
group by
date,
location,
sub_location,
race,
gender,
dob,
member_type
The performance on this is god awful because of the join being a nested loop. I've been trying to think of a way to rework this to avoid that issue but have not had any success. Curious if there is a way to do so that would increase performance significantly.
For reference here are the table designs as well as the explain plan:
create table date_table
(
date_key integer not null encode delta
primary key,
full_date date encode delta,
)
diststyle all
sortkey (date_key);
create table member_tabnle
(
member_key bigint not null
primary key,
member_id integer,
location integer distkey,
sub_location integer encode zstd,
gender varchar(50) encode zstd,
race varchar(100) encode zstd,
date_of_birth date encode delta32k,
member_type char(10) encode zstd,
active char encode zstd,
row_start timestamp encode zstd,
row_end timestamp encode zstd,
)
diststyle key
interleaved sortkey (location, member_id);
execution plan
I've rewritten the query in various ways, none of which meaningfully impacted performance.
The output should be
Date, member attributes, count of records
You're in luck as I solved this exact issue a few years back and wrote up a description on the solution. You can find it here - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp.html
The basic issue you are facing is the need to massively grow the data before you can condense it down. These fat-in-the-middle queries can be expensive on Redshift and often spill to disk making them even more expensive. The solution is to not create a row for each account for each date but rather to look at it as counting account starts by date and account ends by date - the active accounts is the difference between the rolling sums of these values.
I was able to take a clients query run time down from 45 minutes to 17 seconds using this approach.
If the approach isn't clear let me know in a comment and I can help apply this approach to your situation. It can trip people up the first time.
This approach can be used to solve other problems efficiently like joining on the nearest date.
Related
I have a table that stores a list of members - for the sake of simplicity, I will use a simple real-world case that models my use case.
Let's use the analogy of a sports club or gym.
The membership of the gym changes every three months (for example) - with some old members leaving, some new members joining and some members staying unchanged.
I want to run a query on the table - spanning a multi-time period and return the average weight of all of the members in the club.
These are the tables I have come up with so far:
-- A table containing all members the gym has ever had
-- current members have their leave_date field left at NULL
-- departed members have their leave_date field set to the days they left the gym
CREATE TABLE IF NOT EXISTS member (
id PRIMARY KEY NOT NULL,
name TEXT NOT NULL,
join_date DATE NOT NULL,
-- set to NULL if user has not left yet
leave_date DATE DEFAULT NULL
);
-- A table of members weights.
-- This table is populated DAILY,after the weights of CURRENT members
-- has been recorded
CREATE TABLE IF NOT EXISTS current_member_weight (
id PRIMARY KEY NOT NULL,
calendar_date DATE NOT NULL,
member_id INTEGER REFERENCES member(id) NOT NULL,
weight REAL NOT NULL
);
-- I want to write a query that returns the AVERAGE daily weight of
-- CURRENT members of the gym. The query should take a starting_date
-- and an ending_date between which to calculate the daily
-- averages. The aver
-- PSEUDO SQL BELOW!
SELECT calendar_date, AVG(weight)
FROM member, current_member_weight
WHERE calendar_date BETWEEN(starting_date, ending_date);
I have two questions:
can the schema above be improved - if yes, please illustrate
How can I write an SQL* query to return the average weights calculated for all members in the gym during a specified period (t1, t2), where (t1,t2) spans a period that members have joined/left the gym?
[[Note about SQL]]
Preferably, any SQL shown would be database anagnostic, however if a particular flavour of SQL is to be used, I'd prefer PostgreSQL, since that this is the database I'm using.
Below SQL would work as long as the data in the gym_member table is consistent with the joining and leaving date of each member (i.e. for any member, the gym_member table should not have rows with calendar_date less his joining date or with calendar_date greater than his leaving date)
SELECT
gm.calendar_date,
AVG(gm.weight) avg_weight
FROM
member m,
gym_member gm
WHERE
m.id = gm.member_id
AND
gm.calendar_date >= '1-Jan-2017'
AND
gm.calendar_date <= '31-Dec-2017'
GROUP BY
gm.calendar_date
my "insurance_pay_dtl"(insurance table) consist 'ins_paid_dt'(insurance paid date) column,
i need to select all members whoever not paid insurance amount before due date,
due date is 1 year(365 days)
how do i do..?
You need to link insurance_pay_dtl table with insurance_farmer_hdr with its primary key, for example:
Select member_id, member_name from insurance_farmer_hdr ifd, insurance_pay_dtl ipd
where ifd.insurance_rec_id = ipd.insruance_rec_id and trunc(sysdate) > ifd.due_date
change the columns in above query as per your table columns and try.
I've got a .Net application with an attendance table which has fields for a Start and End date. I'm struggling to show a graph of attendance for a given period. I can easily find how many rows are applicable on any given day using between but I can't get my head around pivoting results so that I can graph a count of rows per day. I could run a SQL query for every day individually and then graph the results but is there any way of doing this with T-SQL that I could then use to graph with?
Edit:-
Apologies as this is the first time I've asked a question here, but as huMpty duMpty has stated the question probably needs more clarification. I've got both a startdate and enddate column in the sql db and I need to count per day if the range between these dates falls between the range of the selection criteria. e.g if I've got a start date of 2013-01-01 and end date 2013-01-10 and I report on a period of 2013-01-09 to 2013-01-11 then i'm looking at getting a result for 1 for 2013-01-09 and 1 for 2013-01-10... Hope this make more sense and thanks for your assistance
I think you have a table with start and end dates; for a given date range, you would like to know the given number of records that fall on each date.
I believe this problem may be solved with a numbers table. I created a numbers table on the fly in a stored procedure, but I recommend creating a permanent numbers table in your production code. Here's the SQL Fiddle.
Create Table Attendance (
id int primary key identity(1,1) not null
,start_date date not null
,end_date date not null
);
Go
Insert Attendance(start_date, end_date)
Values ('1/1/2013', '1/10/2013')
,('1/10/2013', '1/15/2013')
,('2/20/2013', '3/1/2013');
Go
-- Create numbers table. See: Method 3 of http://stackoverflow.com/a/1407488/772086
With Numbers(Number) As
(
Select 1 As Number
Union All
Select Number + 1 From Numbers Where Number < 10000
)
Select
AttendanceDate = Convert(date, DateAdd(day,Numbers.number, '1/1/2000'))
,AttendanceCount = Count(*)
From dbo.Attendance
Join Numbers
On Numbers.Number
Between DateDiff(day, '1/1/2000', Attendance.start_date)
And DateDiff(day, '1/1/2000', Attendance.end_date)
-- Reporting range between 1/9 and 1/11
Where DateAdd(day,Numbers.number, '1/1/2000') Between '1/9/2013'
And '1/11/2013'
Group By Convert(date, DateAdd(day,Numbers.number, '1/1/2000'))
Option(MaxRecursion 10000);
All dates are in US format (m/d/yy) - you may want to switch those to the internationalized standard (yyyy-mm-dd) in your production code.
You said you wanted a count by day in a date range. That can be done with a COUNT with a GROUP BY clause.
I don't know your schema, but a solution might look like this:
declare #MyTable table
(
ID int identity(1,1) primary key clustered,
MyDate smalldatetime
)
insert into #MyTable (MyDate)
values
('2012-12-31'), -- before the date range, so not included in results
('2013-01-10'),
('2013-01-10'), -- appears twice
('2013-01-11'), -- appears once
('2013-01-12') -- after the date range, so not included in results
select * from #MyTable
select
MyDate,
count(*)
from #MyTable
where MyDate between '2013-01-09' and '2013-01-11'
group by MyDate
My memory is failing me. I have a simple audit log table based on a trigger:
ID int (identity, PK)
CustomerID int
Name varchar(255)
Address varchar(255)
AuditDateTime datetime
AuditCode char(1)
It has data like this:
ID CustomerID Name Address AuditDateTime AuditCode
1 123 Bob 123 Internet Way 2009-07-17 13:18:06.353I
2 123 Bob 123 Internet Way 2009-07-17 13:19:02.117D
3 123 Jerry 123 Internet Way 2009-07-17 13:36:03.517I
4 123 Bob 123 My Edited Way 2009-07-17 13:36:08.050U
5 100 Arnold 100 SkyNet Way 2009-07-17 13:36:18.607I
6 100 Nicky 100 Star Way 2009-07-17 13:36:25.920U
7 110 Blondie 110 Another Way 2009-07-17 13:36:42.313I
8 113 Sally 113 Yet another Way 2009-07-17 13:36:57.627I
What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.
Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.
Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.
This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:
Insert: copy the full row along with a start date and NULL end date
Delete: set the End Date of the existing current row (endDate is NULL)
Update: do a Delete then Insert
Your select would simply be:
select * from AuditTable where endDate is NULL
Anyway, here's my query for your existing schema:
declare #from datetime
declare #to datetime
select b.* from (
select
customerId
max(auditdatetime) 'auditDateTime'
from
AuditTable
where
auditcode in ('I', 'U')
and auditdatetime between #from and #to
group by customerId
having
/* rely on "current" being defined as INSERTS > DELETES */
sum(case when auditcode = 'I' then 1 else 0 end) >
sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
select top 1 customerId, name, address, auditdateTime
from AuditTable
where auditdatetime = a.auditdatetime and customerId = a.customerId
) b
References
A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)
MSDN page on data warehousing
Ok, a couple of things for audit log tables.
For most applications, we want audit tables to be extremely quick on insertion.
If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.
And this means to put the audit time as the first column of the clustered index, e.g.
create unique clustered index idx_mytable on mytable(AuditDateTime, ID)
This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.
If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.
You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.
However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed.
An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....
Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.
Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.
You will need to do a subquery upon your audit table, between the range,
select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
and then incorporate that into your select query proper, eg.
select AuditTrail.* from AuditTrail
inner join
(select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
) filtration
on filtration.CustomerID = AuditTrail.CustomerID and
filtration.AuditDateTime = AuditTrail.AuditDateTime
Another approach is using a sub select
select a.ID
, a.CustomerID
, a.Name
, a.Address
, a.AuditDateTime
, a.AuditCode
from myauditlogtable a,
(select s.id as maxid,max(s.AuditDateTime)
from myauditlogtable as s
group by maxid)
as subq
where subq.maxid=a.id;
start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36
Here's the query:
SELECT
count(id) AS count
FROM `numbers`
GROUP BY
MONTH(created_at),
YEAR(created_at)
ORDER BY
YEAR(created_at),
MONTH(created_at)
That query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
Ultimately what I'm doing is looking at a table of user-submitted tracking numbers and counting the number of submitted rows a grouping the counts by month/year.
ie. In November 2008 there were 11,312 submitted rows.
UPDATE, here's the DESCRIBE for the numbers table.
id int(11) NO PRI NULL auto_increment
tracking varchar(255) YES NULL
service varchar(255) YES NULL
notes text YES NULL
user_id int(11) YES NULL
active tinyint(1) YES 1
deleted tinyint(1) YES 0
feed text YES NULL
status varchar(255) YES NULL
created_at datetime YES NULL
updated_at datetime YES NULL
scheduled_delivery date YES NULL
carrier_service varchar(255) YES NULL
Give this a shot:
SELECT COUNT(x.id)
FROM (SELECT t.id,
MONTH(t.created_at) 'created_month',
YEAR(t.created_at) 'created_year'
FROM NUMBERS t) x
GROUP BY x.created_month, x.created_year
ORDER BY x.created_month, x.created_year
It's not a good habit to use functions in the WHERE, GROUP BY and ORDER BY clauses because indexes can't be used.
...query throws a 'Using temporary' and 'Using filesort' when doing EXPLAIN.
From what I found, that's to be expected when using DISTINCT/GROUP BY.
Make sure you have a covering index over YEAR and MONTH (that is, both fields within the same index) so that the ORDER BY component of your query can use an index. This should remove the need for a filesort, although a temporary table may still be needed to handle the grouping.
SELECT
count(`id`) AS count, MONTH(`created_at`) as month, YEAR(`created_at`) as year
FROM `numbers`
GROUP BY month, year
ORDER BY created_at
This will be the best you can get, as far as I can tell. I created a table with an id and a datetime column and filled it with 10000 rows. The query above uses a sub select, but it really doesn't do you any different and has the overhead of a sub select. The resulting time for mine was 0.015s and his was 0.016s.
Make sure that you have an index on created_at, this will help your initial query out. It is pretty rare to not end up with a file sort when the group by comes about, but it may be possible in other situations. MySql's docs have an article about this if you feel so inclined. I do not see how those methods can be applied here, with the information you have provided.
Whenever MySQL has to do work in memory, and that work exceeds the available amount (innodb_buffer_pool_size), it starts having to use the disk to store temporary work. You could increase the variable I mentioned, but setting it too high could cause performance problems in other areas.
If you're running a dedicated server, set it to ~50-75%.
The best method would be creating a helper column that would contain numberic values of YEAR and MONTH concatenated together:
YEAR(created_at) * 100 + MONTH(created_at)
Grouping on this column would use INDEX FOR GROUP BY.
However, you can create two helper tables, the first one containing reasonable number of years (say, from 1900 to 2100), the second one containing months (from 0 to 11), and use these tables to generate the sets:
SELECT (
SELECT COUNT(*)
FROM numbers
WHERE created_at >= '1900-01-01' + INTERVAL y YEAR + INTERVAL m MONTH
AND created_at < '1900-01-01' + INTERVAL y YEAR + INTERVAL m + 1 MONTH
)
FROM year_table
CROSS JOIN
month_table
WHERE y BETWEEN 2008 AND 2010
I'm sorry, but I have to disagree with the other answers.
I think what you need is to add an index to your table, preferably a covering index.
If you add an index on the columns you are searching on (created_at) and also on the columns you want to get a result from (id) then it will be dramatically faster then before.
The reason why you are using a temp table is because you use a group by.
To speed up the group by, you can change the MySQL server settings to increase the size of the tmp table and the max heap table size so that the temp table will be in memory.