Filtering a large table by date - sql

I have a table, VISIT_INFO, with these columns:
pers_key - unique identifyer for each person
pers_name - name of person
visit_date - date at which they visited a business
And another table, VALID_DATES, with these columns:
condition - string
start_date - date
end_date - date
I currently have the following query:
select pers_key, pers_name from VISIT_INFO a
CROSS JOIN
(select start_date, end_date from VALID_DATES where condition = 'condition1') b
WHERE (a.visit_date >= b.start_date and a.visit_date <= b.end_date)
GROUP BY a.pers_key
So 'condition1' has a specific start_date and end_date. I need to filter VISIT_INFO for visits that are between the two dates. I'm wondering if there is a more efficient way to do this. From my current understanding, it currently has to go through the entire table (millions of rows) and add start_date and end_date to each row. Then does it have to go through each row again and test against the WHERE condition?
I ask this because when I remove the cross join and hardcode the start_date and end_date for condition1, it takes substantially less time. I'm trying to avoid hardcoding in the dates because it will lead to serious tedium down the road.
So to reiterate, is there a better way to filter VISIT_INFO by specific dates in VALID_DATES?
Edit: I just realized I left out a pretty huge piece of information, being that this is all in HIVE. So EXISTS and joins on (a between b and c) are out of the question.

How about:
SELECT DISTINCT pers_key, pers_name
FROM visit_info
WHERE EXISTS
(
SELECT 1
FROM valid_dates
WHERE condition = 'condition1'
AND visit_date BETWEEN start_date AND end_date
);
?

with dt as (select start_date, end_date from VALID_DATES where condition = 'condition1')
select a.pers_key, a.pers_name
from VISIT_INFO a
JOIN dt on a.visit_date between dt.start_date and dt.end_date
GROUP BY a.pers_key

Trying the exists version is definitely a possibility. However, you might be better off expanding the VALID_DATES table, so there is one row per date.
Then, the query:
select vi.*
from VISIT_INFO vi JOIN
VALID_DATES_expanded vde
ON vi.visit_date = vde.valid_date
where vde.condition = 'condition1';
can make use of an index on VISIT_INFO(visit_date) and on VALID_DATES_expanded(condition, valid_date). This is likely to be the fastest approach to solving this problem, if VISIT_INFO is very large and relatively few rows are being selected by the query.

Related

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.
You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

Get apps with the highest review count since a dynamic series of days

I have two tables, apps and reviews (simplified for the sake of discussion):
apps table
id int
reviews table
id int
review_date date
app_id int (foreign key that points to apps)
2 questions:
1. How can I write a query / function to answer the following question?:
Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?
I think I know how to write a query if given an explicit date:
SELECT
apps.id,
count(reviews.*)
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
group by
1
having
min(reviews.review_date) >= '2020-01-01'
order by 2 desc
limit 10;
But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.
2. What's the best way to model this data?
It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:
... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...
But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.
I think this is what you are looking for:
Postgres 13 or newer
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT app_id, total_ct
FROM cte c
WHERE c.earliest_review >= d.review_window_start
ORDER BY total_ct DESC
FETCH FIRST 1 ROWS WITH TIES -- new & hot
) sub
GROUP BY 1
) a ON true;
WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:
Get top row(s) with highest value, with ties
Postgres 12 or older
WITH cte AS ( -- MATERIALIZED
SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
FROM reviews
GROUP BY 1
)
SELECT *
FROM (
SELECT generate_series(min(review_date)
, max(review_date)
, '1 day')::date
FROM reviews
) d(review_window_start)
LEFT JOIN LATERAL (
SELECT total_ct, array_agg(app_id) AS apps
FROM (
SELECT total_ct, app_id
, rank() OVER (ORDER BY total_ct DESC) AS rnk
FROM cte c
WHERE c.earliest_review >= d.review_window_start
) sub
WHERE rnk = 1
GROUP BY 1
) a ON true;
db<>fiddle here
Same as above, but without WITH TIES.
We don't need to involve the table apps at all. The table reviews has all information we need.
The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:
How to force evaluation of subquery before joining / pushing down to foreign server
The optimized generate_series() call produces the series of days from earliest to latest review. See:
Generating time series between two dates in PostgreSQL
Join a count query on generate_series() and retrieve Null values as '0'
Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See:
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
If you are looking for hints, then here are a few:
Are you aware of generate_series() and how to use it to compose a table of dates given a start and end date? If not, then there are plenty of examples on this site.
To answer this question for any given date, you need to have only two measures for each app, and only one of these is used to compare an app against other apps. Your query in part 1 shows that you know what these two measures are.
Hints 1 and 2 should be enough to get this done. The only thing I can add is for you not to worry about making the database do "too much work." That is what it is there to do. If it does not do it quickly enough, then you can think about optimizations, but before you get to that step, concentrate on getting the answer that you want.
Please comment if you need further clarification on this.
The missing piece for me was lateral join.
I can accomplish just about what I want using the following:
select
review_windows.review_window_start,
id,
review_total,
earliest_review
from
(
select
date_trunc('day', review_windows.review_windows) :: date as review_window_start
from
generate_series(
(
SELECT
min(reviews.review_date)
FROM
reviews
),
(
SELECT
max(reviews.review_date)
FROM
reviews
),
'1 year'
) review_windows
order by
1 desc
) review_windows
left join lateral (
SELECT
apps.id,
count(reviews.*) as review_total,
min(reviews.review_date) as earliest_review
FROM
reviews
INNER JOIN apps ON apps.id = reviews.app_id
where
reviews.review_date >= review_windows.review_window_start
group by
1
having
min(reviews.review_date) >= review_windows.review_window_start
order by
2 desc,
3 desc
limit
2
) apps_most_reviews on true;

Recursive subtraction from two separate tables to fill in historical data

I have two datasets hosted in Snowflake with social media follower counts by day. The main table we will be using going forward (follower_counts) shows follower counts by day:
This table is live as of 4/4/2020 and will be updated daily. Unfortunately, I am unable to get historical data in this format. Instead, I have a table with historical data (follower_gains) that shows net follower gains by day for several accounts:
Ideally - I want to take the follower_count value from the minimum date in the current table (follower_counts) and subtract the sum of gains (organic + paid gains) for each day, until the minimum date of the follower_gains table, to fill in the follower_count historically. In addition, there are several accounts with data in these tables, so it would need to be grouped by account. It should look like this:
I've only gotten as far as unioning these two tables together, but don't even know where to start with looping through these rows:
WITH a AS (
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
total_followers_count,
null AS paid_follower_gain,
null AS organic_follower_gain,
account_name,
last_update
FROM follower_counts
UNION ALL
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
null AS total_followers_count,
organic_follower_gain,
paid_follower_gain,
account_name,
last_update
FROM follower_gains)
SELECT
a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.total_followers_count,
a.organic_follower_gain,
a.paid_follower_gain,
a.account_name,
a.last_update
FROM a
ORDER BY date desc LIMIT 100
UPDATE: Changed union to union all and added not exists to remove duplicates. Made changes per the comments.
NOTE: Please make sure you don't post images of the tables. It's difficult to recreate your scenario to write a correct query. Test this solution and update so that I can make modifications if necessary.
You don't loop through in SQL because its not a procedural language. The operation you define in the query is performed for all the rows in a table.
with cte as (SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
(a.follower_count - (b.organic_gain+b.paid_gain)) AS follower_count,
a.account_name,
a.last_update,
b.organic_gain,
b.paid_gain
FROM follower_counts a
JOIN follower_gains b ON a.account_id = b.account_id
AND b.date < (select min(date) from
follower_counts c where a.account.id = c.account_id)
)
SELECT b.account_id,
b.date,
b.organizational_entity,
b.organizational_entity_type,
b.vanity_name,
b.localized_name,
b.localized_website,
b.organization_type,
b.follower_count,
b.account_name,
b.last_update,
b.organic_gain,
b.paid_gain
FROM cte b
UNION ALL
SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.follower_count,
a.account_name,
a.last_update,
NULL as organic_gain,
NULL as paid_gain
FROM follower_counts a where not exists (select 1 from
follower_gains c where a.account_id = c.account_id AND a.date = c.date)
You could do something like this, instead of using the variable you can just wrap it another bracket and write at end ) AS FollowerGrowth
DECLARE #FollowerGrowth INT =
( SELECT total_followers_count
FROM follower_gains
WHERE AccountID = xx )
-
( SELECT TOP 1 follower_count
FROM follower_counts
WHERE AccountID = xx
ORDER BY date ASCENDING )

Unpivot date columns to a single column of a complex query in Oracle

Hi guys, I am stuck with a stubborn problem which I am unable to solve. Am trying to compile a report wherein all the dates coming from different tables would need to come into a single date field in the report. Ofcourse, the max or the most recent date from all these date columns needs to be added to the single date column for the report. I have multiple users of multiple branches/courses for whom the report would be generated.
There are multiple blogs and the latest date w.r.t to the blogtitle needs to be grouped, i.e. max(date_value) from the six date columns should give the greatest or latest date for that blogtitle.
Expected Result:
select u.batch_uid as ext_person_key, u.user_id, cm.batch_uid as ext_crs_key, cm.crs_id, ir.role_id as
insti_role, (CASE when b.JOURNAL_IND = 'N' then
'BLOG' else 'JOURNAL' end) as item_type, gm.title as item_name, gm.disp_title as ITEM_DISP_NAME, be.blog_pk1 as be_blogPk1, bc.blog_entry_pk1 as bc_blog_entry_pk1,bc.pk1,
b.ENTRY_mod_DATE as b_ENTRY_mod_DATE ,b.CMT_mod_DATE as BlogCmtModDate, be.CMT_mod_DATE as be_cmnt_mod_Date,
b.UPDATE_DATE as BlogUpDate, be.UPDATE_DATE as be_UPDATE_DATE,
bc.creation_date as bc_creation_date,
be.CREATOR_USER_ID as be_CREATOR_USER_ID , bc.creator_user_id as bc_creator_user_id,
b.TITLE as BlogTitle, be.TITLE as be_TITLE,
be.DESCRIPTION as be_DESCRIPTION, bc.DESCRIPTION as bc_DESCRIPTION
FROM users u
INNER JOIN insti_roles ir on u.insti_roles_pk1 = ir.pk1
INNER JOIN crs_users cu ON u.pk1 = cu.users_pk1
INNER JOIN crs_mast cm on cu.crsmast_pk1 = cm.pk1
INNER JOIN blogs b on b.crsmast_pk1 = cm.pk1
INNER JOIN blog_entry be on b.pk1=be.blog_pk1 AND be.creator_user_id = cu.pk1
LEFT JOIN blog_CMT bc on be.pk1=bc.blog_entry_pk1 and bc.CREATOR_USER_ID=cu.pk1
JOIN gradeledger_mast gm ON gm.crsmast_pk1 = cm.pk1 and b.grade_handler = gm.linkId
WHERE cu.ROLE='S' AND BE.STATUS='2' AND B.ALLOW_GRADING='Y' AND u.row_status='0'
AND u.available_ind ='Y' and cm.row_status='0' and and u.batch_uid='userA_157'
I am getting a resultset for the above query with multiple date columns which I want > > to input into a single columnn. The dates have to be the most recent, i.e. max of the dates in the date columns.
I have successfully done the Unpivot by using a view to store the above
resultset and put all the dates in one column. However, I do not
want to use a view or a table to store the resultset and then do
Unipivot simply because I cannot keep creating views for every user
one would query for.
The max(date_value) from the date columns need to be put in one single column. They are as follows:
* 1) b.entry_mod_date, 2) b.cmt_mod_date ,3) be.cmt_mod_date , 4) b.update_Date ,5) be.update_date, 6) bc.creation_date *
Apologies that I could not provide the desc of all the tables and the
fields being used.
Any help to get the above mentioned max of the dates from these
multiple date columns into a single column without using a view or a
table would be greatly appreciated.*
It is not clear what results you want, but the easiest solution is to use greatest().
with t as (
YOURQUERYHERE
)
select t.*,
greatest(entry_mod_date, cmt_mod_date, cmt_mod_date, update_Date,
update_date, bc.creation_date
) as greatestdate
from t;
select <columns>,
case
when greatest (b_ENTRY_mod_DATE) >= greatest (BlogCmtModDate) and greatest(b_ENTRY_mod_DATE) >= greatest(BlogUpDate)
then greatest( b_ENTRY_mod_DATE )
--<same implementation to compare each time BlogCmtModDate and BlogUpDate separately to get the greatest then 'date'>
,<columns>
FROM table
<rest of the query>
UNION ALL
Select <columns>,
case
when greatest (be_cmnt_mod_Date) >= greatest (be_UPDATE_DATE)
then greatest( be_cmnt_mod_Date )
when greatest (be_UPDATE_DATE) >= greatest (be_cmnt_mod_Date)
then greatest( be_UPDATE_DATE )
,<columns>
FROM table
<rest of the query>
UNION ALL
Select <columns>,
GREATEST(bc_creation_date)
,<columns>
FROM table
<rest of the query>

Query question regarding aggregates over a date range

I have a data set where the structure could be like this
yes_no date
0 1/1/2011
1 1/1/2011
1 1/2/2011
0 1/4/2011
1 1/9/2011
Given a start data and and end date, I would like to create a query where it would aggregate over the date and provide a 0 for dates that do not exist in the table, for dates between start_data and end_date including both
This is in SQL.
I am stumped. I can get the aggregate queries very simply, but i don't know how to get zeros for dates that do not exist in the table.
If you're working with a DBMS that supports common table expressions, the following will generate a derived table of dates that you can then left join to your table. This was written for MSSQL, so you may need to derive your dates differently (i.e., an object other than master..spt_values)
with AllDates as (
select top 100000
convert(datetime, row_number() over (order by x.name)) as 'Date'
from
master..spt_values x
cross join master..spt_values y
)
select
ad.Date, isnull(yt.yn, 0)
from
AllDates ad
left join (
select date, sum(yes_no) yn
from YourTable yt
) yt
on ad.date = yt.date
where
ad.Date between YourStartDate and YourEndDate
Generating the dates has to be the way to go.
In ORACLE you could join on to a list of dates, why not..
(SELECT TRUNC(startdate + LEVEL)
FROM DUAL CONNECT BY LEVEL <(enddate-startdate))
If you can't generate your dates on-the-fly
a database agnostic solution would be to create a table containing all of the dates you will ever need and join on to that. (this should be your last resort)
here's the pseudeo code, you will need to substitute mydates for either the on-the fly sql or date table select
SELECT
CASE WHEN COUNT(b.date)=0
THEN
0
ELSE
1
END as yes_no
FROM (mydates) a
LEFT JOIN aggtable b ON a.date=b.date