Compare one row of a table to every rows of a second table - sql

I am trying to retrieve the number of days between a random date and the next known date for a holiday. Let's say my first table looks like this :
date | is_holiday | zone
9/11/18 | 0 | A
22/12/18 | 1 | A
and my holidays table looks like this
start_date | end_date | zone
20/12/18 | 04/01/18 | A
21/12/18 | 04/01/18 | B
...
I want to be able to know how many days are between an entry that is not a holiday in the first table and the next holiday date.
I have tried to get the next row with a later date in a join clause but the join isn't the tool for this task. I also have tried grouping by date and comparing the date with the next row but I can have multiple entries with the same date in the first table so it doesn't work.
This is the join clause I have tried :
SELECT mai.*, vac.start_date, datediff(vac.start_date, mai.date)
FROM (SELECT *
FROM MAIN
WHERE is_holiday = 0
) mai LEFT JOIN
(SELECT start_date, zone
FROM VACATIONS_UPDATED
ORDER BY start_date
) vac
ON mai.date < vac.start_date AND mai.zone = vac.zone
I expect to get a table looking like this :
date | is_holiday | zone | next_holiday
9/11/18 | 0 | A | 11
22/12/18 | 1 | A | 0
Any lead on how to achieve this ?

It might get messy to do it in SQL but if in case you are open to doing it from code, here is what it should look like. You basically need a crossJoin
Dataset<Row> table1 = <readData>
Dataset<Row> holidays = <readData>
//then cache the small table to get the best performance
table1.crossJoin( holidays ).filter("table1.zone == holidays.zone AND table1.date < holidays.start_date").select( "table1.*", "holidays.start_date").withColumn("nextHoliday", *calc diff*)
In scenarios where one row from table1 matches multiple holidays, then you can add an id column to table1 and then group the crossJoin.
// add unique id to the rows
table1 = table1.withColumn("id", functions.monotonically_increasing_id() )
Some details on crossJoins:
http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-apache-spark/

Related

Running sum of unique users in redshift

I have a table with as follows with user visits by day -
| date | user_id |
|:-------- |:-------- |
| 01/31/23 | a |
| 01/31/23 | a |
| 01/31/23 | b |
| 01/30/23 | c |
| 01/30/23 | a |
| 01/29/23 | c |
| 01/28/23 | d |
| 01/28/23 | e |
| 01/01/23 | a |
| 12/31/22 | c |
I am looking to get a running total of unique user_id for the last 30 days . Here is the expected output -
| date | distinct_users|
|:-------- |:-------- |
| 01/31/23 | 5 |
| 01/30/23 | 4 |
.
.
.
Here is the query I tried -
SELECT date
, SUM(COUNT(DISTINCT user_id)) over (order by date rows between 30 preceding and current row) AS unique_users
FROM mytable
GROUP BY date
ORDER BY date DESC
The problem I am running into is that this query not counting the unique user_id - for instance the result I am getting for 01/31/23 is 9 instead of 5 as it is counting user_id 'a' every time it occurs.
Thank you, appreciate your help!
Not the most performant approach, but you could use a correlated subquery to find the distinct count of users over a window of the past 30 days:
SELECT
date,
(SELECT COUNT(DISTINCT t2.user_id)
FROM mytable t2
WHERE t2.date BETWEEN t1.date - INTERVAL '30 day' AND t1.date) AS distinct_users
FROM mytable t1
ORDER BY date;
There are a few things going on here. First window functions run after group by and aggregation. So COUNT(DISTINCT user_id) gives the count of user_ids for each date then the window function runs. Also, window function set up like this work over the past 30 rows, not 30 days so you will need to fill in missing dates to use them.
As to how to do this - I can only think of the "expand to the data so each date and id has a row" method. This will require a CTE to generate the last 2 years of dates plus 30 days so that the look-back window works for the first dates. Then window over the past 30 days for each user_id and date to see which rows have an example of this user_id within the past 30 days, setting the value to NULL if no uses of the user_id are present within the window. Then Count the user_ids counts (non NULL) grouping by just date to get the number of unique user_ids for that date.
This means expanding the data significantly but I see no other way to get truly unique user_ids over the past 30 days. I can help code this up if you need but will look something like:
WITH RECURSIVE CTE to generate the needed dates,
CTE to cross join these dates with a distinct set of all the user_ids in user for the past 2 years,
CTE to join the date/user_id data set with the table of real data for past 2 years and 30 days and window back counting non-NULL user_ids, partition by date and user_id, order by date, and setting any zero counts to NULL with a DECODE() or CASE statement,
SELECT, grouping by just date count the user_ids by date;

Add rows between two dates Presto

I have a table that has 3 columns- start, end and emp_num. I want to generate a new table which has all dates between these dates for every employee. Need to use Presto.
I refered this link - inserting dates into a table between a start and end date in Presto
Tried using unnest function by creating sequence but , I don't know how do I create sequence by pulling dates from two columns in another table.
select unnest(seq) as t(days)
from (select sequence(start, end, interval '1' day) as seq
from table1)
Here's table and expected format
Table 1:
start | end | emp_num
2018/01/01 | 2018/01/05 | 1
2019/02/01 | 2019/02/05 | 2
Expected:
start | emp_num
2018/01/01 | 1
2018/01/02 | 1
2018/01/03 | 1
2018/01/04 | 1
2018/01/05 | 1
2019/02/01 | 2
2019/01/02 | 2
2019/02/03 | 2
2019/02/04 | 2
2019/02/05 | 2
Here is a query that might get the job done for your use case.
The logic is to use Presto sequence() function to generate a wide date range (since year 2000 to end of 2018, you can adapt that as needed), that can be joined with the table to generate the output.
select dt.x, emp_num
from
( select x from unnest(sequence(date '2000-01-01', date '2018-01-31')) t(x) ) dt
inner join table1 ta on dt.x >= ta.start and dt.x <= ta.end
However, as commented JNevill, it would be more efficient to create a calendar table rather than generating it on the fly every time the query runs.
It should be a simple as :
create table calendar as
select x from unnest(sequence(date '1970-01-01', date '2099-01-01')) t(x);
And then your query would become :
select dt.x, emp_num
from
calendar dt
inner join table1 ta on dt.x >= ta.start and dt.x <= ta.end
PS : due to the lack of DB Fiddles for Presto in the wild, I could not test the queries (#PiotrFindeisen - if you happen to read this - a Presto fiddle would be nice to have !).

How to dynamically call date instead of hardcoding in WHERE clause?

In my code using SQL Server, I am comparing data between two months where I have the exact dates identified. I am trying to find if the value in a certain column changes in a bunch of different scenarios. That part works, but what I'd like to do is make it so that I don't have to always go back to change the date each time I wanted to get the results I'm looking for. Is this possible?
My thought was that adding a WITH clause, but it is giving me an aggregation error. Is there anyway I can go about making this date problem simpler? Thanks in advance
EDIT
Ok I'd like to clarify. In my WITH statement, I have:
select distinct
d.Date
from Database d
Which returns:
+------+-------------+
| | Date |
+------+-------------|
| 1 | 01-06-2017 |
| 2 | 01-13-2017 |
| 3 | 01-20-2017 |
| 4 | 01-27-2017 |
| 5 | 02-03-2017 |
| 6 | 02-10-2017 |
| 7 | 02-17-2017 |
| 8 | 02-24-2017 |
| 9 | ........ |
+------+-------------+
If I select this statement and execute, it will return just the dates from my table as shown above. What I'd like to do is be able to have sql that will pull from these date values and compare the last date value from one month to the last date value of the next month. In essence, it should compare the values from date 8 to values from date 4, but it should be dynamic enough that it can do the same for any two dates without much tinkering.
If I didn't misunderstand your request, it seems you need a numbers table, also known as a tally table, or in this case a calendar table.
Recommended post: https://dba.stackexchange.com/questions/11506/why-are-numbers-tables-invaluable
Basically, you create a table and populate it with numbers of year's week o start and end dates. Then join your main query to this table.
+------+-----------+----------+
| week | startDate | endDate |
+------+-----------+----------+
| 1 | 20170101 | 20170107 |
| 2 | 20170108 | 20170114 |
+------+-----------+----------+
Select b.week, max(a.data) from yourTable a
inner join calendarTable b
on a.Date between b.startDate and b.endDate
group by b.week
dynamic dates to filter by BETWEEN
select dateadd(m,-1,dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date))) -- 1st date of last month
select dateadd(day,-datepart(day,cast(getdate() as date)),cast(getdate() as date)) -- last date of last month
select dateadd(day,-(datepart(day,cast(getdate() as date))-1),cast(getdate() as date)) -- 1st date of current month
select dateadd(day,-datepart(day,dateadd(m,1,cast(getdate() as date))),dateadd(m,1,cast(getdate() as date))) -- last date of the month

yet another date gap-fill SQL puzzle

I'm using Vertica, which precludes me from using CROSS APPLY, unfortunately. And apparently there's no such thing as CTEs in Vertica.
Here's what I've got:
t:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
Note that on the first day, the delta is equal to the metric value.
I'd like to fill in the gaps, like this:
t_fill:
day | id | metric | d_metric
-----------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0 -- a delta of 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
I've thought of a way to do this day by day, but what I'd really like is a solution that works in one go.
I think I could get something working with LAST_VALUE, but I can't come up with the right JOIN statements that will let me properly partition and order on each id's day-by-day history.
edit:
assume I have a table like this:
calendar:
day
------------
2011-01-01
2011-01-02
...
that can be involved with joins. My intent would be to maintain the date range in calendar to match the date range in t.
edit:
A few more notes on what I'm looking for, just to be specific:
In generating t_fill, I'd like to exactly cover the date range in t, as well as any dates that are missing in between. So a correct t_fill will start on the same date and end on the same date as t.
t_fill has two properties:
1) once an id appears on some date, it will always have a row for each later date. This is the gap-filling implied in the original question.
2) Should no row for an id ever appear again after some date, the t_fill solution should merrily generate rows with the same metric value (and 0 delta) from the date of that last data point up to the end date of t.
A solution might backfill earlier dates up to the start of the date range in t. That is, for any id that appears after the first date in t, rows between the first date in t and the first date for the id will be filled with metric=0 and d_metric=0. I don't prefer this kind of solution, since it has a higher growth factor for each id that enters the system. But I could easily deal with it by selecting into a new table only rows where metric!=0 and d_metric!=0.
This about what Jonathan Leffler proposed, but into old-fashioned low-level SQL (without fancy CTE's or window functions or aggregating subqueries):
SET search_path='tmp'
DROP TABLE ttable CASCADE;
CREATE TABLE ttable
( zday date NOT NULL
, id INTEGER NOT NULL
, metric INTEGER NOT NULL
, d_metric INTEGER NOT NULL
, PRIMARY KEY (id,zday)
);
INSERT INTO ttable(zday,id,metric,d_metric) VALUES
('2011-12-01',1,10,10)
,('2011-12-03',1,12,2)
,('2011-12-04',1,15,3)
;
DROP TABLE ctable CASCADE;
CREATE TABLE ctable
( zday date NOT NULL
, PRIMARY KEY (zday)
);
INSERT INTO ctable(zday) VALUES
('2011-12-01')
,('2011-12-02')
,('2011-12-03')
,('2011-12-04')
;
CREATE VIEW v_cte AS (
SELECT t.zday,t.id,t.metric,t.d_metric
FROM ttable t
JOIN ctable c ON c.zday = t.zday
UNION
SELECT c.zday,t.id,t.metric, 0
FROM ctable c, ttable t
WHERE t.zday < c.zday
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday = c.zday
)
AND NOT EXISTS ( SELECT *
FROM ttable nx
WHERE nx.id = t.id
AND nx.zday < c.zday
AND nx.zday > t.zday
)
)
;
SELECT * FROM v_cte;
The results:
zday | id | metric | d_metric
------------+----+--------+----------
2011-12-01 | 1 | 10 | 10
2011-12-02 | 1 | 10 | 0
2011-12-03 | 1 | 12 | 2
2011-12-04 | 1 | 15 | 3
(4 rows)
I am not Vertica user, but if you do not want to use their native support for GAP fillings, here you can find a more generic SQL-only solution to do so.
If you want to use something like a CTE, how about using a temporary table? Essentially, a CTE is a view for a particular query.
Depending on your needs you can make the temporary table transaction or session-scoped.
I'm still curious to know why gap-filling with constant-interpolation wouldn't work here.
Given the complete calendar table, it is doable, though not exactly trivial. Without the calendar table, it would be a lot harder.
Your query needs to be stated moderately precisely, which is usually half the battle in any issue with 'how to write the query'. I think you are looking for:
For each date in Calendar between the minimum and maximum dates represented in T (or other stipulated range),
For each distinct ID represented in T,
Find the metric for the given ID for the most recent record in T on or before the date.
This gives you a complete list of dates with metrics.
You then need to self-join two copies of that list with dates one day apart to form the deltas.
Note that if some ID values don't appear at the start of the date range, they won't show up.
With that as guidance, you should be able get going, I believe.

Finding correlated values from second table without resorting to PL/SQL

I have the following two tables in my database:
a) A table containing values acquired at a certain date (you may think of these as, say, temperature readings):
sensor_id | acquired | value
----------+---------------------+--------
1 | 2009-04-01 10:00:00 | 20
1 | 2009-04-01 10:01:00 | 21
1 | 2009-04 01 10:02:00 | 20
1 | 2009-04 01 10:09:00 | 20
1 | 2009-04 01 10:11:00 | 25
1 | 2009-04 01 10:15:00 | 30
...
The interval between the readings may differ, but the combination of (sensor_id, acquired) is unique.
b) A second table containing time periods and a description (you may think of these as, say, periods when someone turned on the radiator):
sensor_id | start_date | end_date | description
----------+---------------------+---------------------+------------------
1 | 2009-04-01 10:00:00 | 2009-04-01 10:02:00 | some description
1 | 2009-04-01 10:10:00 | 2009-04-01 10:14:00 | something else
Again, the length of the period may differ, but there will never be overlapping time periods for any given sensor.
I want to get a result that looks like this for any sensor and any date range:
sensor id | start date | v1 | end date | v2 | description
----------+---------------------+----+---------------------+----+------------------
1 | 2009-04-01 10:00:00 | 20 | 2009-04-01 10:02:00 | 20 | some description
1 | 2009-04-01 10:10:00 | 25 | 2009-04-01 10:14:00 | 30 | some description
Or in text from: given a sensor_id and a date range of range_start and range_end,
find me all time periods which have overlap with the date range (that is, start_date < range_end and end_date > range_start) and for each of these rows, find the corresponding values from the value table for the time period's start_date and end_date (find the first row with acquired > start_date and acquired > end_date).
If it wasn't for the start_value and end_value columns, this would be a textbook trivial example of how to join two tables.
Can I somehow get the output I need in one SQL statement without resorting to writing a PL/SQL function to find these values?
Unless I have overlooked something blatantly obvious, this can't be done with simple subselects.
Database is Oracle 11g, so any Oracle-specific features are acceptable.
Edit: yes, looping is possible, but I want to know if this can be done with a single SQL select.
You can give this a try. Note the caveats at the end though.
SELECT
RNG.sensor_id,
RNG.start_date,
RDG1.value AS v1,
RNG.end_date,
RDG2.value AS v2,
RNG.description
FROM
Ranges RNG
INNER JOIN Readings RDG1 ON
RDG1.sensor_id = RNG.sensor_id AND
RDG1.acquired => RNG.start_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG1_NE.sensor_id = RDG1.sensor_id AND
RDG1_NE.acquired >= RNG.start_date AND
RDG1_NE.acquired < RDG1.acquired
INNER JOIN Readings RDG2 ON
RDG2.sensor_id = RNG.sensor_id AND
RDG2.acquired => RNG.end_date
LEFT OUTER JOIN Readings RDG1_NE ON
RDG2_NE.sensor_id = RDG2.sensor_id AND
RDG2_NE.acquired >= RNG.end_date AND
RDG2_NE.acquired < RDG2.acquired
WHERE
RDG1_NE.sensor_id IS NULL AND
RDG2_NE.sensor_id IS NULL
This uses the first reading after the start date of the range and the first reading after the end date (personally, I'd think using the last date before the start and end would make more sense or the closest value, but I don't know your application). If there is no such reading then you won't get anything at all. You can change the INNER JOINs to OUTER and put additional logic in to handle those situations based on your own business rules.
It seems pretty straight forward.
Find the sensor values for each range. Find a row - I will call acquired of this row just X - where X > start_date and not exists any other row with acquired > start_date and acquired < X. Do the same for end date.
Select only the ranges that meet the query - start_date before and end_date after the dates supplied by the query.
In SQL this would be something like that.
SELECT R1.*, SV1.aquired, SV2.aquired
FROM ranges R1
INNER JOIN sensor_values SV1 ON SV1.sensor_id = R1.sensor_id
INNER JOIN sensor_values SV2 ON SV2.sensor_id = R1.sensor_id
WHERE SV1.aquired > R1.start_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV3
WHERE SV3.aquired > R1.start_date
AND SV3.aquired < SV1.aquired)
AND SV2.aquired > R1.end_date
AND NOT EXISTS (
SELECT *
FROM sensor_values SV4
WHERE SV4.aquired > R1.end_date
AND SV4.aquired < SV2.aquired)
AND R1.start_date < #range_start
AND R1.end_date > #range_end