Missing row using left join and generate_series - sql

I don't understand why my two table behaves incorrectly, the generated series doesn't show on the joined table.
TABLE 1: attendance
id | employee_id | flag | entry_date
-------------------------------
1 | 1 | 0 | 2017-11-17
2 | 2 | 0 | 2017-11-17
3 | 3 | 0 | 2017-11-17
Here's my query:
SELECT
attendance.employee_id,
TO_CHAR(series::date, 'YYYY-MM-DD')
FROM generate_series('2017-11-16', '2017-11-30', '1 day'::INTERVAL) series
LEFT JOIN (
SELECT
employee_id,
to_char(entry_date, 'YYYY-MM-DD') entry_date
FROM
attendance
WHERE entry_at >= '2017-11-16' AND entry_at <= '2017-11-30'
GROUP BY to_char(entry_at, 'YYYY-MM-DD'), employee_id
) attendance ON attendance.entry_date = TO_CHAR(series::date, 'YYYY-MM-DD')
ORDER by employee_id
The result is:
employee_id | to_char
------------------------
1 | 2017-11-17
2 | 2017-11-17
3 | 2017-11-17
I'm expecting a little bit different in which 2017-11-16 will show since my generated series started at said date.
Expected result:
employee_id | to_char
------------------------
null | 2017-11-16
null | 2017-11-16
null | 2017-11-16
1 | 2017-11-17
2 | 2017-11-17
3 | 2017-11-17
Here's a sample SQL Fiddle to test:
http://sqlfiddle.com/#!17/d7907/3
Update [Jan/4/2017]: Looks like i'm thinking on the wrong side, I decided to perform a correlated subquery
What I wanted to do is to count the number of dates present according to the generated_series.

It's hard to understand, what you really need. What you described as "The result is" - these are rows from your internal query. If you left join generate_series with this internal query - you will get 1 null row for every date excluding 2017-11-17 and 3 rows for 2017-11-17. I checked it - it worked as expected:
employee_id | to_char
-------------+------------
1 | 2017-11-17
2 | 2017-11-17
3 | 2017-11-17
| 2017-11-20
| 2017-11-21
| 2017-11-22
| 2017-11-23
| 2017-11-24
| 2017-11-25
| 2017-11-26
| 2017-11-27
| 2017-11-28
| 2017-11-29
| 2017-11-16
| 2017-11-30
| 2017-11-18
| 2017-11-19

Related

Insert a row for each month in the range [duplicate]

This question already has answers here:
Generate series of months for every row in Oracle
(1 answer)
Create all months list from a date column in ORACLE SQL
(3 answers)
Closed 1 year ago.
I want to make my table here in Oracle
+----+------------+------------+
| N | Start | End |
+----+------------+------------+
| 1 | 2018-01-01 | 2018-05-31 |
| 1 | 2018-01-01 | 2018-06-31 |
+----+------------+------------+
Into, as silly as it looks I need to insert one row for each month in the range for each in the first table
+----+------------+
| N | month| |
+----+------------+
| 1 | 2018-01-01 |
| 1 | 2018-01-01 |
| 1 | 2018-02-01 |
| 1 | 2018-02-01 |
| 1 | 2018-03-01 |
| 1 | 2018-03-01 |
| 1 | 2018-04-01 |
| 1 | 2018-04-01 |
| 1 | 2018-05-01 |
| 1 | 2018-05-01 |
| 1 | 2018-06-01 |
+----+------------+
I been trying to follow SQL: Generate Record Per Month In Date Range but I haven't had any luck figuring out the result I want.
Thanks for helping
My best guess is that you want to show all begining of months that are in the interval start to end in your table.
create table t1 as
select date'2018-01-01' start_d, date'2018-05-31' end_d from dual union all
select date'2018-01-01' start_d, date'2018-06-30' end_d from dual;
with cal as
(select add_months(date'2018-01-01', rownum-1) month_d
from dual connect by level <= 12)
select cal.month_d from cal
join t1 on cal.month_d between t1.start_d and t1.end_d
order by 1;
MONTH_D
-------------------
01.01.2018 00:00:00
01.01.2018 00:00:00
01.02.2018 00:00:00
01.02.2018 00:00:00
01.03.2018 00:00:00
01.03.2018 00:00:00
01.04.2018 00:00:00
01.04.2018 00:00:00
01.05.2018 00:00:00
01.05.2018 00:00:00
01.06.2018 00:00:00
So probaly there is a cut & paste error in your expectation for January.
Some other points
do not use reserved word as start for column names
Use DATE format to store dates to aviod invalid entries such as 2018-06-31
You can use a recursive CTE. For example:
with
n (s, e, cur) as (
select s, e, s from t
union all
select s, e, add_months(cur, 1)
from n
where add_months(cur, 1) < e
)
select cur from n;
Result:
CUR
---------
01-JAN-18
01-JAN-18
01-FEB-18
01-FEB-18
01-MAR-18
01-MAR-18
01-APR-18
01-APR-18
01-MAY-18
01-MAY-18
01-JUN-18
See running example at db<>fiddle.

How do I use AVG() with GROUP BY in time_bucket_gapfill() in TimeScaleDB, PostgreSQL?

I'm using TimescaleDB in my PostgreSQL and I have the following two Tables:
windows_log
| windows_log_id | timestamp | computer_id | log_count |
------------------------------------------------------------------
| 1 | 2021-01-01 00:01:02 | 382 | 30 |
| 2 | 2021-01-02 14:59:55 | 382 | 20 |
| 3 | 2021-01-02 19:08:24 | 382 | 20 |
| 4 | 2021-01-03 13:05:36 | 382 | 10 |
| 5 | 2021-01-03 22:21:14 | 382 | 40 |
windows_reliability_score
| computer_id (FK) | timestamp | reliability_score |
--------------------------------------------------------------
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-01 22:21:14 | 6 |
| 382 | 2021-01-02 22:21:14 | 1 |
| 382 | 2021-01-02 22:21:14 | 3 |
| 382 | 2021-01-03 22:21:14 | 7 |
| 382 | 2021-01-03 22:21:14 | 8 |
| 382 | 2021-01-03 22:21:14 | 9 |
Note: In both tables is indexed on the timestamp column (hypertable)
So I'm trying to get the average reliability_score for each time bucket, but it just gives me the average for everything, instead of the average per specific bucket...
This is my query:
SELECT time_bucket_gapfill(CAST(1 * INTERVAL '1 day' AS INTERVAL), wl.timestamp) AS timestamp,
COALESCE(SUM(log_count), 0) AS log_count,
AVG(reliability_score) AS reliability_score
FROM windows_log wl
JOIN reliability_score USING (computer_id)
WHERE wl.time >= '2021-01-01 00:00:00.0' AND wl.time < '2021-01-04 00:00:00.0'
GROUP BY timestamp
ORDER BY timestamp asc
This is the result I'm looking for:
| timestamp | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 | 30 | 6 |
| 2021-01-02 00:00:00 | 20 | 2 |
| 2021-01-03 00:00:00 | 20 | 8 |
But this is what I get:
| timestamp | log_count | reliability_score |
-------------------------------------------------------
| 2021-01-01 00:00:00 | 30 | 5.75 |
| 2021-01-02 00:00:00 | 20 | 5.75 |
| 2021-01-03 00:00:00 | 20 | 5.75 |
Given what we can glean from your example, there's no simple way to do a join between these two tables, with the given functions, and achieve the results you want. The schema, as presented, just makes that difficult.
If this is really what your data/schema look like, then one solution is to use multiple CTE's to get the two values from each distinct table and then join based on bucket and computer.
WITH wrs AS (
SELECT time_bucket_gapfill('1 day', timestamp) AS bucket,
computer_id,
AVG(reliability_score) AS reliability_score
FROM windows_reliability_score
WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, 2
),
wl AS (
SELECT time_bucket_gapfill('1 day', wl.timestamp) bucket, wl.computer_id,
sum(log_count) total_logs
FROM windows_log wl
WHERE timestamp >= '2021-01-01 00:00:00.0' AND timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, 2
)
SELECT wrs.bucket, wrs.computer_id, reliability_score, total_logs
FROM wrs LEFT JOIN wl ON wrs.bucket = wl.bucket AND wrs.computer_id = wl.computer_id;
The filtering would have to be applied internally to each query because pushdown on the outer query likely wouldn't happen and so then you would scan the entire hypertable before the date filter is applied (not what you want I assume).
I tried to quickly re-create your sample schema, so I apologize if I got names wrong somewhere.
The main issue is that the join codition is on column computer_id, where both tables have the same values 382. Thus each column from table windows_log will be joined with each column from table reliability_score (Cartesian product of all rows). Also the grouping is done on column timestamp, which is ambigous, and is likely to be resolved to timestamp from windows_log. This leads to the result that average will use all values of reliability_score for each value of the timestamp from windows_log and explains the undesired result.
The resolution of the gropuing ambiguity, which resolved in favor the inner column, i.e., the table column, is explained in GROUP BY description in SELECT documentation:
In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.
To avoid having groups, which includes all rows matching on computer id, windows_log_id can be used for grouping. This will allow to bring log_count to the query result. And if it is desire to keep the output name timestamp, GROUP BY should use the reference to the position. For example:
SELECT time_bucket_gapfill('1 day'::INTERVAL, rs.timestamp) AS timestamp,
AVG(reliability_score) AS reliability_score,
log_count
FROM windows_log wl
JOIN reliability_score rs USING (computer_id)
WHERE rs.timestamp >= '2021-01-01 00:00:00.0' AND rs.timestamp < '2021-01-04 00:00:00.0'
GROUP BY 1, windows_log_id, log_count
ORDER BY timestamp asc
For ORDER BY it is not an issue, since the output name is used. From the same doc:
If an ORDER BY expression is a simple name that matches both an output column name and an input column name, ORDER BY will interpret it as the output column name.

Cumulative open subscriptions with start_date and end_date on Redshift

I am trying to write a query that will allow to me to count the number of active subscriptions by day in Redshift.
I have the following table:
sub_id | start_date | end_date
---------------------------------------
20001 | 2017-09-01 | NULL
20002 | 2017-08-01 | 2017-08-29
20003 | 2016-01-01 | 2017-04-25
20004 | 2016-07-01 | 2017-09-03
I would like to be able to state, for each date between two dates how many subscriptions are active, such that:
date | active_subs
------------------------
2016-06-30 | 1
2016-07-01 | 2
... |
2017-04-24 | 2
2017-04-25 | 1
... |
2017-07-31 | 1
2017-08-01 | 2
... |
2017-08-28 | 2
2017-08-29 | 1
2017-08-30 | 1
2017-08-31 | 1
2017-09-01 | 2
2017-09-02 | 2
2017-09-03 | 1
I have a reference table from which a query can draw 1 row per day with the table name of date and the relevant column being date.ref_date (in the YYYY-MM-DD format)
Do i write this query using window functions or is there a better way?
Thanks
If I understood you correctly, you don't need nor window functions, joins(except to the date table) or cumulative count. You can do this:
SELECT t.date,
COUNT(s.sub_id) as active_subs
FROM dateTable t
LEFT JOIN YourTable s
ON(t.dateCol between s.start_date
AND COALESCE(s.end_date,<Put A late date here>))
GROUP BY t.date
I would do this as:
with cte as (
select start_date as dte, 1 as inc
from t
union all
select coalesce(end_date, current_date), -1 as inc
from t
)
select dte,
sum(sum(inc)) over (order by dte)
from cte
group by dte
order by dte;
There may be off-by-one errors, depending on whether you count stops on the date given or on the next day.

BigQuery - how many entries per partition?

I have big partitioned tables and try to figure out how many entries are in each day-partition.
So far I used a for loop in a script but there must be a simpler way doing it.
Google did not help me. Does anyone know the right query?
Thanks
you can run the following query to count how many entries you have in each partition
#standardSQL
SELECT
_PARTITIONTIME AS pt,
COUNT(1)
FROM
`dataset.table`
GROUP BY
1
ORDER BY
1 DESC
and
#legacySQL
SELECT
_PARTITIONTIME AS pt,
COUNT(1)
FROM
[dataset:table]
GROUP BY
1
ORDER BY
1 DESC
it returns a table like this, please note that the NULL entries are still in streaming buffer. Hint: to obtain records which are in streaming buffer us a query with NULL.
+-------------------------+-----+--+
| 2017-02-14 00:00:00 UTC | 252 | |
+-------------------------+-----+--+
| 2017-02-13 00:00:00 UTC | 257 | |
+-------------------------+-----+--+
| 2017-02-12 00:00:00 UTC | 188 | |
+-------------------------+-----+--+
| 2017-02-11 00:00:00 UTC | 234 | |
+-------------------------+-----+--+
| 2017-02-10 00:00:00 UTC | 107 | |
+-------------------------+-----+--+
| null | 13 | |
+-------------------------+-----+--+

how to get second max date in postgres sql

I have following situation where i need to get several values between two invoices date.
So query is giving data based on invoices now what i need to do is for some values fetch data between this invoice date and last invoice date
already tried ways
1) sub query will easily solve this but as i have to do this for 4-5 column and its a 15 gb database so that's not possible.
2) if i go like this
left join (select inv.date ,inv,actno from invoice inv) as invo on invo.actno=act.id and invo.date < inv.date
then it will give all the data less then that date but i need only one data that will be less than main invoice date.
3) we can not get second max value in subquery of from clause because outer invoice is not grouped so it might be max or midlle or least .
4) we can not send values of other table in subquery of join table.
ex
create table inv (id serial ,date timestamp without time zone);
insert into inv (date) values('2017-01-31 00:00:00'),('2017-01-30 00:00:00'),('2017-01-29 00:00:00'),('2017-01-28 00:00:00'),('2017-01-27 00:00:00');
select date as d1 from inv;
id | date
----+---------------------
1 | 2017-01-31 00:00:00
2 | 2017-01-30 00:00:00
3 | 2017-01-29 00:00:00
4 | 2017-01-28 00:00:00
5 | 2017-01-27 00:00:00
(5 rows)
I need this
id |date |date | id
1 | 2017-01-31 00:00:00 | 2017-01-30 00:00:00 | 2
2 | 2017-01-30 00:00:00 | 2017-01-29 00:00:00 | 3
3 | 2017-01-29 00:00:00 | 2017-01-28 00:00:00 | 4
4 | 2017-01-28 00:00:00 | 2017-01-27 00:00:00 | 5
5 | 2017-01-27 00:00:00 |
I can't do subquery in select as database is big and need to do this for 4-5 column
UPDATE 1
I need this from same table but using it twice in FROM clause as my requirement is that I need several data joined from invoice table and then there is 4-5 column in which I need things like sum of amount paid between last and this invoice.
So I can take both invoice date in subquery and get the data between them
UPDATE 2
lag will not solve this
select i.id,i.date, lag(date) over (order by date) from inv i order by id ;
id | date | lag
----+---------------------+---------------------
1 | 2017-01-31 00:00:00 | 2017-01-30 00:00:00
2 | 2017-01-30 00:00:00 | 2017-01-29 00:00:00
3 | 2017-01-29 00:00:00 | 2017-01-28 00:00:00
4 | 2017-01-28 00:00:00 | 2017-01-27 00:00:00
5 | 2017-01-27 00:00:00 |
(5 rows)
Time: 0.480 ms
test=# select i.id,i.date, lag(date) over (order by date) from inv i where id=2 order by id ;
id | date | lag
----+---------------------+-----
2 | 2017-01-30 00:00:00 |
(1 row)
Time: 0.525 ms
test=# select i.id,i.date, lag(date) over (order by date) from inv i where id in (2,3) order by id ;
id | date | lag
----+---------------------+---------------------
2 | 2017-01-30 00:00:00 | 2017-01-29 00:00:00
3 | 2017-01-29 00:00:00 |
it will calculate on the data it will get from the table in that query it is bounded in that query see here 3 has a lag but could not get it cause query is not allowing it to have it ....something in left join needs to be done so the lag date can be taken from same table but calling it again in from clause Thanks Again buddy
Like here?:
t=# select date as d1,
lag(date) over (order by date)
from inv
order by 1 desc;
d1 | lag
---------------------+---------------------
2017-01-31 00:00:00 | 2017-01-30 00:00:00
2017-01-30 00:00:00 | 2017-01-29 00:00:00
2017-01-29 00:00:00 | 2017-01-28 00:00:00
2017-01-28 00:00:00 | 2017-01-27 00:00:00
2017-01-27 00:00:00 |
(5 rows)
Time: 1.416 ms