Count devices per day in a given date range - sql

I have a table which has devices with 3 statuses, Pass, Fail and Warning.
Device
Status
Date
Device1
Pass
12/1/2020
Device2
Fail
12/1/2020
Device3
Warning
12/1/2020
Device1
Fail
12/2/2020
Device2
Warning
12/2/2020
Device3
Pass
12/2/2020
I want to generate a trend graph of count of devices based on the daily status. The count is on all the devices for each day. The table above will have device data repeated for multiple dates.
Example:
I want to generate a stacked bar graph, which will show count of devices which are pass, fail or warning. Need to get a query which I can use to get the response back with DateTime, count of failed devices, count of devices passed, count of devices having warning over a range of dates.
select * (select count(*) from status_table where overall_status = 'Fail' and startDate > "" and endDate < "") as failedCount,
(select count(*) from status_table where overall_status = 'Warning' and startDate > "" and endDate < "") as WarningCount,
(select count(*) from status_table where overall_status = 'Pass' startDate > "" and endDate < "") as passCount from status_table
Is there a better solution?

You can use the aggregate FILTER clause to do it in a single query.
This gets three counts (fail, pass, warn) for every selected device on every day in the selected date range. A count of NULL for days without any appearance. 0 if the device appeared, but not with this status:
SELECT date, device_name
, fail_count, warning_count, pass_count
FROM (SELECT DISTINCT device_name FROM status_table) d -- all devices ①
CROSS JOIN (
SELECT generate_series(timestamp '2020-12-01'
, timestamp '2020-12-31'
, interval '1 day')::date
) t(date) -- all dates
LEFT JOIN (
SELECT date, device_name
, count(*) FILTER (WHERE overall_status = 'Fail') AS fail_count
, count(*) FILTER (WHERE overall_status = 'Warning') AS warning_count
, count(*) FILTER (WHERE overall_status = 'Pass') AS pass_count
FROM status_table
WHERE date >= '2020-12-01' -- same date range as above
AND date <= '2020-12-31'
GROUP BY 1, 2
) s USING (date, device_name)
ORDER BY 1, 2;
Basically, you CROSS JOIN all devices to all dates (Cartesian product), the append data where data can be found with a LEFT JOIN.
① Since you don't seem to have a device table (which you probably should), generate the full list on the fly. The above query with DISTINCT is good for few rows per device. Else, there are (much) faster techniques like:
WITH RECURSIVE cte AS (
(SELECT device_name FROM status_table ORDER BY 1 LIMIT 1)
UNION ALL
SELECT (SELECT device_name FROM status_table
WHERE device_name > t.device_name ORDER BY 1 LIMIT 1)
FROM cte
WHERE device_name IS NOT NULL
)
SELECT * FROM cte
WHERE device_name IS NOT NULL;
See:
https://wiki.postgresql.org/wiki/Loose_indexscan
The subquery s aggregates only rows from the given date range. It's strictly optional. You can also left-join to the underlying table directly, and then aggregate all. But this approach is typically (much) faster.
You can convert NULL to zero or vice versa with COALESCE / NULLIF.
Related:
PostgreSQL: running count of rows for a query 'by minute'
Aggregate columns with additional (distinct) filters
For more flags, a crosstab() query might be faster. See:
PostgreSQL Crosstab Query
About generating a date range:
Generating time series between two dates in PostgreSQL
Be aware that dates are defined by your current time zone setting if you operate with timestamp with time zone. See:
Ignoring time zones altogether in Rails and PostgreSQL

Related

Select rows for last n days after event occurs

I have the following table and data:
PatientID PatientName Diagnosed ReportDate ...
1 0
1 0
1 0
1 1
So there are multiple rows for each patient, as the reports come few times a day.
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data . So when Diagnosed ==1, get report time -3 days of data for each patient.
SELECT Patients.ReportDate
FROM Patients
WHERE Diagnosed = 1 and date > ReportDate - interval '3' day;
So getting the past 3 days of data, can be done with ReportDate - interval time, but how do I specify that for every patient (since multiple ids can be for that patient) based on the diagnosed field?
I usually do this filtering after getting csvs in python, but the data set is too large, so I'd like to filter before I convert them to dataframes.
You can look at this another way, which is whether diagnosed = 1 in the next three days -- and take all rows where that is true:
select p.*
from (select p.*,
count(*) filter (where diagnosed = 1) over (partition by patientId order by reportDate range between interval '0 day' following and interval '3 day' following) as cnt_diagnosed_3
from patients p
) p
where cnt_diagnosed_3 > 0
order by patientId, reportDate;
Whenever the diagnosed field is changed to 1, for that patient, I'd like to get the past 3 days of data.
SELECT (p).*
FROM (
SELECT p
, diagnosed
, bool_or(diagnosed = 1) OVER (w RANGE BETWEEN CURRENT ROW AND '3 days' FOLLOWING) AS in_range
, lag(diagnosed) OVER w AS last_diagnosed
FROM patients p
WINDOW w AS (PARTITION BY patientid ORDER BY reportdate)
) sub
WHERE diagnosed = 0 AND in_range
OR diagnosed = 1 AND last_diagnosed = 0
ORDER BY patientid, reportdate;
db<>fiddle here
Returns the "past 3 days of data" where the "field is changed to 1" (previous row had "0").
The WINDOW clause is just syntactic sugar to avoid spelling out the same window definition repeatedly. (No additional benefit for performance.)
SELECT p in the innermost subquery is a neat way to get the whole row. The outer SELECT (p).* returns complete rows without auxiliary columns added in the subquery. This way we get whole rows without spelling out all columns (or even needing to know all of them).
RANGE distance PRECEDING/FOLLOWING requires Postgres 11 or later.
Here is a slower alternative that also works for older versions:
SELECT p.*
FROM (
SELECT patientid, reportdate
FROM (
SELECT patientid, reportdate, diagnosed
, lag(diagnosed) OVER (PARTITION BY patientid ORDER BY reportdate) AS last_diagnosed
FROM patients
) p0
WHERE diagnosed = 1
AND last_diagnosed = 0
) d
JOIN patients p USING (patientid)
WHERE p.reportdate BETWEEN d.reportdate - interval '3 days' AND d.reportdate
ORDER BY p.patientid, p.reportdate;
Subquery d select rows where Diagnosed just switched to 1. Then self-join to select your time frame.
For gaps-and-islands basics, see:
Select longest continuous sequence
You also added:
So when Diagnosed ==1, get report time -3 days of data for each patient.
That's a wider definition, and that's what Gordon's query does. Goes to show the importance of an exact definition of requirements.

Same output in two different lateral joins

I'm working on a bit of PostgreSQL to grab the first 10 and last 10 invoices of every month between certain dates. I am having unexpected output in the lateral joins. Firstly the limit is not working, and each of the array_agg aggregates is returning hundreds of rows instead of limiting to 10. Secondly, the aggregates appear to be the same, even though one is ordered ASC and the other DESC.
How can I retrieve only the first 10 and last 10 invoices of each month group?
SELECT first.invoice_month,
array_agg(first.id) first_ten,
array_agg(last.id) last_ten
FROM public.invoice i
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id ASC
LIMIT 10
) first ON i.id = first.id
JOIN LATERAL (
SELECT id, to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE id = i.id
ORDER BY invoice_date, id DESC
LIMIT 10
) last on i.id = last.id
WHERE i.invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
GROUP BY first.invoice_month, last.invoice_month;
This can be done with a recursive query that will generate the interval of months for who we need to find the first and last 10 invoices.
WITH RECURSIVE all_months AS (
SELECT date_trunc('month','2018-01-01'::TIMESTAMP) as c_date, date_trunc('month', '2018-05-11'::TIMESTAMP) as end_date, to_char('2018-01-01'::timestamp, 'YYYY-MM') as current_month
UNION
SELECT c_date + interval '1 month' as c_date,
end_date,
to_char(c_date + INTERVAL '1 month', 'YYYY-MM') as current_month
FROM all_months
WHERE c_date + INTERVAL '1 month' <= end_date
),
invocies_with_month as (
SELECT *, to_char(invoice_date::TIMESTAMP, 'YYYY-MM') invoice_month FROM invoice
)
SELECT current_month, array_agg(first_10.id), 'FIRST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date ASC limit 10
) first_10 ON TRUE
GROUP BY current_month
UNION
SELECT current_month, array_agg(last_10.id), 'LAST 10' as type FROM all_months
JOIN LATERAL (
SELECT * FROM invocies_with_month
WHERE all_months.current_month = invoice_month AND invoice_date >= '2018-01-01' AND invoice_date <= '2018-05-11'
ORDER BY invoice_date DESC limit 10
) last_10 ON TRUE
GROUP BY current_month;
In the code above, '2018-01-01' and '2018-05-11' represent the dates between we want to find the invoices. Based on those dates, we generate the months (2018-01, 2018-02, 2018-03, 2018-04, 2018-05) that we need to find the invoices for.
We store this data in all_months.
After we get the months, we do a lateral join in order to join the invoices for every month. We need 2 lateral joins in order to get the first and last 10 invoices.
Finally, the result is represented as:
current_month - the month
array_agg - ids of all selected invoices for that month
type - type of the selected invoices ('first 10' or 'last 10').
So in the current implementation, you will have 2 rows for each month (if there is at least 1 invoice for that month). You can easily join that in one row if you need to.
LIMIT is working fine. It's your query that's broken. JOIN is just 100% the wrong tool here; it doesn't even do anything close to what you need. By joining up to 10 rows with up to another 10 rows, you get up to 100 rows back. There's also no reason to self join just to combine filters.
Consider instead window queries. In particular, we have the dense_rank function, which can number every row in the result set according to groups:
SELECT
invoice_month,
time_of_month,
ARRAY_AGG(id) invoice_ids
FROM (
SELECT
id,
invoice_month,
-- Categorize as end or beginning of month
CASE
WHEN month_rank <= 10 THEN 'beginning'
WHEN month_reverse_rank <= 10 THEN 'end'
ELSE 'bug' -- Should never happen. Just a fall back in case of a bug.
END AS time_of_month
FROM (
SELECT
id,
invoice_month,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date) month_rank,
dense_rank() OVER (PARTITION BY invoice_month ORDER BY invoice_date DESC) month_rank_reverse
FROM (
SELECT
id,
invoice_date,
to_char(invoice_date, 'Mon-yy') AS invoice_month
FROM public.invoice
WHERE invoice_date BETWEEN date '2017-10-01' AND date '2018-09-30'
) AS fiscal_year_invoices
) ranked_invoices
-- Get first and last 10
WHERE month_rank <= 10 OR month_reverse_rank <= 10
) first_and_last_by_month
GROUP BY
invoice_month,
time_of_month
Don't be intimidated by the length. This query is actually very straightforward; it just needed a few subqueries.
This is what it does logically:
Fetch the rows for the fiscal year in question
Assign a "rank" to the row within its month, both counting from the beginning and from the end
Filter out everything that doesn't rank in the 10 top for its month (counting from either direction)
Adds an indicator as to whether it was at the beginning or end of the month. (Note that if there's less than 20 rows in a month, it will categorize more of them as "beginning".)
Aggregate the IDs together
This is the tool set designed for the job you're trying to do. If really needed, you can adjust this approach slightly to get them into the same row, but you have to aggregate before joining the results together and then join on the month; you can't join and then aggregate.

Fetch max value from a sort-of incomplete dataset

A number of devices return a value. Only upon change, this value gets stored in a table:
Device Value Date
B 5 2017-07-01
C 2 2017-07-01
A 3 2017-07-02
C 1 2017-07-04
A 6 2017-07-04
Values may enter the table at any date (i.e. date doesn't increment continiously). Several devices may store their value on the same date.
Note that, even though there are usually only a few devices for each date in the table, all devices actually have a value at that date: it's the latest one stored until then. For example, on 2017-07-02 only device A stored a value. The values for B and C on that date are the ones stored on 2017-07-01; these are still valid on -02, they just did not change.
To retrieve the values for all devices on a given date, e.g. 2017-07-04, I'm using this:
select device, value from data inner join (select device, max(date) as date from data where date <= "2017-07-04" group by device) latestdate on data.device = latestdate.device and data.date = latestdate.date
Device Value
A 6
B 5
C 1
Question: I'd like to read the max value of all devices on all dates in a given range. The result set would be like this:
Date max(value)
2017-07-01 5
2017-07-02 5
2017-07-04 6
.. and I have no clue if that's possible using only SQL. Until now all I got was lost in an exceptional bunch of joins and groupings.
(Database is sqlite3. Generic SQL would be nice, but I'd still be happy to hear about solutions specific to other databases, especially PostgreSQL or MariaDB.)
Extra bonus: Include the missing date -03, to be exact: returning values at given dates, not necessarily the ones appearing in the table.
Date max(value)
2017-07-01 5
2017-07-02 5
2017-07-03 5
2017-07-04 6
I think the most generic way to approach this is using a separate query for each date. There are definitely simpler methods, depending on the database. But getting one that works for SQLite, MariaDB, and Postgres is not going to use any sophisticated functionality:
select '2017-07-01' as date, max(data.value)
from data inner join
(select device, max(date) as date
from data
where date <= '2017-07-01' group by device
) latestdate
on data.device = latestdate.device and data.date = latestdate.date
union all
select '2017-07-02' as date, max(data.value)
from data inner join
(select device, max(date) as date
from data
where date <= '2017-07-02' group by device
) latestdate
on data.device = latestdate.device and data.date = latestdate.date
select '2017-07-03' as date, max(data.value)
from data inner join
(select device, max(date) as date
from data
where date <= '2017-07-03' group by device
) latestdate
on data.device = latestdate.device and data.date = latestdate.date
select '2017-07-04' as date, max(data.value)
from data inner join
(select device, max(date) as date
from data
where date <= '2017-07-04' group by device
) latestdate
on data.device = latestdate.device and data.date = latestdate.date;
This should be a solution for your problem.
It should be cross-database, since OVER clause is supported by the most of the databases.
You should create a table with all the dates("ALL_DATE" in the query), otherwise every database has a specific way to do it without a table.
WITH GROUPED_BY_DATE_DEVICE AS (
SELECT DATE, DEVICE, SUM(VALUE) AS VALUE FROM DEVICE_INFO
GROUP BY DATE, DEVICE
), GROUPED_BY_DATE AS (
SELECT A.DATE, MAX(VALUE) AS VALUE
FROM ALL_DATE A
LEFT JOIN GROUPED_BY_DATE_DEVICE B
ON A.DATE = B.DATE
GROUP BY A.DATE
)
SELECT DATE, MAX(VALUE) OVER (ORDER BY DATE) AS MAX_VALUE
FROM GROUPED_BY_DATE
ORDER BY DATE;

Calculating business days in Teradata

I need help in business days calculation.
I've two tables
1) One table ACTUAL_TABLE containing order date and contact date with timestamp datatypes.
2) The second table BUSINESS_DATES has each of the calendar dates listed and has a flag to indicate weekend days.
using these two tables, I need to ensure business days and not calendar days (which is the current logic) is calculated between these two fields.
My thought process was to first get a range of dates by comparing ORDER_DATE with TABLE_DATE field and then do a similar comparison of CONTACT_DATE to TABLE_DATE field. This would get me a range from the BUSINESS_DATES table which I can then use to calculate count of days, sum(Holiday_WKND_Flag) fields making the result look like:
Order# | Count(*) As DAYS | SUM(WEEKEND DATES)
100 | 25 | 8
However this only works when I use a specific order number and cant' bring all order numbers in a sub query.
My Query:
SELECT SUM(Holiday_WKND_Flag), COUNT(*) FROM
(
SELECT
* FROM
BUSINESS_DATES
WHERE BUSINESS.Business BETWEEN (SELECT ORDER_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
AND
(SELECT CONTACT_DATE FROM ACTUAL_TABLE
WHERE ORDER# = '100'
)
TEMP
Uploading the table structure for your reference.
SELECT ORDER#, SUM(Holiday_WKND_Flag), COUNT(*)
FROM business_dates bd
INNER JOIN actual_table at ON bd.table_date BETWEEN at.order_date AND at.contact_date
GROUP BY ORDER#
Instead of joining on a BETWEEN (which always results in a bad Product Join) followed by a COUNT you better assign a bussines day number to each date (in best case this is calculated only once and added as a column to your calendar table). Then it's two Equi-Joins and no aggregation needed:
WITH cte AS
(
SELECT
Cast(table_date AS DATE) AS table_date,
-- assign a consecutive number to each busines day, i.e. not increased during weekends, etc.
Sum(CASE WHEN Holiday_WKND_Flag = 1 THEN 0 ELSE 1 end)
Over (ORDER BY table_date
ROWS Unbounded Preceding) AS business_day_nbr
FROM business_dates
)
SELECT ORDER#,
Cast(t.contact_date AS DATE) - Cast(t.order_date AS DATE) AS #_of_days
b2.business_day_nbr - b1.business_day_nbr AS #_of_business_days
FROM actual_table AS t
JOIN cte AS b1
ON Cast(t.order_date AS DATE) = b1.table_date
JOIN cte AS b2
ON Cast(t.contact_date AS DATE) = b2.table_date
Btw, why are table_date and order_date timestamp instead of a date?
Porting from Oracle?
You can use this query. Hope it helps
select order#,
order_date,
contact_date,
(select count(1)
from business_dates_table
where table_date between a.order_date and a.contact_date
and holiday_wknd_flag = 0
) business_days
from actual_table a

Combine two queries with monthly average

I need to put together the results of these two queries into a single return with the following structure:
"date", avg(selic."Taxa"), avg(titulos."puVenda")
Partial structure of tables:
selic
"dtFechamento" date,
"pTaxa" real
titulos
"dtTitulo" date,
"puVenda" real,
"nomeTitulo" character(30)
Query table selic:
select to_char("dtFechamento", 'YYYY-MM') as data, avg("pTaxa")
from "selic"
group by data
order by data
Query table titulos:
select to_char("dtTitulo", 'YYYY-MM') as data, avg("puVenda")
from "titulos"
where "nomeTitulo" = 'LFT010321'
group by data
order by data
I tried a subquery, but it returned the fields next to each other and can not muster.
select *
from (select to_char("dtFechamento", 'YYYY-MM') as data, avg("pTaxa")
from "selic"
group by data
order by data) as selic,
(select to_char("dtTitulo", 'YYYY-MM') as data, avg("puVenda")
from "titulos"
where "nomeTitulo" = 'LFT010321'
group by data
order by data) as LFT010321;
Assuming you want to return one row per month where either of your two queries returns a row. And pad missing values from the other query with NULL.
Use a FULL [OUTER] JOIN:
SELECT to_char(mon, 'YYYY-MM') AS data, s.avg_taxa, t.avg_venda
FROM (
SELECT date_trunc('month', "dtFechamento") AS mon, avg("pTaxa") AS avg_taxa
FROM selic
GROUP BY 1
) s
FULL JOIN (
SELECT date_trunc('month', "dtTitulo") AS mon, avg("puVenda") AS avg_venda
FROM titulos
WHERE "nomeTitulo" = 'LFT010321'
GROUP BY 1
) t USING (mon)
ORDER BY mon;
It is substantially faster to join after aggregating than before (fewer join operations).
It is also faster to GROUP BY, JOIN and ORDER on timestamp values than on a text rendition. Typically also cleaner and less error prone (although text is unambiguous in this particular case). That's why I use date_trunc() instead of to_char() on lower levels.
If the format for the month is not important, you can just return the timestamp value. Else you can format any way you like after you are done processing.
Similar case with more explanation:
PostgreSQL merge two queries with COUNT and GROUP BY in each
This should get what you need. The inner "PQ" (PreQuery) does a union all between each possible date, but also adds a flag column to identify which average it was associated with. Each part is grouped by date. So now, the outer query will AT MOST have 2 records for a given date... one for tax, the other be Venda. So now you dont need any full outer join, nor need to build some dynamic calendar data basis to get the details for all possible dates.
So, it is possible for only a Tax average OR a Venda average OR BOTH.
SELECT
PQ.Data,
SUM( CASE when PQ.SumType = 'T' then PQ.TypeAvg else 0 end ) as AvgTax,
SUM( CASE when PQ.SumType = 'V' then PQ.TypeAvg else 0 end ) as AvgVenda
from
( select
to_char( dtFechamento, 'YYYY-MM') as data,
'T' as sumtype,
avg( pTaxa ) as TypeAvg
from
selic
group by
to_char( dtFechamento, 'YYYY-MM') as data
UNION ALL
select
to_char( dtTitulo, 'YYYY-MM') as data,
'V' as sumType,
avg( puVenda ) as TypeAvg
from
titulos
where
nomeTitulo = 'LFT010321'
group by
to_char( dtTitulo, 'YYYY-MM') ) PQ
group by
PQ.Data
order by
PQ.Data