Check if the value has been the same for x consecutive days - sql

My goal is to get a list of users(uploaders) which have uploaded 0 byte files for 4 consecutive days.
My current query:
SELECT * FROM file_uploads
JOIN (
SELECT max(date_added) as latest_date
FROM file_uploads
WHERE size = 0
GROUP BY uploader_id
) as list
So in the JOIN I try to get the latest date grouped by uploader where a file is uploaded with the size of 0.
Now I need to check if the size has been 0 for -4 consecutive days since the latest_date, but that is what I can't figure out. Can I check it with a WHERE statement or do I need to do an IF statement or something else.
This is an example of the table I'm working with: https://ibb.co/Sf3RFSk

If there is one row per date, you can filter by the maximum date and then count the number of rows that are 0 for each uploader_id:
select uploader_id
from file_uploads fu
where fu.date_uploaded >= (select max(fu2.date_uploaded) - interval 3 day
from file_uploads fu2
where fu2.uploader_id = fu.uploader_id
) and
fu.size = 0
group by uploader_id
having count(*) = 4;

Related

SQL - Calculate percentage by group, for multiple groups

I have a table in GBQ in the following format :
UserId Orders Month
XDT 23 1
XDT 0 4
FKR 3 6
GHR 23 4
... ... ...
It shows the number of orders per user and month.
I want to calculate the percentage of users who have orders, I did it as following :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
GROUP BY
HasOrders
ORDER BY
Parts
It gives me the following result:
HasOrders Parts
0 35
1 65
I need to calculate the percentage of users who have orders, by month, in a way that every month = 100%
Currently to do this I execute the query once per month, which is not practical :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
WHERE Month = 1
GROUP BY
HasOrders
ORDER BY
Parts
Is there a way execute a query once and have this result ?
HasOrders Parts Month
0 25 1
1 75 1
0 45 2
1 55 2
... ... ...
SELECT
SIGN(Orders),
ROUND(COUNT(*) * 100.000 / SUM(COUNT(*), 2) OVER (PARTITION BY Month)) AS Parts,
Month
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
Demo on Postgres:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4cd2d1455673469c2dfc060eccea8020
You've stated that it's important for the total to be 100% so you might consider rounding down in the case of no orders and rounding up in the case of has orders for those scenarios where the percentages falls precisely on an odd multiple of 0.5%. Or perhaps rounding toward even or round smallest down would be better options:
WITH DATA AS (
SELECT SIGN(Orders) AS HasOrders, Month,
COUNT(*) * 10000.000 / SUM(COUNT(*)) OVER (PARTITION BY Month) AS PartsPercent
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
)
select HasOrders, Month, PartsPercent,
PartsPercent - TRUNCATE(PartsPercent) AS Fraction,
CASE WHEN HasOrders = 0
THEN FLOOR(PartsPercent) ELSE CEILING(PartsPercent)
END AS PartsRound0Down,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5
AND MOD(TRUNCATE(PartsPercent), 2) = 0
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsRoundTowardEven,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5 AND PartsPercent < 50
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsSmallestTowardZero
from DATA
It's usually not advisable to test floating-point values for equality and I don't know how BigQuery's float64 will work with the comparison against 0.5. One half is nevertheless representable in binary. See these in a case where the breakout is 101 vs 99. I don't have immediate access to BigQuery so be aware that Postgres's rounding behavior is different:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=c8237e272427a0d1114c3d8056a01a09
Consider below approach
select hasOrders, round(100 * parts, 2) as parts, month from (
select month,
countif(orders = 0) / count(*) `0`,
countif(orders > 0) / count(*) `1`,
from your_table
group by month
)
unpivot (parts for hasOrders in (`0`, `1`))
with output like below

Snowflake SQL UDFs: SELECT TOP N, LIMIT, ROW_NUMBER() AND RANK() not working in subqueries?

I've been experimenting with Snowflake SQL UDF solutions to add a desired number of working days to a timestamp. I've been tryinig to define a function that takes a timestamp and desired number of working days to add as parameters and returns a date. The function uses date dimension table. The function works when I pass it a single date as a parameter, but whenever I try to give it a full column of dates, it throws error "Unsupported subquery type cannot be evaluated". It seems that this happens whenever I try to use SELECT TOP N, LIMIT, ROW_NUMBER() or RANK() in a subquery.
Here is an example of an approach I tried:
CREATE OR REPLACE FUNCTION "ADDWORKINGDAYSTOWORKINGDAY"(STARTDATE TIMESTAMP_NTZ, DAYS NUMBER)
RETURNS DATE
LANGUAGE SQL
AS '
WITH CTE AS (
SELECT PAIVA
FROM EDW_DEV.REPORTING_SCHEMA."D_PAIVA"
WHERE ARKIPAIVA = 1 AND ARKIPYHA_FI = FALSE
AND 1 = CASE WHEN DAYS < 0 AND P.PAIVA < TO_DATE(STARTDATE) THEN 1
WHEN DAYS < 0 AND P.PAIVA >= TO_DATE(STARTDATE) THEN 0
WHEN DAYS >= 0 AND P.PAIVA > TO_DATE(STARTDATE) THEN 1
ELSE 0
END),
CTE2 AS (
SELECT
PAIVA
,CASE WHEN DAYS >= 0 THEN RANK() OVER
(ORDER BY PAIVA)
ELSE RANK() OVER
(ORDER BY PAIVA DESC)
END AS RANK
FROM CTE
ORDER BY RANK)
SELECT TOP 1
ANY_VALUE (CASE WHEN DAYS IS NULL OR TO_DATE(STARTDATE) IS NULL THEN NULL
WHEN DAYS = 0 THEN TO_DATE(STARTDATE)
ELSE PAIVA
END) AS PAIVA
FROM CTE2
WHERE CASE WHEN DAYS IS NULL OR TO_DATE(STARTDATE) IS NULL THEN 1 = 1
WHEN DAYS > 0 THEN RANK = DAYS
WHEN DAYS = 0 THEN 1 = 1
ELSE RANK = -DAYS
END
';
UDFs are scalar. They will return only one value of a specified type, in this case a date. If you want to return a set of values for a column, you may want to investigate UDTFs, User Defined Table Functions, which return a table.
https://docs.snowflake.net/manuals/sql-reference/udf-table-functions.html
With a bit of modification to your UDF, you can convert it to a UDTF. You can pass it columns instead of scalar values. You can then join the table resulting from the UDTF with the base table to get the work day addition values.

Improving the performance of a query

My background is Oracle but we've moved to Hadoop on AWS and I'm accessing our logs using Hive SQL. I've been asked to return a report where the number of high severity errors on the system of any given type exceeds 9 in any rolling period of 30 days (9 but I use 2 in the example to keep the example data volumes down) by uptime. I've written code to do this but I don't really understand performance tuning in Hive. A lot of the stuff I learned in Oracle doesn't seem applicable.
Can this be improved?
Data is roughly
CREATE TABLE LOG_TABLE
(SYSTEM_ID VARCHAR(1),
EVENT_TYPE VARCHAR(2),
EVENT_ID VARCHAR(3),
EVENT_DATE DATE,
UPTIME INT);
INSERT INOT LOG_TABLE
VALUES
('1','A1','138','2018-10-29',34),
('1','A2','146','2018-11-13',49),
('1','A3','140','2018-11-02',38),
('1','B1','130','2018-10-13',18),
('1','B1','150','2018-11-19',55),
('1','B2','137','2018-10-27',32),
('2','A1','128','2018-10-11',59),
('2','A1','131','2018-10-16',64),
('2','A1','136','2018-10-25',73),
('2','A2','139','2018-10-31',79),
('2','A2','145','2018-11-11',90),
('2','A2','147','2018-11-14',93),
('2','A3','135','2018-10-24',72),
('2','B1','124','2018-10-03',51),
('2','B1','133','2018-10-19',67),
('2','B2','134','2018-10-22',70),
('2','B2','142','2018-11-06',85),
('2','B2','148','2018-11-15',94),
('2','B2','149','2018-11-17',96),
('3','A2','127','2018-10-10',122),
('3','A3','123','2018-10-01',113),
('3','A3','125','2018-10-06',118),
('3','A3','126','2018-10-07',119),
('3','A3','141','2018-11-05',148),
('3','A3','144','2018-11-10',153),
('3','B1','132','2018-10-18',130),
('3','B1','143','2018-11-08',151),
('3','B2','129','2018-10-12',124);
and code that works is as follows. I do a self join on the log table to return all the records with the gap between them and include those with a gap of 30 days or less. I then select those where there are more than 2 events into a second cte and from these I count distinct event types and event ids by system and uptime range
WITH EVENTGAP AS
(SELECT T1.EVENT_TYPE,
T1.SYSTEM_ID,
T1.EVENT_ID,
T2.EVENT_ID AS EVENT_ID2,
T1.EVENT_DATE,
T2.EVENT_DATE AS EVENT_DATE2,
T1.UPTIME,
DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) AS EVENT_GAP
FROM LOG_TABLE T1
INNER JOIN LOG_TABLE T2
ON (T1.EVENT_TYPE=T2.EVENT_TYPE
AND T1.SYSTEM_ID=T2.SYSTEM_ID)
WHERE DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) BETWEEN 0 AND 30
AND T1.UPTIME BETWEEN 0 AND 299
AND T2.UPTIME BETWEEN 0 AND 330),
EVENTCOUNT
AS (SELECT EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE,
COUNT(1)
FROM EVENTGAP
GROUP BY EVENT_TYPE,
SYSTEM_ID,
EVENT_ID,
EVENT_DATE
HAVING COUNT(1)>2)
SELECT EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END AS UPTIME_BAND,
COUNT(DISTINCT EVENTGAP.EVENT_ID2) AS EVENT_COUNT,
COUNT(DISTINCT EVENTGAP.EVENT_TYPE) AS TYPE_COUNT
FROM EVENTGAP
WHERE EVENTGAP.EVENT_ID IN (SELECT DISTINCT EVENTCOUNT.EVENT_ID FROM EVENTCOUNT)
GROUP BY EVENTGAP.SYSTEM_ID,
CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END
This gives the following result, which should be unique counts of event ids and event types that have 3 or more events falling in any rolling 30 day period. Some events may be in more than one period but will only be counted once.
EVENTGAP.SYSTEM_ID UPTIME_BAND EVENT_COUNT TYPE_COUNT
2 50-99 10 3
3 100-149 4 1
In both Hive and Oracle, you would want to do this using window functions, using a window frame clause. The exact logic is different in the two databases.
In Hive you can use range between if you convert event_date to a number. A typical method is to subtract a fixed value from it. Another method is to use unix timestamps:
select lt.*
from (select lt.*,
count(*) over (partition by event_type
order by unix_timestamp(event_date)
range between 60*24*24*30 preceding and current row
) as rolling_count
from log_table lt
) lt
where rolling_count >= 2 -- or 9

Count how many first and last entries in given period of time are equal

Given a table structured like that:
id | news_id(fkey)| status | date
1 10 PUBLISHED 2016-01-10
2 20 UNPUBLISHED 2016-01-10
3 10 UNPUBLISHED 2016-01-12
4 10 PUBLISHED 2016-01-15
5 10 UNPUBLISHED 2016-01-16
6 20 PUBLISHED 2016-01-18
7 10 PUBLISHED 2016-01-18
8 20 UNPUBLISHED 2016-01-20
9 30 PUBLISHED 2016-01-20
10 30 UNPUBLISHED 2016-01-21
I'd like to count distinct news that, in given period time, had first and last status equal(and also status equal to given in query)
So, for this table query from 2016-01-01 to 2016-02-01 would return:
1 (with WHERE status = 'PUBLISHED') because news_id 10 had PUBLISHED in both first( 2016-01-10 ) and last row (2016-01-18)
1 (with WHERE status = 'UNPUBLISHED' because news_id 20 had UNPUBLISHED in both first and last row
notice how news_id = 30 does not appear in results, as his first/last statuses were contrary.
I have done that using following query:
SELECT count(*) FROM
(
SELECT DISTINCT ON (news_id)
news_id, status as first_status
FROM news_events
where date >= '2015-11-12 15:01:56.195'
ORDER BY news_id, date
) first
JOIN (
SELECT DISTINCT ON (news_id)
news_id, status as last_status
FROM news_events
where date >= '2015-11-12 15:01:56.195'
ORDER BY news_id, date DESC
) last
using (news_id)
where first_status = last_status
and first_status = 'PUBLISHED'
Now, I have to transform query into SQL our internal Java framework, unfortunately it does not support subqueries, except when using EXISTS or NOT EXISTS. I was told to transform the query to one using EXISTS clause(if it is possible) or try finding another solution. I am, however, clueless. Could anyone help me do that?
edit: As I am being told right now, the problem lies not with our framework, but in Hibernate - if I understood correctly, "you cannot join an inner select in HQL" (?)
Not sure if this adresses you problem correctly, since it is more of a workaround. But considering the following:
News need to be published before they can be "unpublished". So if you'd add 1 for each "published" and substract 1 for each "unpublished" your balance will be positive (or 1 to be exact) if first and last is "published". It will be 0 if you have as many unpublished as published and negative, if it has more unpublished than published (which logically cannot be the case but obviously might arise, since you set a date threshhold in the query where a 'published' might be occured before).
You might use this query to find out:
SELECT SUM(CASE status WHEN 'PUBLISHED' THEN 1 ELSE -1 END) AS 'publishbalance'
FROM news_events
WHERE date >= '2015-11-12 15:01:56.195'
GROUP BY news_id
First of all, subqueries are a substantial part of SQL. A framework forbidding their use is a bad framework.
However, "first" and "last" can be expressed with NOT EXISTS: where not exists an earlier or later entry for the same news_id and date range.
select count(*)
from mytable first
join mytable last on last.news_id = first.news_id
where date between #from and #to
and not exists
(
select *
from mytable before_first
where before_first.news_id = first.news_id
and before_first.date < first.date
and before_first.date >= #from
)
and not exists
(
select *
from mytable after_last
where after_last.news_id = last.news_id
and after_last.date > last.date
and after_last.date <= #to
)
and first.status = #status
and last.status = #status;
NOT EXISTS to the rescue:
SELECT ff.id ,ff.news_id ,ff.status , ff.zdate AS startdate
, ll.zdate AS enddate
FROM newsflash ff
JOIN newsflash ll
ON ff.news_id = ll.news_id
AND ff.status = ll.status
AND ff.zdate < ll.zdate
AND NOT EXISTS (
SELECT * FROM newsflash nx
WHERE nx.news_id = ff.news_id
AND nx.zdate >= '2016-01-01' AND nx.zdate < '2016-02-01'
AND (nx.zdate < ff.zdate OR nx.zdate > ll.zdate)
)
ORDER BY ff.id
;

sql db2 select records from either table

I have an order file, with order id and ship date. Orders can only be shipped monday - friday. This means there are no records selected for Saturday and Sunday.
I use the same order file to get all order dates, with date in the same format (yyyymmdd).
i want to select a count of all the records from the order file based on order date... and (i believe) full outer join (or maybe right join?) the date file... because i would like to see
20120330 293
20120331 0
20120401 0
20120402 920
20120403 430
20120404 827
etc...
however, my sql statement is still not returning a zero record for the 31st and 1st.
with DatesTable as (
select ohordt "Date" from kivalib.orhdrpf
where ohordt between 20120315 and 20120406
group by ohordt order by ohordt
)
SELECT ohscdt, count(OHTXN#) "Count"
FROM KIVALIB.ORHDRPF full outer join DatesTable dts on dts."Date" = ohordt
--/*order status = filled & order type = 1 & date between (some fill date range)*/
WHERE OHSTAT = 'F' AND OHTYP = 1 and ohscdt between 20120401 and 20120406
GROUP BY ohscdt ORDER BY ohscdt
any ideas what i'm doing wrong?
thanks!
It's because there is no data for those days, they do not show up as rows. You can use a recursive CTE to build a contiguous list of dates between two values that the query can join on:
It will look something like:
WITH dates (val) AS (
SELECT CAST('2012-04-01' AS DATE)
FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT Val + 1 DAYS
FROM dates
WHERE Val < CAST('2012-04-06' AS DATE)
)
SELECT d.val AS "Date", o.ohscdt, COALESCE(COUNT(o.ohtxn#), 0) AS "Count"
FROM dates AS d
LEFT JOIN KIVALIB.ORDHRPF AS o
ON o.ohordt = TO_CHAR(d.val, 'YYYYMMDD')
WHERE o.ohstat = 'F'
AND o.ohtyp = 1