Get consecutive days of rows in table - sql

So I am trying to create a table that stores information about tables failing quality tests.
My goal is to create a table with the current tables who've been failing some sort of test for more than 2 days IN A ROW. Though I am unable to tackle all possibilities of entries from the logs table I am manipulating.
Here's the query I am using:
SELECT run_time,
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
MAX(consecutive_days) as consecutive_days
FROM
(WITH a AS (
SELECT
full_table_name,
expectation_type,
date_trunc('day', run_time) AS run_time,
origin_type,
meta,
expectation_columns
FROM MY_TABLE
WHERE is_successful = false
AND origin_type = 'DWH_TESTS'
GROUP BY 1, 2, 3, 4, 5, 6
)
SELECT
full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time,
ROW_NUMBER() OVER (PARTITION BY full_table_name, expectation_type order by run_time) AS consecutive_days
FROM a )
GROUP BY full_table_name,
expectation_type,
meta,
expectation_columns,
origin_type,
run_time
having MAX(consecutive_days) > 2
ORDER BY run_time desc
Thing is, it accumulates all rows with false results. I need the accumulation of the consecutive.
Meaning:
If I had one table that been failing the same expectation day after day, I'd want to save it as a row, with the start date of this failing session, according to today. Though, once it been settled, I don't want it to still appear in query results (Unless it again surpassed two days threshold.)
Been trying to solve this for few days now, unsuccessfully.
Thanks in advance!

Related

Building a monthly Snapshot VIEW from historical data

I have historical historical data that looks as following:
I need to get a DB VIEW of how each of these Opportunities looked like at the end of the month e.g.
What is the best way to implement this as a Snowflake VIEW?
Well there are two parts, the SQL that goes in the view, and the view to use.
The SQL appears to be the "last value of the month" per opportunity AND deleted is not true, which can be done a couple of ways.
Off the top of my head the most recent value of the month sounds like WHERE clause like against ROW_NUMBER:
SELECT last_modified_date,
opportunity,
amount
FROM table
WHERE deleted == false
AND ROW_NUMBER() OVER(PARTITON BY date_trunc('month', last_modified_date) ORDER BY last_modified_date DESC) = 1
This will keep the values you want, then we need to tweak the last_modified_date to the last day of the month, which can be done via
SELECT dateadd('day', -1 ,dateadd('month', 1, date_trunc('month', last_modified_date))) as month_end
which can be combined as:
SELECT dateadd('day', -1 ,dateadd('month', 1, date_trunced)) as month_end,
opportunity,
amount
FROM (
SELECT
date_trunc('month', last_modified_date) AS date_trunced
opportunity,
amount,
ROW_NUMBER() OVER(PARTITON BY date_trunced ORDER BY last_modified_date DESC) AS rn
FROM table
WHERE deleted == false
)
WHERE rn = 1
ORDER BY 1,2;
Then is you have a few million rows, you could just create a normal view, BUT if you have billions of rows, and years of data, the prior months are going to be rather gross churn over, so a materialized view would be helpful, but I am rather sure you cannot use a ROW_NUMBER in a materialized view, thus you could manually build a "historic months" table, and "current months" view, and union those together, and on some period like month roll the current periods data into history as that becomes stable.

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

How do I design a SQL query that will show me all users who visited at least one page for the last 20 out of 24 hours?

In order to identify human traffic (as opposed to crawlers, bots, etc), I would like to design an SQL query that will identify all unique visitor ID's that have visited websites in the last 20 of 24 hours (as most humans would not be browsing for that long). I believe I understand how I want to structure it, "How many UNIQUE hours have any activity for each visitor in the past 24 hours, and WHERE 20 hours have at least some activity".
While the specifics of such a query would depend on the tables involved, I'm having trouble understanding if my structure is on the right track:
SELECT page_url, affinity, num
FROM (
SELECT AGG GROUP BY visitor_id, pages.page_url, max(v.max_affinity) as affinity, COUNT(*) as num, Row_Number()
OVER (Partition By v.visitor_id ORDER BY COUNT(visitor_id) DESC) AS RowNumber
FROM audience_lab_active_visitors v
SELECT pages ON pages.p_date >= '2017-09-14'
WHERE v.p_date='2017-09-14'
GROUP BY v.vispage_visitors, pages.page_url
) tbl WHERE RowNumber < 20
I don't believe your query is valid SQL, but I have an idea of what you're trying to accomplish. Rather than use a static date, I filtered by the past 24 hours and truncated the current timestamp to the hour, otherwise the query would be considering 25 unique hours. I also removed page_url from the query since it didn't seem relevant to the results based on what you're trying to solve.
For each visitor_id, the query counts the number of unique hours recorded based on the column (timestamp_col in this example) used to record the timestamp of the page view. HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20 returns those you've identified at humans, meaning they visited the website up to 19 of the past 24 hours.
SELECT
visitor_id,
COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) AS num,
MAX(v.max_affinity) AS affinity
FROM audience_lab_active_visitors AS v
JOIN pages AS p ON v.page_url = p.page_url
WHERE
v.p_date >= DATE_TRUNC('hour', CURRENT_TIMESTAMP) - INTERVAL '24' hour
GROUP BY 1
HAVING COUNT(DISTINCT DATE_TRUNC('hour', timestamp_col)) < 20;

How to use window functions to get meterics for today, last 7 days, last 30 days for each value of the date?

My problem seems simple on paper:
For a given date, give me active users for that given date, active users in given_Date()-7, active users in a given_Date()-30
i.e. sample data.
"timestamp" "user_public_id"
"23-Sep-15" "805a47023fa611e58ebb22000b680490"
"28-Sep-15" "d842b5bc5b1711e5a84322000b680490"
"01-Oct-15" "ac6b5f70b95911e0ac5312313d06dad5"
"21-Oct-15" "8c3e91e2749f11e296bb12313d086540"
"29-Nov-15" "b144298810ee11e4a3091231390eb251"
for 01-10 the count for today would be 1, last_7_days would be 3, last_30_days would be 3+n (where n would be the count of the user_ids that fall in dates that precede Oct 1st in a 30 day window)
I am on redshift amazon. Can somebody provide a sample sql to help me get started?
the outputshould look like this:
"timestamp" "users_today", "users_last_7_days", "users_30_days"
"01-Oct-15" 1 3 (3+n)
I know asking for help/incomplete solutions are frowned upon, but this is not getting any other attention so I thought I would do my bit.
I have been pulling my hair out trying to nut this one out, alas, I am a beginner and something is not clicking for me. Perhaps yourself or others will be able to drastically improve my answer, but I think I am on the right track.
SELECT replace(convert(varchar, [timestamp], 111), '/','-') AS [timestamp], -- to get date in same format as you require
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE ([TIMESTAMP]) = ([timestamp])) AS users_today,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-7,[TIMESTAMP]) AND [TIMESTAMP]) AS users_last_7_days ,
(SELECT COUNT([TIMESTAMP]) FROM #SIMPLE WHERE [TIMESTAMP] BETWEEN DATEADD(DY,-30,[TIMESTAMP]) AND [timestamp]) AS users_last_30_days
FROM #SIMPLE
GROUP BY [timestamp]
Starting with this:
CREATE TABLE #SIMPLE (
[timestamp] datetime, user_public_id varchar(32)
)
INSERT INTO #SIMPLE
VALUES('23-Sep-15','805a47023fa611e58ebb22000b680490'),
('28-Sep-15','d842b5bc5b1711e5a84322000b680490'),
('01-Oct-15','ac6b5f70b95911e0ac5312313d06dad5'),
('21-Oct-15','8c3e91e2749f11e296bb12313d086540'),
('29-Nov-15','b144298810ee11e4a3091231390eb251')
The problem I am having is that each row contains the same counts, despite my grouping by [timestamp].
Step 1-- Create a table which has daily counts.
create temp table daily_mobile_Sessions as
select "timestamp" ,
count(user_public_id) over (partition by "timestamp" ) as "today"
from mobile_sessions
group by 1, mobile_sessions.user_public_id
order by 1 DESC
Step 2 -- From the table above. We create yet another table which can use the "today" field, and we apply the window function to Sum the counts.
select "timestamp", today,
sum(today) over (order by "timestamp" rows between 6 PRECEDING and CURRENT ROW) as "last_7days",
sum(today) over (order by "timestamp" rows between 29 PRECEDING and CURRENT ROW) as "last_30days"
from daily_mobile_Sessions group by "timestamp" , 2 order by 1 desc

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?