SQL query to identify 0 AFTER a 1 - sql

Let's say I have two columns: Date and Indicator
Usually the indicator goes from 0 to 1 (when the data is sorted by date) and I want to be able to identify if it goes from 1 to 0 instead. Is there an easy way to do this with SQL?
I am already aggregating other fields in the same table. If I can add this to as another aggregation (e.g. without using a separate "where" statement or passing over the data a second time) it would be pretty awesome.
This is the phenomena I want to catch:
Date Indicator
1/5/01 0
1/4/01 0
1/3/01 1
1/2/01 1
1/1/01 0

This isn't a teradata-specific answer, but this can be done in normal SQL.
Assuming that the sequence is already 'complete' and xn+1 can be derived from xn, such as when the dates are sequential and all present:
SELECT date -- the 1 on the day following the 0
FROM r curr
JOIN r prev
-- join each day with the previous day
ON curr.date = dateadd(d, 1, prev.date)
WHERE curr.indicator = 1
AND prev.indicator = 0
YMMV on the ability of such a query to use indexes efficiently.
If the sequence is not complete the same can be applied after making a delegate sequence which is well ordered and similarly 'complete'.
This can also be done using correlated subqueries, each selecting the indicator of the 'previous max', but.. uhg.

Joining the table against it self it quite generic, but most SQL Dialects now support Analytical Functions. Ideally you could use LAG() but TeraData seems to try to support the absolute minimum of these, and so so they point you to use SUM() combined with rows preceding.
In any regard, this method avoids a potentially costly join and effectively deals with gaps in the data, whilst making maximum use of indexes.
SELECT
*
FROM
yourTable t
QUALIFY
t.indicator
<
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
QUALIFY is a bit TeraData specific, but slightly tidier than the alternative...
SELECT
*
FROM
(
SELECT
*,
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
AS previous_indicator
FROM
yourTable t
)
lagged
WHERE
lagged.indicator < lagged.previous_indicator

Supposing you mean that you want to determine whether any row having 1 as its indicator value has an earlier Date than a row in its group having 0 as its indicator value, you can identify groups with that characteristic by including the appropriate extreme dates in your aggregate results:
SELECT
...
MAX(CASE indicator WHEN 0 THEN Date END) AS last_ind_0,
MIN(CASE indicator WHEN 1 THEN Date END) AS first_ind_1,
...
You then test whether first_ind_1 is less than last_ind_0, either in code or as another selection item.

Related

Executing a Aggregate function within a case without Group by

I am trying to assign a specific code to a client based on the number of gifts that they have given in the past 6 months using a CASE. I am unable to use WITH (screenshot) due to the limitations of the software that I am creating the query in. It only allows for select functions. I am unsure how to get a distinct count from another table (transaction data) and use that as parameters in the CASE I have currently built (based on my client information table). Does anyone know of any workarounds for this? I am unable to GROUP BY clientID at the end of my query because not all of my columns are aggregate, and I only need to GROUP BY clientID for this particular WHEN statement in the CASE. I have looked into the OVER() clause, but I am needing my date range that I am evaluating to be dynamic (counting transactions over the last six months), and the amount of rows that I would be including is variable, as the transaction count month to month varies. Also, the software that I am building this in does not recognize the PARTITIONED BY parameter of the over clause.
Any help would be great!
EDIT:
it is not letting me attach an image... -____- I have added the two sections of code that I am looking for assistance with!
WITH "6MonthGIftCount" (
"ConstituentID"
,"GiftCount"
)
AS (
SELECT COUNT(DISTINCT "GiftView"."GiftID" FROM "GiftView" WHERE MONTHS_BETWEEN("GiftView"."GiftDate", getdate()) <= 6 GROUP BY "GiftView"."ConstituentID")
SELECT...CASE
WHEN "6MonthGiftCount"."GiftCount" >= 4
THEN 'A010'
)
Perform your grouping/COUNT(1) in a subquery to obtain the total # of donations by ConstituentID, then JOIN this total into your main query that uses this new column to perform its CASE statement.
select
hist.*,
case when timesDonated > 5 then 'gracious donor'
when timesDonated > 3 then 'repeated donor'
when timesDonated >= 1 then 'donor'
else null end as donorCode
from gifthistory hist
left join ( /* your grouping subquery here, pretending to be a new table */
select
personID,
count(1) as timesDonated
from gifthistory i
WHERE abs(months_between(giftDate, sysdate)) <= 6
group by personid ) grp on hist.personid = grp.personID
order by 1;
*Naturally, syntax changes will vary by DB; you didn't specify which it was based on, but you should be able to use this template with whichever you utilize. This works in both Oracle and SQL Server after tweaking the month calculation appropriately.

how to use count with case when

I'm newbie to Hivesql.
I have a raw table with 6 million records like this:
I want to count the number of IP_address access to each Modem_id everyweek.
The result table I want will be like this:
I did it with left join, and it worked. But since using join will be time-consuming, I want do it with case when statement - but I can't write a correct statement. Do you have any ideas?
This is the join statement I used:
select a.modem_id,
a.Number_of_IP_in_Day_1,
b.Number_of_IP_in_Day_2
from
(select modem_id,
count(distinct ip_address) as Number_of_IP_in_Day_1
from F_ACS_DEVICE_INFORMATION_NEW
where day=1
group by modem_id) a
left join
(select modem_id,
count(distinct param_value) as Number_of_IP_in_Day_2
from F_ACS_DEVICE_INFORMATION_NEW
where day=2
group by modem_id) b
on a.modem_id= b.modem_id;
You can express your logic using just aggregatoin:
select a.modem_id,
count(distinct case when date = 1 then ip_address end) as day_1,
count(distinct case when date = 2 then ip_address end) as day_2
from F_ACS_DEVICE_INFORMATION_NEW a
group by a.modem_id;
You can obviously extend this for more days.
Note: As your question and code are written, this assumes that your base table has data for only one week. Otherwise, I would expect some date filtering. Presumably, that is what the _NEW suffix means on the table name.
Based on your question and further comments, you would like
The number of different IP addresses accessed by each modem
In counts by week (as columns) for 4 weeks
e.g., result would be 5 columns
modem_id
IPs_accessed_week1
IPs_accessed_week2
IPs_accessed_week3
IPs_accessed_week4
My answer here is based on knowledge of SQL - I haven't used Hive but it appears to support the things I use (e.g., CTEs). You may need to tweak the answer a bit.
The first key step is to turn the day_number into a week_number. A straightforward way to do this is FLOOR((day_num-1)/7)+1 so days 1-7 become week 1, days 8-14 become week2, etc.
Note - it is up to you to make sure the day_nums are correct. I would guess you'd actually want info the the last 4 weeks, not the first four weeks of data - and as such you'd probably calculate the day_num as something like SELECT DATEDIFF(day, IP_access_date, CAST(getdate() AS date)) - whatever the equivalent is in Hive.
There are a few ways to do this - I think the clearest is to use a CTE to convert your dataset to what you need e.g.,
convert day_nums to weeknums
get rid of duplicates within the week (your code has COUNT(DISTINCT ...) - I assume this is what you want) - I'm doing this with SELECT DISTINCT (rather than grouping by all fields)
From there, you could PIVOT the data to get it into your table, or just use SUM of CASE statements. I'll use SUM of CASE here as I think it's clearer to understand.
WITH IPs_per_week AS
(SELECT DISTINCT
modem_id,
ip_address,
FLOOR((day-1)/7)+1 AS week_num -- Note I've referred to it as day_num in text for clarity
FROM F_ACS_DEVICE_INFORMATION_NEW
)
SELECT modem_id,
SUM(CASE WHEN week_num = 1 THEN 1 ELSE 0 END) AS IPs_access_week1,
SUM(CASE WHEN week_num = 2 THEN 1 ELSE 0 END) AS IPs_access_week2,
SUM(CASE WHEN week_num = 3 THEN 1 ELSE 0 END) AS IPs_access_week3,
SUM(CASE WHEN week_num = 4 THEN 1 ELSE 0 END) AS IPs_access_week4
FROM IPs_per_week
GROUP BY modem_id;

Impala get the difference between 2 dates excluding weekends

I'm trying to get the day difference between 2 dates in Impala but I need to exclude weekends.
I know it should be something like this but I'm not sure how the weekend piece would go...
DATEDIFF(resolution_date,created_date)
Thanks!
One approach at such task is to enumerate each and every day in the range, and then filter out the week ends before counting.
Some databases have specific features to generate date series, while in others offer recursive common-table-expression. Impala does not support recursive queries, so we need to look at alternative solutions.
If you have a table wit at least as many rows as the maximum number of days in a range, you can use row_number() to offset the starting date, and then conditional aggregation to count working days.
Assuming that your table is called mytable, with column id as primary key, and that the big table is called bigtable, you would do:
select
t.id,
sum(
case when dayofweek(dateadd(t.created_date, n.rn)) between 2 and 6
then 1 else 0 end
) no_days
from mytable t
inner join (select row_number() over(order by 1) - 1 rn from bigtable) n
on t.resolution_date > dateadd(t.created_date, n.rn)
group by id

Why Window Functions Require My Aggregated Column in Group

I have been working with window functions a fair amount but I don't think I understand enough about how they work to answer why they behave the way they do.
For the query that I was working on (below), why am I required to take my aggregated field and add it to the group by? (In the second half of my query below I am unable to produce a result if I don't include "Events" in my second group by)
With Data as (
Select
CohortDate as month
,datediff(week,CohortDate,EventDate) as EventAge
,count(distinct case when EventDate is not null then GUID end) as Events
From MyTable
where month >= [getdate():month] - interval '12 months'
group by 1, 2
order by 1, 2
)
Select
month
,EventAge
,sum(Events) over (partition by month order by SubAge asc rows between unbounded preceding and current row) as TotEvents
from data
group by 1, 2, Events
order by 1, 2
I have run into this enough that I have just taken it for granted, but would really love some more color as to why this is needed. Is there a way I should be formatting these differently in order to avoid this (somewhat non-intuitive) requirement?
Thanks a ton!
What you are looking for is presumably a cumulative sum. That would be:
select month, EventAge,
sum(sum(Events)) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from data
group by 1, 2
order by 1, 2 ;
Why? That might be a little hard to explain. Perhaps if you see the equivalent version with a subquery it will be clearer:
select me.*
sum(sum_events) over (partition by month
order by SubAge asc
rows between unbounded preceding and current row
) as TotEvents
from (select month, EventAge, sum(events) as sum_events
from data
group by 1, 2
) me
order by 1, 2 ;
This is pretty much an exactly shorthand for the query. The window function is evaluated after aggregation. You want to sum the SUM of the events after the aggregation. Hence, you need sum(sum(events)). After the aggregation, events is no longer available.
The nesting of aggregation functions is awkward at first -- at least it was for me. When I first started using window functions, I think I first spent a few days writing aggregation queries using subqueries and then rewriting without the subqueries. Quickly, I got used to writing them without subqueries.

Redshift: Find MAX in list disregarding non-incremental numbers

I work for a sports film analysis company. We have teams with unique team IDs and I would like to find the number of consecutive weeks they have uploaded film to our site moving backwards from today. Each upload also has its own row in a separate table that I can join on teamid and has a unique date of when it was uploaded. So far I put together a simple query that pulls each unique DATEDIFF(week) value and groups on teamid.
Select teamid, MAX(weekdiff)
(Select teamid, DATEDIFF(week, dateuploaded, GETDATE()) as weekdiff
from leroy_events
group by teamid, weekdiff)
What I am given is a list of teamIDs and unique weekly date differences. I would like to then find the max for each teamID without breaking an increment of 1. For example, if my data set is:
Team datediff
11453 0
11453 1
11453 2
11453 5
11453 7
11453 13
I would like the max value for team: 11453 to be 2.
Any ideas would be awesome.
I have simplified your example assuming that I already have a table with weekdiff column. That would be what you're doing with DATEDIFF to calculate it.
First, I'm using LAG() window function to assign previous value (in ordered set) of a weekdiff to the current row.
Then, using a WHERE condition I'm retrieving max(weekdiff) value that has a previous value which is current_value - 1 for consecutive weekdiffs.
Data:
create table leroy_events ( teamid int, weekdiff int);
insert into leroy_events values (11453,0),(11453,1),(11453,2),(11453,5),(11453,7),(11453,13);
Code:
WITH initial_data AS (
Select
teamid,
weekdiff,
lag(weekdiff,1) over (partition by teamid order by weekdiff) as lag_weekdiff
from
leroy_events
)
SELECT
teamid,
max(weekdiff) AS max_weekdiff_consecutive
FROM
initial_data
WHERE weekdiff = lag_weekdiff + 1 -- this insures retrieving max() without breaking your consecutive increment
GROUP BY 1
SQLFiddle with your sample data to see how this code works.
Result:
teamid max_weekdiff_consecutive
11453 2
You can use SQL window functions to probe relationships between rows of the table. In this case the lag() function can be used to look at the previous row relative to a given order and grouping. That way you can determine whether a given row is part of a group of consecutive rows.
You still need overall to aggregate or filter to reduce the number of rows for each group of interest (i.e. each team) to 1. It's convenient in this case to aggregate. Overall, it might look like this:
select
team,
case min(datediff)
when 0 then max(datediff)
else -1
end as max_weeks
from (
select
team,
datediff,
case
when (lag(datediff) over (partition by team order by datediff) != datediff - 1)
then 0
else 1
end as is_consec
from diffs
) cd
where is_consec = 1
group by team
The inline view just adds an is_consec column to the data, marking whether each row is part of a group of consecutive rows. The outer query filters on that column (you cannot filter directly on a window function), and chooses the maximum datediff from the remaining rows for each team.
There are a few subtleties there:
The case expression in the inline view is written as it is to exploit the fact that the lag() computed for the first row of each partition will be NULL, which does not evaluate unequal (nor equal) to any value. Thus the first row in each partition is always marked consecutive.
The case testing min(datediff) in the outer select clause picks up teams that have no record with datediff = 0, and assigns -1 to column max_weeks for them.
It would also have been possible to mark rows non-consecutive if the first in their group did not have datediff = 0, but then you would lose such teams from the results altogether.