A SQL query to work on a table with subgroups - sql

I have a large Oracle DB table which contains nearly 200 millions of rows. It has only three columns: A subscriber id field, a date field and an offer id field.
For each row in this table, I need to find whether this row has any corresponding rows in the table such that:
1) They belong to the same subscriber (same subscriber id)
2) They are from a certain distance in the past from the current row (for example if our current row is A, the row B with the same subscriber id should have that A.date > B.date >= A.date - 30(days))
3) In addition to 2) we will have to query for a specific offer id as well: (A.date > B.date >= A.date - 30 and B.offerid == some_id)
I am aware of the Oracle Analytics functions lag and lead and I plan to use them for this purpose. These function returns value of the fields above or below of the current row on the ordered table, according to some given fields. The disturbing thing is that the number of rows with the same subscriber id field varies up to 84. When I use an ORDER BY statement on (SUBSCRIBER_ID,DATE) with lag function, then for each row, I need to check 84 rows above of the current one, in order to make sure that the rows above share the same SUBSCRIBER_ID with my current row. Since some subscriber id subgroups only have entries around of 3 - 4 rows, this amount of unnecessary row accesses is wasteful.
How can I accomplish this job, without being in need to check 84 rows each time, for each row? Does Oracle support any methods which work solely on subgroups generated by the GROUP BY statement?

One option is to use a self-join like this:
SELECT t1.*, NVL2(t2.subscriber_id, 'Yes', 'No') as match_found
FROM
myTable t1 LEFT JOIN
myTable t2 ON t1.subscriber_id = t2.subscriber_id
AND t1.date > t2.date AND t2.date >= t1.date - 30
AND t2.offerid = <filter_offer_id>

Actually the analytic function COUNT(*) in Oracle did the necessary stuff for me. I used the following structure
SELECT
SUBSCRIBER_ID,
SEGMENTATION_DATE,
OFFER_ID,
COUNT(*) OVER (PARTITION BY SUBSCRIBER_ID ORDER BY SEGMENTATION_DATE RANGE BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) AS SENDEVER,
COUNT(*) OVER (PARTITION BY SUBSCRIBER_ID ORDER BY SEGMENTATION_DATE RANGE BETWEEN 30 PRECEDING AND 1
COUNT(CASE WHEN (OFFER_ID =580169) THEN 1 ELSE NULL END ) OVER (PARTITION BY SUBSCRIBER_ID ORDER BY SEGMENTATION_DATE RANGE BETWEEN 180 PRECEDING AND 1 PRECEDING) AS SEND6M580169
FROM myTable
PARTITION BY groups the table according to the SUBSCRIBER_ID field and by using proper RANGE BETWEEN statements on each group's rows, I only pick the ones which has the appropriate dates, in the desired time interval.
By using a CASE WHEN statement on the OFFER_ID field, I further filter the rows in the current SUBSCRIBER_ID group and throw out all rows with the invalid offer id.
The nice thing is there is no self join needed here, reducing the order of the operation a magnitude down.

Related

Select max value of each group using partition by

I have the following code which is taking a looong time to get executed. What I need to do is select the column having row number equals 1 after partitioning it by three columns (col_1, col_2, col_3) [which are also the key columns] and ordering by some columns as mentioned below. The number of records in the table is around 90 million. Am I following the best approach or is there any other better one?
with cte as (SELECT
b.*
,ROW_NUMBER() OVER ( PARTITION BY col_1,col_2,col_3
ORDER BY new_col DESC, new_col_2 DESC, new_col_3 DESC ) AS ROW_NUMBER
FROM (
SELECT
*
,CASE
WHEN update_col = ' ' THEN new_update_col
ELSE update_col
END AS new_col_1
FROM schema_name.table_name
) b
)
select top 10 * from cte WHERE ROW_NUMBER=1
Currently you are applying CASE on different columns which is impacting all rows in the database table. CASE (String Comparison) Is a costly method.
At the end, you are keeping only records with ROW NUMBER = 1. If I guess this filter keeping Half of your all records, this will increase the query execution time if you filter (Generate ROW NUMBER First and Keep Rows with RN=1) first and then apply CASE method on columns.

SQL server - count distinct over function or row_numer with rows window function

I am currently trying to get a distinct count for customers over a 90 day rolling period. I have got the amount using sum amount and over partition. However, when I do this with count distinct, SQL doesn't have functionality.
I have attempted to use row_number() with the over partition and use rows current row and 90 preceding but this also isn't available.
Would greatly appreciate any suggested work around to resolve this problem.
I have attempted to solve the problem using 2 approaches, both which have failed based on the limitations outlined above.
Approach 1
select date
,count(distinct(customer_id)) over partition () order by date rows current row and 89 preceding as cust_count_distinct
from table
Approach 2
select date
,customer_id
,row_number() over partition (customer_id) order by date rows current row and 89 preceding as rn
from table
-- was then going to filter for rn = '1' but the rows functionality not possible with ranking function windows.
The simplest method is a correlated subquery of some sort:
select d.date, c.nt
from (select distinct date from t) d cross apply
(select count(distinct customerid) as cnt
from t t2
where t2.date >= dateadd(day, -89, d.date) and
t2.date <= d.date
) c;
This is not particularly efficient (i.e. a killer) on even a medium data set. But it might serve your needs.
You can restrict the dates being returned to test to see if it works.

Populate blank values in a Field with number from last populated value

There is no specific number of blank values. It can be none or many. Here is the current result.
Blank Cells to be Populated:
You can use analytic functions. I think this will work:
select t.*, coalesce(coil, lag(coil ignore nulls) over (order by datetime))
from t;
I know Oracle has supported ignore nulls for a long, long time. I don't quite remember off-hand if ancient versions supported it.
The below approach should work (or hopefully will give you enough to go on). The idea is that you update columns by joining a table on itself and joining on an earliest row which has been entered before the row you are wanting to update and also a row in which the column you are wanting to update is not NULL.
SELECT YT1.ID, YT2.COIL
FROM Your_Table YT1
INNER JOIN Your_Table YT2 ON YT2.ID =
(SELECT TOP 1 ID FROM Your_Table
WHERE [start_date] < YT1.[start_date]
AND COIL IS NOT NULL
ORDER BY [start_date] DESC)
WHERE YT1.COIL IS NULL OR LEN(YT1.COIL) = 0

How to count rows in SQL Server 2012?

I am trying to find whether a person (id = A3) is continuously active in a program at least five months or more in a given year (2013). Any suggestion would be appreciated. My data look like as follows:
You simply use group by and a conditional expression:
select id,
(case when count(ActiveMonthYear) >= 5 then 'YES!' else 'NAW' end)
from table t
where ListOfTheMonths between '201301' and '201312'
group by id;
EDIT:
I suppose "continuously" doesn't just mean any five months. For that, there are various ways. I like the difference of row numbers approach
select distinct id
from (select t.*,
(row_number() over (partition by id order by ListOfTheMonths) -
count(ActiveMonthYear) over (partition by id order by ListOfTheMonths)
) as grp
from table t
where ListOfTheMonths between '201301' and '201312'
) t
where ActiveMonthYear is not null
group by id, grp
having count(*) >= 5;
The difference in the subquery is constant for groups of consecutive active months. This is then used a grouping. The result is a list of all ids that meet this criteria. You can add a where for a particular id (do it in the subquery).
By the way, this is written using select distinct and group by. This is one of the rare cases where these two are appropriately used together. A single id could have two periods of five months in the same year. There is no reason to include that person twice in the result set.

Hive SQL aggregate merge multiple sqls into one

I have a serial sqls like:
select count(distinct userId) from table where hour >= 0 and hour <= 0;
select count(distinct userId) from table where hour >= 0 and hour <= 1;
select count(distinct userId) from table where hour >= 0 and hour <= 2;
...
select count(distinct userId) from table where hour >= 0 and hour <= 14;
Is there a way to merge them into one sql?
It looks like you are trying to keep a cumulative count, bracketed by the hour. To do that, you can use a window function, like this:
SELECT DISTINCT
A.hour AS hour,
SUM(COALESCE(M.include, 0)) OVER (ORDER BY A.hour) AS cumulative_count
FROM ( -- get all records, with 0 for include
SELECT
name,
hour,
0 AS include
FROM
table
) A
LEFT JOIN
( -- get the record with lowest `hour` for each `name`, and 1 for include
SELECT
name,
MIN(hour) AS hour,
1 AS include
FROM
table
GROUP BY
name
) M
ON M.name = A.name
AND M.hour = A.hour
;
There might be a simpler way, but this should yield the correct answer in general.
Explanation:
This uses 2 subqueries against the same input table, with a derived field called include to keep track of which records should contribute to the final total for each bucket. The first subquery simply takes all records in the table and assigns 0 AS include. The second subquery finds all unique names and the lowest hour slot in which that name appears, and assigns them 1 AS include. The 2 subqueries are LEFT JOIN'ed by the enclosing query.
The outermost query does a COALESCE(M.include, 0) to fill in any NULL's produced by the LEFT JOIN, and those 1's and 0's are SUM'ed and windowed by hour. This needs to be a SELECT DISTINCT rather than using a GROUP BY becuse a GROUP BY will want both hour and include listed, but it ends up collapsing every record in a given hour group into a single row (still with include=1). The DISTINCT is applied after the SUM so it will remove duplicates without discarding any input rows.