query with partition and count

query with partition and count - sql

Given the following table (it records users' item viewing history with session)
create table view_log (
server_time timestamp,
device char(2),
session_id char(10),
uid char(7),
item_id char(7)
);
I'm trying to understand what the following code does..
create table coo_cs as
select
item_id,
session_id,
count(distinct session_id) / (sum(count(distinct session_id)) over (partition by item_id)) cs
from view_log
group by item_id, session_id;
I've tried to break down the line with the partition to understand what it's doing but then it emits DISTINCT is not implemented for window functions.
I understand basic partition and group by but can't make sense of the above sql..
edit
there's a rather large data for test...
http://pakdd2017.recobell.io/site_view_log_small.csv000.gz

Some databases do not (yet) support count(distinct) as a window function. For this query, the count(distinct) is not necessary, because you are aggregating by the same column used for the count(distinct). Hence, count(distinct session_id) is 1 on each row.
Your query is essentially:
select item_id, session_id,
1.0 / count(session_id) over (partition by item_id)) as cs
from view_log
group by item_id, session_id;
I wouldn't be surprising if you wanted the ratios at the level of item_id, so the intended query is:
select item_id, count(distinct session_id),
count(distinct session_id) * 1.0 / sum(count(distinct session_id)) over ()) as cs
from view_log
group by item_id;
If so, the equivalent logic can use a subquery:
select vl.*, sum(numsession) over () as cs
from (select item_id, count(distinct session_id) as numsessions
from view_log vl
group by item_id
) vl;

Related

Remove duplicates from SQL Window function

I'm trying to sum values inside a window function but I can't figure out have to prevent summing duplicates. Below is a snippet of the results I have right now. For the last column I want to calculate REG_MOVEMENT summed across unique STORE_ID's and then divide it by the number of unique stores. This column should be 5603.5 ((9359 + 1848)/2) since there are 3 rows with the same STORE_ID and one different.
KEY_ID
PRODUCT_ID
STORE_ID
REG_MOVEMENT
(No column name)
154
5214266
28002
9359
7481.25
155
5214266
28002
9359
7481.25
156
5214266
28002
9359
7481.25
173
5214266
28005
1848
7481.25
My current code is
SELECT
KEY_ID,
PRODUCT_ID,
STORE_ID,
REG_MOVEMENT,
SUM(REG_MOVEMENT) OVER(PARTITION BY PRODUCT_ID) / CONUT(STORE_ID) OVER(PARTITION BY PRODUCT_ID)

You need a distinct count in the denominator, but SQL Server does not allow this in a single count window function call. As a workaround, we can use DENSE_RANK:
WITH cte AS (
SELECT *, DENSE_RANK() OVER (PARTITION BY PRODUCT_ID ORDER BY STORE_ID) dr
FROM yourTable
)
SELECT
KEY_ID,
PRODUCT_ID,
STORE_ID,
REG_MOVEMENT,
SUM(REG_MOVEMENT) OVER (PARTITION BY PRODUCT_ID) /
MAX(dr) OVER (PARTITION BY PRODUCT_ID) AS new_col
FROM cte
ORDER BY PRODUCT_ID, STORE_ID;

One way with a subquery to de-duplicate (store_id, reg_movement) rows:
select
KEY_ID, PRODUCT_ID, STORE_ID, REG_MOVEMENT,
(select avg(reg_movement)
from (select distinct store_id, reg_movement
from Tbl) Unq
) As NewCol
from Tbl
(Tbl is yourtable)

SELECT AVG(reg_movement)
FROM (
SELECT DISTINCT store_id,
CAST(reg_movement AS FLOAT) AS reg_movement
FROM Table1
) a

Oracle SQL Count distinct values in a certain column

I am trying to query a table with a certain logic and I want to remove the records which have a count of 2 or more distinct values in PERSON_ID column. I cannot find an appropriate window query to achieve this. I already tried using:
SELECT
CUSTOMER_ID, PERSON_ID, CODE,
DENSE_RANK() OVER (PARTITION BY CUSTOMER_iD, PERSON_ID ORDER BY PERSON_ID ASC) AS NR
FROM TBL_1;
But I get the following result:
I want to achieve the result below, which counts the distinct values within PERSON_ID column based on a certain CUSTOMER_ID. In my case Customer "444333" would be a record which I want to remove because it has 2 distinct Person_Id's

here is what you need:
SELECT
customer_id, count(distinct PERSON_ID) distinct_person_count
FROM TBL_1
group by customer_id
and if you want to show it for eahc row , you can join it again with the table :
select * from TBL_1 t
join (
select customer_id, count(distinct PERSON_ID) distinct_person_count
from TBL_1
group by customer_id
) tt
on t.customer_id = tt.customer_id
note: you can't have distinct within window functions

If you want the distinct count on each row, then use a window function:
select t.*,
count(distinct person_id) over (partition by customer_id)
from t;
Oracle does support distinct in window functions.

Correlated subquery not working in Netezza

I have a query like this in Netezza, but not sure how I can rewrite it so it will work. Thanks
with dates as (
select distinct event_date from table
)
select event_date,
(select count(distinct id)
from table
where event_date < dates.event_date
)
from dates
This form of correlated query is not supported - consider rewriting

This would be more efficient using window functions anyway. I think the logic is:
select event_date,
sum(count(*)) over (order by event_date) - count(*) as events_before
from table
group by event_date

I get different results every time I run my which uses lead function in SQL Impala

I have the following code:
select *, lead(session_end_type) over (partition by user_id, session_id order by user_id, session_id, log_time) as next_session_end_type
from table_name;
However, it seems like this results in different results every time I run it.
What makes this difference?
Thanks in advance!
(I've checked that the code outputs different results through the following code:
create table t1
select *, lead(session_end_type) over (partition by user_id, session_id order by user_id, session_id, log_time) as next_session_end_type
from table_name;
create table t2
select *, lead(session_end_type) over (partition by user_id, session_id order by user_id, session_id, log_time) as next_session_end_type
from table_name;
select count (*) from
(
select * from t1
union
select * from t2
) as t;
The resulting row count is different from t1's row count and t2's row count; meaning that the result of t1 and t2 is different.)

First, there is no need to repeat the partition by columns in the order by. You can simplify this to:
lead(session_end_type) over (partition by user_id, session_id order by log_time) as next_session_end_type
Second, if log_time is not unique for a given user_id/session_id, then the results are unstable. Remember, SQL tables represent unordered sets, so if there are ties in sort keys then there is no "natural" order to fall back on.
You can check this wtih:
select user_id, session_id, log_time, count(*)
from table_name
group by user_id, session_id, log_time
having count(*) > 1
order by count(*) desc;
If you do have a column that uniquely identifies each row (or each user/user session row), then include that in the order by:
lead(session_end_type) over (partition by user_id, session_id
order by log_time, <make it stable column>) as next_session_end_type
)

SQL server - SELECT a columns entry on a CASE WHEN query

I'd like to find out the first provider (PROVIDER_ID) to the client (CLIENT_ID) in a database table of bookings (BOOKING_ID)
I currently SELECT the CLIENT_ID first, then calculate various other things.
I group by (CLIENT_ID) and the count is correct.
What I'm looking for is
SELECT case when(min(BOOKING_ID)) then PROVIDER_ID else null end)
But I am unable to perform sub queries within the SELECT/CASE WHEN
I hope this makes sense and the question is clear.
Ideally I would like a solution that is within a single SELECT

Assuming you want to get the PROVIDER_ID for the MIN(BOOKING_ID) grouping by CLIENT_ID the following should work:
SELECT
Client_ID,
Booking_ID,
Provider_ID
FROM (
SELECT
Client_ID,
Provider_ID,
Booking_ID,
ROW_NUMBER() OVER (PARTITION BY Client_ID ORDER BY Booking_ID) as RowNumber
FROM
Bookings
) OrderedTable
WHERE
OrderedTable.RowNumber = 1
How does it work? ROW_NUMBER OVER (ORDER BY field) gives you the row number if the result set was ordered by a particular field. The PARTITION BY field allows you to partition the table by a particular key (in this case Client_ID) that will reset the ROW_NUMBER for each Client_ID (so if RowNumber = 1, it's the first entry for that particular client)
More details here: http://msdn.microsoft.com/en-us/library/ms186734.aspx
Using WITH syntax:
WITH OrderedTable AS
(
SELECT
Client_ID,
Provider_ID,
Booking_ID,
ROW_NUMBER() OVER (PARTITION BY Client_ID ORDER BY Booking_ID) as RowNumber
FROM
Bookings
)
SELECT
Client_ID,
Provider_ID,
Booking_ID
FROM
OrderedTable
WHERE
RowNumber = 1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

query with partition and count - sql

Related

Remove duplicates from SQL Window function

Oracle SQL Count distinct values in a certain column

Correlated subquery not working in Netezza

I get different results every time I run my which uses lead function in SQL Impala

SQL server - SELECT a columns entry on a CASE WHEN query

Categories

Resources