Not able to COUNT DISTINCT using WINDOW functions (Spark SQL) - sql

Let's say I have a dataset sample (table 1) as shown below -
Here, one customer can use multiple tokens and one token can be used by multiple customers. I am trying to get for each token, customer and creation date of the record, the number of customers used this token before the creation date.
When I am trying to execute this query in Spark SQL, I am getting the following error -
Option 1 (correlated subquery)
SELECT
t1.token,
t1.customer_id,
t1.creation_date,
(SELECT COUNT(DISTINCT t2.customer_id) FROM Table 1 t2
AND t1.token = t2.token
AND t2.creation_date < t1.creation_date) cust_cnt
FROM Table 1 t1;
Error: Correlated column is not allowed in a non-equality predicate
Option 2 (cross - join)
SELECT
t1.token,
t1.customer_id,
t1.creation_date,
COUNT(DISTINCT t2.customer_id) AS cust_cnt
FROM Table 1 t1, Table 1 t2
WHERE t1.token = t2.token
AND t2.creation_date < t1.creation_date
GROUP BY t1.token, t1.customer_id, t1.creation_date;
Problem: Long running query since Table 1 has millions of rows
Is there any workaround (for eg. using window function) to optimize this query in Spark SQL? Note: window functions does not allow distinct count.

Count the first time a customer appears:
SELECT t1.token, t1.customer_id, t1.creation_date,
SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY token ORDER BY creation_date) as cust_cnt
FROM (SELECT t1.*,
ROW_NUMBER() OVER (PARTITION BY token, customer_id ORDER BY creation_date) as seqnum
FROM Table1 t1
) t1;
Note: This is also counting the current row. I'm guessing that is acceptable for what you want to do.

Related

How to write SQL query without join?

Recently during an interview I was asked a question: if I have a table like as below:
The requirement is: how many orders and how many shipments per day (based on date column) - output needs to be like this:
I have written the following code, but interviewer ask me to write a SQL query without JOIN and UNION, achieve the same output.
SELECT
COALESCE(a.order_date, b.ship_date), orders, shipments
FROM
(SELECT
order_date, COUNT(1) AS orders
FROM
table
GROUP BY 1) a
FULL JOIN
(SELECT
ship_date, COUNT(1) AS shipments
FROM table) b ON a.order_date = b.ship_date
Is this possible? Could you guys please advice?
You can use UNION and GROUP BY with conditional aggregation as follows:
SELECT DATE_,
COUNT(CASE WHEN FLAG = 'ORDER' THEN 1 END) AS ORDERS,
COUNT(CASE WHEN FLAG = 'SHIP' THEN 1 END) AS SHIPMENTS
FROM (SELECT ORDER_DATE AS DATE_, 'ORDER' AS FLAG FROM YOUR_TABLE
UNION ALL
SELECT SHIP_DATE AS DATE_, 'SHIP' AS FLAG FROM YOUR_TABLE) T
In BigQuery, I would express this as:
select date, countif(n = 0) as orders, countif(n = 1) as numships
from t cross join
unnest(array[order_date, ship_date]) date with offset n
group by 1
order by date;
The advantage of this approach (over union all) is two-fold. First, it only scans the table once. More importantly, the unnest() is all on the same node where the data resides -- so data does not need to be moved for the unpivot.

Unable to get duplicate records from table

I have a table with the structure given below:
A User_ID has values for its respective items in the specific time interval. Item value can be text or integer depends upon the item.
I want to check if any Two or more UserId as same values, meaning their items are same with same values and in the same time interval.
As in above table UserId 213456 and UserId 213458 has same records.
I tried using cursor and loops, but it's taking too long. My table has more than 50 million UserId. Is there a way to do this in an efficient way?
I also tried using group by with subqueries but all the attempts were failed to create a good query for it.
I have created the following query using How do I find duplicate values in a table in Oracle?
select t1.USERID, count(t1.USERID)
from USERS_ITEM_VAL t1
where exists ( select *
from USERS_ITEM_VAL t2
where t1.rowid <> t2.rowid and
t2.ITEMID = t1.ITEMID and
t2.TEXT_VALUE = t1.TEXT_VALUE and
--t2.INTEGER_VALUE = t1.INTEGER_VALUE and
t2.INIT_DATE = t1.INIT_DATE and
t2.FINAL_DATE = t1.FINAL_DATE )
group by t1.USERID having count(t1.USERID) > 1 order by count(t1.USERID);
But the problem is its working when excluding the INTEGER_VALUE columns but not giving me output when I include INTEGER_VALUE column in the join, though my data in INTEGER_VALUE column is same.
Here is the structure of my table:
USERID - NUMBER
ITEMID - NUMBER
TEXT_VALUE - VARCHAR2(500)
INTEGER_VALUE - NUMBER
INIT_DATE - DATE
FINAL_DATE - DATE
One way to approach this uses a self join. The idea is to count the number of items that two users have in common (taking the date columns into account). Then compare this to the number of items that each has:
with t as (
select t.*, count(*) over (partition by userid) as numitems
from t
)
select t1.userid, t2.userid
from t t1 join
t t2
on t1.userid < t2.userid and
t1.itemid = t2.itemid and
t1.init_date = t2.init_date and
t1.final_date = t2.final_date and
t1.numitems = t2.numitems
group by t1.userid, t2.userid, t1.numitems
having count(*) = t1.numitems;
The reason your query failed is that either text_value or integer_value will be NULL in every row. For this reason, it's not possible to use an equality predicate in the self-join without using NVL functions to plug the NULL values.
However, below is a query that uses an analytic function to accomplish the goal:
Select * From (
Select t.*, Count(*) Over (Partition By t.itemId,
t.text_value,
t.integer_value,
t.init_date,
t.final_date) as Cnt)
Where cnt > 1;
The query returns all rows where multiple records have identical values in the five columns of the Partition By clause.
A benefit of this technique over the self-join approach is that the table is scanned only once, whereas it would be scanned twice with a self join. This could result in better performance if the table is large.

SQL select minimum date from same column

I'm trying to write a query based on accounts and their contracts. The table has all contracts for each account, whether the contract is active, expired, etc. I want the query to only bring back the contract with earliest start date per account, so only one row for each account. However i don't know the status of the earliest contract for each account. Some might have active, some might have pending. I run into the problem now where it brings back multiple records for each account if the contract status is in the list i specify. Simple sample code below:
Select t.account, t.contract, t.status Min(t.start_date)
From table t
where t.status in ('Active','Countersigned','Pending')
If your database supports it (e.g. Oracle, Postgres, SQL Server, but not MySQL or SQLite), you can use Window Functions. For instance, you can rank your contracts within each account by starting_at:
SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts
Then you can use that in a subquery to join to accounts and only take contracts with a rank of 1. You'll need to put it in a subquery, because unfortunately (in Postgres at least) you
can't use window functions inside WHERE. So this won't work:
SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts
WHERE rank = 1
but this will:
SELECT *
FROM (SELECT *, rank() OVER (PARTITION BY account_id ORDER BY starting_at ASC) AS rank
FROM contracts) x
WHERE rank = 1
Note you can easily add filtering by status, etc. to any of these queries.
This should work:
select account, contract, status, MinDate
from
(
Select t.account, t.contract, t.status, t.start_date,
Min(t.start_date) over(partition by t.account) MinDate
From table t
where t.status in ('Active','Countersigned','Pending')
) x
where start_date=MinDate
a solution that works if you don't have multiple contracts for each account on the same MIN(date) (in that case you'd get multiple rows for each account and you should decide which of these N contracts you want to see, I can't decide for you)
SELECT t.*
FROM (
Select t.account, Min(t.start_date) AS MinDate
From table t
where t.status in ('Active','Countersigned','Pending')
GROUP BY t.account
) AS t2
INNER JOIN table t ON t.account = t2.account AND t.start_date = t2.MinDate

SQL Server "cannot perform an aggregate function on an expression containing an aggregate or a subquery", but Sybase can

This issue has been discussed before, but none of the answers address my specific problem because I am dealing with different where clauses in the inner and outer selects. This query executed just fine under Sybase, but gives the error in the title of this post when executed under SQL Server. The query is complicated, but the general outline of the query is:
select sum ( t.graduates -
( select sum ( t1.graduates )
from table as t1
where t1.id = t.id and t1.group_code not in ('total', 'others' ) ) )
from table as t
where t.group_code = 'total'
The following describes the situation I am trying to resolve:
all group codes represent races except for 'total' and 'others'
group code 'total' represents the total graduates of all races
however, multi-race is missing, so the race graduate counts may not add up to the total graduate counts
this missing data is what needs to be calculated
Is there anyway to rewrite this using derived tables or joins to get the same results?
Update: I created sample data and 3 solutions to my specific problem (2 influenced by sgeddes). The one that I added involves moving the correlated subquery to a derived table in the FROM clause. Thanks for the help guys!
One option is to put the subquery in a LEFT JOIN:
select sum ( t.graduates ) - t1.summedGraduates
from table as t
left join
(
select sum ( graduates ) summedGraduates, id
from table
where group_code not in ('total', 'others' )
group by id
) t1 on t.id = t1.id
where t.group_code = 'total'
group by t1.summedGraduates
Perhaps a better option would be to use SUM with CASE:
select sum(case when group_code = 'total' then graduates end) -
sum(case when group_code not in ('total','others') then graduates end)
from yourtable
SQL Fiddle Demo with both

SQL Query to select top 2 for each value

I have a table with 3 columns, the data in column1 has repeating values and column 3 has totals, what I'd like to do is to return the top 2 totals for each value in column 1.
My query to create this table is below:
SELECT service,name, total
FROM [test].[dbo].[TestTable]
join test1.dbo.service
on substring(servan,0,4)=servicebn
where substring(servan,0,4)=servicebn and name <> testname
group by service,name,total
order by service,total desc
any help would be much appreciated
if you are using SQL Server 2005+, you can use Common Table Expression and Window Function.
WITH recordsList
AS
(
SELECT service, name, total,
DENSE_RANK() OVER (PARTITION BY service
ORDER BY total DESC) rn
FROM [test].[dbo].[TestTable]
INNER join test1.dbo.servd
on substring(servan,0,4)=servicebn
where substring(servan,0,4) = servicebn and
name <> testname
)
SELECT service, name, total
FROM recordsLIst
WHERE rn <= 2
As a side note, this query has poor in performance because it requires FULL TABLE SCAN on every table. The reason is because of the join condition substring(servan,0,4)=servicebn. It doesn't use index.