This is an extension to my prior question How to distribute values when prior rank is zero. The solution worked great for the postgres environment, but now I need to replicate to a databricks environment (spark sql).
The question is the same as before, but now trying to determine how to convert this postgres query to spark sql. Basically, it's summing up allocation amounts if there are gaps in the data (ie, no micro_geo's when grouping by location and geo3). The "imputed allocation" will equal 1 for all location & zip3 groups.
This is the postgres query, which works great:
select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from
(
select ia.*,
(case when has_micro_geo > 0
then sum(allocation) over (partition by location_code, geo3, grp)
else 0
end) as imputed_allocation
from (select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
) ia
)z
But it doesn't translate well and produces this error in databricks:
Error in SQL statement: ParseException:
mismatched input 'from' expecting <EOF>(line 1, pos 78)
== SQL ==
select location_code, geo3, distance_group, has_micro_geo, imputed_allocation from
------------------------------------------------------------------------------^^^
(
select ia.*,
(case when has_micro_geo > 0
then sum(allocation) over (partition by location_code, geo3, grp)
else 0
end) as imputed_allocation
from (select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
) ia
)z
Or at a minimum, how to convert just part of this inner query which creates a "grp", and then perhaps the rest will work. I have been trying to replace this filter-where logic with something else, but attempts have not worked as desired.
select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
Here's a db-fiddle with data https://www.db-fiddle.com/f/wisvDZJL9BkWxNFkfLXdEu/0 which is currently set to postgres, but again I need to run this in a spark sql environment. I've tried breaking this down and creating different tables, but my groups are not working as desired.
Here's an image to better visualize the output:
You need to rewrite this subquery:
select s.*,
count(*) filter (where has_micro_geo <> 0) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
Although the filter() clause to window and aggregate functions is standard SQL, few databases support it so far. Instead, consider a conditional window sum(), which produces the same result:
select s.*,
sum(case when has_micro_geo <> 0 then 1 else 0 end) over (partition by location_code, geo3 order by distance_group desc) as grp
from staging_groups s
I think that the rest of the query should run fine in Spark SQL.
As has_micro_geo is already a 0/1 flag you can rewite the count(filter) to
sum(has_micro_geo)
over (partition by location_code, geo3
order by distance_group desc
rows unbounded preceding) as grp
Adding rows unbounded preceding to avoid the default range unbounded preceding which might be less performant.
Btw, I wrote that already in my comment to Gordon's solution to your prior question :-)
Related
I have the table
I need to calculate cumsum group by id for every row with type="end".
Can anyone see the problem?
Output result
This is a little tricky. One method is to assign a grouping by reverse counting the ends. Then use dense_rank():
select t.*,
dense_rank() over (order by grp desc) as result
from (select t.*,
count(*) filter (where type = 'end') over (order by created desc) as grp
from t
) t;
You can also do this without a subquery:
select t.*,
(count(*) filter (where type = 'end') over () -
count(*) filter (where type = 'end') over (order by created desc) -
1
)
from t;
I have a problem with functions such as maxif and sumif. When I try to use any of them in my project the console returns 'Function not found: sumif/maxif; Did you mean sum/max?'
It is odd, because function countif works perfectly fine, and both of maxif and sumif are described in the BigQuery documentation, so I'm kind of worried what to do with them in order to run the code properly.
Beneath is a part of my code, any suggestions would be most welcome:
SELECT
DISTINCT *,
COUNTIF(status ='completed') OVER (PARTITION BY id ORDER BY created_at) cpp, --this works
sumif(value,status='completed') OVER (PARTITION BY id ORDER BY created_at) spp, -- this doesn't
maxif(created_at, status = 'completed') OVER (PARTITION BY id ORDER BY created_at DESC) lastpp,
FROM
`production.payment_transactions`
Below is for BigQuery Standard SQL
#standardSQL
SELECT
DISTINCT *,
COUNTIF(status = 'completed') OVER (PARTITION BY id ORDER BY created_at) cpp, --this works
SUM(IF(status = 'completed', value, NULL)) OVER (PARTITION BY id ORDER BY created_at) spp, -- this now works
MAX(IF(status = 'completed', value, NULL)) OVER (PARTITION BY id ORDER BY created_at DESC) lastpp, -- this now works
FROM `production.payment_transactions`
SUMIF() and MAXIF() are not a big query functions. Use a CASE expression:
maxif(case when status = 'completed' then created_at end) over (partition by id order by created_at desc)
This is confusing because the functions are used in other parts of the GCP environment, particularly a component called Dataprep.
My current data is as follows:
And I want Data to be
When I use the row_number function it is reordering itself and giving me the wrong row_number,as below
If we See "Adjusted conversion COst" value 0.160 is coming top of result and is numbered 1 which is wrong as per the first screenshot it should be numbered 3
Thanks
MYSQL Using Variable
Result - http://www.sqlfiddle.com/#!9/406f64/8/0
select
colo1,f7,
if(colo1='Total Adj. Conversion Spend',#initVal:=#initVal+1,1) as RowNumber
from temp,(select #initVal:=0) vars
MS-SQL Using Rank and Row Number
I've used Row_Number() to preserve the order and then using Rank() inside a case statement
http://www.sqlfiddle.com/#!18/fde9f/15/0
select subquery_1.colo1,subquery_1.f7
,case when subquery_1.colo1='Total Adj. Conversion Spend' then
rank() over (partition by colo1 order by rownum) else 1 end as rnk
from
(select *,row_number() OVER (ORDER BY (Select 0)) as rownum from temp) as subquery_1
order by subquery_1.rownum
I have a query consisting of multiple subqueries. I used 'join' as im not allowed to use 'with'. The subqueries have 'from' clause which is creating an issue.
I have to display two columns with each column consisting certain logic to be displayed. For printing the two columns, i need to use sub queries which requires 'from' clause. I'm not sure how to write the 'from' clause to fit the whole query and make it runnable. I have checked the individual queries and they all work fine.
select lead(dt) over
(partition by t1.id_user order by f.topup_date desc rows between 0
preceding and unbounded following )
from
(select *,
(max(case when f.topup_value >= 20 then f.topup_date end) over (partition
by f.id_user order by f.topup_date desc rows between 0 preceding and
unbounded following )) as dt
from topups f) as f, //(<-I think this is incorrect)
CAST(f.topup_value as float)/CAST(t1.topup_value as float) from
(SELECT t1.seq,t1.id_user,t1.topup_value,row_number()
over (partition by t1.id_user order by t1.topup_date )
as rowrank from topups t1) as t1
inner join topups f on f.id_user=t1.id_user
inner join topups t2 on t1.seq=t2.seq
You're getting a syntax error because a query can only have a single FROM clause. It's difficult to tell the outcome you're trying to achieve, but turning the first query into a non-correlated subquery and using it for column f might be what you're looking for:
select
(select lead(dt) over (partition by t1.id_user order by f.topup_date desc rows between 0 preceding and unbounded following )
from (
select *,
(max(case when f.topup_value >= 20 then f.topup_date end) over (partition by f.id_user order by f.topup_date desc rows between 0 preceding and unbounded following )) as dt
from topups f
) x) as f,
CAST(f.topup_value as float)/CAST(t1.topup_value as float)
from (
SELECT t1.seq, t1.id_user, t1.topup_value, row_number() over (partition by t1.id_user order by t1.topup_date ) as rowrank
from topups t1
) as t1
inner join topups f on f.id_user=t1.id_user
inner join topups t2 on t1.seq=t2.seq
Really hard to read that query. What you marked as possible incorrectness is wrong because you're trying to add what looks like another SELECT after your original FROM clause. That's incorrect syntax. Think of your FROM subquery as a temporary table. You couldn't say something like:
SELECT some_column
FROM a_table, some_other_column
That's cross-join syntax. some_other_column would need to be a table for that to even be valid.
Consider adding a CREATE TABLE and sample data so we can test.
You might be looking for something along the lines of this:
SELECT LEAD(temp.dt) OVER(PARTITION BY temp.id_user ORDER BY temp.topup_date DESC ROWS BETWEEN 0 PRECEDING AND UNBOUNDED FOLLOWING)
, temp.division
FROM
(
SELECT (max(CASE WHEN f.topup_value >= 20 THEN f.topup_date END) OVER(PARTITION BY f.id_user ORDER BY f.topup_date DESC ROWS BETWEEN 0 PRECEDING AND UNBOUNDED FOLLOWING )) AS dt
, f.topup_value::float / t1.topup_value::float AS division
, t1.id_user
, f.topup_date
FROM topups t1
JOIN topups f USING (id_user)
) temp
;
Just an opinion but its less noisy to use the :: operator to cast variables. Instead of CAST(f.topup_value as float) just use f.topup_value::float
I'm converting some code from Oracle to SQL Server (2012) and have run into an issue where this subquery is using a PARTITION/ORDER BY to retrieve the most recent record. The subquery runs fine on its own, but as it is a subquery, I'm getting the error:
SQL Server Database Error: The ORDER BY clause is invalid in views,
inline functions, derived tables, subqueries, and common table
expressions, unless TOP, OFFSET or FOR XML is also specified.
Here's the section of SQL:
FROM (
SELECT distinct enr.MemberNum,
(ISNULL(enr.MemberFirstName, '') + ' ' + ISNULL(enr.MemberLastName, '')) AS MEMBER_NAME,
enr.MemberBirthDate as DOB,
enr.MemberGender as Gender,
LAST_VALUE(enr.MemberCurrentAge) OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS AGE,
LAST_VALUE(enr.EligStateAidCategory)OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS EligStateAidCategory,
LAST_VALUE(enr.EligStateAidCategory)OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS AID_CAT_ROLL_UP,
LAST_VALUE(enr.EligFinanceAidCategoryRollup)OVER (PARTITION BY MemberNum ORDER BY StaticDate ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS EligFinanceAidCategoryRollup,
SUM(enr.MemberMonth) OVER (PARTITION BY MemberNum) AS TOTAL_MEMBER_MONTHS
FROM dv_Enrollment enr
WHERE enr.StaticDate BETWEEN '01-JUN-2016' AND '30-JUN-2016'
)A
So, I've looked around and found that you can use the TOP (2147483647) hack, so I tried changing the first line to:
SELECT distinct TOP (2147483647) enr.MemberNum,
But I'm still getting the same error. All the other ways I've thought of also require an ORDER BY (using DENSE RANK, etc).
In both databases, I would write this like:
FROM (SELECT enr.MemberNum,
(ISNULL(enr.MemberFirstName, '') + ' ' + ISNULL(enr.MemberLastName, '')) AS MEMBER_NAME,
enr.MemberBirthDate as DOB,
enr.MemberGender as Gender,
MAX(CASE WHEN seqnum = 1 THEN enr.MemberCurrentAge END) AS AGE,
MAX(CASE WHEN seqnum = 1 THEN enr.EligStateAidCategory END) AS EligStateAidCategory,
MAX(CASE WHEN seqnum = 1 THEN enr.EligStateAidCategory END) AS AID_CAT_ROLL_UP,
MAX(CASE WHEN seqnum = 1 THEN enr.EligFinanceAidCategoryRollup END) AS EligFinanceAidCategoryRollup,
SUM(enr.MemberMonth) as TOTAL_MEMBER_MONTHS
FROM (SELECT enr.*,
ROW_NUMBER() OVER (PARTITION BY MemberNum ORDER BY StaticDate DESC) as seqnum
FROM dv_Enrollment enr
) enr
WHERE enr.StaticDate >= DATE '2016-06-01' AND -- DATE not needed in SQL Server
enr.StaticDate < DATE '2016-07-01' -- DATE not needed in SQL Server
GROUP BY enr.MemberNum, enr.MemberFirstName, enr.MemberLastName,
enr.MemberBirthDate, enr.MemberGender
) A
Why the changes?
The date changes are just to be careful about time components on the date. BETWEEN with date/times is a bad habit, because sometimes it can result in incorrect code and hard to debug errors.
I simply do not like using SELECT DISTINCT to mean GROUP BY. It is clever to use it with window functions (and necessary with LAST_VALUE()); but I think the code ends up being misleading.
I find the use of the subquery with seqnum to make it clear that the four "last value" variables are all pulling data from the last row.
In addition, it the sort is not stable (that is, the key is not unique), seqnum guarantees that the values are all from the same row. last_value() does not.
Switch this over to an aggregate subquery and cross apply() and see what happens.
select
e.MemberNum
, e.MemberName
, e.DOB
, e.Gender
, x.MemberCurrentAge
, x.EligStateAidCategory
, x.EligFinanceAidCategoryRollup
, x.MemberMonth
, e.Total_Member_Months
from (
select
enr.MemberNum
, MemberName = isnull(enr.MemberFirstName+' ', '') + isnull(enr.MemberLastName, '')
, DOB = enr.MemberBirthDate
, Gender = enr.MemberGender
/* This sounds like a weird thing to sum */
, Total_Member_Months = sum(enr.MemberMonth)
from dv_Enrollment enr
group by
enr.MemberNum
, isnull(enr.MemberFirstName+' ', '') + isnull(enr.MemberLastName, '')
, enr.MemberBirthDate
, enr.MemberGender
) as e
/* cross apply() is like an inner join
, use outer apply() for something like a left join */
cross apply (
select top 1
i.MemberCurrentAge
, i.EligStateAidCategory
, i.EligFinanceAidCategoryRollup
, i.MemberMonth
from dv_Enrollment as i
where i.MemberNum = e.MemberNum
and i.StaticDate >= '20160601'
and i.StatisDate <= '20160630'
order by i.StaticDate desc -- descending for most recent
) as x