Spark SQL RANK() over ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING fails - apache-spark-sql

I encountered a SQL clause which Spark SQL behaves differently (a bug?) from others (I compared with Hive).
You may copy and paste the following statements to test in hive shell.
hive>
CREATE TABLE t (v INT);
INSERT INTO t (v) VALUES (11), (21), (31), (42), (52);
SELECT v % 10 AS d, v, RANK() OVER (PARTITION BY v % 10 ORDER BY v ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS rank FROM t;
hive>
The result shows what we expect.
1 11 1
1 21 2
1 31 3
2 42 1
2 52 2
However, when testing the following equivalent code in spark-shell,
scala>
Seq(11, 21, 31, 42, 52).toDF("v").createOrReplaceTempView("t")
spark.sql("SELECT v % 10 AS d, v, RANK() OVER (PARTITION BY v % 10 ORDER BY v ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS rank FROM t").show
scala>
we get an exception.
org.apache.spark.sql.AnalysisException: Window Frame ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING must match the required frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2153)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30$$anonfun$applyOrElse$11.applyOrElse(Analyzer.scala:2149)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:258)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:249)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30.applyOrElse(Analyzer.scala:2149)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$$anonfun$apply$30.applyOrElse(Analyzer.scala:2148)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$.apply(Analyzer.scala:2148)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame$.apply(Analyzer.scala:2147)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:623)
... 48 elided
This happens in both Spark 2.2.0 and Spark 2.3.0.
Is it a bug? Or some misunderstanding of mine?
P.S. I've also tested the following functions (by replacing RANK() in the clause).
ROW_NUMBER() => same exception
DENSE_RANK() => same exception
CUME_DIST() => same exception
PERCENT_RANK() => same exception
NTILE() => same exception
LEAD(v) => same exception
LAG(v) => same exception
FIRST_VALUE(v) => OK
LAST_VALUE(v) => OK
COUNT(v) => OK
SUM(v) => OK
AVG(v) => OK
MEAN(v) => OK
MIN(v) => OK
MAX(v) => OK
VARIANCE(v) => OK
STDDEV(v) => OK
COLLECT_LIST(v) => OK
COLLECT_SET(v) => OK

ROW_NUMBER() => same exception
DENSE_RANK() => same exception
CUME_DIST() => same exception
PERCENT_RANK() => same exception
NTILE() => same exception
LEAD(v) => same exception
LAG(v) => same exception
My first point is do you know why we use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ?
You have to first understand the concept of
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
It will return list of rows/records (between the frame of preceding value and following value). If you are using ROW_NUMBER()/Lead/LAG/ or the above analytical function for this, can you think what it will return. If you want to find just row number then why you are using
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Your error says that the frame is not given properly.

Related

Date-wise and bucket capacity-wise serial number

I couldn't give an appropriate title to my problem. Let me explain it through example.
Suppose I have the following table
INPUT
What I want
First, I want to group the transactions by date (dd-MM-yyyy)
Then I want to create a chunk/bucket of at most 2 items. Thus I want to assign a sub_batch_ref_id to each chunk/bucket of 2 items. In each chunk/bucket transactions must belong to exactly one date.
Convention of SUB_BATCH_REF_ID is BATCH_REF{global serial number of chunk/bucket}
A chunk or bucket can contain at most 2 items of the same date
I understand that this can be achieved through any high-level programming language (except data-oriented language like SQL) quite easily but I don't have such provision. The solution can be shaped into the following pseudocode for better understanding:
Pseudocode
//I have the following map (assumed)
Map<Date, List<Transaction>> dateWiseTransactions;
BUCKET_CAPACITY = 2
GLOBAL_SERIAL = 0
for each entry in dateWiseTransactions
LOOP
GLOBAL_SERIAL = GLOBAL_SERIAL + 1;
for each transaction in entry.value i.e. List<Transaction>
LOOP
if loopIndex > BUCKET_CAPACITY //loopIndex starts from 1
GLOBAL_SERIAL = GLOBAL_SERIAL + 1
end if;
transaction.SUB_BATCH_REF_ID = CONCAT(transaction.BATCH_REF, GLOBAL_SERIAL)
END LOOP;
END LOOP;
EXPECTED OUTPUT
What I tried
I tried to partition the transaction data by date first then assigned a row number but I couldn't come to a solution.
SELECT
T.*,
ROW_NUMBER() over (PARTITION BY TRUNC(INSERT_DATE) ORDER BY TRANSACTION_ID) rn
FROM TRANSACTION T
WHERE BATCH_REF='XYZ'
Any help is much appreciated.
SQL FIDDLE
This is not an optimal solution. Your actual problem is a rather complex graph problem -- because you want transactions only to be used once over all the dates.
One solution is to assign a "working" date to the transaction. The following does this randomly:
SELECT T.*
FROM (SELECT T.*,
ROW_NUMBER() over (PARTITION BY TRUNC(INSERT_DATE) ORDER BY TRANSACTION_ID) AS rn
FROM (SELECT T.*,
ROW_NUMBER() OVER (PARTITION BY BATCH_REF, TRANSACTION_ID ORDER BY DBMS_RANDOM.RANDOM) as seqnum
FROM TRANSACTION T
) T
WHERE BATCH_REF = 'XYZ' AND
SEQNUM = 1
) T
WHERE rn <= 2;

SQL return only the next 1 row of same type after a certain row

I have 4 'Operations' called Start, Finish, Available, Unavailable. Every time I see a row where 'Operation' = Available, I want to only return the next 1 row where the operation = 'Start' (while keeping the 'Finish' row for that same ID) until the next row where 'Operation' = Available (which, when this happens, I want to again return only the next 1 row where Operation = Start).
So starting with this dataset
Time ID Operation
6:34:50 AM 2016544 Finish
6:33:09 AM 2016544 Start
6:32:12 AM 2015289 Finish
6:32:07 AM 2015268 Finish
6:31:53 AM 2015834 Finish
6:31:39 AM 2015539 Finish
6:31:14 AM Available Available
6:31:12 AM Unavailable Unavailable
6:31:02 AM 2015289 Start
6:30:57 AM 2015268 Start
6:30:42 AM 2015834 Start
6:30:28 AM 2015539 Start
6:30:22 AM Available Available
I would like to get to this
Time ID Operation
6:34:50 AM 2016544 Finish
6:33:09 AM 2016544 Start
6:31:39 AM 2015539 Finish
6:31:14 AM Available Available
6:31:12 AM Unavailable Unavailable
6:30:28 AM 2015539 Start
6:30:22 AM Available Available
I don't fully follow the explanation. But your sample data and results suggests that you want the first row where a sequence of operations of the same type appear:
select t.*
from (select t.*, lag(operation) over (order by time) as prev_operation
from t
) t
where prev_operation is null or prev_operation <> operation;
From your desired result I guess you want last operation from sequence of the same operations. Try:
select Time, ID, Operation
from (
select Time, ID, Operation,
rownumber() over (partition by grp order by Time desc) rn
from (
select *,
rownumber() over (order by time) -
rownumber() over (partition by Operation order by time) grp
from MyTable
) a
) a where rn = 1

Bigquery Query failes the first time and successfully completes the 2nd time

I'm executing the following query.
SELECT properties.os, boundary, user, td,
SUM(boundary) OVER(ORDER BY rows) AS session
FROM
(
SELECT properties.os, ROW_NUMBER() OVER() AS rows, user, td,
CASE WHEN td > 1800 THEN 1 ELSE 0 END AS boundary
FROM (
SELECT properties.os, t1.properties.distinct_id AS user,
(t2.properties.time - t1.properties.time) AS td
FROM (
SELECT properties.os, properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.os, properties.distinct_id, properties.time,
ROW_NUMBER()
OVER (PARTITION BY properties.distinct_id
ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE properties.time > 1367916800
AND properties.time < 1380003200)) AS t1
JOIN (
SELECT properties.distinct_id, properties.time, srlno,
srlno-1 AS prev_srlno
FROM (
SELECT properties.distinct_id, properties.time,
ROW_NUMBER() OVER
(PARTITION BY properties.distinct_id ORDER BY properties.time) AS srlno
FROM [ziptrips.ziptrips_events]
WHERE
properties.time > 1367916800
AND properties.time < 1380003200 )) AS t2
ON t1.srlno = t2.prev_srlno
AND t1.properties.distinct_id = t2.properties.distinct_id
WHERE (t2.properties.time - t1.properties.time) > 0))
It fails the first time with the following error. However on 2nd run it completes without any issue. I'd appreciate any pointers on what might be causing this.
The error message is:
Query Failed
Error: Field 'properties.os' not found in table '__R2'.
Job ID: job_VWunPesUJVLxWGZsMgpoti14BM4
Thanks,
Navneet
We (the BigQuery team) are in the process of rolling out a new version of the query engine that fixes a number of issues like this one. You likely hit an old version of the query engine and then when you retried, hit the new one. It may take us a day or so with a portion of traffic pointing at the updated version in order to verify there aren't any regressions. Please let us know if you hit this again after 24 hours or so.

Cypher - Issue with Where + Aggregate + With

I am trying to execute the following Cypher query
START b=node:customer_idx(ID = 'ABCD')
MATCH p = b-[r1:LIKES]->stuff, someone_else_too-[r2:LIKES]->stuff
with b,someone_else_too, count(*) as matchingstuffcount
where matchingstuffcount > 1
//with b, someone_else_too, matchingstuffcount, CASE WHEN ...that has r1, r2... END as SortIndex
return someone_else_too, SortIndex
order by SortIndex
The above query works fine but moment I uncomment lower "with" I get following errors
Unknown identifier `b`.
Unknown identifier `someone_else_too`.
Unknown identifier `matchingstuffcount`.
Unknown identifier `r1`.
Unknown identifier `r2`.
To get around, I include r1 and r2 in the top with - "with b,someone_else_too, count(*) as matchingstuffcount". to "with b, r1, r2, someone_else_too, count(*) as matchingstuffcount". This messes my count(*) > 1 condition as count(*) does not aggregate properly.
Any workarounds / suggestions to filter out count(*) > 1 while making sure Case When can also be executed ?
Under neo4j 2.0 via console.neo4j.org I was able to get the following query to work. I tried to mimic the constructs you had, namely the WITH/WHERE/WITH/RETURN sequence. (If I missed something, please let me know!)
START n=node:node_auto_index(name='Neo')
MATCH n-[r:KNOWS|LOVES*]->m
WITH n,COUNT(r) AS cnt,m
WHERE cnt >1
WITH n, cnt, m, CASE WHEN m.name?='Cypher' THEN 1 ELSE 0 END AS isCypher
RETURN n AS Neo, cnt, m, isCypher
ORDER BY cnt
Update it or change it here.

SQL Window Max() function having issues in code

I am writing a window function that is supposed create a month window and only grab records that has the max value in the update flag field within that said month window.
I am having issues with my window function its still showing all results in the window when it should be showing the only the max value.
I have left my code below. Please help.
SELECT
gb1.SKU_Id,
gb1.Warehouse_Code,
gb1.Period_Start,
gb1.country,
tm.c445_month,
tm.report_date,
gb1.update_flag,
max(gb1.update_flag) over (partition by tm.yearmonth order by gb1.update_flag range between unbounded preceding and current row ) as update_window,
SUM(gb1.TOTAL_NEW_SALES_FORECAST) AS dc_forecast
FROM BAS_E2E_OUTPUT_GLOBAL_FCST gb1
inner join (
SELECT
gb2.SKU_Id,
gb2.Warehouse_Code,
gb2.Period_Start,
gb2.country,
gb2.update_flag,
gb2.report_date,
tm1.week_date,
tm1.c445_month,
tm1.yearmonth
FROM BAS_E2E_OUTPUT_GLOBAL_FCST as gb2
left join (
select distinct(week_date) as week_date,
c445_month,
yearmonth
from "PROD"."INV_PROD"."BAS_445_MONTH_ALIGNMENT"
group by c445_month, week_date, yearmonth
) as tm1 on gb2.report_date = tm1.week_date
group by SKU_Id,
Warehouse_Code,
Period_Start,
country,
update_flag,
report_date,
tm1.week_date,
tm1.c445_month,
tm1.yearmonth
) as tm
on gb1.report_date = tm.week_date
and gb1.SKU_ID = tm.sku_id
and gb1.Warehouse_Code = tm.warehouse_code
and gb1.Period_Start = tm.period_start
and gb1.country = tm.country
GROUP BY
gb1.SKU_Id,
gb1.Warehouse_Code,
gb1.Period_Start,
gb1.country,
tm.c445_month,
tm.yearmonth,
tm.report_date,
gb1.update_flag
You are currently using MAX with the window being defined as every preceding row up and including the current one. Hence, rightfully the max value it returns should probably change for each record. Perhaps you wanted to take the max over a fixed partition:
MAX(gb1.update_flag) OVER (PARTITION BY tm.yearmonth) AS update_window
By the way, if you did really intend to use MAX with your current window logic, on most versions of SQL the ORDER BY clause can be simplified to:
MAX(gb1.update_flag) OVER (PARTITION BY tm.yearmonth ORDER BY gb1.update_flag) AS update_window
That is, the default range is unbounded preceding to the current row, so it is not necessary to say this.