Please help to understand weather ntile function differ in spark vs impala
consider same source for both queries.
below is the impala query
select provider_id, ntile(10) over (order by col asc) as VOL_DECILE from p340B_rpt2_hcp_total_vol_prod_state
below is the spark code
var window_spec = Window.orderBy(col("rnk.col").asc)
var p340B_rpt2_hcp_vol_prod_state = p340B_rpt2_hcp_total_vol_prod_state.as("rnk").
select (
col("provider_id"),
ntile(10).over(window_spec).as("VOL_DECILE")
)
Related
I am trying to get the STDEV of MCW_NM column but I want it to be STDEV of all rows not per group by BLADEID. But in Variance_Blade_MCW I need it to be grouped by BLADEID. I have tried over() but I get this error:
Column 'ENG.DBO.MCW_BCL_WEDGE.MCW_NM' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
Can anyone help me? Below is my query.
PS: I am having difficulty explaining the problem so please bear with me. Let me know if you have clarifications! thanks a lot!
SELECT
BladeID,
Total_Sigma_MCW = STDEV(MCW_NM) OVER (),
CountD_Blade = COUNT(BLADEID) OVER (),
Variance_Blade_MCW = SQUARE(STDEV(MCW_NM))
FROM
ENG.DBO.MCW_BCL_WEDGE
WHERE
TESTDATE > GETDATE() - 6
GROUP BY
BLADEID
HAVING
COUNT(BladeID) >= 5000
I don't have access to mssql at the moment, but this might work. The inner query returns 1 row per BladeID with what I think are the aggregates you want. Problem is window functions always return 1 row for each row in the source, so the outer query flattens this.
SELECT DISTINCT
BladeID,
Total_Sigma_MCW = STDEV(MCW_NM) OVER (PARTITION BY 1),
Variance_Blade_MCW,
CountD_Blade,
FROM
(
SELECT
BladeID,
MCW_NM,
CountD_Blade = COUNT() OVER (PARTITION BY BladeID),
Variance_Blade_MCW = SQUARE(STDEV(MCW_NM) OVER (PARTITION BY BLADEID))
FROM
ENG.DBO.MCW_BCL_WEDGE
WHERE
TESTDATE > GETDATE() - 6
) q
WHERE CountD_Blade >= 5000
It may be more efficient to create two queries, one to group by BladeID and one over the full dataset and join them.
We have a select statement in production that takes quite a lot of time.
The current query uses row number - window function.
I am trying to rewrite the query and test the same. assuming its orc table fetching aggregate values instead of using row number may help to reduce the execution time, is my assumption
Is something like this possible. Let me know if i am missing anything.
Sorry i am trying to learn, so please bear with my mistakes, if any.
I tried to rewrite the query as mentioned below.
Original query
SELECT
Q.id,
Q.crt_ts,
Q.upd_ts,
Q.exp_ts,
Q.biz_effdt
(
SELECT u.id, u.crt_ts, u.upd_ts, u.exp_ts, u.biz_effdt, ROW_NUMBER() OVER (PARTITION BY u.id ORDER BY u.crt_ts DESC) AS ROW_N
FROM ( SELECT cust_prd.id, cust_prd.crt_ts, cust_prd.upd_ts, cust_prd.exp_ts, cust_prd.biz_effdt FROM MSTR_CORE.cust_prd
WHERE biz_effdt IN ( SELECT MAX(cust_prd.biz_effdt) FROM MSTR_CORE.cust_prd )
) U
)Q WHERE Q.row_n = 1
My attempt:
SELECT cust_prd.id, cust_prd.crt_ts, cust_prd.upd_ts, cust_prd.exp_ts, cust_prd.biz_effdt FROM MSTR_CORE.cust_prd
WHERE biz_effdt IN ( SELECT MAX(cust_prd.biz_effdt) FROM MSTR_CORE.cust_prd )
having cust_prd.crt_ts = max (cust_prd.crt_ts)
I am using the following query to remove all rows where volume is in the top 1%. I am structuring my query using the following stackoverflow question: Select top 10 percent, also bottom percent in SQL Server.
However, my query is generating an error. I was hoping for some input on what has to be changed.
CREATE TABLE TEST AS
WITH PERCENTILE AS
(
SELECT SEGMENT,
VOLUME,
OUTLIER_VOL = NTILE(100) OVER (ORDER BY VOLUME)
FROM OFFER_PERIOD_SEGMENT
)
SELECT *
FROM PERCENTILE
WHERE OUTLIER_VOL NOT IN (99,100)
I am receiving the following error:
CLI prepare error: [Oracle][ODBC][Ora]ORA-00923: FROM keyword not found where expected
Try to change
OUTLIER_VOL = NTILE(100) OVER (ORDER BY VOLUME)
to:
NTILE(100) OVER (ORDER BY VOLUME) OUTLIER_VOL
That <column alias> = <value> syntax is special to SQL Server I believe.
If someone stumbles upon this in the future, I want to add that I was calculating the percentiles incorrectly. The code below is used to calculate the percentiles, whereas in the above scenario you are creating 100 equal sized buckets in which your data is placed:
CREATE TABLE TEST AS
WITH PERCENTILE AS
(
SELECT SEGMENT,
VOLUME,
PERCENT_RANK() OVER (ORDER BY VOLUME) AS OUTLIER_VOL
FROM OFFER_PERIOD_SEGMENT
)
SELECT *
FROM PERCENTILE
WHERE OUTLIER_VOL < 0.99
I realize that I am often tempted to do the following:
var df_mean = df.groupBy("category").agg(mean("column1") as "mean")
val df_with_mean = df.join(df_mean, Seq("category"))
So basically I want all the rows of my initial dataframe to have a column which is the mean value of their category.
Is it the correct way to achieve that? Any better idea?
It is correct (yields expected results) and idiomatic. DataFrame DSL is just a wrapper around SQL and standard SQL solution can be expressed as follows:
WITH means AS (SELECT category, avg(column1) AS mean FROM df GROUP BY category)
SELECT df.category, df.column1, means.mean
FROM df JOIN means ON df.category = means.category
You can easily check that this generates the same execution plan as df_with_mean.
It is possible to express the same logic using window functions:
SELECT *, avg(column1) OVER w AS mean FROM df
WINDOW w AS (
PARTITION BY category
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
with DSL equivalent:
val w = Window.partitionBy($"category").rowsBetween(Long.MinValue, Long.MaxValue)
df.select($"*", avg($"column1").over(w).alias("mean"))
but in general Spark doesn't perform particularly well with UNBOUNDED FOLLOWING frame.
Does anyone know the best way for Apache Spark SQL to achieve the same results as the standard SQL qualify() + rnk or row_number statements?
For example:
I have a Spark Dataframe called statement_data with 12 monthly records each for 100 unique account_numbers, therefore 1200 records in total
Each monthly record has a field called "statement_date" that can be used for determining the most recent record
I want my final result to be a new Spark Dataframe with the 3 most recent records (as determined by statement_date descending) for each of the 100 unique account_numbers, therefore 300 final records in total.
In standard Teradata SQL, I can do the following:
select * from statement_data
qualify row_number ()
over(partition by acct_id order by statement_date desc) <= 3
Apache Spark SQL does not have a standalone qualify function that I'm aware of, maybe I'm screwing up the syntax or can't find documentation that qualify exists.
It is fine if I need to do this in two steps as long as those two steps are:
A select query or alternative method to assign rank/row numbering for each account_number's records
A select query where I'm selecting all records with rank <= 3 (i.e. choose 1st, 2nd, and 3rd most recent records).
EDIT 1 - 7/23 2:09pm:
The initial solution provided by zero323 was not working for me in Spark 1.4.1 with Spark SQL 1.4.1 dependency installed.
EDIT 2 - 7/23 3:24pm:
It turns out the error was related to using SQL Context objects for my query instead of Hive Context. I am now able to run the below solution correctly after adding the following code to create and use a Hive Context:
final JavaSparkContext sc2;
final HiveContext hc2;
DataFrame df;
hc2 = TestHive$.MODULE$;
sc2 = new JavaSparkContext(hc2.sparkContext());
....
// Initial Spark/SQL contexts to set up Dataframes
SparkConf conf = new SparkConf().setAppName("Statement Test");
...
DataFrame stmtSummary =
hc2.sql("SELECT * FROM (SELECT acct_id, stmt_end_dt, stmt_curr_bal, row_number() over (partition by acct_id order by stmt_curr_bal DESC) rank_num FROM stmt_data) tmp WHERE rank_num <= 3");
There is no qualify (it is usually useful to check parser source) but you can use subquery like this:
SELECT * FROM (
SELECT *, row_number() OVER (
PARTITION BY acct_id ORDER BY statement_date DESC
) rank FROM df
) tmp WHERE rank <= 3
See also SPARK : failure: ``union'' expected but `(' found