Does anyone know the best way for Apache Spark SQL to achieve the same results as the standard SQL qualify() + rnk or row_number statements?
For example:
I have a Spark Dataframe called statement_data with 12 monthly records each for 100 unique account_numbers, therefore 1200 records in total
Each monthly record has a field called "statement_date" that can be used for determining the most recent record
I want my final result to be a new Spark Dataframe with the 3 most recent records (as determined by statement_date descending) for each of the 100 unique account_numbers, therefore 300 final records in total.
In standard Teradata SQL, I can do the following:
select * from statement_data
qualify row_number ()
over(partition by acct_id order by statement_date desc) <= 3
Apache Spark SQL does not have a standalone qualify function that I'm aware of, maybe I'm screwing up the syntax or can't find documentation that qualify exists.
It is fine if I need to do this in two steps as long as those two steps are:
A select query or alternative method to assign rank/row numbering for each account_number's records
A select query where I'm selecting all records with rank <= 3 (i.e. choose 1st, 2nd, and 3rd most recent records).
EDIT 1 - 7/23 2:09pm:
The initial solution provided by zero323 was not working for me in Spark 1.4.1 with Spark SQL 1.4.1 dependency installed.
EDIT 2 - 7/23 3:24pm:
It turns out the error was related to using SQL Context objects for my query instead of Hive Context. I am now able to run the below solution correctly after adding the following code to create and use a Hive Context:
final JavaSparkContext sc2;
final HiveContext hc2;
DataFrame df;
hc2 = TestHive$.MODULE$;
sc2 = new JavaSparkContext(hc2.sparkContext());
....
// Initial Spark/SQL contexts to set up Dataframes
SparkConf conf = new SparkConf().setAppName("Statement Test");
...
DataFrame stmtSummary =
hc2.sql("SELECT * FROM (SELECT acct_id, stmt_end_dt, stmt_curr_bal, row_number() over (partition by acct_id order by stmt_curr_bal DESC) rank_num FROM stmt_data) tmp WHERE rank_num <= 3");
There is no qualify (it is usually useful to check parser source) but you can use subquery like this:
SELECT * FROM (
SELECT *, row_number() OVER (
PARTITION BY acct_id ORDER BY statement_date DESC
) rank FROM df
) tmp WHERE rank <= 3
See also SPARK : failure: ``union'' expected but `(' found
Related
I have a table which contains two column with values that are not unique, those values are generated automatically and I have no way to do anything about it, cannot edit the table, db nor make custom functions.
With that in mind I've solved this problem in sql server, but it contains some functions that does not exist in ms-access.
The columns are Volume and ComponentID, here is my code in sql:
with rows as (
select row_number() over (order by volume) as rownum, volume
from test where componentid = 'S3')
select top 10
rowsMinusOne.volume, coalesce(rowsMinusOne.volume - rows.volume,0) as diff
from rows as rowsMinusOne
left outer join rows
on rows.rownum = rowsMinusOne.rownum - 1
Sample data:
58.29168
70.57396
85.67902
97.04888
107.7026
108.2022
108.3975
108.5777
109
109.8944
Expected results:
Volume
diff
58.29168
0
70.57396
12.28228
85.67902
15.10506
97.04888
11.36986
107.7026
10.65368
108.2022
0.4996719
108.3975
0.1952896
108.5777
0.1801834
109
0.4223404
109.8944
0.89431
I have solved the part of the coalesce by replacing it with NZ, I have tryed to use the DCOUNT to solve the row_number (How to show the record number in a MS Access report table?) but I reveive the error that it cannot find the function (I am reading the data by code, that is the only thing I can do).
I also tryed this but, as the answer says I need a column with a unique value which I do not have nor can create Microsoft Access query to duplicate ROW_NUMBER
Consider:
SELECT TOP 10 Table1.ComponentID,
DCount("*","Table1","ComponentID = 'S3' AND Volume<" & [Volume])+1 AS Seq, Table1.Volume,
Nz(Table1.Volume -
(SELECT Top 1 Dup.Volume FROM Table1 AS Dup
WHERE Dup.ComponentID = Table1.ComponentID AND Dup.Volume<Table1.Volume
ORDER BY Volume DESC),0) AS Diff
FROM Table1
WHERE (((Table1.ComponentID)="S3"))
ORDER BY Table1.Volume;
This will likely perform very slowly with large dataset.
Alternative solutions:
build query that calculates difference, use that query as source for a report, use textbox RunningSum property to calculate sequence number
VBA looping through recordset and saving results to a 'temp' table
export to Excel
Have a query that uses double select (with select max) to fetch the row with the latest 'calculation_time' column among multiple rows which can have the same 'patient_set_id'. If there are multiple rows with the same 'patient_set_id', only the row with the latest 'calculation_time' should be retrieved. Calculation time is a date.
So far I've tried this but I'm not really sure if there is any better way for this, maybe using ORDER BY. But I'm very new to sql and need to know which one would be the fastest and more appropriate?
SELECT median from diagnostic_risk_stats WHERE
calculation_time=(SELECT MAX(calculation_time) FROM diagnostic_risk_stats WHERE
patient_set_id = UNHEX(REPLACE('5a9dbfca-74d6-471a-af27-31beb4b53bb2', "-","")));
You can use not exists as follows:
SELECT median
from diagnostic_risk_stats t
WHERE not exists
(Select 1 from diagnostic_risk_stats tt
Where t.patient_set_id = tt.patient_set_id
And tt.calculation_time > t.calculation_time)
And t.patient_set_id = UNHEX(REPLACE('5a9dbfca-74d6-471a-af27-31beb4b53bb2', "-",""));
I need to write a query in sql and I can't do it correctly. I have a table with 7 columns 1st_num, 2nd_num, 3rd_num, opening_Date, Amount, code, cancel_Flag.
For every 1st_num, 2nd_num, 3rd_num I want to take only the record with the min (cancel_flag), and if there's more then 1 row so take the the newest opening Date.
But when I do group by and choose min and max for the relevant fields, I get a mix of the rows, for example:
1. 12,130,45678,2015-01-01,2005,333,0
2. 12,130,45678,2015-01-09,105,313,0
The result will be
:12,130,45678,2015-01-09,2005,333,0
and that mixes the rows into one
Microsoft sql server 2008 . using ssis by visual studio 2008
my code is :
SELECT
1st_num,
2nd_num,
3rd_num,
MAX(opening_date),
MAX (Amount),
code,
MIN(cancel_flag)
FROM do. tablename
GROUP BY
1st_num,
2nd_num,
3rd_num,
code
HAVING COUNT(*) > 1
How do I take the row with the max date or.min cancel flag as it is without mixing values?
I can't really post my code because of security reasons but I'm sure you can help.
thank you,
Oren
It is very difficult like this to answer, because every DBMS has different syntax.
Anyways, for most dbms this should work. Using row_number() function to rank the rows, and take only the first one by our definition (all your conditions):
SELECT * FROM (
SELECT t.*,
ROW_NUMBER() OVER ( PARTITION BY t.1st_num,t.2nd_num,t.3rd_num order by t.cancel_flag asc,t.opening_date desc) as row_num
FROM YourTable t ) as tableTempName
WHERE row_num = 1
Use NOT EXISTS to return a row as long as no other row with same 1st_num, 2nd_num, 3rd_num has a lower cancel_flag value, or same cancel_flag but a higher opening_Date.
select *
from tablename t1
where not exists (select 1 from tablename t2
where t2.1st_num = t1.1st_num
and t2.2nd_num = t1.2nd_num
and t2.3rd_num = t1.3rd_num
and (t2.cancel_flag < t1.cancel_flag
or (t2.cancel_flag = t1.cancel_flag and
t2.opening_Date > t1.opening_Date)))
Core ANSI SQL-99, expected to work with (almost) any dbms.
Can anyone please help me in unstanding below csum function.
What will be the output in each case.
csum(1,1),
csum(1,1) + emp_no
csum(1,emp_no)+emp_no
CSUM is an old deprecated function from V2R3, over 15 years ago. It can always be rewritten using newer Standard SQL compliant syntax.
CSUM(1,1) returns the same as ROW_NUMBER() OVER (ORDER BY 1), a sequence starting with 1.
But you should never use it like that as ORDER BY 1 within a Windowed Aggregate Function is not the same as the final ORDER BY 1 of a SELECT, it's ordering all rows by the same value 1. Teradata calculates those functions in parallel based on the values in PARTITION BY and ORDER BY, this means all rows with the same PARTITION/ORDER data are processed on a single AMP, if there's only a single value one AMP will process all rows, resulting in a totally skewed distribution.
Instead of ORDER BY 1 you should use a column which is more or less unique in best case.
csum(1,emp_no)+emp_no is probably used with another SELECT to get the current maximum value of a column and add the new sequential values to it, i.e. creating your own gap-less sequence numbers.
This is the best way to do it:
SELECT ROW_NUMBER() OVER (ORDER BY column(s)_with_a_low_number_of_rows_per_value)
+ COALESCE((SELECT MAX(seqnum) FROM table),0)
,....
FROM table
I am trying to get the query below to return the TWO lowest PlayedTo results for each PlayerID.
select
x1.PlayerID, x1.RoundID, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and x2.PlayedTo <= x1.PlayedTo
) <3
order by PlayerID, PlayedTo, RoundID;
Unfortunately at the moment it doesn't return a result when there is a tie for one of the lowest scores. A copy of the dataset and code is here http://sqlfiddle.com/#!3/4a9fc/13.
PlayerID 47 has only one result returned as there are two different RoundID's that are tied for the second lowest PlayedTo. For what I am trying to calculate it doesn't matter which of these two it returns as I just need to know what the number is but for reporting I ideally need to know the one with the newest date.
One other slight problem with the query is the time it takes to run. It takes about 2 minutes in Access to run through the 83 records but it will need to run on about 1000 records when the database is fully up and running.
Any help will be much appreciated.
Resolve the tie by adding DatePlayed to your internal sorting (you wanted the one with the newest date anyway):
select
x1.PlayerID, x1.RoundID
, x1.PlayedTo
from P_7to8Calcs as x1
where
(
select count(*)
from P_7to8Calcs as x2
where x2.PlayerID = x1.PlayerID
and (x2.PlayedTo < x1.PlayedTo
or x2.PlayedTo = x1.PlayedTo
and x2.DatePlayed >= x1.DatePlayed
)
) <3
order by PlayerID, PlayedTo, RoundID;
For performance create an index supporting the join condition. Something like:
create index P_7to8Calcs__PlayerID_RoundID on P_7to8Calcs(PlayerId, PlayedTo);
Note: I used your SQLFiddle as I do not have Acess available here.
Edit: In case the index does not improve performance enough, you might want to try the following query using window functions (which avoids nested sub-query). It works in your SQLFiddle but I am not sure if this is supported by Access.
select x1.PlayerID, x1.RoundID, x1.PlayedTo
from (
select PlayerID, RoundID, PlayedTo
, RANK() OVER (PARTITION BY PlayerId ORDER BY PlayedTo, DatePlayed DESC) AS Rank
from P_7to8Calcs
) as x1
where x1.RANK < 3
order by PlayerID, PlayedTo, RoundID;
See OVER clause and Ranking Functions for documentation.