Hive Windowing : distinct results on partition - hive

Hello I was learning WINDOWING functionality of Hive, and came across a problem.
I was trying to find the number of customer in a month:
my_table:
date_in_out: date of acquisition
rate_plan_name : string
stock: int
incomers: int
I do a partition on 3 variables: the year / month of acquisition and rate_plan
SELECT (first_value(stock) OVER w + sum(incomers) OVER w) AS stock_monthly,
year(date_in_out) AS year_in,
month(date_in_out) AS month_in,
rate_plan_name
FROM my_table
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out) ORDER BY date_in_out ASC);
I got the result
I get different monthly_stock values, whereas year_in / month_in and rate_plan_name are the same in my dataset.
My question is why is this value different ? I would expect it to the same here.

With an order by date_in_out in the window specification, sum gets computed for every row. If you need it aggregated at a year month level, use
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out))
But note that first_value still needs an order by.
I think you are looking for,
SELECT first_value(stock) OVER(w ORDER BY date_in_out) + sum(incomers) OVER w AS stock_monthly,
year(date_in_out) AS year_in,
month(date_in_out) AS month_in,
rate_plan_name
FROM my_table
WINDOW w AS (PARTITION BY rate_plan_name, year(date_in_out), month(date_in_out))

Related

Count half of rest of a partition by from position

I'm trying to achieve the following results:
now, the group comes from
SUM(CASE WHEN seqnum <= (0.5 * seqnum_rev) THEN i.[P&L] END) OVER(PARTITION BY i.bracket_label ORDER BY i.event_id) AS [P&L 50%],
I need that in each iteration it counts the total of rows from the end till position (seq_inv) and sum the amounts in P&L only for the half of it from that position.
for example, when
seq = 2
seq_inv will be = 13, half of it is 6 so I need to sum the following 6 positions from seq = 2.
when seq = 4 there are 11 positions till the end (seq_inv = 11), so half is 5, so I want to count 5 positions from seq = 4.
I hope this makes sense, I'm trying to come up with a rule that will be able to adapt to the case I have, since the partition by is what gives me the numbers that need to be summed.
I was also thinking if there was something to do with a partition by top 50% or something like that, but I guess that doesn't exist.
I have the advantage that I've helped him before and have a little extra context.
That context is that this is just the later stage of a very long chain of common table expressions. That means self-joins and/or correlated sub-queries are unfortunately expensive.
Preferably, this should be answerable using window functions, as the data set is already available in the appropriate ordering and partitioning.
My reading is this...
The SUM(5:9) (meaning the sum of rows 5 to row 9, inclusive) is equal to SUM(5:end) - SUM(10:end)
That leads me to this...
WITH
cumulative AS
(
SELECT
*,
SUM([P&L]) OVER (PARTITION BY bracket_label ORDER BY event_id DESC) AS cumulative_p_and_l
FROM
data
)
SELECT
*,
cum_val - LEAD(cumulative_p_and_l, seq_inv/2, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_50_perc,
cum_val - LEAD(cumulative_p_and_l, seq_inv/4, 0) OVER (PARTITION BY bracket_label ORDER BY event_id) AS p_and_l_25_perc,
FROM
cumulative
NOTE: Using , &, % in column names is horrendous, don't do it ;)
EDIT: Corrected the ORDER BY in the cumulative sum.
I don't think that window functions can do what you want. You could use a correlated subquery instead, with the following logic:
select
t.*,
(
select sum(t1.P&L]
from mytable t1
where t1.seq - t.seq between 0 and t.seq_inv/2
) [P&L 50%]
from mytable t

How to find neighboring records in the SQL table in terms of month and year?

Please help me to optimize my SQL query.
I have a table with the fields: date, commodity_id, exp_month_id, exp_year, price, where the first 4 fields are the primary key. The months are designated with the alphabet-ordered letters: e.g. F (for Jan), G (for Feb.), H (for March), etc. Thus the letter of more distant from Jan. month will be larger than the letter of the less distant month (F < G < H < ...). Some commodity_ids have all 12 months in the table, some only 5 or 3, which are constant for all years.
I need to calculate the difference between prices (gradient) of the neighboring records in terms of exp_month_id, exp_year. As the first step, I want to define for every couple (exp_month_id, exp_year) the valid couple (next_month_id, next_year). The main problem here, that if the current exp_month_id is the last in the year, then next_year = exp_year + 1 and next_month_id should be the first one in the year.
I have written the following query to do the job:
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id
FROM futures
ORDER BY exp_month_id
)
SELECT DISTINCT f.commodity_id,
f.exp_month_id,
f.exp_year,
(
WITH [temp] AS (
SELECT exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id
)
SELECT exp_month_id
FROM [temp]
WHERE exp_month_id > f.exp_month_id
UNION ALL
SELECT exp_month_id
FROM [temp]
LIMIT 1
)
AS next_month_id,
(
SELECT CASE WHEN EXISTS (
SELECT commodity_id,
exp_month_id
FROM trading_months
WHERE commodity_id = f.commodity_id AND
exp_month_id > f.exp_month_id
LIMIT 1
)
THEN f.exp_year ELSE f.exp_year + 1 END
)
AS next_year
FROM futures AS f
This query serves as a base for a dynamic table (view) which is subsequently used for calculating the gradient. However, the execution of this query takes more than one second and thus the whole process takes minutes. I wonder if you could help me optimizing the query.
Note: The following requires Sqlite 3.25 or newer for window function support:
Lack of sample data (Preferably as a CREATE TABLE and INSERT statements for easy importing) and expected results makes it hard to test, but if your end goal is computing the difference in prices between expiration dates (Making your question a bit of an XY problem, maybe something like:
SELECT date, commodity_id, price, exp_year, exp_month_id
, price - lag(price, 1) OVER (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id) AS "change from last price"
FROM futures;
Thanks to the hint of #Shawn to use window functions I could rewrite the query in much shorter form:
CREATE VIEW "futures_nextmonths_win" AS
WITH trading_months AS (
SELECT DISTINCT commodity_id,
exp_month_id,
exp_year
FROM futures)
SELECT commodity_id,
exp_month_id,
exp_year,
lead(exp_month_id) OVER w AS next_month_id,
lead(exp_year) OVER w AS next_year
FROM trading_months
WINDOW w AS (PARTITION BY commodity_id ORDER BY exp_year, exp_month_id);
which is also slightly faster then the original one.

SPARK SQL Equivalent of Qualify + Row_number statements

Does anyone know the best way for Apache Spark SQL to achieve the same results as the standard SQL qualify() + rnk or row_number statements?
For example:
I have a Spark Dataframe called statement_data with 12 monthly records each for 100 unique account_numbers, therefore 1200 records in total
Each monthly record has a field called "statement_date" that can be used for determining the most recent record
I want my final result to be a new Spark Dataframe with the 3 most recent records (as determined by statement_date descending) for each of the 100 unique account_numbers, therefore 300 final records in total.
In standard Teradata SQL, I can do the following:
select * from statement_data
qualify row_number ()
over(partition by acct_id order by statement_date desc) <= 3
Apache Spark SQL does not have a standalone qualify function that I'm aware of, maybe I'm screwing up the syntax or can't find documentation that qualify exists.
It is fine if I need to do this in two steps as long as those two steps are:
A select query or alternative method to assign rank/row numbering for each account_number's records
A select query where I'm selecting all records with rank <= 3 (i.e. choose 1st, 2nd, and 3rd most recent records).
EDIT 1 - 7/23 2:09pm:
The initial solution provided by zero323 was not working for me in Spark 1.4.1 with Spark SQL 1.4.1 dependency installed.
EDIT 2 - 7/23 3:24pm:
It turns out the error was related to using SQL Context objects for my query instead of Hive Context. I am now able to run the below solution correctly after adding the following code to create and use a Hive Context:
final JavaSparkContext sc2;
final HiveContext hc2;
DataFrame df;
hc2 = TestHive$.MODULE$;
sc2 = new JavaSparkContext(hc2.sparkContext());
....
// Initial Spark/SQL contexts to set up Dataframes
SparkConf conf = new SparkConf().setAppName("Statement Test");
...
DataFrame stmtSummary =
hc2.sql("SELECT * FROM (SELECT acct_id, stmt_end_dt, stmt_curr_bal, row_number() over (partition by acct_id order by stmt_curr_bal DESC) rank_num FROM stmt_data) tmp WHERE rank_num <= 3");
There is no qualify (it is usually useful to check parser source) but you can use subquery like this:
SELECT * FROM (
SELECT *, row_number() OVER (
PARTITION BY acct_id ORDER BY statement_date DESC
) rank FROM df
) tmp WHERE rank <= 3
See also SPARK : failure: ``union'' expected but `(' found

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start
I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.
Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.
; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

SQL max multiple columns

I am trying to display the maximum value of a specific value, and the corresponding timestamp for the value. I have the command working properly, but unfortunately, if the value is at the maximum value for more than one time period, it displays all of the timestamps. This can be cumbersome with multiple targets as well. Here is what I am using now:
select target_name,value,collection_timestamp
from (select target_name,value,collection_timestamp,
max(value) over (partition by target_name) max_value
from mgmt$metric_details
where target_type='host' and metric_name='TotalDiskUsage'
and column_label='Total Disk Utilized (%) (across all local filesystems)'
)
where value=max_value;
I want to utilize the same kind of command (trying to avoid inner joins etc, because of the lack of bandwidth)....but only show 1 max value/timestamp per target_name. Is there a way to coordinate a group by or limit function into this, without breaking it? I am somewhat unfamiliar with SQL, so this is all new territory.
Your query is so close. Instead of doing the max, do a row_number():
select target_name,value,collection_timestamp
from (select target_name,value,collection_timestamp,
row_number() over (partition by target_name order by value desc) as seqnum
from mgmt$metric_details
where target_type='host' and metric_name='TotalDiskUsage'
and column_label='Total Disk Utilized (%) (across all local filesystems)'
)
where seqnum = 1
This orders everything in the partition by value. You want the one largest value, so order by descending value and take the first in the sequence.
Use ROW_NUMBER() function instead of MAX() and appropriate ORDER BY in the window to resolve the ties:
select target_name,value,collection_timestamp
from (select target_name,value,collection_timestamp,
ROW_NUMBER() OVER (partition by target_name
ORDER BY value DESC,
collection_timestamp DESC )
AS rn
from mgmt$metric_details
where target_type='host' and metric_name='TotalDiskUsage'
and column_label='Total Disk Utilized (%) (across all local filesystems)'
)
where rn = 1 ;