Optimizing spark sql query

Optimizing spark sql query - sql

I am using this below query to derive outliers form my data. using distinct is creating too much shuffle and the end tasks are taking huge amount of time to complete. are there any optimization that can be done to speed it up?
query = """SELECT
DISTINCT NAME,
PERIODICITY,
PERCENTILE(CAST(AMOUNT AS INT), 0.997) OVER(PARTITION BY NAME, PERIODICITY) as OUTLIER_UPPER_THRESHOLD,
CASE
WHEN PERIODICITY = "WEEKLY" THEN 100
WHEN PERIODICITY = "BI_WEEKLY" THEN 200
WHEN PERIODICITY = "MONTHLY" THEN 250
WHEN PERIODICITY = "BI_MONTHLY" THEN 400
WHEN PERIODICITY = "QUARTERLY" THEN 900
ELSE 0
END AS OUTLIER_LOWER_THRESHOLD
FROM base"""

I would suggest rephrasing this so you can filter before aggregating:
SELECT NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD,
MIN(AMOUNT)
FROM (SELECT NAME, PERIODICITY,
RANK() OVER (PARTITION BY NAME, PERIODICITY ORDER BY AMOUNT) as sequm,
COUNT(*) OVER (PARTITION BY NAME, PERIODICITY) as cnt,
(CASE . . . END) as OUTLIER_LOWER_THRESHOLD
FROM base
) b
WHERE seqnum >= 0.997 * cnt
GROUP BY NAME, PERIODICITY, OUTLIER_LOWER_THRESHOLD;
Note: This ranks duplicate amounts based on the lowest rank. That means that some NAME/PERIODICITY pairs may not be in the results. They can easily be added back in using a LEFT JOIN.

The easiest way to deal with a large shuffle, independent of what the shuffle is, is to use a larger cluster. It's the easiest way because you don't have to think much about it. Machine time is usually much cheaper than human time refactoring code.
The second easiest way to deal with a large shuffle that is the union of some independent and constant parts is to break it into smaller shuffles. In your case, you could run separate queries for each periodicity, filtering the data down before the shuffle and then union the results.
If the first two approaches are not applicable for some reason, it's time to refactor. In your case you are doing two shuffles: first to compute OUTLIER_UPPER_THRESHOLD which you associate with every row and then to distinct the rows. In other words, you are doing a manual, two-phase GROUP BY. Why don't you just group by NAME, PERIODICITY and compute the percentile?

Related

How to make a test-train split in the table using SQL?

I am slightly new to PostgreSQL and had a doubt.
Suppose I have a table with data like this (below). How do I select ten rows from each of the classes (or sentiments), i.e. 10 positive tweets, 10 extremely positive tweets, 10 negative tweets, etc. (there are around 11 such classes)? The data is also somewhat large to do manually; ~1k rows.
As the title reads, my actual goal is to make a 10:90 split in the entire table and add another column with a test-train label, while keeping the class balance intact. So, if there is any other better way to do this, please suggest.
(I do realise that python could make life easier, but I wish to know if we can do it with SQL directly, also kinda constrained to SQL at the moment.)
Original Tweet
Sentiment
New Yorkers encounter empty supermarket shelves ...
Extremely Negative
When I couldn't find hand sanitizer at Fred Meyer, I turned ...
Positive
Find out how you can protect yourself and loved ones from ...
Extremely Positive
#Panic buying hits #NewYork City as anxious shoppers stock up on...
Negative
#toiletpaper #dunnypaper #coronavirus #coronavirusaustralia ...
Neutral
Do you remember the last time you paid $2.99 a gallon for ...
Neutral

You can use row_number():
select t.*
from (select t.*,
row_number() over (partition by sentiment order by rand()) as seqnum
from t
) t
where seqnum <= 10;
If you wanted to split the data into 90-10 groups randomly, I would probably recommend ntile():
select t.*,
(case when ntile(10) over (order by rand() <= 9 then 'train' else 'test'
end) as grp
from t;

Query for grouping of successful attempts when order matters

Let's say, for example, I have a db table Jumper for tracking high jumpers. It has three columns of interest: attempt_id, athlete, and result (a boolean for whether the jumper cleared the bar or not).
I want to write a query that will compare all athletes' performance across different attempts yielding a table with this information: attempt number, number of cleared attempts, total attempts. In other words, what is the chance that an athlete will clear the bar on x attempt.
What is the best way of writing this query? It is trickier than it would seem at first because you need to determine the attempt number for each athlete to be able to total the final totals.
I would prefer answers be written with Django ORM, but SQL will also be accepted.
Edit: To be clear, I need it to be grouped by attempt, not by athlete. So it would be all athletes' combined x attempt.

You could solve it using SQL:
SELECT t.attempt_id,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM Jumper t
GROUP BY t.attempt_id
EDIT: If attempt_id is just a sequence, and you want to use it to calculate the attempt number for each jumper, you could use this query instead:
SELECT t.attempt_number,
SUM(CASE t.result WHEN TRUE THEN 1 ELSE 0 END) AS cleared,
COUNT(*) AS total
FROM (SELECT s.*,
ROW_NUMBER() OVER(PARTITION BY athlete
ORDER BY attempt_id) AS attempt_number
FROM Jumper s) t
GROUP BY t.attempt_number
This way, you group every first attempt from all athletes, every second attempt from all athletes, and so on...

Computing a moving maximum in BigQuery

Given a BigQuery table with some ordering, and some numbers, I'd like to compute a "moving maximum" of the numbers -- similar to a moving average, but for a maximum instead. From Trying to calculate EMA (exponential moving average) using BigQuery it seems like the best way to do this is by using LEAD() and then doing the aggregation myself. (Bigquery moving average suggests essentially a CROSS JOIN, but that seems like it would be quite slow, given the size of the data.)
Ideally, I might be able to just return a single repeated field, rather than 20 individual fields, from the inner query, and then use normal aggregation over the repeated field, but I haven't figured out a way to do that, so I'm stuck with rolling my own aggregation. While this is easy enough for a sum or average, computing the max inline is pretty tricky, and I haven't figured out a good way to do it.
(The examples below are of course somewhat contrived in order to use public datasets. They also do the rolling max over 3 elements, whereas I'd like to do it for around 20. I'm already generating the query programmatically, so making it short isn't a big issue.)
One approach is to do the following:
SELECT word,
(CASE
WHEN word_count >= word_count_1 AND word_count >= word_count_2 THEN word_count
WHEN word_count_1 >= word_count AND word_count_1 >= word_count_2 THEN word_count_1
ELSE word_count_2 END
) AS max_count
FROM (
SELECT word, word_count,
LEAD(word_count, 1) OVER (ORDER BY word) AS word_count_1,
LEAD(word_count, 2) OVER (ORDER BY word) AS word_count_2,
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'
)
This is O(n^2), but it at least works. I could also do a nested chain of IFs, like this:
SELECT word,
IF(word_count >= word_count_1,
IF(word_count >= word_count_2, word_count, word_count_2),
IF(word_count_1 >= word_count_2, word_count_1, word_count_2)) AS max_count
FROM ...
This is O(n) to evaluate, but the query size is exponential in n, so I don't think it's a good option; certainly it would surpass the BigQuery query size limit for n=20. I could also do n nested queries:
SELECT word,
IF(word_count_2 >= max_count, word_count_2, max_count) AS max_count
FROM (
SELECT word,
IF(word_count_1 >= word_count, word_count_1, word_count) AS max_count
FROM ...
)
It seems like doing 20 nested queries might not be a great idea performance-wise, though.
Is there a good way to do this kind of query? If not, am I correct that for n around 20, the first is the least bad?

A trick I'm using for rolling windows: CROSS JOIN with a table of numbers. In this case, to have a moving window of 3 years, I cross join with the numbers 0,1,2. Then you can create an id for each group (ending_at_year==year-i) and group by that.
SELECT ending_at_year, MAX(mean_temp) max_temp, COUNT(DISTINCT year) c
FROM
(
SELECT mean_temp, year-i ending_at_year, year
FROM [publicdata:samples.gsod] a
CROSS JOIN
(SELECT i FROM [fh-bigquery:public_dump.numbers_255] WHERE i<3) b
WHERE station_number=722860
)
GROUP BY ending_at_year
HAVING c=3
ORDER BY ending_at_year;

I have another way to do the thing you are trying to achieve. See query below
SELECT word, max(words)
FROM
(SELECT word,
word_count AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 1) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth'),
(SELECT word,
LEAD(word_count, 2) OVER (ORDER BY word) AS words
FROM [publicdata:samples.shakespeare]
WHERE corpus = 'macbeth')
group by word order by word
You can try it and compare performance with your approach (I didn't try that)

There's an example creating a moving using window function in the docs here.
Quoting:
The following example calculates a moving average of the values in the current row and the row preceding it. The window frame comprises two rows that move with the current row.
#legacySQL
SELECT
name,
value,
AVG(value)
OVER (ORDER BY value
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW)
AS MovingAverage
FROM
(SELECT "a" AS name, 0 AS value),
(SELECT "b" AS name, 1 AS value),
(SELECT "c" AS name, 2 AS value),
(SELECT "d" AS name, 3 AS value),
(SELECT "e" AS name, 4 AS value);

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start

I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.

Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.

; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

Query Optimization

That's my current query, it works but it is slow:
SELECT row, MIN(flg) ||' to ' ||Max (flg) as xyz , avg(amt_won), count(*)
FROM(
SELECT (ROW_NUMBER() OVER (ORDER BY flg))*100/
(SELECT count(*)+100 as temprow FROM temporary_six_max) as row, flg, amt_won
FROM temporary_six_max
JOIN (
SELECT id_player AS pid, avg(flg_vpip::int) AS flg
FROM temporary_six_max
GROUP BY id_player
) AS auxtable
ON pid = id_player
) as auxtable2
group by 1
order by 1;
I am grouping in fixed (or almost fixed) count 100 ranges that are ordered by avg(flg_vpip) grouped by id_player.
Here I've pasted the results in case it may help to understand:
https://spreadsheets0.google.com/ccc?key=tFVsxkWVn4fMWYBxxGYokwQ&authkey=CNDvuOcG&authkey=CNDvuOcG#gid=0
I wonder if there is a better function to use than ROW_NUMBER() in this case and I feel like I am doing too many subselects but I don't know how to optimize it.
I'll appreciate very much any help.
If something is not clear just let me know.
Thank you.
EDIT:
The reason I created auxtable 2, is because when I use (ROW_NUMBER() OVER (ORDER BY flg), and use other agregate commands such as avg(amt_won) and count(*), which are essential, I get an error saying that flg should be in the aggregate function, but I can't order by an aggregate function of flg.

I generated some data to test with like this:
create table temporary_six_max as
select id_player, flg_vpip,
random()*100 * (case flg_vpip when 0 then 1 else -1 end) as amt_won
from (select (random()*1000)::int as id_player, random()::int as flg_vpip
from generate_series(1,1000000)) source;
create index on temporary_six_max(id_player);
Your query runs successfully against that, but doesn't quite generate the same plan, I get a nested loop in the lower arm rather than a merge and a seq scan in the init-plan-- you haven't turned off enable_seqscan I hope?
A solution just using a single scan of the table:
select row, min(flg) || ' to ' || max(flg) as xyz, avg(amt_won), count(*)
from (select flg, amt_won, ntile(100) over(order by flg) as row
from (select id_player as pid, amt_won,
avg(flg_vpip::int) over (partition by id_player) as flg
from temporary_six_max
) player_stats
) chunks
group by 1
order by 1
The bad news is that this actually performs worse on my machine, especially if I bump work_mem up enough to avoid the first disk sort (making player_stats, sorting by flg). Although increasing work_mem did halve the query time, so I guess that is at least a start?
Having said that, my queries are running for about 5 seconds to process 10E6 input rows in temporary_six_max, which is an order of magnitude faster than you posted. Does your table fit into your buffer cache? If not, a single-scan solution may be much better for you. (Which version of Postgresql are you using? "explain (analyze on, buffers on) select..." will show you buffer hit/miss rates in 9.0, or just look at your "shared_buffers" setting and compare with the table size)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimizing spark sql query - sql

Related

How to make a test-train split in the table using SQL?

Query for grouping of successful attempts when order matters

Computing a moving maximum in BigQuery

Select finishes where athlete didn't finish first for the past 3 events

Query Optimization

Categories

Resources