Spark coalesce on rdd resulting in less partitions than expected

Spark coalesce on rdd resulting in less partitions than expected - apache-spark-sql

We are running a spark batch job which performs following operations :
Create dataframe by reading from hive table
Convert dataframe to rdd
Store the rdd into list
Above steps are performed for 2 different tables and a variable ( called minNumberPartitions ) is set which holds the minimum number of partitions out of the 2 RDDs created.
When the job starts coalesce value is initialized to a constant value. This value is used to coalesce the above created RDDs only if it is less than the minNumberPartitions ( set in above step ). But, if coalesce value is greater than minNumberPartitions then it is re-set to minNumberPartitions ( i.e coalesceValue = minNumberPartitions ) and then coalesce happens to both the RDDs created with this value.
In our scenario, we are facing issue in the later condition when coalesce value is greater than minNumberPartitions. So the scenario is somewhat like this :
CoalesceValue is initialized to 20000, Number of partition of RDD1 created from Dataframe1 after reading from hivetable1 is 187, Number of partition of RDD2 created from Dataframe2 after reading from hivetable2 is 10. So the minNumberPartitions is set to 10.
Hence coalesceValue is getting reset to 10 and coalesce of respective RDDs happen with the value 10 i.e RDD1.coalesce(10, false, null) and RDD2.coalesce(10, false, null) [ Here shuffle in coalesce is set to false and ordering is set to null ].
According to common understanding, number of partitions of RDD1 should be reduced from 187 to 10 and RDD2 should remain the same i.e 10. In this case, number of partitions for RDD1 is getting reduced to 10 from 187, but for RDD2 number of partitions is getting reduced from 10 to 9. Due to this behaviour some operations are getting hampered and final spark job is getting failed.
Please help us understand if coalesce works differently on the RDD when coalesce value is same as that of number of partitions of the RDD.
UPDATE :
I found a Open Jira Ticket ( SPARK-13365 ) for the same issue but it is not conclusive. Moreover i don't understand the meaning of the statement in the above mentioned Jira ticket
' One case I've seen this is actually when users do coalesce(1000)
without the shuffle which really turns into a coalesce(100) '

Related

how to read most recent partition in apache spark

I have a used the dataframe which contains the query
df : Dataframe =spark.sql(s"show Partitions $yourtablename")
Now the number of partition changes every day as it runs every day.
The main concern is that I need to fetch the latest partition.
Suppose I get the partition for a random table for a particular day
like
year=2019/month=1/day=1
year=2019/month=1/day=10
year=2019/month=1/day=2
year=2019/month=1/day=21
year=2019/month=1/day=22
year=2019/month=1/day=23
year=2019/month=1/day=24
year=2019/month=1/day=25
year=2019/month=1/day=26
year=2019/month=2/day=27
year=2019/month=2/day=3
Now you can see the functionality that it sorts the partition so that after day=1 comes day=10. This creates a problem, as I need to fetch the latest partition.
I have managed to get the partition by using
val df =dff.orderby(col("partition").desc.limit(1)
but this gives me the tail -1 partition and not the latest partition.
How can I get the latest partition from the tables overcoming hives's limitation of arranging partitions?
So suppose in the above example I need to pick up
year=2019/month=2/day=27
and not
year=2019/month=2/day=3
which is the last partition in the table.

You can get max partitions from SHOW PARTITIONS
spark.sql("SHOW PARTITIONS my_database.my_table").select(max('partition)).show(false)

I would no rely on positional dependency but if you were to do so I would at least have year=2019/month=2/day=03.
I would rely on partition pruning and SQL via an SQL statement. I am not sure if you are using ORC, PARQUET, etc. but partition pruning should be a goer.
E.g.
val df = sparkSession.sql(""" select max(partition_col)
from randomtable
""")
val maxVal = df.first().getString(0) // this as sql result is a DF
See also https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/

Why does select result fields double data scanned in BigQuery

I have a table with 2 integer fields x,y and few millions of rows.
The fields are created with the following code:
Field.newBuilder("x", LegacySQLTypeName.INTEGER).setMode(Field.Mode.NULLABLE).build();
If I run the following from the web:
SELECT x,y FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: "Valid: This query will process 64.9 MB when run."
compared to:
SELECT x FROM [myproject:Test.Test] where x=1 LIMIT 50
Query Editor: " Valid: This query will process 32.4 MB when run."
It scans more than double of the original data scanned.
I would expect it will first find the relevant rows based on where clause and then bring the extra field without scanning the entire second field.
Any inputs on why it doubles the data scanned and how to avoid it will be appreciated.
In my application I have hundred of possible fields which I need to fetch for a very small number of rows (50) which answer the query.
Does this means I will need to processed all fields data?
* I'm aware how columnar database works, but wasn't aware for the huge price when you want to brings lots of fields based on a very specific where clause.
The following link provide very clear answer:
best-practices-performance-input

BigQuery does not have a concept of index or something like that. When you query a field column, BigQuery will scan through all the values of that column and then make the operations you want (for a deeper deep understanding they have some pretty cool posts about the inner workings of BQ).
That means that when you select x and y where x = 1, BQ will read through all values of x and y and then find where x = 1.
This ends up being an amazing feature of BQ, you just load your data there and it just works. It does force you to be aware on how much data you retrieve from each query. Queries of the type select * from table should be used only if you really need all columns.

How to get repeatable sample using Presto SQL?

I am trying to get a sample of data from a large table and want to make sure this can be repeated later on. Other SQL allow repeatable sampling to be done with either setting a seed using set.seed(integer) or repeatable (integer) command. However, this is not working for me in Presto. Is such a command not available yet? Thanks.

One solution is that you can simulate the sampling by adding a column (or create a view) with random stuff (such as UUID) and then selecting rows by filtering on this column (for example, UUID ended with '1'). You can tune the condition to get the sample size you need.
By design, the result is random and also repeatable across multiple runs.

If you are using Presto 0.263 or higher you can use key_sampling_percent to reproducibly generate a double between 0.0 and 1.0 from a varchar.
For example, to reproducibly sample 20% of records in table using the id column:
select
id
from table
where key_sampling_percent(id) < 0.2
If you are using an older version of Presto (e.g. AWS Athena), you can use what's in the source code for key_sampling_percent:
select
id
from table
where (abs(from_ieee754_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2

I have found that you have to use from_big_endian_64 instead of from_ieee754_64 to get reliable results in Athena. Otherwise I got no many numbers close to zero because of the negative exponent.
select id
from table
where (abs(from_big_endian_64(xxhash64(cast(id as varbinary)))) % 100) / 100. < 0.2

You may create a simple intermediate table with selected ids:
CREATE TABLE IF NOT EXISTS <temp1>
AS
SELECT <id_column>
FROM <tablename> TABLESAMPLE SYSTEM (10);
This will contain only sampled ids and will be ready to use it downstream in your analysis by doing JOIN with data of interest.

Finding a min value for each key causing efficiency issues

I have a very large table, CLAIMS, with the following columns:
p_key
c_key
claim_type
Each row is uniquely defined by p_key, c_key. Often there will be multiple c_keys for each p_key. The table would look like this:
p_key c_key claim_type
1 1 A
1 2 A
2 3 B
2 5 C
3 1 B
I want to find the minimum c_key for each p_key. This is my query:
SELECT p_key,
min(c_key) as min_ckey
from CLAIMS
GROUP BY p_key
The issue is, when I run this as a mapreduce job through HIVE CLI (0.13), the reduce portion takes 30 minutes to even get 5% done. I'm not entirely sure what could cause a simple query to take so long. This query gives the same issue:
SELECT p_key,
row_number() OVER(PARTITION BY p_key ORDER BY c_key) as RowNum
from CLAIMS
So my question is why would the reduce portion of a seemingly simple mapreduce job take so long? Any suggestions on how to investigate this/improve the query would also be appreciated.

Do you know if the data is imbalanced? If there is one p_key with a very large number of c_key values compared to the average case, then the reducer which deals with that p_key will take a very long time.
Alternatively, is it possible that there are a small number of p_key values in general? Since you're grouping by p_key that would limit the number of reducers doing useful work.

The reduce phase occurs in three stages. When <=33% is shuffle, between 33% and 66% is sort, and >= 67% is the reduce phase.
Your job sounds like it is getting hung up in the shuffle portion of the reduce phase. My guess is that your data is spread all over and this portion is IO bound. Your observations are being moved to reducers.
You can try bucketing your data:
create table claim_bucket (p_key string, c_key string, claim_type string)
clustered by (p_key) into 6 buckets
row format delimited fields terminated by ",";
You may want more or less buckets and this will require some heavy lifting by hive inititally but should speed up subsequent queries of the table where p_key is used.
Of course you haven't left much else to go on here. If you post an edit and give more information you might get a better answer. Good luck.

What is MAX(DISTINCT x) in SQL?

I just stumbled over jOOQ's maxDistinct SQL aggregation function.
What does MAX(DISTINCT x) do different from just MAX(x) ?

maxDistinct and minDistinct were defined in order to keep consistency with the other aggregate functions where having a distinct option actually makes a difference (e.g., countDistinct, sumDistinct).
Since the maximum (or minimum) calculated between the distinct values of a dataset is mathematically equivalent with the simple maximum (or minimum) of the same set, these function are essentially redundant.

In short, there will be no difference. In case of MySQL, it's even stated in manual page:
Returns the maximum value of expr. MAX() may take a string argument;
in such cases, it returns the maximum string value. See Section 8.5.3,
“How MySQL Uses Indexes”. The DISTINCT keyword can be used to find the
maximum of the distinct values of expr, however, this produces the
same result as omitting DISTINCT.
The reason why it's possible - is because to keep compatibility with other platforms. Internally, there will be no difference - MySQL will just omit influence of DISTINCT. It will not try to do something with set of rows (i.e. produce distinct set first). For indexed columns it will be Select tables optimized away (thus reading one value from index, not a table), for non-indexed - full scan.

If i'm not wrong there are no difference
For Columns
ID
1
2
2
3
3
4
5
5
The OUTPUT for both quires are same 5
MAX(DISTINCT x)
// ID = 1,2,2,3,3,4,5,5
// DISTINCT = 1,2,3,4,5
// MAX = 5
// 1 row
and for
MAX(x)
// ID = 1,2,2,3,3,4,5,5
// MAX = 5
// 1 row

Theoretically, DISTINCT x ensures that every element is different from a certain set. The max operator selects the highest value from a set. In plain SQL there should be no difference between both.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas