get number of partitions in pyspark - dataframe

I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as:
partitionBy('date', 't', 's', 'p')
now I want to get number of partitions through using
df.rdd.getNumPartitions()
but it returns a much larger number (15642 partitions) that expected (18 partitions):
show partitions command in hive:
date=2019-10-02/t=u/s=u/p=s
date=2019-10-03/t=u/s=u/p=s
date=2019-10-04/t=u/s=u/p=s
date=2019-10-05/t=u/s=u/p=s
date=2019-10-06/t=u/s=u/p=s
date=2019-10-07/t=u/s=u/p=s
date=2019-10-08/t=u/s=u/p=s
date=2019-10-09/t=u/s=u/p=s
date=2019-10-10/t=u/s=u/p=s
date=2019-10-11/t=u/s=u/p=s
date=2019-10-12/t=u/s=u/p=s
date=2019-10-13/t=u/s=u/p=s
date=2019-10-14/t=u/s=u/p=s
date=2019-10-15/t=u/s=u/p=s
date=2019-10-16/t=u/s=u/p=s
date=2019-10-17/t=u/s=u/p=s
date=2019-10-18/t=u/s=u/p=s
date=2019-10-19/t=u/s=u/p=s
Any idea why the number of partitions is that huge number? and how can I get number of partitions as expected (18)

spark.sql("show partitions hivetablename").count()
The number of partitions in rdd is different from the hive partitions.
Spark generally partitions your rdd based on the number of executors in cluster so that each executor gets fair share of the task.
You can control the rdd partitions by using sc.parallelize(, )) , df.repartition() or coalesce().

I found a detour easier way:
>>> t = spark.sql("show partitions my_table")
>>> t.count()
18

Related

DolphinDB: chunks distribution of a dfs table in a cluster

How to get the distribution of all the chunks of a dfs table in a cluster with DolphinDB? I've tried getChunksMeta but it only returned the chunk information.
Use DolphinDB function getTabletsMeta() to view the chunk metadata of the data node. The output includes the information on the data node where the chunk is located. Then encapsulate a query function:
def chunkDistribution(dbName, tbName){
return select count(*) from pnodeRun(getTabletsMeta{"/"+substr(dbName,6)+"/%",tbName,true,-1}) group by node
}
dbName = "dfs://testDB"
tbName = "testTable"
chunkDistribution(dbName, tbName)

Computing grouped medians in DolphinDB

I have a DFS table in DolphinDB. I tried to run a query that would compute grouped medians on this table. But it just threw an exception.
select median(col1) from t group by col2
The aggregated function in column med(v1) doesn't have a map-reduce implementation and can't be applied to a partitioned or distributed table.
Seems to me that DolphinDB does not support distributed median algorithm.
The aggregated function median differs from avgerage in that it can't be solved by map-reduce. So we have to pull the data and then apply the median function to each group.
DolphinDB's repartition mechanism make such work much easier.
ds = repartitionDS(<select first(col2) as col2, median(col1) as col1 from t>,`col2, VALUE)
mr(ds, x->x,,unionAll{false})

how to read most recent partition in apache spark

I have a used the dataframe which contains the query
df : Dataframe =spark.sql(s"show Partitions $yourtablename")
Now the number of partition changes every day as it runs every day.
The main concern is that I need to fetch the latest partition.
Suppose I get the partition for a random table for a particular day
like
year=2019/month=1/day=1
year=2019/month=1/day=10
year=2019/month=1/day=2
year=2019/month=1/day=21
year=2019/month=1/day=22
year=2019/month=1/day=23
year=2019/month=1/day=24
year=2019/month=1/day=25
year=2019/month=1/day=26
year=2019/month=2/day=27
year=2019/month=2/day=3
Now you can see the functionality that it sorts the partition so that after day=1 comes day=10. This creates a problem, as I need to fetch the latest partition.
I have managed to get the partition by using
val df =dff.orderby(col("partition").desc.limit(1)
but this gives me the tail -1 partition and not the latest partition.
How can I get the latest partition from the tables overcoming hives's limitation of arranging partitions?
So suppose in the above example I need to pick up
year=2019/month=2/day=27
and not
year=2019/month=2/day=3
which is the last partition in the table.
You can get max partitions from SHOW PARTITIONS
spark.sql("SHOW PARTITIONS my_database.my_table").select(max('partition)).show(false)
I would no rely on positional dependency but if you were to do so I would at least have year=2019/month=2/day=03.
I would rely on partition pruning and SQL via an SQL statement. I am not sure if you are using ORC, PARQUET, etc. but partition pruning should be a goer.
E.g.
val df = sparkSession.sql(""" select max(partition_col)
from randomtable
""")
val maxVal = df.first().getString(0) // this as sql result is a DF
See also https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/

How can i write a data frame to a specific partition of a date partitioned BQ table using to_gbq()

I have a dataframe which I want to write to date partitioned BQ table. I am using to_gbq() method to do this. I am able to replace or append the existing table but can't write to a specific partition of the table using to_gbq()
Since to_gbq() doesn't support it as of yet, I created a code snippet for doing this with BigQuery API client.
Assuming you have an existing date-partitioned table that was created like this (you don't need to pre-create it, more details later):
CREATE TABLE
your_dataset.your_table (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
and you have a DataFrame like this:
import pandas
import datetime
records = [
{"transaction_id": 1, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 2, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 3, "transaction_date": datetime.date(2021, 10, 21)},
]
df = pandas.DataFrame(records)
here's how to write to a specific partition:
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
job_config = bigquery.LoadJobConfig(
write_disposition="WRITE_TRUNCATE",
# This is needed if table doesn't exist, but won't hurt otherwise:
time_partitioning=bigquery.table.TimePartitioning(type_="DAY")
)
# Include target partition in the table id:
table_id = "your_project.your_dataset.your_table$20211021"
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
job.result() # Wait for job to finish
The important part is the $... part in the table id. It tells the API to only update a specific partition. If your data contains records which belong to a different partition, the operation is going to fail.
I believe that to_gbq() is not supported yet for partitioned tables.
You can check here recent issues https://github.com/pydata/pandas-gbq/issues/43.
I would recommend that using Google BigQuery API client library https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
You can upload dataframe to BigQuery table too.
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe

Spark coalesce on rdd resulting in less partitions than expected

We are running a spark batch job which performs following operations :
Create dataframe by reading from hive table
Convert dataframe to rdd
Store the rdd into list
Above steps are performed for 2 different tables and a variable ( called minNumberPartitions ) is set which holds the minimum number of partitions out of the 2 RDDs created.
When the job starts coalesce value is initialized to a constant value. This value is used to coalesce the above created RDDs only if it is less than the minNumberPartitions ( set in above step ). But, if coalesce value is greater than minNumberPartitions then it is re-set to minNumberPartitions ( i.e coalesceValue = minNumberPartitions ) and then coalesce happens to both the RDDs created with this value.
In our scenario, we are facing issue in the later condition when coalesce value is greater than minNumberPartitions. So the scenario is somewhat like this :
CoalesceValue is initialized to 20000, Number of partition of RDD1 created from Dataframe1 after reading from hivetable1 is 187, Number of partition of RDD2 created from Dataframe2 after reading from hivetable2 is 10. So the minNumberPartitions is set to 10.
Hence coalesceValue is getting reset to 10 and coalesce of respective RDDs happen with the value 10 i.e RDD1.coalesce(10, false, null) and RDD2.coalesce(10, false, null) [ Here shuffle in coalesce is set to false and ordering is set to null ].
According to common understanding, number of partitions of RDD1 should be reduced from 187 to 10 and RDD2 should remain the same i.e 10. In this case, number of partitions for RDD1 is getting reduced to 10 from 187, but for RDD2 number of partitions is getting reduced from 10 to 9. Due to this behaviour some operations are getting hampered and final spark job is getting failed.
Please help us understand if coalesce works differently on the RDD when coalesce value is same as that of number of partitions of the RDD.
UPDATE :
I found a Open Jira Ticket ( SPARK-13365 ) for the same issue but it is not conclusive. Moreover i don't understand the meaning of the statement in the above mentioned Jira ticket
' One case I've seen this is actually when users do coalesce(1000)
without the shuffle which really turns into a coalesce(100) '