I have a used the dataframe which contains the query
df : Dataframe =spark.sql(s"show Partitions $yourtablename")
Now the number of partition changes every day as it runs every day.
The main concern is that I need to fetch the latest partition.
Suppose I get the partition for a random table for a particular day
like
year=2019/month=1/day=1
year=2019/month=1/day=10
year=2019/month=1/day=2
year=2019/month=1/day=21
year=2019/month=1/day=22
year=2019/month=1/day=23
year=2019/month=1/day=24
year=2019/month=1/day=25
year=2019/month=1/day=26
year=2019/month=2/day=27
year=2019/month=2/day=3
Now you can see the functionality that it sorts the partition so that after day=1 comes day=10. This creates a problem, as I need to fetch the latest partition.
I have managed to get the partition by using
val df =dff.orderby(col("partition").desc.limit(1)
but this gives me the tail -1 partition and not the latest partition.
How can I get the latest partition from the tables overcoming hives's limitation of arranging partitions?
So suppose in the above example I need to pick up
year=2019/month=2/day=27
and not
year=2019/month=2/day=3
which is the last partition in the table.
You can get max partitions from SHOW PARTITIONS
spark.sql("SHOW PARTITIONS my_database.my_table").select(max('partition)).show(false)
I would no rely on positional dependency but if you were to do so I would at least have year=2019/month=2/day=03.
I would rely on partition pruning and SQL via an SQL statement. I am not sure if you are using ORC, PARQUET, etc. but partition pruning should be a goer.
E.g.
val df = sparkSession.sql(""" select max(partition_col)
from randomtable
""")
val maxVal = df.first().getString(0) // this as sql result is a DF
See also https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/
Related
I have a SQL table like this and I want to find the average adjusted amt for products partitioned by store_id that looks like this
Here, I need to compute the adj_amt which is the product of the previous two columns.
For this, I need to fill the nulls in the avg_quantity with the first non_null value in the partition. The query I use is below.
select
CASE WHEN av_quantity is null then
# the boolen here is for non-null values
first_value(av_quantity, True) over (partition by store_no order by product_id
range between current row and unbounded following
)
else av_quantity
end as adj_av_quantity
I'm having trouble with the SQL required to get the adjusted cost, since its not pulling the first non_null value for factor but still fetches it based on the same row for the adj_av_quantity. any thoughts on how I could do this?
FYI I've simplified the data here. The actual dataset is pretty huge (> 125 million rows with 800+ columns) so I won't be able to use joins and have to do this via window functions. I'm using spark-sql
In bigQuery GCP, I am trying to grab some data in a table where the date is the same as a date in a list of values I have got. If I hardcode the list of values in the select it is vastly cheaper in process to run than if I use a temp structure like an array...
Is there a way to use the temp structure but avoid the enormous processing cost ?
Why is it so expensive for something small simple like this.
please see below examples:
**-----1/ array structure example: this query process's 144.8 GB----------**
WITH
get_a as (
SELECT
GENERATE_DATE_ARRAY('2000-01-01','2000-01-02') as array_of_dates
)
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
get_a as b
UNNEST(b.array_of_dates) as c
WHERE
c in (CAST(a.ingest_time AS DATE)
)
**------2/ hardcoded example: this query processes 936.5 MB over 154 X's less ? --------**
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
WHERE
(CAST(a.ingest_time as DATE)) IN ('2000-01-01','2000-01-02')
Presumably, your view_a.events table is partitioned by the ingest_time.
The issue is that partition pruning is very conservative (buggy?). With the direct comparisons, BigQuery is smart enough to recognize exactly which partitions are used for the query. But with the generated version, BigQuery is not able to figure this out, so the entire table needs to be read.
Which one will be best to use from the perspective of cost, time and processing.here etl_batch_date is the partition column for the table.
1.Query - This query will process 607.7 kb when run
Table size : 9.77 MB
SELECT count(*) from demo
WHERE etlbatchid = '20200003094244327' and etl_batch_date='2020-06-03
Query - This query will process 427.6 kb when run
Table size : 9.77MB
SELECT count(*) from demo WHERE etlbatchid = '20200003094244327'
Also when you write second query then does it read the data from every partition?
You valuable comments will be appreciated.
Rule of thumb: Always use the partitioned column to filter data.
Play with this query:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
# 2.2 GB processed
For every datehour you add to the filter, an extra gigabyte of data will be queried. That's because:
Filtering by datehour implies a read of the datehour column. So this makes the query go over more data.
But since the datehour column is the partitioned column, then it only scans that day of data.
Now, if I add another filter:
SELECT COUNT(*)
FROM `fh-bigquery.wikipedia_v3.pageviews_2020`
WHERE DATE(datehour) IN ('2020-01-01', '2020-01-02')
AND wiki='en'
# 686.8 MB processed
That processed less data!
That's because wiki is the main clustering column.
So try to always use partitions and clusters - even tho for smaller tables the results might look less intuitive.
I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as:
partitionBy('date', 't', 's', 'p')
now I want to get number of partitions through using
df.rdd.getNumPartitions()
but it returns a much larger number (15642 partitions) that expected (18 partitions):
show partitions command in hive:
date=2019-10-02/t=u/s=u/p=s
date=2019-10-03/t=u/s=u/p=s
date=2019-10-04/t=u/s=u/p=s
date=2019-10-05/t=u/s=u/p=s
date=2019-10-06/t=u/s=u/p=s
date=2019-10-07/t=u/s=u/p=s
date=2019-10-08/t=u/s=u/p=s
date=2019-10-09/t=u/s=u/p=s
date=2019-10-10/t=u/s=u/p=s
date=2019-10-11/t=u/s=u/p=s
date=2019-10-12/t=u/s=u/p=s
date=2019-10-13/t=u/s=u/p=s
date=2019-10-14/t=u/s=u/p=s
date=2019-10-15/t=u/s=u/p=s
date=2019-10-16/t=u/s=u/p=s
date=2019-10-17/t=u/s=u/p=s
date=2019-10-18/t=u/s=u/p=s
date=2019-10-19/t=u/s=u/p=s
Any idea why the number of partitions is that huge number? and how can I get number of partitions as expected (18)
spark.sql("show partitions hivetablename").count()
The number of partitions in rdd is different from the hive partitions.
Spark generally partitions your rdd based on the number of executors in cluster so that each executor gets fair share of the task.
You can control the rdd partitions by using sc.parallelize(, )) , df.repartition() or coalesce().
I found a detour easier way:
>>> t = spark.sql("show partitions my_table")
>>> t.count()
18
I have a dataframe which I want to write to date partitioned BQ table. I am using to_gbq() method to do this. I am able to replace or append the existing table but can't write to a specific partition of the table using to_gbq()
Since to_gbq() doesn't support it as of yet, I created a code snippet for doing this with BigQuery API client.
Assuming you have an existing date-partitioned table that was created like this (you don't need to pre-create it, more details later):
CREATE TABLE
your_dataset.your_table (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
and you have a DataFrame like this:
import pandas
import datetime
records = [
{"transaction_id": 1, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 2, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 3, "transaction_date": datetime.date(2021, 10, 21)},
]
df = pandas.DataFrame(records)
here's how to write to a specific partition:
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
job_config = bigquery.LoadJobConfig(
write_disposition="WRITE_TRUNCATE",
# This is needed if table doesn't exist, but won't hurt otherwise:
time_partitioning=bigquery.table.TimePartitioning(type_="DAY")
)
# Include target partition in the table id:
table_id = "your_project.your_dataset.your_table$20211021"
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
job.result() # Wait for job to finish
The important part is the $... part in the table id. It tells the API to only update a specific partition. If your data contains records which belong to a different partition, the operation is going to fail.
I believe that to_gbq() is not supported yet for partitioned tables.
You can check here recent issues https://github.com/pydata/pandas-gbq/issues/43.
I would recommend that using Google BigQuery API client library https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
You can upload dataframe to BigQuery table too.
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe