How can i write a data frame to a specific partition of a date partitioned BQ table using to_gbq() - pandas

I have a dataframe which I want to write to date partitioned BQ table. I am using to_gbq() method to do this. I am able to replace or append the existing table but can't write to a specific partition of the table using to_gbq()

Since to_gbq() doesn't support it as of yet, I created a code snippet for doing this with BigQuery API client.
Assuming you have an existing date-partitioned table that was created like this (you don't need to pre-create it, more details later):
CREATE TABLE
your_dataset.your_table (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
and you have a DataFrame like this:
import pandas
import datetime
records = [
{"transaction_id": 1, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 2, "transaction_date": datetime.date(2021, 10, 21)},
{"transaction_id": 3, "transaction_date": datetime.date(2021, 10, 21)},
]
df = pandas.DataFrame(records)
here's how to write to a specific partition:
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
job_config = bigquery.LoadJobConfig(
write_disposition="WRITE_TRUNCATE",
# This is needed if table doesn't exist, but won't hurt otherwise:
time_partitioning=bigquery.table.TimePartitioning(type_="DAY")
)
# Include target partition in the table id:
table_id = "your_project.your_dataset.your_table$20211021"
job = client.load_table_from_dataframe(df, table_id, job_config=job_config) # Make an API request
job.result() # Wait for job to finish
The important part is the $... part in the table id. It tells the API to only update a specific partition. If your data contains records which belong to a different partition, the operation is going to fail.

I believe that to_gbq() is not supported yet for partitioned tables.
You can check here recent issues https://github.com/pydata/pandas-gbq/issues/43.
I would recommend that using Google BigQuery API client library https://googlecloudplatform.github.io/google-cloud-python/latest/bigquery/usage.html
You can upload dataframe to BigQuery table too.
https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-dataframe

Related

DolphinDB: chunks distribution of a dfs table in a cluster

How to get the distribution of all the chunks of a dfs table in a cluster with DolphinDB? I've tried getChunksMeta but it only returned the chunk information.
Use DolphinDB function getTabletsMeta() to view the chunk metadata of the data node. The output includes the information on the data node where the chunk is located. Then encapsulate a query function:
def chunkDistribution(dbName, tbName){
return select count(*) from pnodeRun(getTabletsMeta{"/"+substr(dbName,6)+"/%",tbName,true,-1}) group by node
}
dbName = "dfs://testDB"
tbName = "testTable"
chunkDistribution(dbName, tbName)

get number of partitions in pyspark

I select all from a table and create a dataframe (df) out of it using Pyspark. Which is partitioned as:
partitionBy('date', 't', 's', 'p')
now I want to get number of partitions through using
df.rdd.getNumPartitions()
but it returns a much larger number (15642 partitions) that expected (18 partitions):
show partitions command in hive:
date=2019-10-02/t=u/s=u/p=s
date=2019-10-03/t=u/s=u/p=s
date=2019-10-04/t=u/s=u/p=s
date=2019-10-05/t=u/s=u/p=s
date=2019-10-06/t=u/s=u/p=s
date=2019-10-07/t=u/s=u/p=s
date=2019-10-08/t=u/s=u/p=s
date=2019-10-09/t=u/s=u/p=s
date=2019-10-10/t=u/s=u/p=s
date=2019-10-11/t=u/s=u/p=s
date=2019-10-12/t=u/s=u/p=s
date=2019-10-13/t=u/s=u/p=s
date=2019-10-14/t=u/s=u/p=s
date=2019-10-15/t=u/s=u/p=s
date=2019-10-16/t=u/s=u/p=s
date=2019-10-17/t=u/s=u/p=s
date=2019-10-18/t=u/s=u/p=s
date=2019-10-19/t=u/s=u/p=s
Any idea why the number of partitions is that huge number? and how can I get number of partitions as expected (18)
spark.sql("show partitions hivetablename").count()
The number of partitions in rdd is different from the hive partitions.
Spark generally partitions your rdd based on the number of executors in cluster so that each executor gets fair share of the task.
You can control the rdd partitions by using sc.parallelize(, )) , df.repartition() or coalesce().
I found a detour easier way:
>>> t = spark.sql("show partitions my_table")
>>> t.count()
18

how to read most recent partition in apache spark

I have a used the dataframe which contains the query
df : Dataframe =spark.sql(s"show Partitions $yourtablename")
Now the number of partition changes every day as it runs every day.
The main concern is that I need to fetch the latest partition.
Suppose I get the partition for a random table for a particular day
like
year=2019/month=1/day=1
year=2019/month=1/day=10
year=2019/month=1/day=2
year=2019/month=1/day=21
year=2019/month=1/day=22
year=2019/month=1/day=23
year=2019/month=1/day=24
year=2019/month=1/day=25
year=2019/month=1/day=26
year=2019/month=2/day=27
year=2019/month=2/day=3
Now you can see the functionality that it sorts the partition so that after day=1 comes day=10. This creates a problem, as I need to fetch the latest partition.
I have managed to get the partition by using
val df =dff.orderby(col("partition").desc.limit(1)
but this gives me the tail -1 partition and not the latest partition.
How can I get the latest partition from the tables overcoming hives's limitation of arranging partitions?
So suppose in the above example I need to pick up
year=2019/month=2/day=27
and not
year=2019/month=2/day=3
which is the last partition in the table.
You can get max partitions from SHOW PARTITIONS
spark.sql("SHOW PARTITIONS my_database.my_table").select(max('partition)).show(false)
I would no rely on positional dependency but if you were to do so I would at least have year=2019/month=2/day=03.
I would rely on partition pruning and SQL via an SQL statement. I am not sure if you are using ORC, PARQUET, etc. but partition pruning should be a goer.
E.g.
val df = sparkSession.sql(""" select max(partition_col)
from randomtable
""")
val maxVal = df.first().getString(0) // this as sql result is a DF
See also https://mapr.com/blog/tips-and-best-practices-to-take-advantage-of-spark-2-x/

BigQuery, None until set from the server property

I am trying to get the number of rows in a table in BigQuery, using the method num_rows, but I get None as a the result. When checked the documentation, it shows in the code :returns: the row count (None until set from the server). When will the server set the number of rows in a table or should I perform any operations before calling this method.
Below is my code
from google.cloud import bigquery
bqclient = bigquery.Client.from_service_account_json('service_account.json')
datasets = list(bqclient.list_datasets())
for dataset in datasets:
for table in bqclient.list_dataset_tables(dataset):
print(table.num_rows)
Try this instead:
for dataset in datasets:
for table in bqclient.list_dataset_tables(dataset):
print("Table {} has {} rows".format(table.table_id,
bqclient.get_table(table).num_rows))

pandas read sql query improvement

So I downloaded some data from a database which conveniently has a sequential ID column. I saved the max ID for each table I am querying to a small text file which I read into memory (max_ids dataframe).
I was trying to create a query where I would say give me all of the data where the Idcol > max_id for that table. I was getting errors that Series are mutable so I could not use them in a parameter. The code below ended up working but it was literally just a guess and check process. I turned it into an int and then a string which basically extracted the actual value from the dataframe.
Is this the correct way to accomplish what I am trying to do before I replicate this for about 32 different tables? I want to always be able to grab only the latest data from these tables which I am then doing stuff to in pandas and eventually consolidating and exporting to another database.
df= pd.read_sql_query('SELECT * FROM table WHERE Idcol > %s;', engine, params={'max_id', str(int(max_ids['table_max']))})
Can I also make the table name more dynamic as well? I need to go through a list of tables. The database is MS SQL and I am using pymssql and sqlalchemy.
Here is an example of where I ran max_ids['table_max']:
Out[11]:
0 1900564174
Name: max_id, dtype: int64
assuming that your max_ids DF looks as following:
In [24]: max_ids
Out[24]:
table table_max
0 tab_a 33333
1 tab_b 555555
2 tab_c 66666666
you can do it this way:
qry = 'SELECT * FROM {} WHERE Idcol > :max_id'
for i, r in max_ids.iterrows():
print('Executing: [%s], max_id: %s' %(qry.format(r['table']), r['table_max']))
pd.read_sql_query(qry.format(r['table']), engine, params={'max_id': r['table_max']})