Adding Labels to Big Query Table from Pyspark job on Dataproc using Spark BQ Connector - google-bigquery

I am trying to use py-spark on google dataproc cluster to run a spark job and writing results to a Big Query table.
Spark Bigquery Connector Documentation - https://github.com/GoogleCloudDataproc/spark-bigquery-connector
The requirement is during the creation of the table, there are certain labels that should be present on the big query table.
The spark bq connector does not provide any provision to add labels for write operation
df.write.format("bigquery") \
.mode("overwrite") \
.option("temporaryGcsBucket", "tempdataprocbqpath") \
.option("createDisposition", "CREATE_IF_NEEDED") \
.save("abc.tg_dataset_1.test_table_with_labels")
The above command creates bigquery load job in background that loads the table with the data.
Having checked further, the big query load job syntax itself does not support addition of labels in contrast to big query - query job.
Is there any plan to support the below
Support for labels in big query load job
Support for labels in write operation of spark bq connector.
Since there is no provision to add labels during the load/write operation, the current workaround used is to have the table created with schema/labels before the pyspark job

Related

Is it possible to export all Bigquery scheduled queries listed into a csv?

I need to create programmatic approach in getting the updated Bigquery scheduled queries listed in csv (or Gsheet / Bigquery table if possible). But I cannot find the related documentation for that. For now I can only select all text manually from Bigquery scheduled query page.
Below are the information needed:
Display name and its URL
Schedule (UTC)
Next scheduled
Author
destination dataset and destination table
But with the new scheduled query is still being created, it is getting more complicated to track with the list is still growing.
To list the scheduled queries with the bq CLI:
bq ls --transfer_config --transfer_location=US --format=prettyjson
For viewing the details of a Schedule Query:
bq show --transfer_config [RESOURCE_NAME]
# [RESOURCE_NAME] is the value from the above bq ls command
In python you can use the below code to list transfer configurations in a project.
from google.cloud import bigquery_datatransfer
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
project_id = "my-project"
parent = transfer_client.common_project_path(project_id)
configs = transfer_client.list_transfer_configs(parent=parent)
print("Got the following configs:")
for config in configs:
print(f"\tID: {config.name}, Schedule: {config.schedule}")
For more information you can refer to link1 and link2.

Is it possible to reduce the number of MetaStore checks when querying a Hive table with lots of columns?

I am using spark sql on databricks, which uses a Hive metastore, and I am trying to set up a job/query that uses quite a few columns (20+).
The amount of time it takes to run the metastore validation checks is scaling linearly with the number of columns included in my query - is there any way to skip this step? Or pre-compute the checks? Or to at least make the metastore only check once per table rather than once per column?
A small example is that when I run the below, even before calling display or collect, the metastore checker happens once:
new_table = table.withColumn("new_col1", F.col("col1")
and when I run the below, the metastore checker happens multiple times, and therefore takes longer:
new_table = (table
.withColumn("new_col1", F.col("col1")
.withColumn("new_col2", F.col("col2")
.withColumn("new_col3", F.col("col3")
.withColumn("new_col4", F.col("col4")
.withColumn("new_col5", F.col("col5")
)
The metastore checks it's doing look like this in the driver node:
20/01/09 11:29:24 INFO HiveMetaStore: 6: get_database: xxx
20/01/09 11:29:24 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_database: xxx
The view to the user on databricks is:
Performing Hive catalog operation: databaseExists
Performing Hive catalog operation: tableExists
Performing Hive catalog operation: getRawTable
Running command...
I would be interested to know if anyone can confirm that this is just the way it works (a metastore check per column), and if I have to just plan for the overhead of the metastore checks.
I am surprised by this behavior as it does not fit with the Spark processing model and I cannot replicate it in Scala. It is possible that it is somehow specific to PySpark but I doubt that since PySpark is just an API for creating Spark plans.
What is happening, however, is that after every withColumn(...) the plan is analyzed. If the plan is large, this can take a while. There is a simple optimization, however. Replace multiple withColumn(...) calls for independent columns with df.select(F.col("*"), F.col("col2").as("new_col2"), ...). In this case, only a single analysis will be performed.
In some cases of extremely large plans, we've saved 10+ minutes of analysis for a single notebook cell.

Saving Spark DataFrame created from an oracle query in a hive table?

I use impala/hive via HUE on a Cloudera Platform.
If I pull a table from hive into a Spark DataFrame via Pyspark, I can save it as a different table with something like this:
sdf.write.mode("overwrite").saveAsTable("schema.PythonTest")
Then when I refresh my tables in the HUE under either hive/impala, I can see the new table there and starting writing HQL with it.
However when I pull data from oracle into a Spark Dataframe I get errors when trying that same syntax as above.
sdf = spark.read \
.format("jdbc") \
.option("url", "jdbc:oracle:thin:UN/PW!#blah.bleh.com:port/sid") \
.option("dbtable", mySQL) \
.option("user", "UN") \
.option("password", "pw!") \
.option("driver", "oracle.jdbc.driver.OracleDriver") \
.load()
I'm at a loss for explanation. Why would the syntax work when hive query pulls data into the sdf, but not when oracle does?
The sql for oracle runs fine, and for testing purposes it's only 2 columns and 2 rows. when I use the type(sdf) function, I can clearly see that i'm successfully creating the Spark DataFrame.
Am I missing some settings or steps?
What's the error you're getting while pulling the data from Oracle?
Also does the format should be jdbc?
If this is happening to you:
Make sure you are not stopping and starting the SparkContext. If so, you are most likely not including the necessary options that are in the default settings, that would allow the Spark DataFrame to save to hive via saveAsTable.
I restarted my kernel, skipped my cell where I was stopping and starting a new SparkContext and worked fine.

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.

How Bigquery use data stored in google cloud?

Guys a very basic question but not able to decipher ,Please help me out.
Q1: When we create bigquery table using below command , the data resides in same Cloud Storage?
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
Q2: let's say my data director is gs://sp2040/raw/cards/cust/ for customer file Table structure defined is:
bq mk --time_partitioning_type=DAY market.cust \
custid:string,grp:integer,odate:string
Everyday I create new dir in the bucket such as 20170101,20170102..to load new dataset. So after the data loaded in this bucket Do I need to fire below queries.
D1:
bq load --source_format=CSV 'market.cust$20170101' \
gs://sp2040/raw/cards/cust/20170101/20170101_cust.csv
D2:
bq load --source_format=CSV 'market.cust$20170102' \
gs://sp2040/raw/cards/cust/20170102/20170102_cust.csv
When we create bigquery table using below command , the data resides in same Cloud Storage?
Nope! BigQuery is not using Cloud Storage for storing data (unless it is federated Table linked to Cloud Storage)
Check BigQuery Under the Hood with Tino Tereshko and Jordan Tigani - you will like it
Do I need to fire below queries
Yes. you need to load those files into BigQuery, so you can query the data
Yes you would need load the data into BigQuery using those commands.
However, there are a couple of alternatives
PubSub and Dataflow: You could configure PubSub to watch your cloud storage and create notification when files are added, described here. You could then have Dataflow job that imported the file into BigQuery. DataFlow Documentation
BigQuery external tables: BigQuery can query cvs files that are stored in Cloud Storage without importing the data, as described here. There is wildcard support for filenames so it could be configured once. Performance might not be as good as directly storing items in BigQuery