Data to Native table - sql

I want to move data from my external tables to native tables.
Please mention if there is a simple way of doing this.
Thank you.

The better way would be to directly create native tables from Google Cloud Storage itself. The external tables are read only, so there is no modification you could have made on the tables. But here I am giving the code for your use case i.e. move data from multiple external tables to native tables.
I am using the Python client library and hence it is required that you follow this document to install the library and set up authentication as mentioned, before running the code. In case you want to grant BigQuery specific roles to the service account, you will also need to add storage.objectViewer role to your service account as well, since we are querying external tables. My external tables (along with many other tables) are placed in dataset “Dataset1” and I am copying them to another dataset “Dataset2” ( which I have already created from the console). Please replace ProjectID, Dataset1 and Dataset2 with suitable values.
Code:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query1 = """select table_name as table_name from
(SELECT TABLE_NAME as table_name, TABLE_TYPE as table_type
FROM `ProjectID.dataset1.INFORMATION_SCHEMA.TABLES`) t1 where t1.table_type="EXTERNAL"
"""
# The above query returns table names of only external tables present in the dataset.
query_job = client.query(query1)
for row in query_job:
# Row values can be accessed by field name or index.
b = row[0]
query2 = f"select * from `ProjectID.dataset1.{b}`"
# In query2 I am doing a select * from the tables whose names are returned by query1
destination = f"ProjectID.dataset2.{b}"
# Creating the table in another dataset with the same name as was in the first dataset.
job_config = bigquery.QueryJobConfig(destination=destination)
query_job2 = client.query(query2, job_config=job_config)
query_job2.result()
After I run the query I get the external tables copied to Dataset2 as native tables :

Related

How to dynamically copy multiple datasets from a Google BigQuery project using Azure Synapse Analytics

Is it possible to dynamically copy all datasets from a BigQuery Project to Azure Synapse Analytics, then dynamically copy all tables within each dataset? I know we can dynamically copy all tables within a BigQuery dataset reference to this answered question Loop over of table names ADFv2, but is there a way to do it at the project level with the lookup activity to loop through all datasets? Is there a way to do a SELECT * to the datasets?
SELECT
*
FROM
gcp_project_name.dataset_name.INFORMATION_SCHEMA.TABLES
WHERE table_type = 'BASE TABLE'
According to Microsoft's Lookup activity in Azure Data Factory and Azure Synapse Analytics, this only reaches the dataset level.
I also tried just putting in the GCP's project name into the Lookup activity's query, but it did not work, ref Understanding the "Not found: Dataset ### was not found in location US" error
This can be done using two-level pipeline. I tried to repro this and below is the approach.
Take a lookup activity and take the Google big query as source dataset. In Query text box, enter the below query.
SELECT schema_name
FROM `project_name`.INFORMATION_SCHEMA.SCHEMATA
This query will list the datasets in the project.
Add a for-each activity next to the lookup activity. In for-each settings' item , type #activity('Lookup1').output.value as a dynamic content.
Then inside for-each activity, take another lookup activity with same big Query dataset as source dataset. Type the below query as a dynamic content.
SELECT
*
FROM
gcp_project_name.dataset_name.#{item().schema_name}.TABLES
WHERE table_type = 'BASE TABLE'
This will give list of all tables within each dataset.
Since you cannot nest a for-each inside for-each in ADF, you can design a two-level pipeline where the outer pipeline with the outer ForEach loop iterates over an inner pipeline with the nested loop.
Refer the NiharikaMoola-MT's answer on this SO thread for Nested foreach in ADF.

How can I save a pyspark dataframe in databricks to SAP Hana (SAP Data Warehouse Cloud) using hdbcli?

I need to push some data from Databricks on AWS to SAP Data Warehouse cloud, and have been encouraged to use the python hdbcli (https://pypi.org/project/hdbcli/). The only documentation I have been able to find is the one in pypi, which is quite scarce. I can see an example of how to push individual rows to a sql table, but I have found no examples of how to save a pyspark dataframe to a table in SAP Data Warehouse cloud.
Documentation examples:
sql = 'INSERT INTO T1 (ID, C2) VALUES (:id, :c2)'
cursor = conn.cursor()
id = 3
c2 = "goodbye"
cursor.execute(sql, {"id": id, "c2": c2})
# returns True
cursor.close()
I have tried the following in my data bricks notebook:
df.createOrReplaceTempView("final_result_local")
sql = "INSERT INTO final_result SELECT * FROM final_result_local"
cursor.execute(sql)
cursor.close()
After this I got the following error:
invalid table name: Could not find table/view FINAL_RESULT_LOCAL in
schema DATABRICKS_SCHEMA
It seems df.createOrReplaceTempView created the sql table in a different context to the one called by hdbcli, and I don't know how to push the local table to sap data warehouse cloud. Any help would be much appreciated.
You should consider using the Python machine learning client for SAP HANA (hana-ml). You can think of it as being an abstraction layer on top of hdbcli. The central object to send and retrieve data is the HANA dataframe, which behaves similar to a Pandas dataframe, but is persisted on database side (i.e. this can be a table).
For your scenario, you should be able to create a HANA dataframe and thus a table using function create_dataframe_from_spark() (see documentation).
Regarding the direct use of hdbcli, you can find the full documentation here (also linked on PyPi).
I disagree with using hdbcli. Instead look into connecting from Spark directly, this instruction should be helpful.

Bigquery - browsing table data on table of type view

I want to use the preview/head feature of BigQuery to see sample data of a table without charge
as described here, and to do so i tried using the python api listed here
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to browse data rows.
# table_id = "your-project.your_dataset.your_table_name"
# Download all rows from a table.
rows_iter = client.list_rows(table_id) # Make an API request.
# Iterate over rows to make the API requests to fetch row data.
rows = list(rows_iter)
which results in:
BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/your-project/datasets/your_dataset/tables/your_table_name/data?formatOptions.useInt64Timestamp=True&prettyPrint=false: Cannot list a table of type VIEW.
Is there a way to preview a table of type view?
Is there another free alternative?
You cannot use the TableDataList JSON API method(the API you are using here) to retrieve data from a view. This is limitation of view. So, only way is to use the original table to preview the data.
I assume you could write the contents of the view into a temp table and link to that instead. Not the cleanest solution I'd agree on that.
drftr

Can I issue a query rather than specify a table when using the BigQuery connector for Spark?

I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
copies the entirety of the named table into input_directory. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?
Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment:
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

Google BQ: Running Parameterized Queries where Parameter Variable is the BQ Table Destination

I am trying to run a SQL from the Linux Commandline for a BQ Table destination. This SQL script will be used for multiple dates, clients, and BQ Table destinations, so this would require using parameters in my BQ API-commandline calls (the flag --parameter). Now, I have followed this link to learn about parameterized queries: https://cloud.google.com/bigquery/docs/parameterized-queries , but it's limited in helping me with declaring a table name.
My SQL script, called Advertiser_Date_Check.sql, is the following:
#standardSQL
SELECT *
FROM (SELECT *
FROM #variable_table
WHERE CAST(_PARTITIONTIME AS DATE) = #variable_date) as final
WHERE final.Advertiser IN UNNEST(#variable_clients)
Where the parameter variables represent the following:
variable_table: The BQ Table destination that I want to call
variable_date: The Date that I want to pull from the BQ table
variable_clients: An Array list of specific clients that I want to pull from the data (which is from the date I referenced)
Now, my Commandline (LINUX) for the BQ data is the following
TABLE_NAME=table_name_example
BQ_TABLE=$(echo '`project_id.dataset_id.'$TABLE_NAME'`')
TODAY=$(date +%F)
/bin/bq query --use_legacy_sql=false \
--parameter='variable_table::'$BQ_TABLE'' \
--parameter=variable_date::"$TODAY" \
--parameter='variable_clients:ARRAY<STRING>:["Client_1","Client_2","Client_3"]' \
"`cat /path/to/script/Advertiser_Date_Check.sql`"
The parameters of #variable_date and #variable_clients have worked just fine in the past when it was just them. However, since I desire to run this exact SQL command on various tables in a loop, I created a parameter called variable_table. Parameterized Queries have to be in Standard SQL format, so the table name convention needs to be in such format:
`project_id.dataset_id.table_name`
Whenever I try to run this on the Commandline, I usually get the following error:
Error in query string: Error processing job ... : Syntax error: Unexpected "#" at [4:12]
Which is referencing the parameter #variable_table, so it's having a hard time processing that this is referencing a table name.
In past attempts, there even has been the error:
project_id.dataset_id.table_name: command not found
But this was mostly due to poor reference of table destination name. The first error is the most common occurrence.
Overall, my questions regarding this matter are:
How do I reference a BQ Table as a parameter in the Commandline for Parameterized Queries at the FROM Clause (such as what I try to do with #variable_table)? Is it even possible?
Do you know of other methods to run a query on multiple BQ tables from the commandline besides by the way I am currently doing it?
Hope this all makes sense and thank you for your assistance!
From the documentation that you linked:
Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.
I think what might work for you in this case, though, is performing the injection of the table name as a regular shell variable (instead of a query parameter). You'd want to make sure that you trust the contents of it, or that you are building the string yourself in order to avoid SQL injection. One approach is to have hardcoded constants for the table names and then choose which one to insert into the query text based on the user input.
I thought I would just post my example here which only covers your question about creating a "dynamic table name", but you can also use my approach for your other variables. My approach was to do this operation directly in python just before doing the BigQuery API call, by leveraging python's internal time function (assuming you want your variables to be time-based).
Create BigQuery-table via Python BQ API:
from google.colab import auth
from datetime import datetime
from google.cloud import bigquery
auth.authenticate_user()
now = datetime.now()
current_time = now.strftime("%Y%m%d%H%M")
project_id = '<project_id>'
client = bigquery.Client(project=project_id)
table_id = "<project_id>.<dataset_id>.table_"
table_id = table_id + current_time
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT
dataset_id,
project_id,
table_id,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
row_count,
size_bytes,
round(safe_divide(size_bytes, (1000*1000)),1) as size_mb,
round(safe_divide(size_bytes, (1000*1000*1000)),3) as size_gb
FROM (select * from `<project_id>:<dataset_id>.__TABLES__`)
ORDER BY dataset_id, table_id asc;
"""
query_job = client.query(sql, job_config=job_config)
query_job.result()
print("Query results loaded to the table {}".format(table_id))
# Output:
# Query results loaded to the table <project_id>.<dataset_id>.table_202101141450
Feel free to copy and test it within a google colab notebook. Just fill in your own:
<project_id>
<dataset_id>