Is there a way to use dynamic dataset name in bigquery - google-bigquery

Problem Statement :
I am trying to use BigqueryOperator in airflow. The aim is to read the same queries as many times with dynamic changing of dataset names ie dataset names will be passed as a parameter.
example:
project.dataset1_layer1.tablename1, project.dataset2_layer1.tablename1
Expected:
I want to maintain one single copy of SQL wherein I can pass dataset names as parameters which can get replaced for that particular dataset.
Error Messages:
I tried to pass dynamic dataset name as a part of query_params. But it got failed with below error message.
The query got parsed as
INFO - Executing: [u'SELECT col1, col2 FROM project.#partner_layer1.tablename']
ERROR - BigQuery job failed. Final error was: {u'reason': u'invalidQuery', u'message': u'Query parameters cannot be used in place of table names at [1:37]', u'location': u'query'}. u'CREATE_IF_NEEDED', u'query': u'SELECT col1, col2 FROM project.#partner_layer1.tablename'}, u'jobType': u'QUERY'}}
`
Things I have tried so far
Query Temaplate temp.sql is as below:
SELECT col1, col2 FROM `project.#partner_layer1.tablename`;
Airflow BigqueryOperator is used as below:
query_template_dict = {
'partner_list' = ['val1', 'val2', 'val3', 'val4']
'google_project': 'project_name',
'queries': {
'layer3': {
'template': 'temp.sql',
'output_dataset': '_layer3',
'output_tbl': 'table_{}'.format(table_date),
'output_tbl_schema': 'temp.txt'
}
},
'applicable_tasks': {
'val1': {
'table_layer3': []
},
'val2': {
'table_layer3': []
},
'val3': {
'table_layer3': []
},
'val4': {
'table_layer3': []
}
}
}
for partner in query_template_dict['partner_list']:
# Loop over applicable report queries for a partner
applicable_tasks = query_template_dict['applicable_tasks'][partner].keys()
for task in applicable_tasks:
destination_tbl = '{}.{}{}.{}'.format(query_template_dict['google_project'], partner,
query_template_dict['queries'][task]['output_dataset'] ,
query_template_dict['queries'][task]['output_tbl'])
}
#Actual destination table structure
#destination_tbl = 'project.partner_layer3.table_20200223'
run_bq_cmd = BigQueryOperator (
task_id =partner + '-' + task,
sql =[query_template_dict['queries'][task]['template']],
destination_dataset_table =destination_tbl,
use_legacy_sql =False,
write_disposition ='WRITE_APPEND',
create_disposition ='CREATE_IF_NEEDED',
allow_large_results =True,
query_params=[
{
"name": "partner",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": partner}
},
{
"name": "batch_date",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": batch_date}
}
],
dag=dag,
Can anybody help me with this issue?
Is there a limitation in BigQuery to dynamically pass dataset names?

Replace the dataset name in Airflow, not in BigQuery.
So do this before the query is sent to BigQuery - use Python string replacement within Airflow.

Related

Apache superset with mongoDB(NO SQL database)

I am using MongoDB. My task is to build Dashboard charts for the data. So, I am using Apache superset. I connected MongoDB to apache drill as it wont connect directly with superset. Then connected apache drill to Apachesueperset. My collection is nested. How can I process this nested data to get use for dashboard charts.My data looks as below
{
"_id": {
"$oid": "6229d3cfdbfc81a8777e4821"
},
"jobs": [
{
"job_ID": {
"$oid": "62289ded8079821eb24760e0"
},
"New": false,
"Expired": false
},
{
"job_ID": {
"$oid": "6228a252fb4554dd5c48202a"
},
"New": true,
"Expired": true
},
{
"job_ID": {
"$oid": "622af1c391b290d34701af9f"
},
"New": true,
"Expired": false
}
],
"email": "mani2090996#ail.com"
}
I am querying in apache drill as follows
SELECT flat.fill FROM (SELECT FLATTEN(t.jobs) AS fill FROM mongo.recruitingdb.flatten.`Vendorjobs` t) flat WHERE flat.fill.New = flase;
And i am getting parsing error
org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: Encountered "." at line 1, column 123.
Superset doesn't really handle nested data very well. Drill does however, so you'll have to craft queries to produce columns that can be visualized.
Take a look here: https://drill.apache.org/docs/json-data-model/
and here: https://drill.apache.org/docs/querying-complex-data-introduction/.
UPDATE:
Try the query below. The FROM clause may not be exactly right, but you should get the idea from this.
Note that you can access maps in Drill in two ways:
tablename.mapname.field OR
mapname['field']
You can do this for any level of nesting.
SELECT mongoTable.jobs.job_ID.`$oid` AS job_ID,
mongoTable.jobs.`New` AS new,
mongoTable.jobs.`Expired` AS expired
FROM
(
SELECT flatten(jobs) AS jobs
FROM mongo.recruitingdb.flatten.`Vendorjobs` AS t1
WHERE t1.jobs.New = false
) AS mongoTable

Query for entire JSON document in nested JSON schema

Background:
I wish to locate the entire JSON document that has a condition where "state" = "new" and where length(Features.id) > 4
{
"id": "123"
"feedback": {
"Features": [
{
"state": "new"
"id": "12345"
}
]
}
}
This is what I have tried to do:
Since this is a nested document. My query looks like this:
A stackoverflow member has helped me to access the nested contents within the query, but is there a way to obtain the full document
I have used:
SELECT VALUE t.id FROM t IN f.feedback.Features where t.state = 'new' and length(t.id)>4
This will give me the ids.
My desire is to have access to the full document with this condition?
{
"id": "123"
"feedback": {
"Features": [
{
"state": "new"
"id": "12345"
}
]
}
}
Any help is appreciated
Try this
SELECT *
FROM f
WHERE
f.feedback.Features[0].state = 'new'
AND length(f.feedback.Features[0].id)>4
Here is the SELECT spec for CosmosDB for more details
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-select
Also, check out "working with JSON" in CosmosDB notes
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-working-with-json
If the Features array has more than 1 value, you can use EXISTS clause to search within them. See specs of EXISTS here with examples:
https://learn.microsoft.com/en-us/azure/cosmos-db/sql-query-subquery#exists-expression

SQL query as string in JSON file

It is possible to store SQL query in JSON array of objects?? Because when i have something like this:
[{
"id": "1",
"query": "SELECT ID FROM table"
},
{
"id": "2",
"query": "SELECT ID FROM table"
},
{
"id": "3",
"query": "SELECT USER FROM table"
}
]
JSON file in VSCode is ok no error it is getting nasty when i want to store complex queries with joins etc.
for example this query even if i format it correctly it will generate error in JSON file about formatting
(just example i not it is not valid)
SELECT user, id, , count(price) as numrev
FROM price
where id = 1 and user = 0
group by user, id, price
that it can't be stored in string
It is bit easy to do, but requires on extra step.
Simply convert/encode you raw SQL queries in base64 text.
Decode the text before you execute the queries in you code.
If the JSON file is created automatically by a program/code
All most all programming languages proved base64 encode / decode functions as part of the core if not download compatible package / library to achieve this automation
var queries = [{
"id": "1",
"query": "U0VMRUNUIElEIEZST00gdGFibGU="
},
{
"id": "2",
"query": "U0VMRUNUIElEIEZST00gdGFibGU="
},
{
"id": "3",
"query": "U0VMRUNUIFVTRVIgRlJPTSB0YWJsZQ=="
},
{
"id": "4",
"query": "U0VMRUNUIHVzZXIsIGlkLCAsIGNvdW50KHByaWNlKSBhcyBudW1yZXYKICBGUk9NIHByaWNlCiAgd2hlcmUgaWQgPSAxIGFuZCB1c2VyID0gMCAKICBncm91cCBieSB1c2VyLCBpZCwgcHJpY2U="
}
];
for (i = 0; i < queries.length; i++) {
console.log("id = " + queries[i].id + ", query = " + atob(queries[i].query));
}
when you parse you JSON array make sure to decode before you execute the SQL queries.
let me know this one helped you.. ☺☻☺
FYI , refer http://www.utilities-online.info/base64/
enter image description here

Nested pymongo queries (mlab)

I have some documents in mlab mongodb; the format is:
{
"_id": {
"$oid": "58aeb1d074fece33edf2b356"
},
"sensordata": {
"operation": "chgstatus",
"user": {
"status": "0",
"uniqueid": "191b117fcf5c"
}
},
"created_date": {
"$date": "2017-02-23T15:26:29.840Z"
}
}
database name : mparking_sensor
collection name : sensor
I want to query in python to extract status key value pair and created_date key value pair only.
my python code is :
import sys
import pymongo
uri = 'mongodb://thorburn:tekush1!#ds157529.mlab.com:57529/mparking_sensor'
client = pymongo.MongoClient(uri)
db = client.get_default_database().sensor
print db
results = db.find()
for record in results:
print(record["sensordata"] , record['created_date'])
print()
client.close()
which gives me everything under sensordata as expected, dot notations giving me an error, can somebody help?
PyMongo represents BSON documents as Python dictionaries, and subdocuments as dictionaries within dictionaries. To access a value in a nested dictionary:
record["sensordata"]["user"]["status"]
So a complete print statement might be:
print("%s %s" % (record["sensordata"]["user"]["status"], record['created_date']))
That prints:
0 {'$date': '2017-02-23T15:26:29.840Z'}

AWS: Other function than COPY by transferring data from S3 to Redshift with amazon-data-pipeline

I'm trying to transfer data from the Amazon S3-Cloud to Amazon-Redshift with the Amazon-Data-Pipeline tool.
Is it possible while transferring the Data to change the Data with e.G. an SQL Statement so that just the results of the SQL-Statement will be the input into Redshift?
I only found the Copy Command like:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
Source: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
Yes, it is possible. There are two approaches to it:
Use transformSQL of RedShiftCopyActivity
transformSQL is useful if the transformations are performed within the scope of the record that are getting loaded on a timely basis, e.g. every day or hour. That way changes are only applied to the batch and not to the whole table.
Here is an excerpt from the documentation:
transformSql: The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. So transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.
Please, find an example of usage of transformSql below. Notice that select is from staging table. It will effectively run CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;. Also, all fields must be included and match the existing table in RedShift DB.
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
Use script of SqlActivity
SqlActivity allows operations on the whole dataset, and can be scheduled to run after particular events through dependsOn mechanism
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
There is an optional field in RedshiftCopyActivity called 'transformSql'.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
I have not personally used this, but from the looks of it, it seems like - you will treat your s3 data being in a temp table and this sql stmt will return transformed data for redshift to insert.
So, you will need to list all fields in the select whether or not you are transforming that field.
AWS Datapipeline SqlActivity
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
So basically in
"script" any sql script/transformations/commands Amazon Redshift SQL Commands
transformSql is fine but support only The SQL SELECT expression used to transform the input data. ref : RedshiftCopyActivity