I have tried using BigQueryInsertJobOperator operator in airflow for inserting data into table, It works completely fine for 1-2 rows.
My sql has 100 million rows , the task for the operators run completely with status as successful even creates the table but table is empty
task2 = BigQueryInsertJobOperator(
task_id="insert_query_job",
gcp_conn_id='xxxxxxxxx',
configuration={
"query": {
"query": my_sql_query,
"useLegacySql": False,
"destinationTable": {
"projectId": "xxxxxxxx",
"datasetId": 'yyyyyyyy',
"tableId": 'zzzzzzzzz'
},
"allowLargeResults":True
}
}
)
Related
Problem Statement :
I am trying to use BigqueryOperator in airflow. The aim is to read the same queries as many times with dynamic changing of dataset names ie dataset names will be passed as a parameter.
example:
project.dataset1_layer1.tablename1, project.dataset2_layer1.tablename1
Expected:
I want to maintain one single copy of SQL wherein I can pass dataset names as parameters which can get replaced for that particular dataset.
Error Messages:
I tried to pass dynamic dataset name as a part of query_params. But it got failed with below error message.
The query got parsed as
INFO - Executing: [u'SELECT col1, col2 FROM project.#partner_layer1.tablename']
ERROR - BigQuery job failed. Final error was: {u'reason': u'invalidQuery', u'message': u'Query parameters cannot be used in place of table names at [1:37]', u'location': u'query'}. u'CREATE_IF_NEEDED', u'query': u'SELECT col1, col2 FROM project.#partner_layer1.tablename'}, u'jobType': u'QUERY'}}
`
Things I have tried so far
Query Temaplate temp.sql is as below:
SELECT col1, col2 FROM `project.#partner_layer1.tablename`;
Airflow BigqueryOperator is used as below:
query_template_dict = {
'partner_list' = ['val1', 'val2', 'val3', 'val4']
'google_project': 'project_name',
'queries': {
'layer3': {
'template': 'temp.sql',
'output_dataset': '_layer3',
'output_tbl': 'table_{}'.format(table_date),
'output_tbl_schema': 'temp.txt'
}
},
'applicable_tasks': {
'val1': {
'table_layer3': []
},
'val2': {
'table_layer3': []
},
'val3': {
'table_layer3': []
},
'val4': {
'table_layer3': []
}
}
}
for partner in query_template_dict['partner_list']:
# Loop over applicable report queries for a partner
applicable_tasks = query_template_dict['applicable_tasks'][partner].keys()
for task in applicable_tasks:
destination_tbl = '{}.{}{}.{}'.format(query_template_dict['google_project'], partner,
query_template_dict['queries'][task]['output_dataset'] ,
query_template_dict['queries'][task]['output_tbl'])
}
#Actual destination table structure
#destination_tbl = 'project.partner_layer3.table_20200223'
run_bq_cmd = BigQueryOperator (
task_id =partner + '-' + task,
sql =[query_template_dict['queries'][task]['template']],
destination_dataset_table =destination_tbl,
use_legacy_sql =False,
write_disposition ='WRITE_APPEND',
create_disposition ='CREATE_IF_NEEDED',
allow_large_results =True,
query_params=[
{
"name": "partner",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": partner}
},
{
"name": "batch_date",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": batch_date}
}
],
dag=dag,
Can anybody help me with this issue?
Is there a limitation in BigQuery to dynamically pass dataset names?
Replace the dataset name in Airflow, not in BigQuery.
So do this before the query is sent to BigQuery - use Python string replacement within Airflow.
We have a number of .gs files in an Apps Script project, which schedule a number of BiqQuery SQL queries into tables.
Everything was fine till a few days ago when one table started not updating correctly. We've looked into the query history, and seen that one of our tables hasn't been updated in quite a while. When we run the Apps Script responsible for that table, and check the BigQuery query history, it actually is running a different query, even though the script is valid and references different source and destination tables.
Our scripts mostly look like the below:
function table_load_1() {
var configuration = {
"query": {
"useQueryCache": false,
"destinationTable": {
"projectId": "project",
"datasetId": "schema",
"tableId": "destination_table"
},
"writeDisposition": "WRITE_TRUNCATE",
"createDisposition": "CREATE_IF_NEEDED",
"allowLargeResults": true,
"useLegacySql": false,
"query": "select * from `project.schema.source_table` "
}
};
var job = {
"configuration": configuration
};
var jobResult = BigQuery.Jobs.insert(job, "project");
Logger.log(jobResult);
}
Any idea why this would be happening?
Given there is a Partitioned table in BigQuery, is it possible to make it non-partitioned?
AFAIK there is no feature/function to unpartition a partitioned table. However, there's nothing stopping you doing a select * from partitioned_table and writing to results to a new (non-partitioned table). Using this approach you'll of course take a hit on cost.
Another way could be to export your table(s) to GCS and then load the exported file(s) back in. Loading doesn't cost anything, so you'd only pay for the brief amount of time the files are stored in GCS.
Is it possible to make a Partitioned table Non-partitioned in BigQuery?
It is possible!! Quick and Free ($0.00)
Step 1 - Create new non-partitioned table with exact same schema as your source table and no rows
#standardSQL
SELECT *
FROM `xxx.yyy.your_partitioned_table`
WHERE FALSE
run above with destination - xxx.yyy.your_new_NON_PARTITIONED_table
Step 2 - Invoke Copy job with WRITE_APPEND as a writeDisposition
Your copy config will look like below
{
"configuration": {
"copy": {
"sourceTable": {
"projectId": "xxx",
"datasetId": "yyy",
"tableId": "your_partitioned_table"
},
"destinationTable": {
"projectId": "xxx",
"datasetId": "yyy",
"tableId": "your_new_NON_PARTITIONED_table"
},
"createDisposition": "CREATE_IF_NEEDED",
"writeDisposition": "WRITE_APPEND"
}
}
}
Mission accomplished!
Note: Both steps are free of charge!
I'm trying to transfer data from the Amazon S3-Cloud to Amazon-Redshift with the Amazon-Data-Pipeline tool.
Is it possible while transferring the Data to change the Data with e.G. an SQL Statement so that just the results of the SQL-Statement will be the input into Redshift?
I only found the Copy Command like:
{
"id": "S3Input",
"type": "S3DataNode",
"schedule": {
"ref": "MySchedule"
},
"filePath": "s3://example-bucket/source/inputfile.csv"
},
Source: https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-get-started-copy-data-cli.html
Yes, it is possible. There are two approaches to it:
Use transformSQL of RedShiftCopyActivity
transformSQL is useful if the transformations are performed within the scope of the record that are getting loaded on a timely basis, e.g. every day or hour. That way changes are only applied to the batch and not to the whole table.
Here is an excerpt from the documentation:
transformSql: The SQL SELECT expression used to transform the input data. When you copy data from DynamoDB or Amazon S3, AWS Data Pipeline creates a table called staging and initially loads it in there. Data from this table is used to update the target table. If the transformSql option is specified, a second staging table is created from the specified SQL statement. The data from this second staging table is then updated in the final target table. So transformSql must be run on the table named staging and the output schema of transformSql must match the final target table's schema.
Please, find an example of usage of transformSql below. Notice that select is from staging table. It will effectively run CREATE TEMPORARY TABLE staging2 AS SELECT <...> FROM staging;. Also, all fields must be included and match the existing table in RedShift DB.
{
"id": "LoadUsersRedshiftCopyActivity",
"name": "Load Users",
"insertMode": "OVERWRITE_EXISTING",
"transformSql": "SELECT u.id, u.email, u.first_name, u.last_name, u.admin, u.guest, CONVERT_TIMEZONE('US/Pacific', cs.created_at_pst) AS created_at_pst, CONVERT_TIMEZONE('US/Pacific', cs.updated_at_pst) AS updated_at_pst FROM staging u;",
"type": "RedshiftCopyActivity",
"runsOn": {
"ref": "OregonEc2Resource"
},
"schedule": {
"ref": "HourlySchedule"
},
"input": {
"ref": "OregonUsersS3DataNode"
},
"output": {
"ref": "OregonUsersDashboardRedshiftDatabase"
},
"onSuccess": {
"ref": "LoadUsersSuccessSnsAlarm"
},
"onFail": {
"ref": "LoadUsersFailureSnsAlarm"
},
"dependsOn": {
"ref": "BewteenRegionsCopyActivity"
}
}
Use script of SqlActivity
SqlActivity allows operations on the whole dataset, and can be scheduled to run after particular events through dependsOn mechanism
{
"name": "Add location ID",
"id": "AddCardpoolLocationSqlActivity",
"type": "SqlActivity",
"script": "INSERT INTO locations (id) SELECT 100000 WHERE NOT EXISTS (SELECT * FROM locations WHERE id = 100000);",
"database": {
"ref": "DashboardRedshiftDatabase"
},
"schedule": {
"ref": "HourlySchedule"
},
"output": {
"ref": "LocationsDashboardRedshiftDatabase"
},
"runsOn": {
"ref": "OregonEc2Resource"
},
"dependsOn": {
"ref": "LoadLocationsRedshiftCopyActivity"
}
}
There is an optional field in RedshiftCopyActivity called 'transformSql'.
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-redshiftcopyactivity.html
I have not personally used this, but from the looks of it, it seems like - you will treat your s3 data being in a temp table and this sql stmt will return transformed data for redshift to insert.
So, you will need to list all fields in the select whether or not you are transforming that field.
AWS Datapipeline SqlActivity
{
"id" : "MySqlActivity",
"type" : "SqlActivity",
"database" : { "ref": "MyDatabase" },
"script" : "insert into AnalyticsTable (select (cast(requestEndTime as bigint) - cast(requestBeginTime as bigint)) as requestTime, hostname from StructuredLogs where hostname LIKE '%.domain.sfx');",
"schedule" : { "ref": "Hour" },
"queue" : "priority"
}
So basically in
"script" any sql script/transformations/commands Amazon Redshift SQL Commands
transformSql is fine but support only The SQL SELECT expression used to transform the input data. ref : RedshiftCopyActivity
I am trying to do a simple facet request over a field containing more than one word (Eg: 'Name1 Name2', sometimes with dots and commas inside) but what I get is...
"terms" : [{
"term" : "Name1",
"count" : 15
},
{
"term" : "Name2",
"count" : 15
}]
so my field value is split by spaces and then runs the facet request...
Query example:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": [
"dataset"
],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"field": [
"speciesName"
],
"size": 50000
}
}
}
}'
Your field shouldn't be analyzed, or at least not tokenized. You need to update your mapping and then reindex if you want to index the field without tokenizing it.
First of all, javanna provided a very good answer from a practical perspective. However, for the sake of completeness, I want to mention that in some cases there is a way to do it without reindexing the data.
If the speciesName field is stored and your queries produce relatively small number of results, you can use script_field to retrieve stored field values:
curl -XGET http://my_server:9200/idx_occurrence/Occurrence/_search?pretty=true -d '{
"query": {
"query_string": {
"fields": ["dataset"],
"query": "2",
"default_operator": "AND"
}
},
"facets": {
"test": {
"terms": {
"script_field": "_fields['\''speciesName'\''].value",
"size": 50000
}
}
}
}
'
As a result of this query, elasticsearch will retrieve the speciesName field for every record in your result set and it will construct facets from these values. Needless to say, if your result set contains millions of records, performance of this query might be sluggish.
Similarly, if the field is not stored, but record source is stored, you can use script_field facet to retrieve field values from the source:
......
"script_field": "_source['\''speciesName'\'']",
......
Again, source for each record in the result list will be retrieved and parsed, so you might need some patience to run this query on a large set of records.