Is it possible to execute Bigquery's query using Google cloud data flow directly and fetch data, not reading data from table then putting conditions?
For example, PCollections res=p.apply(BigqueryIO.execute("Select col1,col2 from publicdata:samples.shakeseare where ...."))
Instead of reinventing using iterative method what Bigquery queries already implemented, we can use the same directly.
Thanks and Regards
Ajay K N
BigQueryIO currently only supports reading from a Table and not a Query or View (FAQ).
One way to work around this is in your main program to create a BigQuery permanent table by issuing a query before you run your Dataflow job. After, your job runs you could delete the table.
Related
I have a similar question asked in this link BigQuery - Export query results to local file/Google storage
I need to extract data from 2 big query tables using joins and where conditions. The extracted data has to be placed in a file on cloud storage. Mostly csv file. I want to go with a simple solution. Can I use big query export data statement In standard sql and schedule it?? Does it has a limitation of 1 Gb export?? If yes, what is the best possible way to implement this? Creating another temp table to save results from the query and using a data flow job to extras the data from the temp table? Please advise.
Basically google cloud now supports below
Please see code snippet in cloud documentation
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#exporting_data_to_csv_format
I’m thinking if I can use the above statement to export data into a file and select query will have join from 2 tables and other conditions.
This query will be a scheduled query in big query.
Any inputs please??
Trying to use Dataflow SQL for Stream ingestion:
We have a Pubsub topic (source) and BigQuery Table (sink).
To achieve that we need to follow steps:
From BigQuery UI, adding schema for topic manually.
Question: Can we automate this process using commandline options?
Writing SQL for the transformation and executing using gcloud dataflow query command (helps us with dynamic queries and automation).
Question: Suppose we have missing key from Pubsub messages and the pipeline will mark those messages as error in stack driver. Can we add some capability like if validation of schema fails move to table y else table x? Something like, if we get message type y move of table y else table x?
You can use gcloud to add a schema to a topic. This was actually the only way to do it, at first: https://cloud.google.com/dataflow/docs/guides/sql/data-sources-destinations#gcloud
For saving messages that cannot be parsed into SQL rows, the functionality is often called "dead letter queue". It is available in Beam SQL DDL for Pubsub but is not yet available when using Dataflow SQL through the BigQuery UI. See https://beam.apache.org/documentation/dsls/sql/extensions/create-external-table/#pubsub
I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.
I would save the result of a query on an external db to Bigquery.
I am using pyodbc to manage the odbc connection.
What is the most efficient way to perform such operation?
Should I fetchOne each cursor row and then insert in BigQuery?
Does the result have large amount of data?
If the result is small, you can just read all rows and insert into BigQuery. The benefit is the result is immediately available to BigQuery queries. However, for large results, the streaming insert might be expensive (see https://cloud.google.com/bigquery/pricing).
For large results I would just save the result to a file (commonly CSV), upload it to GCP and run load job.
What is the command to execute DMLs like Insert,Update,Delete in Google Big Query?
I tried using bq query "select query"
It is working only for Select statements
Note that BigQuery really excels at being a secondary database used for performing fast analytical queries on big data that is static, such as recorded data analysis, logs, and audit history.
If you instead require regular data updates, it is highly recommended to use a separate master database such as the Datastore to perform fast entity operations and updates. You would then persist your data from your master database to your secondary BigQuery database for further analysis.
Therefore, you must tell the bq commandline to use the full standard SQL --use_legacy_sql=false instead of the original BigQuery default legacy SQL to access the Data Manipulation Language (DML) functionality.