Bigquery - Create external tables using a query - google-bigquery

I have a complex query which has multiple joins and aggregations. The output of the data is in billions now. Looking for a best approach to have this output data into an external table
I was able to do the following things (both were done using python)
-Export the query output into an Internal table
-Export Internal table into GCS
I had to combine the above steps and then create an external table on top of it. Is there any other efficient way to create an external table based on the query output directly?

Related

Updating/Inserting Oracle table from csv data file

I am still learning Oracle SQL, and I've been trying to find the best way to update/insert records in an OracleSQL table with the data from a CSV file.
So far, I've figured out how to load the csv into a temporary table using External Tables in Oracle, but I'm having difficulty finding a detailed guide on how to update/insert (UPSERT) the loaded data into an existing table.
What is the best way to do this, when I have 30+ fields in the table? For example, is it best to read the csv line by line with something like pandas and update each record one by one, or is it best to do it with a sql script using something like a merge statement? Not all records in the csv have a value for the primary key, in which case I need to insert rather than update. Thanks for the help!
That looks like a MERGE, indeed.
Data from external table would then be used to
update values in existing rows
create new rows in the target table
Pandas and row-by-row processing? I wouldn't do that. If you already have a powerful database, then use its capabilities. Row-by-row is usually slow-by-slow and there's rarely some benefit in doing it that way.

Add new partition-scheme to existing table in athena with SQL code

Is it even possible to add a partition to an existing table in Athena that currently is without partitions? If so, please also write syntax for doing so in the answer.
For example:
ALTER TABLE table1 ADD PARTITION (ourDateStringCol = '2021-01-01')
The above command will give the following error:
FAILED: SemanticException table is not partitioned but partition spec exists
Note: I have done a web-search, and variants exist for SQL server, or adding a partition to an already partitioned table. However, I personally could not find a case where one could successfully add a partition to an existing non-partitioned table.
This is extremely similar to:
SemanticException adding partiton Hive table
However, the answer given there requires re-creating the table.
I want to do so without re-creating the table.
Partitions in Athena are based on folder structure in S3. Unlike standard RDBMS that are loading the data into their disks or memory, Athena is based on scanning data in S3. This is how you enjoy the scale and low cost of the service.
What it means is that you have to have your data in different folders in a meaningful structure such as year=2019, year=2020, and make sure that the data for each year is all and only in that folder.
The simple solution is to run a CREATE TABLE AS SELECT (CTAS) query that will copy the data and create a new table that can be optimized for your analytical queries. You can choose the table format (Parquet, for example), the compression (SNAPPY, for example), and also the partition schema (per year, for example).

Can I use big query export data statement and scheduled the query?

I have a similar question asked in this link BigQuery - Export query results to local file/Google storage
I need to extract data from 2 big query tables using joins and where conditions. The extracted data has to be placed in a file on cloud storage. Mostly csv file. I want to go with a simple solution. Can I use big query export data statement In standard sql and schedule it?? Does it has a limitation of 1 Gb export?? If yes, what is the best possible way to implement this? Creating another temp table to save results from the query and using a data flow job to extras the data from the temp table? Please advise.
Basically google cloud now supports below
Please see code snippet in cloud documentation
https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#exporting_data_to_csv_format
I’m thinking if I can use the above statement to export data into a file and select query will have join from 2 tables and other conditions.
This query will be a scheduled query in big query.
Any inputs please??

U-SQL job to query multiple tables with dynamic names

Our challenge is the following one :
in an Azure SQL database, we have multiple tables with the following table names : table_num where num is just an integer. These tables are created dynamically so the number of tables can vary. (from table_1, table_2 to table_N) All tables have the same columns.
As part of a U-SQL script file, we would like to execute the same query on all of these tables and generate an output csv file with the combined results of all these queries.
We tried several things :
U-SQL does not allow looping so we were thinking creating a View in our Azure SQL database that would combine all the tables using a cursor of some sort. Then, the U-SQL file would query this View (using external source). However, a View in Azure SQL database can only be created via a function and a function cannot execute dynamic SQL or even call a stored procedure...
We did not find a way to call a stored procedure of the external data source directly from U-SQL
we dont want to update our U-SQL job each time a new table is added...
Is there a way to do that in U-SQL through a custom extractor for instance? Any other ideas?
One solution I can think of is to use Azure Data Factory (v2) to assist in this.
You could create a pipeline with the following activities:
Lookup activity configured to execute the stored procedure
For Each activity that uses the output of the lookup activity as a source
As a child item use a U-Sql Activity that executes your U-Sql script which writes the output of a single table (the item of the For Each activity) to blob or datalake
Add a Copy Activity that merges the blobs from step 2.1 to one final blob.
If you have little or no experience working with ADF v2 do mind that it takes some time to get to know it but once you do, you won't regret it. Having a GUI to create the pipeline is a nice bonus.
Edit: as #wBob mentions another (far easier) solution is to somehow create a single table with all rows since all dynamically generated table have the same schema. You can create a stored procedure for populating this table for example.

Create Partition table in Big Query

Can anyone please suggest how to create partition table in Big Query ?.
Example: Suppose I have one log data in google storage for the year of 2016. I stored all data in one bucket partitioned by year , month and date wise. Here I want create table with partitioned by date.
Thanks in Advance
Documentation for partitioned tables is here:
https://cloud.google.com/bigquery/docs/creating-partitioned-tables
In this case, you'd create a partitioned table and populate the partitions with the data. You can run a query job that reads from GCS (and filters data for the specific date) and writes to the corresponding partition of a table. For example, to load data for May 1st, 2016 -- you'd specify the destination_table as table$20160501.
Currently, you'll have to run several query jobs to achieve this process. Please note that you'll be charged for each query job based on bytes processed.
Please see this post for some more details:
Migrating from non-partitioned to Partitioned tables
There are two options:
Option 1
You can load each daily file into separate respective table with name as YourLogs_YYYYMMDD
See details on how to Load Data from Cloud Storage
After tables created, you can access them either using Table wildcard functions (Legacy SQL) or using Wildcard Table (Standar SQL). See also Querying Multiple Tables Using a Wildcard Table for more examples
Option 2
You can create Date-Partitioned Table (just one table - YourLogs) - but you still will need to load each daily file into respective partition - see Creating and Updating Date-Partitioned Tables
After table is loaded you can easily Query Date-Partitioned Tables
Having partitions for an External Table is not allowed as for now. There is a Feature Request for it:
https://issuetracker.google.com/issues/62993684
(please vote for it if you're interested in it!)
Google says that they are considering it.