Create new files from multiple export query data in bigquery - sql

I'm trying to run multiple export statements in bigquery like this
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10
The problem is that I don't want to overwrite new files, but instead I want new files to be created only if the query returns something.
I've tried changing the uri to gs://bucket/folder/*1.csv but this is creating an empty file (which I don't want).
I've also set the overwrite parameter to false; this results in Invalid value: overwrite option is not specified and destination is not empty. at [109:1]
Any ways to fix this?

You may try and consider below approach.
EXECUTE IMMEDIATE CONCAT(
"EXPORT DATA OPTIONS(uri='gs://your-bucket-bucket/your-folder/file_",
FORMAT_TIMESTAMP("%F-%T", TIMESTAMP(Current_TIMESTAMP()), "UTC"),
"_*.csv', format = 'CSV', overwrite = true, header = true, field_delimiter = ';') AS SELECT field1, field2 FROM your-dataset.table1 ORDER BY field1 "
)
This approach concatenates current timestamp to your filename so that it will serve as a unique identifier of file names and will create a new one if your query returns something.
Sample output on my Bucket:
In addition, if you experience a location error when executing the above query, the posted answer here will solve the problem.

Related

BigQuery: Insert dynamic folder name into an EXPORT DATA statement

I am trying to dynamically set a folder name on a data export.
The below creates a folder called "#date" and does not input the actual date variable.
Is this possible to insert dynamic values into the uri?
EXECUTE IMMEDIATE
"EXPORT DATA OPTIONS(uri='gs://bucket/folder/#date/*.csv', format='CSV', overwrite=true, header=true, field_delimiter=',') AS SELECT field1, field2 FROM `dataset.table` ORDER BY field1 LIMIT 10"
USING CAST(CURRENT_DATE() AS STRING) as date;
I found another approach to insert a formatted timestamp that feels simpler.
See the '|| timestamp ||' section within the uri below
"EXPORT DATA OPTIONS ( uri='gs://bucket/folder-
'|| FORMAT_TIMESTAMP("%F-%T", TIMESTAMP_ADD(Current_TIMESTAMP(), INTERVAL 10 HOUR)) ||'
-*.csv', format='CSV', overwrite=true, header=true, field_delimiter=',') AS SELECT field1, field2 FROM `dataset.table` "
My answer may not be suitable for your answer, but might help others who came across my situation where the file name with GCS path must be passed as an input parameter like below:
StoredProcedures_Offers(filePath STRING)
then this filePath needs to passed into EXPORT DATA OPTIONS uri section as follows:
EXPORT DATA OPTIONS( uri=""||filePath||"", format='JSON', overwrite=true) AS
Note: Excluding query part here in the answer just to highlight the EXPORT section
Try concat:
EXECUTE IMMEDIATE CONCAT(
"EXPORT DATA OPTIONS(uri='gs://bucket/folder/",
CAST(CURRENT_DATE() AS STRING),
"/*.csv', format='CSV', overwrite=true, header=true, field_delimiter=',') AS SELECT field1, field2 FROM `dataset.table` ORDER BY field1 LIMIT 10"
)

How add date of archive in the name of the archive with Export data options command in Bigquery

I have a problem. I want to create a file with the function Export Data Options and I want the date of the archive in the name of the file. I am trying to create a variable test in order to do it, but it doesn't function. Do you have any idea how can I solve it ?
SELECT
CONCAT("'gs://archivage-base/base1_",today,"*.csv'") as test
FROM
(SELECT
CURRENT_DATE() as today);
EXPORT DATA OPTIONS(
uri=test,
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-042021.campagnemakers.makerlocal`
Thank you
You can use SQL functions right within your uri declaration as such. Examples: current_date(), current_datetime(), etc. Check out this article for dozens of options.
EXPORT DATA OPTIONS(
uri='gs://bucket_name/my_prefix/my_archive_name_'||current_date()||'-part-*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-042021.campagnemakers.makerlocal`
You can do it using EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE """
EXPORT DATA OPTIONS(
uri='gs://bucket_name/my_prefix/my_archive_name_#EXPORT_DATE-part-*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-042021.campagnemakers.makerlocal`
"""
using current_date('+/- time offset from UTC') as EXPORT_DATE
Also note that you have to specify an URI with a wildcard because the result of the query to export might be very large see EXPORT DATA docs.
For more information on how to specify a timezone
EXECUTE IMMEDIATE is a scripting capability that allows to create SQL statement from a string at runtime so you can modify any part fo the SQL statement. You could even create a stored procedure that will get a table name as input and a GCS prefix to be able to export different tables to different location via a simple call to a stored procedure.

Create SQL table from parquet files

I am using R to handle large datasets (largest dataframe 30.000.000 x 120). These are stored in Azure Datalake Storage as parquet files, and we would need to query these daily and restore these in a local SQL database. Parquet files can be read without loading the data into memory, which is handy. However, creating SQL tables from parquuet files is more challenging as I'd prefer not to load the data into memory.
Here is the code I used. Unfortunately, this is not a perfect reprex as the SQL database need to exist for this to work.
# load packages
library(tidyverse)
library(arrow)
library(sparklyr)
library(DBI)
# Create test data
test <- data.frame(matrix(rnorm(20), nrow=10))
# Save as parquet file
write_parquet(test2, tempfile(fileext = ".parquet"))
# Load main table
sc <- spark_connect(master = "local", spark_home = spark_home_dir())
test <- spark_read_parquet(sc, name = "test_main", path = "/tmp/RtmpeJBgyB/file2b5f4764e153.parquet", memory = FALSE, overwrite = TRUE)
# Save into SQL table
DBI::dbWriteTable(conn = connection,
name = DBI::Id(schema = "schema", table = "table"),
value = test)
Is it possible to write a SQL table without loading parquet files into memory?
I lack the experience with T-sql bulk import and export but this is likely where you'll find your answer.
library(arrow)
library(DBI)
test <- data.frame(matrix(rnorm(20), nrow=10))
f <- tempfile(fileext = '.parquet')
write_parquet(test2, f)
#Upload table using bulk insert
dbExecute(connection,
paste("
BULK INSERT [database].[schema].[table]
FROM '", gsub('\\\\', '/', f), "' FORMAT = 'PARQUET';
")
)
here I use T-sql's own bulk insert command.
Disclaimer I have not yet used this command in T-sql, so it may riddled with error. For example I can't see a place to specify snappy compression within the documentation, although it can be specified if one instead defined a custom file format with CREATE EXTERNAL FILE FORMAT.
Now the above only inserts into an existing table. For your specific case, where you'd like to create a new table from the file, you would likely be looking more for OPENROWSET using CREATE TABLE AS [select statement].
column_definition <- paste(names(column_defs), column_defs, collapse = ',')
dbExecute(connection,
paste0("CREATE TABLE MySqlTable
AS
SELECT *
FROM
OPENROWSET(
BULK '", f, "' FORMAT = 'PARQUET'
) WITH (
", paste0([Column definitions], ..., collapse = ', '), "
);
")
where column_defs would be a named list or vector describing giving the SQL data-type definition for each column. A (more or less) complete translation from R data types to is available on the T-sql documentation page (Note two very necessary translations: Date and POSIXlt are not present). Once again disclaimer: My time in T-sql did not get to BULK INSERT or similar.

Is it possible to change the delimiter of AWS athena output file

Here is my sample code where I create a file in S3 bucket using AWS Athena. The file by default is in csv format. Is there a way to change it to pipe delimiter ?
import json
import boto3
def lambda_handler(event, context):
s3 = boto3.client('s3')
client = boto3.client('athena')
# Start Query Execution
response = client.start_query_execution(
QueryString="""
select * from srvgrp
where category_code = 'ACOMNCDU'
""",
QueryExecutionContext={
'Database': 'tmp_db'
},
ResultConfiguration={
'OutputLocation': 's3://tmp-results/athena/'
}
)
queryId = response['QueryExecutionId']
print('Query id is :' + str(queryId))
There is a way to do that with CTAS query.
BUT:
This is a hacky way and not what CTAS queries are supposed to be used for, since it will also create a new table definition in AWS Glue Data Catalog.
I'm not sure about performance
CREATE TABLE "UNIQU_PREFIX__new_table"
WITH (
format = 'TEXTFILE',
external_location = 's3://tmp-results/athena/__SOMETHING_UNIQUE__',
field_delimiter = '|',
bucketed_by = ARRAY['__SOME_COLUMN__'],
bucket_count = 1
) AS
SELECT *
FROM srvgrp
WHERE category_code = 'ACOMNCDU'
Note:
It is important to set bucket_count = 1, otherwise Athena will create multiple files.
Name of the table in CREATE_TABLE ... also should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime.
External location should be unique, e.g. use timestamp prefix/suffix which you can inject during python runtime. I would advise to embed table name into S3 path.
You need to include in bucketed_by only one of the columns from SELECT.
At some point you would need to clean up AWS Glue Data Catalog from all table defintions that were created in such way

What's the best way to copy a subset of a table's rows from one database to another in Postgres?

I've got a production DB with, say, ten million rows. I'd like to extract the 10,000 or so rows from the past hour off of production and copy them to my local box. How do I do that?
Let's say the query is:
SELECT * FROM mytable WHERE date > '2009-01-05 12:00:00';
How do I take the output, export it to some sort of dump file, and then import that dump file into my local development copy of the database -- as quickly and easily as possible?
Source:
psql -c "COPY (SELECT * FROM mytable WHERE ...) TO STDOUT" > mytable.copy
Destination:
psql -c "COPY mytable FROM STDIN" < mytable.copy
This assumes mytable has the same schema and column order in both the source and destination. If this isn't the case, you could try STDOUT CSV HEADER and STDIN CSV HEADER instead of STDOUT and STDIN, but I haven't tried it.
If you have any custom triggers on mytable, you may need to disable them on import:
psql -c "ALTER TABLE mytable DISABLE TRIGGER USER; \
COPY mytable FROM STDIN; \
ALTER TABLE mytable ENABLE TRIGGER USER" < mytable.copy
source server:
BEGIN;
CREATE TEMP TABLE mmm_your_table_here AS
SELECT * FROM your_table_here WHERE your_condition_here;
COPY mmm_your_table_here TO 'u:\\source.copy';
ROLLBACK;
your local box:
-- your_destination_table_here must be created first on your box
COPY your_destination_table_here FROM 'u:\\source.copy';
article: http://www.postgresql.org/docs/8.1/static/sql-copy.html
From within psql, you just use copy with the query you gave us, exporting this as a CSV (or whatever format), switch database with \c and import it.
Look into \h copy in psql.
With the constraint you added (not being superuser), I do not find a pure-SQL solution. But doing it in your favorite language is quite simple. You open a connection to the "old" database, another one to the new database, you SELECT in one and INSERT in the other. Here is a tested-and-working solution in Python.
#!/usr/bin/python
"""
Copy a *part* of a database to another one. See
<http://stackoverflow.com/questions/414849/whats-the-best-way-to-copy-a-subset-of-a-tables-rows-from-one-database-to-anoth>
With PostgreSQL, the only pure-SQL solution is to use COPY, which is
not available to the ordinary user.
Stephane Bortzmeyer <bortzmeyer#nic.fr>
"""
table_name = "Tests"
# List here the columns you want to copy. Yes, "*" would be simpler
# but also more brittle.
names = ["id", "uuid", "date", "domain", "broken", "spf"]
constraint = "date > '2009-01-01'"
import psycopg2
old_db = psycopg2.connect("dbname=dnswitness-spf")
new_db = psycopg2.connect("dbname=essais")
old_cursor = old_db.cursor()
old_cursor.execute("""SET TRANSACTION READ ONLY""") # Security
new_cursor = new_db.cursor()
old_cursor.execute("""SELECT %s FROM %s WHERE %s """ % \
(",".join(names), table_name, constraint))
print "%i rows retrieved" % old_cursor.rowcount
new_cursor.execute("""BEGIN""")
placeholders = []
namesandvalues = {}
for name in names:
placeholders.append("%%(%s)s" % name)
for row in old_cursor.fetchall():
i = 0
for name in names:
namesandvalues[name] = row[i]
i = i + 1
command = "INSERT INTO %s (%s) VALUES (%s)" % \
(table_name, ",".join(names), ",".join(placeholders))
new_cursor.execute(command, namesandvalues)
new_cursor.execute("""COMMIT""")
old_cursor.close()
new_cursor.close()
old_db.close()
new_db.close()