BigQuery: Insert dynamic folder name into an EXPORT DATA statement - sql

I am trying to dynamically set a folder name on a data export.
The below creates a folder called "#date" and does not input the actual date variable.
Is this possible to insert dynamic values into the uri?
EXECUTE IMMEDIATE
"EXPORT DATA OPTIONS(uri='gs://bucket/folder/#date/*.csv', format='CSV', overwrite=true, header=true, field_delimiter=',') AS SELECT field1, field2 FROM `dataset.table` ORDER BY field1 LIMIT 10"
USING CAST(CURRENT_DATE() AS STRING) as date;

I found another approach to insert a formatted timestamp that feels simpler.
See the '|| timestamp ||' section within the uri below
"EXPORT DATA OPTIONS ( uri='gs://bucket/folder-
'|| FORMAT_TIMESTAMP("%F-%T", TIMESTAMP_ADD(Current_TIMESTAMP(), INTERVAL 10 HOUR)) ||'
-*.csv', format='CSV', overwrite=true, header=true, field_delimiter=',') AS SELECT field1, field2 FROM `dataset.table` "

My answer may not be suitable for your answer, but might help others who came across my situation where the file name with GCS path must be passed as an input parameter like below:
StoredProcedures_Offers(filePath STRING)
then this filePath needs to passed into EXPORT DATA OPTIONS uri section as follows:
EXPORT DATA OPTIONS( uri=""||filePath||"", format='JSON', overwrite=true) AS
Note: Excluding query part here in the answer just to highlight the EXPORT section

Try concat:
EXECUTE IMMEDIATE CONCAT(
"EXPORT DATA OPTIONS(uri='gs://bucket/folder/",
CAST(CURRENT_DATE() AS STRING),
"/*.csv', format='CSV', overwrite=true, header=true, field_delimiter=',') AS SELECT field1, field2 FROM `dataset.table` ORDER BY field1 LIMIT 10"
)

Related

Inserting data into a BigQuery table through Python

I am trying to insert data into an existing BigQuery Table. But I'm struggling. I am sorry but I am new to BigQuery so I am surely missing something. I am using the BigQuery API and I want to append/insert the data through Python.
from google.cloud import bigquery
from google.api_core.exceptions import NotFound
import logging
import os
run_transformation_query("A_DAILY_ORDER_NORMALIZE.sql", pipeline_settings)
def run_transformation_query(query_file, pipeline_settings):
result = invoke_transformation_queries(query_file, pipeline_settings)
error_value = result.error_result
def invoke_transformation_queries(sql_file_name, pipeline_settings):
logging.basicConfig(filename='error.log', level=logging.ERROR)
client = bigquery.Client(project=pipeline_settings['project'])
print("check")
#debug_string = f"{pipeline_settings['repo_path']}sql/{sql_file_name}"
# sql = get_query_text(
# f"{pipeline_settings['repo_path']}sql/{sql_file_name}")
file_path = os.path.join('sql', sql_file_name)
sql = get_query_text (file_path)
def get_query_text(file_path):
with open(file_path, 'r') as file:
query_text = file.read()
return query_text
My SQL File is as follows:
DECLARE project_name STRING DEFAULT 'xxx-dl-cat-training';
DECLARE dataset_name STRING DEFAULT 'xxxxxxx_bootcamp_dataset_source';
DECLARE table_name_source STRING DEFAULT 'xxxxxxx-bootcamp-table-source';
DECLARE table_name_target STRING DEFAULT 'xxxxxxx-bootcamp-table-target';
DECLARE todays_date STRING;
SET todays_date = FORMAT_DATE("%Y%m%d", CURRENT_DATE());
WITH ORDERS AS (
SELECT
date_order,
area,
customer_name,
SPLIT(order_details, ',') AS items_list,
total_transaction
FROM
`${project_name}.${dataset_name}.${table_name_source}`
), TRANSFORMED_ORDERS AS (
SELECT
date_order,
area,
customer_name,
TRIM(IFNULL(
REGEXP_REPLACE(
item,
r'-\s*\d+(\.\d+)?$',
''
),
item
)) AS item_name,
CAST(NULLIF(TRIM(REGEXP_EXTRACT(item, r'-\s*\d+(\.\d+)?')), '') AS FLOAT64) AS item_price,
total_transaction
FROM ORDERS, UNNEST(items_list) as item
WHERE CAST(NULLIF(TRIM(REGEXP_EXTRACT(item, r'-\s*\d+(\.\d+)?')), '') AS FLOAT64) IS NOT NULL
)
CREATE OR REPLACE TABLE `${project_name}.${dataset_name}.${table_name_target}`
SELECT *
FROM TRANSFORMED_ORDERS;
Once my Subquery ends , whatever I get this error
Expected "(" or "," or keyword SELECT but got keyword CREATE at [36:1]'}
I am not sure where I ma messing up. Any help will be appreciated. I have run the transformation in the BigQuery UI and i am happy with the transformation. It all works ok
I shall be grateful if someone can help

Create new files from multiple export query data in bigquery

I'm trying to run multiple export statements in bigquery like this
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10
The problem is that I don't want to overwrite new files, but instead I want new files to be created only if the query returns something.
I've tried changing the uri to gs://bucket/folder/*1.csv but this is creating an empty file (which I don't want).
I've also set the overwrite parameter to false; this results in Invalid value: overwrite option is not specified and destination is not empty. at [109:1]
Any ways to fix this?
You may try and consider below approach.
EXECUTE IMMEDIATE CONCAT(
"EXPORT DATA OPTIONS(uri='gs://your-bucket-bucket/your-folder/file_",
FORMAT_TIMESTAMP("%F-%T", TIMESTAMP(Current_TIMESTAMP()), "UTC"),
"_*.csv', format = 'CSV', overwrite = true, header = true, field_delimiter = ';') AS SELECT field1, field2 FROM your-dataset.table1 ORDER BY field1 "
)
This approach concatenates current timestamp to your filename so that it will serve as a unique identifier of file names and will create a new one if your query returns something.
Sample output on my Bucket:
In addition, if you experience a location error when executing the above query, the posted answer here will solve the problem.

How add date of archive in the name of the archive with Export data options command in Bigquery

I have a problem. I want to create a file with the function Export Data Options and I want the date of the archive in the name of the file. I am trying to create a variable test in order to do it, but it doesn't function. Do you have any idea how can I solve it ?
SELECT
CONCAT("'gs://archivage-base/base1_",today,"*.csv'") as test
FROM
(SELECT
CURRENT_DATE() as today);
EXPORT DATA OPTIONS(
uri=test,
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-042021.campagnemakers.makerlocal`
Thank you
You can use SQL functions right within your uri declaration as such. Examples: current_date(), current_datetime(), etc. Check out this article for dozens of options.
EXPORT DATA OPTIONS(
uri='gs://bucket_name/my_prefix/my_archive_name_'||current_date()||'-part-*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-042021.campagnemakers.makerlocal`
You can do it using EXECUTE IMMEDIATE.
EXECUTE IMMEDIATE """
EXPORT DATA OPTIONS(
uri='gs://bucket_name/my_prefix/my_archive_name_#EXPORT_DATE-part-*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT * FROM `base-042021.campagnemakers.makerlocal`
"""
using current_date('+/- time offset from UTC') as EXPORT_DATE
Also note that you have to specify an URI with a wildcard because the result of the query to export might be very large see EXPORT DATA docs.
For more information on how to specify a timezone
EXECUTE IMMEDIATE is a scripting capability that allows to create SQL statement from a string at runtime so you can modify any part fo the SQL statement. You could even create a stored procedure that will get a table name as input and a GCS prefix to be able to export different tables to different location via a simple call to a stored procedure.

BigQuery: Array<string> field with WriteToBigQuery

I'm creating a Google Dataflow template in Python:
query = "#standardSQL" + """
SELECT
Frame.Serial,
Frame.Fecha,
Frame.Longitud,
Frame.Latitud,
ARRAY_AGG (CONCAT (ID, '-', Valor) ORDER BY ID) AS Resumen
FROM <...>
TABLE_SCHEMA = 'Serial:STRING,Fecha:DATETIME,Longitud:STRING,Latitud:STRING,Resumen:STRING'
| 'Read from BQ' >> beam.io.Read(beam.io.BigQuerySource(query=query,dataset="xxx",use_standard_sql=True))
| 'Write transform to BigQuery' >> WriteToBigQuery('table',TABLE_SCHEMA)
The problem
This fails due Resumen field is an Array:
Array specified for non-repeated field.
What I tested
Create the table directly in BigQuery UI with the sentence:
CREATE TABLE test (Resumen ARRAY<STRING>)
This works. The table is created with:
Type: string
Mode: Repeated
Change the TABLE_SCHEMA and run the pipeline:
TABLE_SCHEMA ='Serial:STRING,Fecha:DATETIME,Longitud:STRING,Latitud:STRING,Resumen:ARRAY<STRING>'
With the error:
"Invalid value for: ARRAY\u003cSTRING\u003e is not a valid value".
How it should be the TABLE_SCHEMA to create the table and use with beam.io.WriteToBigQuery()?
Looks like repeated or nested fields are not supported if you specify a BQ schema in a single string: https://beam.apache.org/documentation/io/built-in/google-bigquery/#creating-a-table-schema
You will need to describe your schema explicitly and set the field mode to repeated: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/bigquery_schema.py#L95
# A repeated field.
children_schema = bigquery.TableFieldSchema()
children_schema.name = 'children'
children_schema.type = 'string'
children_schema.mode = 'repeated'
table_schema.fields.append(children_schema)

Exporting data containing line feeds as CSV from PostgreSQL

I'm trying to export data From postgresql to csv.
First i created the query and tried exporting From pgadmin with the File -> Export to CSV. The CSV is wrong, as it contains for example :
The header : Field1;Field2;Field3;Field4
Now, the rows begin well, except for the last field that it puts it on another line:
Example :
Data1;Data2;Data3;
Data4;
The problem is i get error when trying to import the data to another server.
The data is From a view i created.
I also tried
COPY view(field1,field2...) TO 'C:\test.csv' DELIMITER ',' CSV HEADER;
It exports the same file.
I just want to export the data to another server.
Edit:
When trying to import the csv i get the error :
ERROR : Extra data after the last expected column. Context Copy
actions, line 3: <<"Data1, data2 etc.">>
So the first line is the header, the second line is the first row with data minus the last field, which is on the 3rd line, alone.
In order for you to export the file in another server you have two options:
Creating a shared folder between the two servers, so that the
database also has access to this directory.
COPY (SELECT field1,field2 FROM your_table) TO '[shared directory]' DELIMITER ',' CSV HEADER;
Triggering the export from the target server using the STDOUT of
COPY. Using psql you can achieve this running the following
command:
psql yourdb -c "COPY (SELECT * FROM your_table) TO STDOUT" > output.csv
EDIT: Addressing the issue of fields containing line feeds (\n)
In case you wanna get rid of the line feeds, use the REPLACE function.
Example:
SELECT E'foo\nbar';
?column?
----------
foo +
bar
(1 Zeile)
Removing the line feed:
SELECT REPLACE(E'foo\nbaar',E'\n','');
replace
---------
foobaar
(1 Zeile)
So your COPY should look like this:
COPY (SELECT field1,REPLACE(field2,E'\n','') AS field2 FROM your_table) TO '[shared directory]' DELIMITER ',' CSV HEADER;
the described above export procedure is OK, e.g:
t=# create table so(i int, t text);
CREATE TABLE
t=# insert into so select 1,chr(10)||'aaa';
INSERT 0 1
t=# copy so to stdout csv header;
i,t
1,"
aaa"
t=# create table so1(i int, t text);
CREATE TABLE
t=# copy so1 from stdout csv header;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself, or an EOF signal.
>> i,t
1,"
aaa"
>> >> >> \.
COPY 1
t=# select * from so1;
i | t
---+-----
1 | +
| aaa
(1 row)