GBQ BQ-CLI Export CSV to Google Storage with query and specified delimiter - google-bigquery

I'm trying to export a table from Google Bigquery into CSV and save the file into a google storage using bq extract but the issue is I want to export that table into 2 CSV files with different filters, and also using ; as delimiter for my CSV file. But I cannot find any documentation online where I can use queries with bq extract.
For example, I have table mytable.cities and I would like to export that table into 2 CSV files: the first CSV file I want export the table with a condition where city = 'Los Angeles' and for my second CSV with where city = 'New York'.
My syntax currently is this:
bq extract --destination_format=CSV --field_delimiter=';' mytable.cities gs://myBucket/myFile.csv
I did not use the command bq query because it doesn't give me the option to change my delimiter to ;
How can I achieve this?

An easy way to do this is to write a code that uses BQ API to run your queries and then use GCS API to save the results as CSV. See code below to do this:
from google.cloud import bigquery
from google.cloud import storage
bq_client = bigquery.Client()
gcs_client = storage.Client()
bucket = gcs_client.get_bucket("your-bucket-name")
query_1 = """
SELECT mascot_name,mascot FROM `bigquery-public-data.ncaa_basketball.mascots`
where non_tax_type = 'Devils'
"""
query_2 = """
SELECT mascot_name,mascot FROM `bigquery-public-data.ncaa_basketball.mascots`
where non_tax_type = 'Dragons'
"""
df_1 = bq_client.query(query_1).to_dataframe()
bucket.blob("devils.csv").upload_from_string(df_1.to_csv(index=False,sep=";"),"text/csv")
df_2 = bq_client.query(query_2).to_dataframe()
bucket.blob("dragons.csv").upload_from_string(df_2.to_csv(index=False,sep=";"),"text/csv")
NOTE: The example above used a public dataset to test the query and import to csv.
See bucket after code is ran:
Sample file content (devils.csv):

Related

Ignore Last row in CSV file as part of BigQuery External table command

I have about 40 odd csv files, comma delimited in GCS however the last line of all the files has quotes and dot
”.
So these are not exactly conformed csv schema and has data quality issue which i have to get around
My aim is to create an external table referencing to the gcs files and then be able to select the data.
example:
create or replace dataset.tableName
options (
uris = ['gs://bucket_path/allCSVFILES_*.csv'],
format = 'CSV',
skip_leading_rows = 1,
ignore_unknown_values = true
)
the external table gets created without any error. however, when I select the data, I ran to error
"error message: CSV table references column position 16, but line starting at position:18628631 contains only 1 columns"
This is due to quotes and dot ”. at the end of file.
My question is: is there any way in BigQuery to consume to data without the LAST LINE. as part of options we have skip_leading_rows to skip header but any way to skip to last row?
Currently my best placed option is to clean the files, using sed/tail command.
I have checked the create or replace external table options list below and have tried using ignore_unknown_values but other than this option i don't see any other option which will work.
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language#create_external_table_statement
You can try below work around:
I tried with pandas and removed the last record from the csv file.
from google.cloud import bigquery
import pandas as pd
from google.cloud import storage
df=pd.read_csv('gs://samplecsv.csv')
client = bigquery.Client()
dataset_ref = client.dataset('dataset')
table_ref = dataset_ref.table('new_table')
df.drop(df.tail(1).index,inplace=True)
client.load_table_from_dataframe(df, table_ref).result()
For more information you can refer to this link which mentions the limitation for loading csv files to Bigquery.

Generate Multiple files Table Structure and create table

I'm trying to Generate Multiple files Table Structure to Snowflake via Python
I have list of files in Directory, I want to read data from files create the tables dynamically in snowflake using file names.
below is what I tried so far
# Generate Multiple files Table Structure to Snowflake via Python
import os
from os import path
import pandas as pd
import snowflake.connector
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
dir = r"D:\Datasets\users_dataset"
engine = create_engine(URL(
account='<account>',
user='<user>',
password='<password>',
role='ACCOUNTADMIN',
warehouse='COMPUTE_WH',
database='DEM0_DB',
schema='PUBLIC'
))
connection = engine.connect()
connection.execute("USE DATABASE DEMO_DB")
connection.execute("USE SCHEMA PUBLIC")
results = connection.execute('USE DATABASE DEMO_DB').fetchone()
print(results)
# read the files from the directory and split the filename and extension
for file in os.listdir(dir):
name, extr = path.splitext(file)
print(name)
file_path = os.path.join(dir, file)
print(file_path)
df = pd.read_csv(file_path, delimiter=',')
df.to_sql(name, con=engine, index=False)
I'm getting below error
sqlalchemy.exc.ProgrammingError: (snowflake.connector.errors.ProgrammingError) 090105 (22000): Cannot perform CREATE TABLE. This session does not have a current database. Call 'USE DATABASE', or use a qualified name.
[SQL:
CREATE TABLE desktop (
"[.ShellClassInfo]" FLOAT
)
]
(Background on this error at: https://sqlalche.me/e/14/f405)
I checked for permission issues on snowflake, and I haven't found any issues.
Can someone please help with this error
At the create_engine the database is called DEM0_DB instead of DEMO_DB:
engine = create_engine(URL(
...
#database='DEM0_DB',
database='DEMO_DB',
schema='PUBLIC'
))

how can I upload a gzipped json file to gcs bucket from bigquery

I need to load bigquery data ( select with some filter) to gcs bucket with json format and then compress. Current airflow operator is exporting table from bq to gcs, Is there any way to push some select data with some filter from BQ to GCS?
You can just set the compression parameter of BigQueryToGCSOperator:
from airflow.providers.google.cloud.transfers.bigquery_to_gcs import BigQueryToGCSOperator
bigquery_to_gcs = BigQueryToGCSOperator(
task_id="bigquery_to_gcs",
source_project_dataset_table="DATASET_NAME.TABLE",
destination_cloud_storage_uris=["gs://folder/your_file"],
compression='gzip'
)
There is a pure SQL solution to this problem using EXPORT DATA statement in BigQuery. See the following example:
EXPORT DATA
OPTIONS (
compression = GZIP,
format = JSON,
uri = 'gs://bucket/path/file_*'
) AS
-- query_statement
select 1 as x, 2 as y;
After downloading the file from GCS and extracting it from the arhive I get the following data:
{"x":"1","y":"2"}

How to Convert Many CSV files to Parquet using AWS Glue

I'm using AWS S3, Glue, and Athena with the following setup:
S3 --> Glue --> Athena
My raw data is stored on S3 as CSV files. I'm using Glue for ETL, and I'm using Athena to query the data.
Since I'm using Athena, I'd like to convert the CSV files to Parquet. I'm using AWS Glue to do this right now. This is the current process I'm using:
Run Crawler to read CSV files and populate Data Catalog.
Run ETL job to create Parquet file from Data Catalog.
Run a Crawler to populate Data Catalog using Parquet file.
The Glue job only allows me to convert one table at a time. If I have many CSV files, this process quickly becomes unmanageable. Is there a better way, perhaps a "correct" way, of converting many CSV files to Parquet using AWS Glue or some other AWS service?
I had the exact same situation where I wanted to efficiently loop through the catalog tables catalogued by crawler which are pointing to csv files and then convert them to parquet. Unfortunately there is not much information available in the web yet. That's why I have written a blog in LinkedIn explaining how I have done it. Please have a read; specially point #5. Hope that helps. Please let me know your feedback.
Note: As per Antti's feedback, I am pasting the excerpt solution from my blog below:
Iterating through catalog/database/tables
The Job Wizard comes with option to run predefined script on a data source. Problem is that the data source you can select is a single table from the catalog. It does not give you option to run the job on the whole database or a set of tables. You can modify the script later anyways but the way to iterate through the database tables in glue catalog is also very difficult to find. There are Catalog APIs but lacking suitable examples. The github example repo can be enriched with lot more scenarios to help developers.
After some mucking around, I came up with the script below which does the job. I have used boto3 client to loop through the table. I am pasting it here if it comes to someone’s help. I would also like to hear from you if you have a better suggestion
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
client = boto3.client('glue', region_name='ap-southeast-2')
databaseName = 'tpc-ds-csv'
print '\ndatabaseName: ' + databaseName
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables['TableList']
for table in tableList:
tableName = table['Name']
print '\n-- tableName: ' + tableName
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="tpc-ds-csv",
table_name=tableName,
transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=datasource0,
connection_type="s3",
connection_options={
"path": "s3://aws-glue-tpcds-parquet/"+ tableName + "/"
},
format="parquet",
transformation_ctx="datasink4"
)
job.commit()
Please refer to EDIT for updated info.
S3 --> Athena
Why not you use CSV format directly with Athena?
https://docs.aws.amazon.com/athena/latest/ug/supported-format.html
CSV is one of the supported formats. Also to make it efficient, you can compress multiple CSV files for faster loading.
Supported compression,
https://docs.aws.amazon.com/athena/latest/ug/compression-formats.html
Hope it helps.
EDIT:
Why Parquet format is more helpful than CSV?
https://dzone.com/articles/how-to-be-a-hero-with-powerful-parquet-google-and
S3 --> Glue --> Athena
More details on CSV to Parquet conversion,
https://aws.amazon.com/blogs/big-data/build-a-data-lake-foundation-with-aws-glue-and-amazon-s3/
I'm not a big fan of Glue, nor creating schemas from data
Here's how to do it in Athena, which is dramatically faster than Glue.
This is for the CSV files:
create table foo (
id int,
name string,
some date
)
row format delimited
fields terminated by ','
location 's3://mybucket/path/to/csvs/'
This is for the parquet files:
create table bar
with (
external_location = 's3://mybucket/path/to/parquet/',
format = 'PARQUET'
)
as select * from foo
You don't need to create that path for parquet, even if you use partitioning
you can convert either JSON or CSV files into parquet directly, without importing it to the catalog first.
This is for the JSON files - the below code would convert anything hosted at the rawFiles directory
import sys
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.utils import getResolvedOptions
## #params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sparkContext = SparkContext()
glueContext = GlueContext(sparkContext)
spark = glueContext.spark_session
job = Job(glueContext) job.init(args['JOB_NAME'], args)
s3_json_path = 's3://rawFiles/'
s3_parquet_path = 's3://convertedFiles/'
output = spark.read.load(s3_json_path, format='json')
output.write.parquet(s3_parquet_path)
job.commit()
Sounds like in your step 1 you are crawling the individual csv file (e.g some-bucket/container-path/file.csv), but if you instead set your crawler to look at a path level instead of a file level (e.g some-bucket/container-path/) and all your csv files are uniform then the crawler should only create a single external table instead of an external table per file and you’ll be able to extract the data from all of the files at once.

Using the BigQuery Connector with Spark

I'm not getting the Google example work
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example
PySpark
There are a few mistakes in the code i think, like:
'# Output Parameters
'mapred.bq.project.id': '',
Should be: 'mapred.bq.output.project.id': '',
and
'# Write data back into new BigQuery table.
'# BigQueryOutputFormat discards keys, so set key to None.
(word_counts
.map(lambda pair: None, json.dumps(pair))
.saveAsNewAPIHadoopDataset(conf))
will give an error message. If I change it to:
(word_counts
.map(lambda pair: (None, json.dumps(pair)))
.saveAsNewAPIHadoopDataset(conf))
I get the error message:
org.apache.hadoop.io.Text cannot be cast to com.google.gson.JsonObject
And whatever I try I can not make this work.
There is a dataset created in BigQuery with the name I gave it in the 'conf' with a trailing '_hadoop_temporary_job_201512081419_0008'
And a table is created with '_attempt_201512081419_0008_r_000000_0' on the end. But are always empty
Can anybody help me with this?
Thanks
We are working to update the documentation because, as you noted, the docs are incorrect in this case. Sorry about that! While we're working to update the docs, I wanted to get you a reply ASAP.
Casting problem
The most important problem you mention is the casting issue. Unfortunately,PySpark cannot use the BigQueryOutputFormat to create Java GSON objects. The solution (workaround) is to save the output data into Google Cloud Storage (GCS) and then load it manually with the bq command.
Code example
Here is a code sample which exports to GCS and loads the data into BigQuery. You could also use subprocess and Python to execute the bq command programatically.
#!/usr/bin/python
"""BigQuery I/O PySpark example."""
import json
import pprint
import pyspark
sc = pyspark.SparkContext()
# Use the Google Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Google Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
# Perform word count.
word_counts = (
table_data
.map(lambda (_, record): json.loads(record))
.map(lambda x: (x['word'].lower(), int(x['word_count'])))
.reduceByKey(lambda x, y: x + y))
# Display 10 results.
pprint.pprint(word_counts.take(10))
# Stage data formatted as newline delimited json in Google Cloud Storage.
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
partitions = range(word_counts.getNumPartitions())
output_files = [output_directory + '/part-{:05}'.format(i) for i in partitions]
(word_counts
.map(lambda (w, c): json.dumps({'word': w, 'word_count': c}))
.saveAsTextFile(output_directory))
# Manually clean up the input_directory, otherwise there will be BigQuery export
# files left over indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
print """
###########################################################################
# Finish uploading data to BigQuery using a client e.g.
bq load --source_format NEWLINE_DELIMITED_JSON \
--schema 'word:STRING,word_count:INTEGER' \
wordcount_dataset.wordcount_table {files}
# Clean up the output
gsutil -m rm -r {output_directory}
###########################################################################
""".format(
files=','.join(output_files),
output_directory=output_directory)