How to export a null field on BIGQUERY - google-bigquery

When I try to export a JSON object through BigQuery, when there is a field with a "null" value, it disappears from the results' download.
An example of downloaded query:
EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS
SELECT NULL AS field1
Actual result is: {}
When the expected result is: {field1: null}
How to force an export with the null value like I showed on the expected result?

For this OP you can use:
Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1
On Export DATA documentation there is no reference to an option that includes null values on output so I think you can go to feature request report page and create one request for it. Also, there are similar observations on other projects and points that it will not be supported yet, see details here.
There are many workarounds for this, let me show you 2 options, see below:
Option 1: Call directly from python using bigquery client library
from google.cloud import bigquery
import json
client = bigquery.Client()
query = "select null as field1, null as field2"
query_job = client.query(query)
json_list = {}
for row in query_job:
json_row = {'field1':row[0],'field2':row[1]}
json_list.update(json_row)
with open('test.json','w+') as file:
file.write(json.dumps(json_list))
Option 2: Use apache beam dataflow with python and BigQuery to produce the desirable output
import argparse
import re
import json
import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def add_null_field(row, field):
if field!='skip':
row.update({field: row.get(field, None)})
return row
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
| beam.Map(add_null_field, field='skip')
| beam.Map(json.dumps)
| beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))
if __name__ == '__main__':
run()
To run it:
python -m export --output gs://my_bucket_id/output/ \
--runner DataflowRunner \
--project my_project_id \
--region my_region \
--temp_location gs://my_bucket_id/tmp/
Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.
Both options will produce you the output you are looking for:
{"field1": null, "field2": null}
Please let me know if it helps you and gives you the result you want to achieve.

Related

snowflake.connector SQL compilation error invalid identifier from pandas dataframe

I'm trying to ingest a df I created from a json response into an existing table (the table is currently empty because I can't seem to get this to work)
The df looks something like the below table:
index
clicks_affiliated
0
3214
1
2221
but I'm seeing the following error:
snowflake.connector.errors.ProgrammingError: 000904 (42000): SQL
compilation error: error line 1 at position 94
invalid identifier '"clicks_affiliated"'
and the column names in snowflake match to the columns in my dataframe.
This is my code:
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas, pd_writer
from pandas import json_normalize
import requests
df_norm = json_normalize(json_response, 'reports')
#I've tried also adding the below line (and removing it) but I see the same error
df = df_norm.reset_index(drop=True)
def create_db_engine(db_name, schema_name):
engine = URL(
account="ab12345.us-west-2",
user="my_user",
password="my_pw",
database="DB",
schema="PUBLIC",
warehouse="WH1",
role="DEV"
)
return engine
def create_table(out_df, table_name, idx=False):
url = create_db_engine(db_name="DB", schema_name="PUBLIC")
engine = create_engine(url)
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
print(df.head)
create_table(df, "reporting")
So... it turns out I needed to change my columns in my dataframe to uppercase
I've added this after the dataframe creation to do so and it worked:
df.columns = map(lambda x: str(x).upper(), df.columns)

Not able to import data from BIg-Query to S3 Directly

I have a simple BigQuery command that I invoke from Python client.
Import to GS works fine , but when I import to S3 , I get a 400 error "google.api_core.exceptions.BadRequest: 400 Unknown keyword EXPORT"
I have seen sample from Google to load directly from BQ to S3 .. https://cloud.google.com/bigquery/docs/samples/bigquery-omni-export-query-result-to-s3 Is this only supported in BigQuery-OMNI ??
not sure what is missing ... please advise ...
There is no permission issue anywhere, I have double verified
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file("dev-key.json", scopes=["https://www.googleapis.com/auth/cloud-platform"],)
client = bigquery.Client(credentials=credentials, project=credentials.project_id,)
bq_export_to_s3 = """
EXPORT DATA WITH CONNECTION `aws-us-east-1.s3-write-conn`
OPTIONS(
uri='s3://my-bucket/logs/edo/dengg_audit/bq-demo/temp4/*',
format='CSV',
overwrite=true,
header=false,
field_delimiter='^') AS
select col1 , col2 from `project.schema.table` where clientguid = '1234' limit 10
"""
bq_export_to_gs = """
EXPORT DATA OPTIONS(
uri='gs://my-bucket/logs/edo/dengg_audit/bq-demo/temp4/*',
format='CSV',
overwrite=true,
header=false,
field_delimiter='^') AS
select col1 , col2 from `project.schema.table` where clientguid = '1234' limit 10
"""
query_job= client.query(bq_export_to_s3)
results = query_job.result()
for row in results:
print(row)
pip3 install google-api-python-client==1.12.8 \
&& pip3 install google-cloud-storage==1.33.0 \
&& pip3 install google-cloud-bigquery==2.10.0
Error Message

Bigquery bitcoin dataset SQL query to obtain transactions after a timestamp

I want to create a CSV of all bitcoin transactions after timestamp 1572491526. so i tried below code. I want CSV to have four columns -
transaction_id, timestamp, input, output
1 1 aaa bbb
1 1 abc cde
2 2 pqr xyz
i tried this so far
from google.cloud import bigquery
client = bigquery.Client()
QUERY = """
SELECT timestamp, transaction_id, inputs, outputs
FROM bigquery-public-data.bitcoin_blockchain.transactions
WHERE timestamp > 1572491526
LIMIT 1
"""
# note that max_gb_scanned is set to 24, rather than 1
queryjob = client.query(QUERY) # API request
rows = queryjob.result()
row = list(rows)
import pandas as pd
headlines = pd.DataFrame(data=[list(x.values()) for x in row], columns=list(row[0].keys()))
headlines
But the output i am getting is incorrect. how to solve this
timestamp transaction_id inputs outputs
0 1237254030000 8425ac5096ff2b55e0feefa7c78ba609a245e6f185ecde... [{'input_script_bytes': b'\x04\xff\xff\x00\x1d... [{'output_satoshis': 5000000000, 'output_scrip...
The output you are getting is correct, I tested your query in the Bigquery UI and seems to be the same you are getting. Please consider that the fields inputs and outputs are arrays, the confusion might be there.
Also, I tested your code and I got the same output
from google.cloud import bigquery
client = bigquery.Client()
QUERY = """
SELECT timestamp, transaction_id, inputs, outputs
FROM bigquery-public-data.bitcoin_blockchain.transactions
WHERE timestamp > 1572491526
LIMIT 1
"""
# note that max_gb_scanned is set to 24, rather than 1
queryjob = client.query(QUERY) # API request
rows = queryjob.result()
row = list(rows)
import pandas as pd
headlines = pd.DataFrame(data=[list(x.values()) for x in row], columns=list(row[0].keys()))
headlines.to_csv('output.csv', index=False, header=True)

How to log/print message in pyspark pandas_udf?

I have tested that both logger and print can't print message in a pandas_udf , either in cluster mode or client mode.
Test code:
import sys
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import logging
logger = logging.getLogger('test')
spark = (SparkSession
.builder
.appName('test')
.getOrCreate())
df = spark.createDataFrame(pd.DataFrame({
'y': np.random.randint(1, 10, (20,)),
'ds': np.random.randint(1000, 9999, (20,)),
'store_id' : ['a'] * 10 + ['b'] *7 + ['q']*3,
'product_id' : ['c'] * 5 + ['d'] *12 + ['e']*3,
})
)
#pandas_udf('y int, ds int, store_id string, product_id string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
print('#'*100)
logger.info('$'*100)
logger.error('&'*100)
return pd.DataFrame([], columns=['y', 'ds','store_id','product_id'])
df1 = df.groupby(['store_id', 'product_id']).apply(train_predict)
Also note:
log4jLogger = spark.sparkContext._jvm.org.apache.log4j
LOGGER = log4jLogger.LogManager.getLogger(__name__)
LOGGER.info("#"*50)
You can't use this in pandas_udf, because this log beyond to spark context object, you can't refer to spark session/context in a udf.
The only way I know is use Excetion as the answer I wrote below.
But it is tricky and with drawback.
I want to know if there is any way to just print message in pandas_udf.
Currently, I tried every way in spark 2.4 .
Without log, it is hard to debug a faulty pandas_udf. The only workable way I know can print error messgage in pandas_udf is raise Exception . So it really cost time to debug in this way, but there isn't a better way I know .
#pandas_udf('y int, ds int, store_id string, product_id string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
print('#'*100)
logger.info('$'*100)
logger.error('&'*100)
raise Exception('#'*100) # The only way I know can print message but would break execution
return pd.DataFrame([], columns=['y', 'ds','store_id','product_id'])
The drawback is you can't keep spark running after print message.
One thing you can do is to put the log message into the DataFrame itself.
For example
#pandas_udf('y int, ds int, store_id string, product_id string, log string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
return pd.DataFrame([3, 5, 'store123', 'product123', 'My log message'], columns=['y', 'ds','store_id','product_id', 'log'])
After this, you can select the log column with related information into another DataFrame and output to file. Drop it from the original DataFrame.
It's not perfect, but it might be helpful.

How can we parse multiline using pyspark

I am having a test csv file with below content:
"TVBQGEM461
2016-10-05 14:04:33 cvisser gefixt door company
"
I need to store this entire content into one single row. However while processing with pyspark, this are getting splitted into 2 rows.
Below is the pyspark code:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("desc", StringType(), True)])
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', inferschema='true').load('/dev/landingzone/test.csv', schema = customSchema)
df.registerTempTable("temp")
sqlContext.sql("create table dev_core_source.test as select * from temp")
Data is getting loaded into the hive table, but they are splited into 2 rows instead of 1 row
I have also tried some other options like the one below for creating data frame, but still facing the same issue.
df = sqlContext.read \
... .format('com.databricks.spark.csv') \
... .options(header='true') \
... .option(inferschema, 'true') \
... .option(wholeFile, 'true') \
... .options(parserLib ='UNIVOCITY') \
... .load('/dev/landingzone/test.csv', schema = customSchema)