Not able to import data from BIg-Query to S3 Directly - google-bigquery

I have a simple BigQuery command that I invoke from Python client.
Import to GS works fine , but when I import to S3 , I get a 400 error "google.api_core.exceptions.BadRequest: 400 Unknown keyword EXPORT"
I have seen sample from Google to load directly from BQ to S3 .. https://cloud.google.com/bigquery/docs/samples/bigquery-omni-export-query-result-to-s3 Is this only supported in BigQuery-OMNI ??
not sure what is missing ... please advise ...
There is no permission issue anywhere, I have double verified
from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file("dev-key.json", scopes=["https://www.googleapis.com/auth/cloud-platform"],)
client = bigquery.Client(credentials=credentials, project=credentials.project_id,)
bq_export_to_s3 = """
EXPORT DATA WITH CONNECTION `aws-us-east-1.s3-write-conn`
OPTIONS(
uri='s3://my-bucket/logs/edo/dengg_audit/bq-demo/temp4/*',
format='CSV',
overwrite=true,
header=false,
field_delimiter='^') AS
select col1 , col2 from `project.schema.table` where clientguid = '1234' limit 10
"""
bq_export_to_gs = """
EXPORT DATA OPTIONS(
uri='gs://my-bucket/logs/edo/dengg_audit/bq-demo/temp4/*',
format='CSV',
overwrite=true,
header=false,
field_delimiter='^') AS
select col1 , col2 from `project.schema.table` where clientguid = '1234' limit 10
"""
query_job= client.query(bq_export_to_s3)
results = query_job.result()
for row in results:
print(row)
pip3 install google-api-python-client==1.12.8 \
&& pip3 install google-cloud-storage==1.33.0 \
&& pip3 install google-cloud-bigquery==2.10.0
Error Message

Related

How to export a null field on BIGQUERY

When I try to export a JSON object through BigQuery, when there is a field with a "null" value, it disappears from the results' download.
An example of downloaded query:
EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS
SELECT NULL AS field1
Actual result is: {}
When the expected result is: {field1: null}
How to force an export with the null value like I showed on the expected result?
For this OP you can use:
Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1
On Export DATA documentation there is no reference to an option that includes null values on output so I think you can go to feature request report page and create one request for it. Also, there are similar observations on other projects and points that it will not be supported yet, see details here.
There are many workarounds for this, let me show you 2 options, see below:
Option 1: Call directly from python using bigquery client library
from google.cloud import bigquery
import json
client = bigquery.Client()
query = "select null as field1, null as field2"
query_job = client.query(query)
json_list = {}
for row in query_job:
json_row = {'field1':row[0],'field2':row[1]}
json_list.update(json_row)
with open('test.json','w+') as file:
file.write(json.dumps(json_list))
Option 2: Use apache beam dataflow with python and BigQuery to produce the desirable output
import argparse
import re
import json
import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def add_null_field(row, field):
if field!='skip':
row.update({field: row.get(field, None)})
return row
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
| beam.Map(add_null_field, field='skip')
| beam.Map(json.dumps)
| beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))
if __name__ == '__main__':
run()
To run it:
python -m export --output gs://my_bucket_id/output/ \
--runner DataflowRunner \
--project my_project_id \
--region my_region \
--temp_location gs://my_bucket_id/tmp/
Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.
Both options will produce you the output you are looking for:
{"field1": null, "field2": null}
Please let me know if it helps you and gives you the result you want to achieve.

Jupyter load dataframe into DB2 table

I have a Jupyter notebook and I am trying to read CSV files from URLS and then load them in a DB2. My first problem is that sometimes the table exists and I need to drop it first - second, seems impossible to load the data into tables.. what am I doing wrong?
import pandas as pd
import io
import requests
import ibm_db
import ibm_db_dbi
After this I try
dsn = "DRIVER={{IBM DB2 ODBC DRIVER}};" + \
"DATABASE=BLUDB;" + \
"HOSTNAME=myhostname;" + \
"PORT=50000;" + \
"PROTOCOL=TCPIP;" + \
"UID=myuid;" + \
"PWD=mypassword;"
url1="https://ibm.box.com/shared/static/05c3415cbfbtfnr2fx4atenb2sd361ze.csv"
s1=requests.get(url1).content
df1=pd.read_csv(io.StringIO(s1.decode('utf-8')))
url2="https://ibm.box.com/shared/static/f9gjvj1gjmxxzycdhplzt01qtz0s7ew7.csv"
s2=requests.get(url2).content
df2=pd.read_csv(io.StringIO(s2.decode('utf-8')))
url3="https://ibm.box.com/shared/static/svflyugsr9zbqy5bmowgswqemfpm1x7f.csv"
s3=requests.get(url3).content
df3=pd.read_csv(io.StringIO(s3.decode('utf-8')))
all works. And connection as well
hdbc = ibm_db.connect(dsn, "", "")
hdbi = ibm_db_dbi.Connection(hdbc)
and this fails
#DropTableIfExists = ibm_db.exec_immediate(hdbc, 'DROP TABLE QBG03137.DATAFRAME1')
CreateTable = ibm_db.exec_immediate(hdbc, sql)
resultSet = ibm_db.exec_immediate(hdbc, sql)
#define row and fetch tuple
row = ibm_db.fetch_tuple(resultSet)
comma = ""
while (row != False):
for column in row:
print(comma,end="")
print(column,end="")
comma = ","
print()
row = ibm_db.fetch_tuple(resultSet)
With error
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-94-840854e92bd7> in <module>
5
6 # Testing table exists
----> 7 resultSet = ibm_db.exec_immediate(hdbc, sql)
8
9 #define row and fetch tuple
Exception: [IBM][CLI Driver][DB2/LINUXX8664] SQL0601N The name of the object to be created is identical to the existing name "QBG03137.DATAFRAME1" of type "TABLE". SQLSTATE=42710 SQLCODE=-601

Bigquery bitcoin dataset SQL query to obtain transactions after a timestamp

I want to create a CSV of all bitcoin transactions after timestamp 1572491526. so i tried below code. I want CSV to have four columns -
transaction_id, timestamp, input, output
1 1 aaa bbb
1 1 abc cde
2 2 pqr xyz
i tried this so far
from google.cloud import bigquery
client = bigquery.Client()
QUERY = """
SELECT timestamp, transaction_id, inputs, outputs
FROM bigquery-public-data.bitcoin_blockchain.transactions
WHERE timestamp > 1572491526
LIMIT 1
"""
# note that max_gb_scanned is set to 24, rather than 1
queryjob = client.query(QUERY) # API request
rows = queryjob.result()
row = list(rows)
import pandas as pd
headlines = pd.DataFrame(data=[list(x.values()) for x in row], columns=list(row[0].keys()))
headlines
But the output i am getting is incorrect. how to solve this
timestamp transaction_id inputs outputs
0 1237254030000 8425ac5096ff2b55e0feefa7c78ba609a245e6f185ecde... [{'input_script_bytes': b'\x04\xff\xff\x00\x1d... [{'output_satoshis': 5000000000, 'output_scrip...
The output you are getting is correct, I tested your query in the Bigquery UI and seems to be the same you are getting. Please consider that the fields inputs and outputs are arrays, the confusion might be there.
Also, I tested your code and I got the same output
from google.cloud import bigquery
client = bigquery.Client()
QUERY = """
SELECT timestamp, transaction_id, inputs, outputs
FROM bigquery-public-data.bitcoin_blockchain.transactions
WHERE timestamp > 1572491526
LIMIT 1
"""
# note that max_gb_scanned is set to 24, rather than 1
queryjob = client.query(QUERY) # API request
rows = queryjob.result()
row = list(rows)
import pandas as pd
headlines = pd.DataFrame(data=[list(x.values()) for x in row], columns=list(row[0].keys()))
headlines.to_csv('output.csv', index=False, header=True)

S3 select query not recognizing data

I generate a dataframe, write the dataframe to S3 as CSV file, and perform a select query on the CSV in S3 bucket. Based on the query and data I expect to see '4' and '10' printed but I only see '4'. For some reason S3 is not seeing the '10'.
It works fine for filtering between date.
import pandas as pd
import s3fs
import boto3
# dataframe
d = {'date':['1990-1-1','1990-1-2','1990-1-3','1999-1-4'], 'speed':[0,10,3,4]}
df = pd.DataFrame(d)
# write csv to s3
bytes_to_write = df.to_csv(index=False).encode()
fs = s3fs.S3FileSystem()
with fs.open('app-storage/test.csv', 'wb') as f:
f.write(bytes_to_write)
# query csv in s3 bucket
s3 = boto3.client('s3',region_name='us-east-1')
resp = s3.select_object_content(
Bucket='app-storage',
Key='test.csv',
ExpressionType='SQL',
Expression="SELECT s.\"speed\" FROM s3Object s WHERE s.\"speed\" > '3'",
InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}},
OutputSerialization = {'CSV': {}},
)
for event in resp['Payload']:
if 'Records' in event:
records = event['Records']['Payload'].decode('utf-8')
print(records)
Just needed to cast the string to float in the SQL statement.
"SELECT s.\"speed\" FROM s3Object s WHERE cast(s.\"speed\" as float) > 3"
Not it works without a problem.

How can we parse multiline using pyspark

I am having a test csv file with below content:
"TVBQGEM461
2016-10-05 14:04:33 cvisser gefixt door company
"
I need to store this entire content into one single row. However while processing with pyspark, this are getting splitted into 2 rows.
Below is the pyspark code:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
customSchema = StructType([ \
StructField("desc", StringType(), True)])
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false', inferschema='true').load('/dev/landingzone/test.csv', schema = customSchema)
df.registerTempTable("temp")
sqlContext.sql("create table dev_core_source.test as select * from temp")
Data is getting loaded into the hive table, but they are splited into 2 rows instead of 1 row
I have also tried some other options like the one below for creating data frame, but still facing the same issue.
df = sqlContext.read \
... .format('com.databricks.spark.csv') \
... .options(header='true') \
... .option(inferschema, 'true') \
... .option(wholeFile, 'true') \
... .options(parserLib ='UNIVOCITY') \
... .load('/dev/landingzone/test.csv', schema = customSchema)