Scio saveAsTypedBigQuery write to a partition for SCollection of Typed Big Query case class - google-bigquery

I'm trying to write a SCollection to a partition in Big Query using:
import java.time.LocalDate
import java.time.format.DateTimeFormatter
val date = LocateDate.parse("2017-06-21")
val col = sCollection.typedBigQuery[Blah](query)
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.ISO_LOCAL_DATE),
writeDisposition = WriteDisposition.WRITE_EMPTY,
createDisposition = CreateDisposition.CREATE_IF_NEEDED)
The error I get is
Table IDs must be alphanumeric (plus underscores) and must be at most 1024 characters long. Also, Table decorators cannot be used."
How can I write to a partition? I don't see any options to specify partitions via either saveAsTypedBigQuery method so I was trying the Legacy SQL table decorators.

See: BigqueryIO Unable to Write to Date-Partitioned Table. You need to manually create the table. BQ IO cannot create a table and partition it.
Additionally, the no table decorators was a complete ruse. It's the alphanumeric part I was missing.
col.saveAsTypedBigQuery(
tableSpec = "test.test$" + date.format(DateTimeFormatter.BASIC_ISO_DATE),
writeDisposition = WriteDisposition.WRITE_APPEND,
createDisposition = CreateDisposition.CREATE_NEVER)

Related

how to dynamically build select list from a API payload using PyPika

I have a JSON API payload containing tablename, columnlist - how to build a SELECT query from it using pypika?
So far I have been able to use a string columnlist, but not able to do advanced querying using functions, analytics etc.
from pypika import Table, Query, functions as fn
def generate_sql (tablename, collist):
table = Table(tablename)
columns = [str(table)+'.'+each for each in collist]
q = Query.from_(table).select(*columns)
return q.get_sql(quote_char=None)
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue)']
print (generate_sql(tablename, collist)) #1
table = Table(tablename)
q = Query.from_(table).select(table.id, table.fname, fn.Sum(table.revenue))
print (q.get_sql(quote_char=None)) #2
#1 outputs
SELECT "customers".id,"customers".fname,"customers".fn.Sum(revenue) FROM customers
#2 outputs correctly
SELECT id,fname,SUM(revenue) FROM customers
You should not be trying to assemble the query in a string by yourself, that defeats the whole purpose of pypika.
What you can do in your case, that you have the name of the table and the columns coming as texts in a json object, you can use * to unpack those values from the collist and use the syntax obj[key] to get the table attribute with by name with a string.
q = Query.from_(table).select(*(table[col] for col in collist))
# SELECT id,fname,fn.Sum(revenue) FROM customers
Hmm... that doesn't quite work for the fn.Sum(revenue). The goal is to get SUM(revenue).
This can get much more complicated from this point. If you are only sending column names that you know to belong to that table, the above solution is enough.
But if you have complex sql expressions, making reference to sql functions or even different tables, I suggest you to rethink your decision of sending that as json. You might end up with something as complex as pypika itself, like a custom parser or wathever. than your better option here would be to change the format of your json response object.
If you know you only need to support a very limited set of capabilities, it could be feasible. For example, you can assume the following constraints:
all column names refer to only one table, no joins or alias
all functions will be prefixed by fn.
no fancy stuff like window functions, distinct, count(*)...
Then you can do something like:
from pypika import Table, Query, functions as fn
import re
tablename = 'customers'
collist = ['id', 'fname', 'fn.Sum(revenue / 2)', 'revenue % fn.Count(id)']
def parsed(cols):
pattern = r'(?:\bfn\.[a-zA-Z]\w*)|([a-zA-Z]\w*)'
subst = lambda m: f"{'' if m.group().startswith('fn.') else 'table.'}{m.group()}"
yield from (re.sub(pattern, subst, col) for col in cols)
table = Table(tablename)
env = dict(table=table, fn=fn)
q = Query.from_(table).select(*(eval(col, env) for col in parsed(collist)))
print (q.get_sql(quote_char=None)) #2
Output:
SELECT id,fname,SUM(revenue/2),MOD(revenue,COUNT(id)) FROM customers

Right way to implement pandas.read_sql with ClickHouse

Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?
There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes

how to refer values based on column names in python

i am trying to extract and read the data from a SQL query.
Below is the sample data from SQL developer:
target_name expected_instances environment system_name hostname
--------------------------------------------------------------------------------------
ORAUAT_host1 1 UAT ORAUAT_host1_sys host1.sample.net
ORAUAT_host2 1 UAT ORAUAT_host1_sys host2.sample.net
Normally i pass the system_name to the query (which has a bind variable for system_name) and get the data as a list,but not the column names.
Is there a way in Python to retrieve the data along with the column names and reference values with column name like target_name[0] giving the value ORAUAT_host1?Please suggest.Thanks.
If what you want is to get the column names from the table you are querying, you can do something like this:
My example is printing a csv file
import os
import sys
import cx_Oracle
db = cx_Oracle.connect('user/pass#host:1521/service_name')
SQL = "select * from dual"
print(SQL)
cursor = db.cursor()
f = open("C:\dual.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
r = cursor.execute(SQL)
#this takes the column names
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
for row in cursor:
writer.writerow(row)
f.close()
The way to print the columns is using the method description of the cursor object
Cursor.description
This read-only attribute is a sequence of 7-item sequences. Each of
these sequences contains information describing one result column:
(name, type, display_size, internal_size, precision, scale, null_ok).
This attribute will be None for operations that do not return rows or
if the cursor has not had an operation invoked via the execute()
method yet.
The type will be one of the database type constants defined at the
module level.
https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#

Can BigQuery API overwrite existing table/view with create_table() (tables insert)?

I'm using the Python client create_table() function which calls the underlying tables insert API. There is an exists_ok parameter but this causes the function to simply ignore the create if the table already exists. The problem with this is that when creating a view, I would like to overwrite the existing view SQL if it's already there. What I'm currently doing to get around this is:
if overwrite:
bq_client.delete_table(view, not_found_ok=True)
view = bq_client.create_table(view)
What I don't like about this is there are potentially several seconds during which the view no longer exists. And if the code dies for whatever reason after the delete but before the create then the view is effectively gone.
My question: is there a way to create a table (view) such that it overwrites any existing object? Or perhaps I have to detect this situation and run some kind of update_table() (patch)?
If you want to overwrite an existing table, you can use google.cloud.bigquery.job.WriteDisposition class, please refer to official documentation.
You have three possibilities here: WRITE_APPEND, WRITE_EMPTY and WRITE_TRUNCATE. What you should use, is WRITE_TRUNCATE, which overwrites the table data.
You can see following example here:
from google.cloud import bigquery
import pandas
client = bigquery.Client()
table_id = "<YOUR_PROJECT>.<YOUR_DATASET>.<YOUR_TABLE_NAME>"
records = [
{"artist": u"Michael Jackson", "birth_year": 1958},
{"artist": u"Madonna", "birth_year": 1958},
{"artist": u"Shakira", "birth_year": 1977},
{"artist": u"Taylor Swift", "birth_year": 1989},
]
dataframe = pandas.DataFrame(
records,
columns=["artist", "birth_year"],
index=pandas.Index(
[u"Q2831", u"Q1744", u"Q34424", u"Q26876"], name="wikidata_id"
),
)
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("artist", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
job.result()
table = client.get_table(table_id)
Let me know if it suits your need. I hope it helps.
UPDATED:
You can use following Python code to update a table view using the client library:
client = bigquery.Client(project="projectName")
table_ref = client.dataset('datasetName').table('tableViewName')
table = client.get_table(table_ref)
table.view_query = "SELECT * FROM `projectName.dataset.sourceTableName`"
table = client.update_table(table, ['view_query'])
You can do it this way.
Hope this may help!
from google.cloud import bigquery
clientBQ = bigquery.Client()
def tableExists(tableID, client=clientBQ):
"""
Check if a table already exists using the tableID.
return : (Boolean)
"""
try:
table = client.get_table(tableID)
return True
except NotFound:
return False
if tableExists(viewID, client=clientBQ):
print("View already exists, Deleting the view ... ")
clientBQ .delete_table(viewID)
view = bigquery.Table(viewID)
view.view_query = "SELECT * FROM `PROJECT_ID.DATASET_NAME.TABLE_NAME`"
clientBQ.create_table(view)

SparkSQL Staging Table Row Count vs Hive Row count

I am attempting to extract data from Cassandra, into a specific partitioned Hive table using Spark 2.1.1 on Hadoop 2.7. To do this, I have all the data from Cassandra into an rdd which I transform into a dataframe via rdd.toDF(), and passed into the following function:
public def writeToHive(ss: SparkSession, df: DataFrame) {
df.createOrReplaceTempView(tablename)
val cols = df.columns
val schema = df.schema
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
val outdf = ss.sql(s"""INSERT INTO TABLE ${db}.${t} PARTITION (date="${destPartition}") SELECT * FROM ${tablename}""")
// Have also tried the following lines below, but yielded the same results
// var dfInput_1 = dfInput.withColumn("region", lit(s"${destPartition}"))
// dfInput_1.write.mode("append").insertInto(s"${db}.${t}")
// logs 358
LOG.info(s"""SELECT COUNT(*) FROM ${tablename}""")
// logs 423
LOG.info(s"""SELECT COUNT(*) FROM ${db}.${t} where date='${destPartition}'""")
}
When looking in Cassandra, there are indeed 358 rows in the table. I saw this post on Hortonworks https://community.hortonworks.com/questions/51322/count-msmatch-while-using-the-parquet-file-in-spar.html but there doesn't seem to be a solution. I have tried setting spark.sql.hive.metastorePartitionPruning to true, but no changes were seen in the row counts.
Would love any feedback as to why there is a discrepancy between the row counts. Thanks!
EDIT: bad data coming in.... should've seen that coming
Sometimes data contains non-utf8 characters like Japanese or Chinese. Check if data contains any such non-utf8 characters.
If this is a case insert it in ORC format. By default it is text, and text doesn't support non-utf8 characters.