BigQuery: Array<string> field with WriteToBigQuery - google-bigquery

I'm creating a Google Dataflow template in Python:
query = "#standardSQL" + """
SELECT
Frame.Serial,
Frame.Fecha,
Frame.Longitud,
Frame.Latitud,
ARRAY_AGG (CONCAT (ID, '-', Valor) ORDER BY ID) AS Resumen
FROM <...>
TABLE_SCHEMA = 'Serial:STRING,Fecha:DATETIME,Longitud:STRING,Latitud:STRING,Resumen:STRING'
| 'Read from BQ' >> beam.io.Read(beam.io.BigQuerySource(query=query,dataset="xxx",use_standard_sql=True))
| 'Write transform to BigQuery' >> WriteToBigQuery('table',TABLE_SCHEMA)
The problem
This fails due Resumen field is an Array:
Array specified for non-repeated field.
What I tested
Create the table directly in BigQuery UI with the sentence:
CREATE TABLE test (Resumen ARRAY<STRING>)
This works. The table is created with:
Type: string
Mode: Repeated
Change the TABLE_SCHEMA and run the pipeline:
TABLE_SCHEMA ='Serial:STRING,Fecha:DATETIME,Longitud:STRING,Latitud:STRING,Resumen:ARRAY<STRING>'
With the error:
"Invalid value for: ARRAY\u003cSTRING\u003e is not a valid value".
How it should be the TABLE_SCHEMA to create the table and use with beam.io.WriteToBigQuery()?

Looks like repeated or nested fields are not supported if you specify a BQ schema in a single string: https://beam.apache.org/documentation/io/built-in/google-bigquery/#creating-a-table-schema
You will need to describe your schema explicitly and set the field mode to repeated: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/cookbook/bigquery_schema.py#L95
# A repeated field.
children_schema = bigquery.TableFieldSchema()
children_schema.name = 'children'
children_schema.type = 'string'
children_schema.mode = 'repeated'
table_schema.fields.append(children_schema)

Related

Create new files from multiple export query data in bigquery

I'm trying to run multiple export statements in bigquery like this
EXPORT DATA OPTIONS(
uri='gs://bucket/folder/*.csv',
format='CSV',
overwrite=true,
header=true,
field_delimiter=';') AS
SELECT field1, field2 FROM mydataset.table1 ORDER BY field1 LIMIT 10
The problem is that I don't want to overwrite new files, but instead I want new files to be created only if the query returns something.
I've tried changing the uri to gs://bucket/folder/*1.csv but this is creating an empty file (which I don't want).
I've also set the overwrite parameter to false; this results in Invalid value: overwrite option is not specified and destination is not empty. at [109:1]
Any ways to fix this?
You may try and consider below approach.
EXECUTE IMMEDIATE CONCAT(
"EXPORT DATA OPTIONS(uri='gs://your-bucket-bucket/your-folder/file_",
FORMAT_TIMESTAMP("%F-%T", TIMESTAMP(Current_TIMESTAMP()), "UTC"),
"_*.csv', format = 'CSV', overwrite = true, header = true, field_delimiter = ';') AS SELECT field1, field2 FROM your-dataset.table1 ORDER BY field1 "
)
This approach concatenates current timestamp to your filename so that it will serve as a unique identifier of file names and will create a new one if your query returns something.
Sample output on my Bucket:
In addition, if you experience a location error when executing the above query, the posted answer here will solve the problem.

Databricks notebook - return all values from a table

For test purposes, I have an empty DB into which I populate a tiny amount of data, extracted and transformed from a json file.
I would like to create a notebook using scala, which gets all values from all columns from a given table, and exit the notebook returning this result as a string.
I've tried variations of the following:
val result = spark.sql("select * from table.DB").as[String];
dbutils.notebook.exit(result)
However the first command fails with error:
AnalysisException: Try to map struct<Version:bigint,metadataInformation:struct<metadataID:string... etc ...> to Tuple1, but failed as the number of fields does not line up.;
However, something like the following works, to retrieve value of a specific field, from a column:
val result = spark.sql("select column.jsonfield from table.DB").as[String].first();
dbutils.notebook.exit(result)
How can I return the content of all columns?
val result = spark.sql("SELECT x FROM y").collect().toList.flatMap(x => x.toSeq).mkString(",")
dbutils.notebook.exit(result)

how to refer values based on column names in python

i am trying to extract and read the data from a SQL query.
Below is the sample data from SQL developer:
target_name expected_instances environment system_name hostname
--------------------------------------------------------------------------------------
ORAUAT_host1 1 UAT ORAUAT_host1_sys host1.sample.net
ORAUAT_host2 1 UAT ORAUAT_host1_sys host2.sample.net
Normally i pass the system_name to the query (which has a bind variable for system_name) and get the data as a list,but not the column names.
Is there a way in Python to retrieve the data along with the column names and reference values with column name like target_name[0] giving the value ORAUAT_host1?Please suggest.Thanks.
If what you want is to get the column names from the table you are querying, you can do something like this:
My example is printing a csv file
import os
import sys
import cx_Oracle
db = cx_Oracle.connect('user/pass#host:1521/service_name')
SQL = "select * from dual"
print(SQL)
cursor = db.cursor()
f = open("C:\dual.csv", "w")
writer = csv.writer(f, lineterminator="\n", quoting=csv.QUOTE_NONNUMERIC)
r = cursor.execute(SQL)
#this takes the column names
col_names = [row[0] for row in cursor.description]
writer.writerow(col_names)
for row in cursor:
writer.writerow(row)
f.close()
The way to print the columns is using the method description of the cursor object
Cursor.description
This read-only attribute is a sequence of 7-item sequences. Each of
these sequences contains information describing one result column:
(name, type, display_size, internal_size, precision, scale, null_ok).
This attribute will be None for operations that do not return rows or
if the cursor has not had an operation invoked via the execute()
method yet.
The type will be one of the database type constants defined at the
module level.
https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#

Cannot define a BigQuery column as ARRAY<STRUCT<INT64, INT64>>

I am trying to define a table that has a column that is an arrays of structs using standard sql. The docs here suggest this should work:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<INT64,INT64>>
)
but I get an error:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat xxxx.ddl.temp.sql | awk 'ORS=" "')"
Waiting on bqjob_r6735048b_00000173ed2d9645_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxxx-10843454-yyyyy-
dev:bqjob_r6735048b_00000173ed2d9645_1': Illegal field name:
Changing the field (edit: column!) name does not fix it. What I am doing wrong?
The fields within the struct need to be named so this works:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<x INT64,y INT64>>
)

find tables with specific columns' names in a database on databricks by pyspark

I would like to find tables with a specific column in a database on databricks by pyspark sql.
I use the following code but it does not work.
https://medium.com/#rajnishkumargarg/find-all-the-tables-by-column-name-in-hive-51caebb94832
On SQL server my code:
SELECT Table_Name, Column_Name
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_CATALOG = 'YOUR_DATABASE'
AND COLUMN_NAME LIKE '%YOUR_COLUMN%'
but, I cannot find out how to do the same thing on pyspark sql ?
thanks
The SparkSession has a property catalog. This catalog's method listTables returns a list of all tables known to the SparkSession. With this list you can query all columns for each table with listColumns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("test").getOrCreate()
spark.sql("CREATE TABLE tab1 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab2 (name STRING, age INT) USING parquet")
spark.sql("CREATE TABLE tab3 (street STRING, age INT) USING parquet")
for table in spark.catalog.listTables():
for column in spark.catalog.listColumns(table.name):
if column.name == 'name':
print('Found column {} in table {}'.format(column.name, table.name))
prints
Found column name in table tab1
Found column name in table tab2
Both methods, listTables and listColumns accept a database name as an optional argument if you want to restrict your search to a single database.
I had a similar problem to OP, I needed to find all columns - including nested columns - that match a LIKE clause.
I wrote a post about it here https://medium.com/helmes-people/how-to-view-all-databases-tables-and-columns-in-databricks-9683b12fee10
But you can find the full code below.
The benefit of this solution, in comparison with the previous answers, is that it works in case you need to search columns with LIKE '%%', as written by OP. Also, it allows you to search for name in nested fields. Finally, it creates a SQL like view, similar to INFORMATION_SCHEMA views.
from pyspark.sql.types import StructType
# get field name from schema (recursive for getting nested values)
def get_schema_field_name(field, parent=None):
if type(field.dataType) == StructType:
if parent == None:
prt = field.name
else:
prt = parent+"."+field.name # using dot notation
res = []
for i in field.dataType.fields:
res.append(get_schema_field_name(i, prt))
return res
else:
if parent==None:
res = field.name
else:
res = parent+"."+field.name
return res
# flatten list, from https://stackoverflow.com/a/12472564/4920394
def flatten(S):
if S == []:
return S
if isinstance(S[0], list):
return flatten(S[0]) + flatten(S[1:])
return S[:1] + flatten(S[1:])
# list of databases
db_list = [x[0] for x in spark.sql("SHOW DATABASES").rdd.collect()]
for i in db_list:
spark.sql("SHOW TABLES IN {}".format(i)).createOrReplaceTempView(str(i)+"TablesList")
# create a query for fetching all tables from all databases
union_string = "SELECT database, tableName FROM "
for idx, item in enumerate(db_list):
if idx == 0:
union_string += str(item)+"TablesList WHERE isTemporary = 'false'"
else:
union_string += " UNION ALL SELECT database, tableName FROM {}".format(str(item)+"TablesList WHERE isTemporary = 'false'")
spark.sql(union_string).createOrReplaceTempView("allTables")
# full list = schema, table, column
full_list = []
for i in spark.sql("SELECT * FROM allTables").collect():
table_name = i[0]+"."+i[1]
table_schema = spark.sql("SELECT * FROM {}".format(table_name))
column_list = []
for j in table_schema.schema:
column_list.append(get_schema_field_name(j))
column_list = flatten(column_list)
for k in column_list:
full_list.append([i[0],i[1],k])
spark.createDataFrame(full_list, schema = ['database', 'tableName', 'columnName']).createOrReplaceTempView("allColumns")```
#The following code will create a TempView containing all the tables,
# and all their columns along with their type , for a specified database
cls = []
spark.sql("Drop view if exists allTables")
spark.sql("Drop view if exists allColumns")
for table in spark.catalog.listTables("TYPE_IN_YOUR_DB_NAME_HERE"):
for column in spark.catalog.listColumns(table.name, table.database):
cls.append([table.database,table.name, column.name, column.dataType])
spark.createDataFrame(cls, schema = ['databaseName','tableName','columnName',
'columnDataType']).createOrReplaceTempView("allColumns")
SparkSession really has catalog property as werner mentioned.
If i understand you correctly, you want to get tables that has a specific column.
you can try this code(sorry for scala code instead python):
val databases = spark.catalog.listDatabases().select($"name".as("db_name")).as("databases")
val tables = spark.catalog.listTables().select($"name".as("table_name"), $"database").as("tables")
val tablesWithDatabase = databases.join(tables, $"databases.db_name" === $"tables.database", "inner").collect()
tablesWithDatabase.foreach(row => {
val dbName = row.get(0).asInstanceOf[String]
val tableName = row.get(1).asInstanceOf[String]
val columns = spark.catalog.listColumns(dbName, tableName)
columns.foreach(column=>{
if (column.name == "Your column")
// Do your logic here
null
})
})
Notice that i am doing collect so if you have a lot of tables/databases it can cause an OOM error, the reason im doing collect is because that in contrast to listTables or listDatabases methods, that can be called without arguments at all, listColumns need to get dbName and tableName, and it is not having any unique column id match to table.
So the search of the column will be done locally on the driver.
Hope that was helping.