Apache beam dataflow Big query IO without schema - google-bigquery

Is there any way to write unstructured data to a big query table using apache beam dataflow big query io API (i.e without providing schema upfront)

Bigquery needs to know the schema when it creates the table, or when one writes to it. Depending on your situation one may be able to dynamically determine the schema in the pipeline construction code rather than hard coding it.

create a table with just a single STRING column to store data from Dataflow.
CREATE IF NOT EXISTS `your_project.dataset.rawdata` (
raw STRING
);
You can store whatever data as a string without knowing the schema of it. For example, you can store a JSON data as a single string and a CSV as a string, etc.
Specify the table as a destination of your Dataflow. You may need to provide
Dataflow with a javascript UDF which converts a message from a source to a single string which is compatible to a schema of above table.
/**
* User-defined function (UDF) to transform events
* as part of a Dataflow template job.
*
* #param {string} inJson input Pub/Sub JSON message (stringified)
* #return {string} outJson output JSON message (stringified)
*/
function process(inJson) {
var obj = JSON.parse(inJson),
includePubsubMessage = obj.data && obj.attributes,
data = includePubsubMessage ? obj.data : obj;
// INSERT CUSTOM TRANSFORMATION LOGIC HERE
return JSON.stringify(obj);
}
https://cloud.google.com/blog/topics/developers-practitioners/extend-your-dataflow-template-with-udfs
you can see above sample UDF returns a JSON string.
You can later interpret the data with a schema (a.k.a, schema-on-read strategy) like the following
SELECT JSON_VALUE(raw, '$.json_path_you_have') AS column1,
JSON_QUERY_ARRAY(raw, '$.json_path_you_have') AS column2,
...
FROM `your_project.dataset.rawdata`
Depending on your source data, you can use JSON functions or regular expressions to organize your data to a table with a schema you want.

Related

In BigQuery, query to get GCS metadata (filenames in GCS)

We have a GCS bucket with a subfolder at url https://storage.googleapis.com/our-bucket/path-to-subfolder. This sub-folder contains files:
file_1_3.png
file_7_4.png
file_3_2.png
file_4_1.png
We'd like to create a table in BigQuery with a column number1 with values 1,7,3,4 (first number in filename) and a column number2 with the second numbers. String splitting is easy, once the data (a column with filenames) is in BigQuery. How can the filenames be retrieved? Is it possible to query a GCS bucket for metadata on files?
EDIT: want to do this
Updating the answer to reflect the question of how do you retrieve GCS Bucket metadata on files.
There are two options you can have here depending on the use case:
Utilize a cloud function on a cron schedule to perform a read of metadata (like in the example you shared) then using the BQ Client library perform an insert. Then perform the regex listed below.
This option utilizes a feature (remote function) in preview so you may not have the functionality needed, however may be able to request it. This option would get you the latest data on read. It involves the following:
Create a Cloud Function that returns an array of blob names, see code below.
Create a connection resource in BigQuery (overall process is listed here however since the remote function portion is in preview the documentation and potentially your UI may not reflect the necessary options (it did not in mine).
Create a remote function (third code block in link)
Call the function from your code then manipulate as needed with regexp.
Example CF for option 2:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
blob_array = []
for blob in blobs:
blob_array.append()
return blob_array
Example remote function from documentation:
CREATE FUNCTION mydataset.remoteMultiplyInputs(x FLOAT64, y FLOAT64)
RETURNS FLOAT64
REMOTE WITH CONNECTION us.myconnection
OPTIONS(endpoint="https://us-central1-myproject.cloudfunctions.net/multiply");
Once its in it will return the full gcs path of the file. From there you can use REGEX like the following regexp_extract(_FILE_NAME, 'file_(.+)_') to extract the important information.
Now that BQ Remote Function (RF) is GA as well as JSON, I thought of sharing a way to get any property of blobs in a bucket, right from BQ SQL.
!! Make sure to carefully read the official documentation first on how to set up RF as it's easy to miss a step. There are slight differences if you rather use 2nd Gen Function or Cloud run
Create following storage Cloud Function (here Python) - 1st gen good enough:
import json
from google.cloud import storage
storage_client = storage.Client()
def list_blobs(request):
print(request_json := request.get_json()) # print for debugging
calls = request_json['calls']
bucket_name = calls[0][0]
blobs = storage_client.list_blobs(bucket_name)
reply = [b._properties for b in blobs]
return json.dumps({'replies': [reply]})
Create BQ remote function (assumes fns dataset, us.api connection and my_project_id):
CREATE FUNCTION fns.list_blobs(bucket STRING)
RETURNS JSON
REMOTE WITH CONNECTION us.api
OPTIONS(endpoint="https://us-central1-my_project_id.cloudfunctions.net/storage")
The trick to return multiples values for a single request is to use JSON type
SELECT whatever properties you want
SELECT STRING(blob.name), STRING(blob.size), CAST(STRING(blob.updated) AS TIMESTAMP)
FROM
UNNEST(
JSON_EXTRACT_ARRAY(
fns.list_blobs('my_bucket')
)
) blob
The JSON is converted to an ARRAY, and UNNEST() pivots to multiple rows - unfortunately not columns too.
Voila ! I wish there was a easier way to fully parse a JSON array to a table, populating all columns at once, but as of this writing, you must explicitly extract the properties you want:
You can do many more cool things by extending the functions (cloud and remote) so you don't have to leave SQL, like,
generate and return signed URL to display/download right from a query result (e.g. BI tool)
use user_defined_context and branch logic in the CF code, to perform other operations like delete blobs or do other stuff
Object tables are read-only tables containing metadata index over the unstructured data objects in a specified Cloud Storage bucket. Each row of the table corresponds to an object, and the table columns correspond to the object metadata generated by Cloud Storage, including any custom metadata.
With Object tables we can get the file names and do operations on top of that in BigQuery itself.
https://cloud.google.com/bigquery/docs/object-table-introduction

How to auto detect schema from file in GCS and load to BigQuery?

I'm trying to load a file from GCS to BigQuery whose schema is auto-generated from the file in GCS. I'm using Apache Airflow to do the same, the problem I'm having is that when I use auto-detect schema from file, BigQuery creates schema based on some ~100 initial values.
For example, in my case there is a column say X, the values in X is mostly of Integer type, but there are some values which are of String type, so bq load will fail with schema mismatch, in such a scenario we need to change the data type to STRING.
So what I could do is manually create a new table by generating schema on my own. Or I could set the max_bad_record value to some 50, but that doesn't seem like a good solution. An ideal solution would be like this:
Try to load the file from GCS to BigQuery, if the table was created successfully in BQ without any data mismatch, then I don't need to do anything.
Otherwise I need to be able to update the schema dynamically and complete the table creation.
As you can not change column type in bq (see this link)
BigQuery natively supports the following schema modifications:
BigQuery natively supports the following schema modifications:
* Adding columns to a schema definition
* Relaxing a column's mode from REQUIRED to NULLABLE
All other schema modifications are unsupported and require manual workarounds
So as a workaround I suggest:
Use --max_rows_per_request = 1 in your script
Use 1 line which is the best suitable for your case with the optimized field type.
This will create the table with the correct schema and 1 line and from there you can load the rest of the data.

Can I issue a query rather than specify a table when using the BigQuery connector for Spark?

I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
copies the entirety of the named table into input_directory. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?
Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment:
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

How to deserialize the ProtoBuf serialized HBase columns in Hive?

I have used ProtoBuf's to serialize the class and store in HBase Columns.
I want to reduce the number of Map Reduce jobs for simple aggregations, so I need SQL like tool to query the data.
If I use Hive, Is it possible to extend the HBaseStorageHandler and write our own Serde for each Table?
Or any other good solution to is available.
Updated:
I created the HBase table as
create 'hive:users' , 'i'
and inserted user data from java api,
public static final byte[] INFO_FAMILY = Bytes.toBytes("i");
private static final byte[] USER_COL = Bytes.toBytes(0);
public Put mkPut(User u)
{
Put p = new Put(Bytes.toBytes(u.userid));
p.addColumn(INFO_FAMILY, USER_COL, UserConverter.fromDomainToProto(u).toByteArray());
return p;
}
my scan gave results as:
hbase(main):016:0> scan 'hive:users'
ROW COLUMN+CELL
kim123 column=i:\x00, timestamp=1521409843085, value=\x0A\x06kim123\x12\x06kimkim\x1A\x10kim123#gmail.com
1 row(s) in 0.0340 seconds
When I query the table in Hive, I don't see any records.
Here is the command I used to create table.
create external table users(userid binary, userobj binary)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties("hbase.columns.mapping" = ":key, i:0", "hbase.table.default.storage.type" = "binary")
tblproperties("hbase.table.name" = "hive:users");
when I query the hive table I don't see the record inserted from hbase.
Can you please tell me what is wrong here?
You could try writing a UDF which would take binary protobuf and convert it to some readable structure (comma separated or json). You would have to make sure to map values as binary data.

Keep Column types from Java ResultSet in CSV export

I'm currently building a tool that pulls data directly from a database because SPSS Modeler is too slow and store it in a Java ResultSet first of all.
But I try to export the data into a CSV (or similar) file and try to keep as much column types as possible.
Currently I'm using opencsv but it casts Decimals and many others to a String. When I load the file back into SPSS Modeler I get only Integers and Strings.
Are there any CSV libraries (maybe with a special encoding) or other file types I can use to export the data with its column types (like IBM InfoSphere Data Architect can do) so I can load it directly back into SPSS Modeler without changing it back manually there ?
Thank you!
Retrieving the Metadata from the DB Information Schema
If the data is currently stored in a database, you can retrieve the column type from the information schema. All you need to do is retrieving this information after your queried the table and store it so that you can reuse it later.
// connect to DB as usual
Statement stmt = conn.createStatement();
// create your query
// Note that you can use a dummy query here.
//You only need to access the metadata schema of the table, regardless of the actual query.
ResultSet rse = stmt.executeQuery("Select A,B FROM table WHERE ..");
// get the ResultSetMetadata
ResultSetMetaData rsmd = rse.getMetaData();
// Get database specific type
rsmd.getColumnTypeName(1); // database specific type name for column 1 (e.g. VARCHAR)
rsmd.getColumnTypeName(2); // database specific type name for column 2 (e.g. DateTime)
....
// Get generic JDBC type http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html
rsmd.getColumnType(1) // generic type for col 1 (e.g. 12)
rsmd.getColumnType(2) // generic type for col 2
Processing
You could store this information in a CSV schema and process this during the transformation process.
I recommend that you use SuperCSV, which is available here.
This library provides so called cell processors, which allow you to define the type of the columns.
Description:
Cell processors are an integral part of reading and writing with Super CSV - they automate the data type conversions, and enforce constraints. They implement the chain of responsibility design pattern - each processor has a single, well-defined purpose and can be chained together with other processors to fully automate all of the required conversions and constraint validation for a single CSV column.