Data not being populated after altering the hive table - hive

I have a hive table which is being populated by the underlying parquet file in HDFS location. Now, I have altered the table schema by changing the column name,but the column is now getting populated with NULL instead of original data in parquet.

Give this a try. Open up the .avsc file. For the column, you will find something like
{
"name" : "start_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "start_date",
"sqlType" : "12"
}
Add an alias
{
"name" : "start_date",
"type" : [ "null", "string" ],
"default" : null,
"columnName" : "start_date",
"aliases" : [ "sta_dte" ],
"sqlType" : "12"
}

Related

Use null as default for a field in Avro4k

I am using avro4k and I have a field that is nullable, like this:
#Serializable
data class Product(
#AvroDefault("null")
#ScalePrecision(DEFAULT_SCALE, DEFAULT_PRECISION)
#Serializable(with = BigDecimalSerializer::class)
val price: BigDecimal? = null
)
This is the generated schema:
{
"type" : "record",
"name" : "Product",
"namespace" : "org.company",
"fields" : [ {
"name" : "ask",
"type" : [ {
"type" : "bytes",
"logicalType" : "decimal",
"precision" : 7,
"scale" : 2
}, "null" ],
"default" : "null"
} ]
}
Avro specification expects that null should not have the quotes.
I think it's also a problem that the null types appears last in the schema, according to the documentation:
Unions, as mentioned above, are represented using JSON arrays. For example, ["null", "string"] declares a schema which may be either a null or string.
(Note that when a default value is specified for a record field whose type is a union, the type of the default value must match the first element of the union. Thus, for unions containing “null”, the “null” is usually listed first, since the default value of such unions is typically null.)
Is there a way to fix both these issues?
You're supposed to use Avro.NULL in that case.
#Serializable
data class Product(
#AvroDefault(Avro.NULL)
#ScalePrecision(DEFAULT_SCALE, DEFAULT_PRECISION)
#Serializable(with = BigDecimalSerializer::class)
val price: BigDecimal? = null
)
It also fixes the order.

How to return nested json in one row using JSON_TABLE

I am struggling with the following query. I have json and it is splitting the duration and distance elements over two rows. I need them in one and cannot see where to change this.
The following will just work from SQL*Plus etc (Oracle 20.1)
select *
from json_table(
'{
"destination_addresses" : [ "Page St, London SW1P 4BG, UK" ],
"origin_addresses" : [ "Corbridge Drive, Luton, LU2 9UH, UK" ],
"rows" : [
{
"elements" : [
{
"distance" : {
"text" : "88 km",
"value" : 87773
},
"duration" : {
"text" : "1 hours 25 mins",
"value" : 4594
},
"status" : "OK"
}
]
}
],
"status" : "OK"
}',
'$' COLUMNS
(
origin_addresses varchar2(1000 char) path '$.origin_addresses[*]',
destination_addresses varchar2(1000 char) path '$.destination_addresses[*]',
nested path '$.rows.elements.distance[*]'
COLUMNS(
distance_text varchar2(100) path '$.text',
distance_value varchar2(100) path '$.value'
),
nested path '$.rows.elements.duration[*]'
COLUMNS(
duration_text varchar2(100) path '$.text',
duration_value varchar2(100) path '$.value'
)
)
);
Mathguy: I don't have influence over the JSON that Google returns but origin and destinations are arrays it is just this search is from A to Z and not A,B,C to X,Y,Z for example which would return 9 results instead of 1
See results here

Is there a way to match avro schema with Bigquery and Bigtable?

I'd like to import bigquery data to bigtable using Google Composer.
Exporting bigquery rows in Avro format to GCS was successful. However, import Avro data to Bigtable was not.
The error says
Caused by: org.apache.avro.AvroTypeException: Found Root, expecting com.google.cloud.teleport.bigtable.BigtableRow, missing required field key
I guess the schema between bigquery and bigtable should match each other. But I have no idea how to do this.
For every record read from the Avro files:
Attributes present in the files and in the table are loaded into the table.
Attributes present in the file but not in the table are subject to ignore_unknown_fields,
Attributes that exist in the table but not in the file will use their default value, if there is one set.
The below links are helpful.
[1] https://cloud.google.com/dataflow/docs/guides/templates/provided-batch#cloud-storage-avro-to-bigtable
[2] https://github.com/GoogleCloudPlatform/DataflowTemplates/blob/master/src/main/resources/schema/avro/bigtable.avsc
[3] Avro to BigTable - Schema issue?
For those of you who still have problem like me because they are not familiar with avro, here is one working schema transformation that I found after some tinkering.
For example, if you have table from bigquery like this
And you want to use user_id as the bigtable row_key and ingest all columns, here is the example code to encode them as avro file.
from avro.schema import Parse
from avro.io import DatumWriter
from avro.datafile import DataFileWriter
bigtable_schema = {
"name" : "BigtableRow",
"type" : "record",
"namespace" : "com.google.cloud.teleport.bigtable",
"fields" : [
{ "name" : "key", "type" : "bytes"},
{ "name" : "cells",
"type" : {
"type" : "array",
"items": {
"name": "BigtableCell",
"type": "record",
"fields": [
{ "name" : "family", "type" : "string"},
{ "name" : "qualifier", "type" : "bytes"},
{ "name" : "timestamp", "type" : "long", "logicalType" : "timestamp-micros"},
{ "name" : "value", "type" : "bytes"}
]
}
}
}
]
}
parsed_schema = Parse(json.dumps(bigtable_schema))
row_key = 'user_id'
family_name = 'feature_name'
feature_list = ['channel', 'zip_code', 'history']
with open('features.avro', 'wb') as f:
writer = DataFileWriter(f, DatumWriter(), parsed_schema)
for item in df.iterrows():
row = item[1]
ts = int(datetime.now().timestamp()) * 1000 * 1000
for feat in feature_list:
writer.append({
"key": row[row_key].encode('utf-8'),
"cells": [{"family": family_name,
"qualifier": feat.encode('utf-8'),
"timestamp": ts,
"value": str(row[feat]).encode('utf-8')}]
})
writer.close()
Then you can use dataflow template job to run the ingestion.
Complete code can be found here: https://github.com/mitbal/sidu/blob/master/bigquery_to_bigtable.ipynb

OPENJSON does not select all documents into the SQL table

I have been trying to export the contents of a JSON file to an SQL Server table. However, despite the presence of multiple rows in the JSON, the output SQL table consists of only the first row from the JSON. The code I am using is as follows:
DROP TABLE IF EXISTS testingtable;
DECLARE #json VARCHAR(MAX) = '{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" },
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }';
SELECT * INTO testingtable FROM OPENJSON(#json) WITH (_id int, city varchar(20), loc float(50), pop int, state varchar(5)
)
SELECT * FROM testingtable
And the output obtained is as follows:
Click to view
A multiline JSON text is enclosed in a square bracket, for example;
[
{first data set},
{second data set}, .....
]
You can either add square brackets while passing data to this query or else you can add square brackets to your #json variable (eg. '['+ #json + ']')
DECLARE #json VARCHAR(MAX) = '{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" },
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }';
SELECT * INTO testingtable FROM OPENJSON ('['+ #json + ']') WITH (_id int, city varchar(20), loc float(50), pop int, state varchar(5)
)
SELECT * FROM testingtable
The string isn't valid JSON. You can't have two root objects in a JSON document. Properly formatted, the JSON string looks like this
DECLARE #json VARCHAR(MAX) = '{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" },
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }';
It should be
DECLARE #json VARCHAR(MAX) = '[{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" },
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }
]';
Looks like OPENJSON parsed the first object and stopped as soon as it encounterd the invalid text.
The quick & dirty way to fix this would be to add the missing square brackets :
SELECT * FROM OPENJSON('[' + #json + ']') WITH (_id int, city varchar(20), loc float(50), pop int, state varchar(5))
I suspect that string came from a log or event file that stores individual records in separate lines. That's not valid, nor is there any kind of standard or specification for this (name squatters notwithstanding) but a lot of high-traffic applications use this, eg in log files or event streaming.
The reason they do this is that there's no need to construct or read the entire array to get a record. It's easy to just append a new line for each record. Reading huge files and processing them in parallel is also easier - just read the text line by line and feed it to workers. Or split the file in N parts to the nearest newline and feed individual parts to different machines. That's how Map-Reduce works.
That's why adding the square brackets is a dirty solution - you have to read the entire text from a multi-MB or GB-sized file before you can parse it. That's not what OPENJSON was built to do.
The proper solution would be to read the file line-by-line using another tool, parse the records and insert the values into the target tables.
If you know the JSON docs will not contain any internal newline characters, you can split the string with string_split. OPENJSON doesn't care about leading whitespace or trailing ,. That way you avoid adding the [ ] characters, and don't have to parse it as one big document.
EG:
DROP TABLE IF EXISTS testingtable;
DECLARE #jsonFragment VARCHAR(MAX) = '{ "_id" : "01001", "city" : "AGAWAM", "loc" : [ -72.622739, 42.070206 ], "pop" : 15338, "state" : "MA" },
{ "_id" : "01002", "city" : "CUSHMAN", "loc" : [ -72.51564999999999, 42.377017 ], "pop" : 36963, "state" : "MA" }';
SELECT *
INTO testingtable
FROM string_split(#jsonFragment,CHAR(10)) docs
cross apply
(
select *
from openjson(docs.value)
WITH (_id int, city varchar(20), loc float(50), pop int, state varchar(5))
) d
SELECT * FROM testingtable
This format is what you might call a "JSON Fragment", by analogy with XML. And this is another difference between XML and JSON in SQL Server. For XML the engine is happy to parse and store XML Fragments, but not with JSON.

BigQuery: --[no]use_avro_logical_types flag doesn't work

I try to use bq command with --[no]use_avro_logical_types flag to load avro files into BigQuery table which does not exist before executing the command. The avro schema contains timestamp-millis logical type value. When the command is executed, a new table is created but the schema of its column becomes INTEGER.
This is a recently released feature so that I cannot find examples and I don't know what I am missing. Could anyone give me a good example?
My avro schema looks like following,
...
}, {
"name" : "timestamp",
"type" : [ "null", "long" ],
"default" : null,
"logicalType" : [ "null", "timestamp-millis" ]
}, {
...
And executing command is this:
bq load --source_format=AVRO --use_avro_logical_types <table> <path/to/file>
To use the timestamp-millis logical type, you can specify the field in the following way:
{
"name" : "timestamp",
"type" : {"type": "long", "logicalType" : "timestamp-millis"}
}
In order to provide an optional 'null' value, you can try out the following spec:
{
"name" : "timestamp",
"type" : ["null", {"type" : "long", "logicalType" : "timestamp-millis"}]
}
For a full list of supported Avro logical types please refer to the Avro spec: https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types.
According to https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro, the avro type, timestamp-millis, is converted to an INTEGER once loaded in BigQuery.