Hive/Beeline : creation of table with subarray - hive

I'm trying to create a table on Hive by using Beeline. The data are stocked in HDFS as parquet files and they have the following schema :
{
"object_type":"test",
"heartbeat":1496755564224,
"events":[
{
"timestamp":1496755582985,
"hostname":"hostname1",
"instance":"instance1",
"metrics_array":[
{
"metric_name":"metric1_1",
"metric_value":"value1_1"
}
]
},
{
"timestamp":1496756626551,
"hostname":"hostname2",
"instance":"instance1",
"metrics_array":[
{
"metric_name":"metric2_1",
"metric_value":"value2_1"
}
]
}
]
}
My hql script used for the creation of the table is the following :
set hive.support.sql11.reserved.keywords=false;
CREATE DATABASE IF NOT EXISTS datalake;
DROP TABLE IF EXISTS datalake.test;
CREATE EXTERNAL TABLE IF NOT EXISTS datalake.test
(
object_type STRING,
heartbeat BIGINT,
events STRUCT <
metrics_array: STRUCT <
metric_name: STRING,
metric_value: STRING
>,
timestamp: BIGINT,
hostname: STRING,
instance: STRING
>
)
STORED AS PARQUET
LOCATION '/tmp/test/';
Here, is the error i have when i do a SELECT * FROM datalake.test :
Error: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://tmp/test/part-r-00000-7e58b193-a08f-44b1-87fa-bb12b4053bdf.gz.parquet (state=,code=0)
Any ideas ?
Thanks !

Related

How to present dynamic keys in a type struct?

I have a PostgreSQL table which has a JSONB filed. The table can be created by
create table mytable
(
id uuid primary key default gen_random_uuid(),
data jsonb not null,
);
insert into mytable (data)
values ('{
"user_roles": {
"0x101": [
"admin"
],
"0x102": [
"employee",
"customer"
]
}
}
'::json);
In above example, I am using "0x101", "0x102" to present two UIDs. In reality, it has more UIDs.
I am using jackc/pgx to read that JSONB field.
Here is my code
import (
"context"
"fmt"
"github.com/jackc/pgx/v4/pgxpool"
)
type Data struct {
UserRoles struct {
UID []string `json:"uid,omitempty"`
// ^ Above does not work because there is no fixed field called "uid".
// Instead they are "0x101", "0x102", ...
} `json:"user_roles,omitempty"`
}
type MyTable struct {
ID string
Data Data
}
pg, err := pgxpool.Connect(context.Background(), databaseURL)
sql := "SELECT data FROM mytable"
myTable := new(MyTable)
err = pg.QueryRow(context.Background(), sql).Scan(&myTable.Data)
fmt.Printf("%v", myTable.Data)
As the comment inside mentions, the above code does not work.
How to present dynamic keys in a type struct or how to return all JSONB field data? Thanks!
edit your Data struct as follows,
type Data struct {
UserRoles map[string][]string `json:"user_roles,omitempty"`
}
you can also use a uuid type as the map's key type if you are using a package like https://github.com/google/uuid for uuids.
However please note that this way if you have more than one entry in the json object user_roles for a particular user(with the same uuid), only one will be fetched.

How to create Type RECORD of INTEGER in my terraform file for BigQuery

I am trying to make the terraform schema for my BigQuery table and I need a column of type RECORD which will be populated by INTEGER.
The field in question would have the format of brackets with integers inside could be one or mutiple seperated by comma : [1]
I tried writing like this:
resource "google_bigquery_table" "categories" {
project = "abcd-data-ods-${terraform.workspace}"
dataset_id = google_bigquery_dataset.bq_dataset_op.dataset_id
table_id = "categories"
schema = <<EOF
[
{"type":"STRING","name":"a","mode":"NULLABLE"},
{"type":"RECORD[INTEGER]","name":"b","mode":"NULLABLE"}
]
EOF
}
and like this:
resource "google_bigquery_table" "categories" {
project = "abcd-data-ods-${terraform.workspace}"
dataset_id = google_bigquery_dataset.bq_dataset_op.dataset_id
table_id = "categories"
schema = <<EOF
[
{"type":"STRING","name":"a","mode":"NULLABLE"},
{"type":"RECORD","name":"b","mode":"NULLABLE"}
]
EOF
}
But it didn't work as I keep getting an error in my CI/CD on gitlab
The error for the first attempt:
Error: googleapi: Error 400: Invalid value for type: RECORD[INTEGER] is not a valid value, invalid
The error for the second attempt:
Error: googleapi: Error 400: Field b is type RECORD but has no schema, invalid
I presume that the second implementation is the closet to the solution given the error but it is still missing something
Does anyone has an idea about the right way to declare it
Just as stated at the second error:
Error: googleapi: Error 400: Field b is type RECORD but has no schema, invalid
You must provide a schema for RECORD types (you can read more on the docs). For instance, a valid example could be:
resource "google_bigquery_table" "categories" {
project = "abcd-data-ods-${terraform.workspace}"
dataset_id = google_bigquery_dataset.bq_dataset_op.dataset_id
table_id = "categories"
schema = <<EOF
[
{
"type":"STRING",
"name":"a",
"mode":"NULLABLE"
},
{
"type":"RECORD",
"name":"b",
"mode":"NULLABLE",
"fields": [{
"name": "c",
"type": "INTEGER",
"mode": "NULLABLE"
}]
}
]
EOF
}
Hope can help.

BigQuery: Convert record of repeated in repeated record

I have a BigQuery table represented by this JSON (Record of Repeated)
{
"createdBy": [
"foo",
"foo"
],
"fileName": [
"bar1",
"bar2"
]
}
that I need to convert to Repeated Record
[
{
"createdBy": "foo",
"fileName": "bar1"
},
{
"createdBy": "foo",
"fileName": "bar2"
}
]
To make this conversion you use the index 0 for every array and you created the first object, use the 1 index for the second object, ...
I performed this kind of transformation using a UDF, but the problem is due to BigQuery limits I'm unable to save a VIEW that performs this transformation:
No support for CREATE TEMPORARY FUNCTION statements inside views
Following the full statement to generate a sample table and the function
CREATE TEMP FUNCTION filesObjectArrayToArrayObject(filesJson STRING)
RETURNS ARRAY<STRUCT<createdBy STRING, fileName STRING>>
LANGUAGE js AS """
function filesObjectArrayToArrayObject_execute(files) {
var createdBy = files["createdBy"];
var fileName = files["fileName"];
var output = [];
for(var i=0 ; i<createdBy.length ; i++) {
output.push({ "createdBy" : createdBy[i], "fileName" : fileName[i] });
}
return output;
}
return filesObjectArrayToArrayObject_execute(JSON.parse(filesJson));
""";
WITH sample_table AS (
SELECT STRUCT<createdBy ARRAY<STRING>, fileName ARRAY<STRING>>(
["foo", "foo"],
["bar1", "bar2"]
) AS files
)
SELECT
files AS filesOriginal,
filesObjectArrayToArrayObject(TO_JSON_STRING(files)) AS filesConverted
FROM sample_table
Is there a way to perform the same kind of task using native BigQuery statements?
Please note that:
The real data has more than 2 keys, but those are fixed in names
The length of the array is not fixed, can be 0, 1, 10, 20, ...
Below is for BigQuery Standard SQL
#standardSQL
WITH sample_table AS (
SELECT STRUCT<createdBy ARRAY<STRING>, fileName ARRAY<STRING>>(
["foo", "foo"],
["bar1", "bar2"]
) AS files
)
SELECT
ARRAY(
SELECT STRUCT(createdBy, fileName)
FROM t.files.createdBy AS createdBy WITH OFFSET
JOIN t.files.fileName AS fileName WITH OFFSET
USING(OFFSET)
) files
FROM `sample_table` t
with output
Row files.createdBy files.fileName
1 foo bar1
foo bar2

SQL query to search a String without key, in a Column which has only JSON data

I need to search a string, for example 'RecId' in a column which has only JSON data.
First Cell JSON Data:
{
"AuditedFieldsAndRelationships": null,
"AuditObjectChanges": false,
"CalculatedRules": {
"AuditHistoryDescription": {
"Calculated": "Always",
"CalculatedExpression": {
"Description": null,
"FieldRefs": ["RecId", "Rel_CIComponent_InstalledApplication_Name", "Rel_Software_Id", "Rel_Software_Name"]
}
}
}
}
Image:
Database: Microsoft SQL Server 2014
I got pretty similar problem solution in link but it is with respect to key
SELECT * FROM #table CROSS APPLY OPENJSON(Col,'$.Key') WHERE value ='SearchedString'
but it is showing error Invalid object name 'OPENJSON'
For that error, I tried the below solution given in link
SELECT compatibility_level FROM sys.databases WHERE name = 'DataBaseName';
But getting the below error:
Could someone help me out here.

Dataflow job: Failed to copy Column partitioned table to Column partitioned meta table: not supported

I have an Apache Beam project that uses the Google Dataflow runner to process quite some data stored in BigQuery. The flow reads 1 main table and uses 3 different side streams. For every row in the input data set, we calculate a 'label', which generates 5 different output streams. The main BigQuery table we read is 60GB, the 3 side streams are 2GB, 51GB and 110GB each. These are all converted to a PCollectionView<Map<String, Iterable<TableRow>>>
Eventually, these 5 streams are combined and written back to BigQuery.
When I run this job on a subset of the data (1 million rows), the job works as expected, but when I run it on the full data set (177 million rows), the job returns the following error: Failed to copy Column partitioned table to Column partitioned meta table: not supported
What does this error mean? And how can I fix this? Thanks!
Full stack trace:
java.lang.RuntimeException: Failed to create copy job with id prefix beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00000, reached max retries: 3, last failed copy job: {
"configuration" : {
"copy" : {
"createDisposition" : "CREATE_IF_NEEDED",
"destinationTable" : {
"datasetId" : "KPI",
"projectId" : "bolcom-stg-kpi-logistics-f6c",
"tableId" : "some_table_v1$20180811"
},
"sourceTables" : [ {
"datasetId" : "KPI",
"projectId" : "bolcom-stg-kpi-logistics-f6c",
"tableId" : "beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00002_00000"
}, {
"datasetId" : "KPI",
"projectId" : "bolcom-stg-kpi-logistics-f6c",
"tableId" : "beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00001_00000"
}, {
"datasetId" : "KPI",
"projectId" : "bolcom-stg-kpi-logistics-f6c",
"tableId" : "beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00004_00000"
}, {
"datasetId" : "KPI",
"projectId" : "bolcom-stg-kpi-logistics-f6c",
"tableId" : "beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00003_00000"
} ],
"writeDisposition" : "WRITE_APPEND"
}
},
"etag" : "\"HbYIGVDrlNbv2nDGLHCFlwJG0rI/oNgxlMGidSDy59VClvLIlEu08aU\"",
"id" : "bolcom-stg-kpi-logistics-f6c:EU.beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00000-2",
"jobReference" : {
"jobId" : "beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00000-2",
"location" : "EU",
"projectId" : "bolcom-stg-kpi-logistics-f6c"
},
"kind" : "bigquery#job",
"selfLink" : "https://www.googleapis.com/bigquery/v2/projects/bolcom-stg-kpi-logistics-f6c/jobs/beam_load_poisrschellenberger0810134033c63e44ed_e7cf725c5321409b96a4f20e7ec234bc_3d9288a5ff3a24b9eb8b1ec9c621e7dc_00000-2?location=EU",
"statistics" : {
"creationTime" : "1533957446953",
"endTime" : "1533957447111",
"startTime" : "1533957447111"
},
"status" : {
"errorResult" : {
"message" : "Failed to copy Column partitioned table to Column partitioned meta table: not supported.",
"reason" : "invalid"
},
"errors" : [ {
"message" : "Failed to copy Column partitioned table to Column partitioned meta table: not supported.",
"reason" : "invalid"
} ],
"state" : "DONE"
},
"user_email" : "595758839781-compute#developer.gserviceaccount.com"
}.
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.copy(WriteRename.java:166)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.writeRename(WriteRename.java:107)
at org.apache.beam.sdk.io.gcp.bigquery.WriteRename.processElement(WriteRename.java:80)
The table to write to is created as follows:
private static void write(final PCollection<TableRow> data) {
// Write to BigQuery.
data.apply(BigQueryIO.writeTableRows()
.to(new GetPartitionFromTableRowFn("table_name"))
.withSchema(getOutputSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
}
private static TableSchema getOutputSchema() {
final List<TableFieldSchema> fields = new ArrayList<>();
fields.add(new TableFieldSchema().setName(ORDER_LINE_REFERENCE).setType("INTEGER"));
fields.add(new TableFieldSchema().setName(COLUMN_LABEL).setType("STRING"));
fields.add(new TableFieldSchema().setName(COLUMN_INSERTION_DATETIME).setType("TIMESTAMP"));
fields.add(new TableFieldSchema().setName(COLUMN_PARTITION_DATE).setType("DATE"));
return new TableSchema().setFields(fields);
}
With the following SerializationFunction:
public class GetPartitionFromTableRowFn implements SerializableFunction<ValueInSingleWindow<TableRow>, TableDestination> {
private final String tableDestination;
public GetPartitionFromTableRowFn(final String tableDestination) {
this.tableDestination = tableDestination;
}
public TableDestination apply(final ValueInSingleWindow<TableRow> element) {
final TableDestination tableDestination;
if (null != element.getValue()) {
final TimePartitioning timePartitioning = new TimePartitioning().setType("DAY");
timePartitioning.setField(Constants.COLUMN_PARTITION_DATE);
final String formattedDate = element.getValue().get(Constants.COLUMN_PARTITION_DATE).toString().replaceAll("-", "");
// e.g. output$20180801
final String tableName = String.format("%s$%s", this.tableDestination, formattedDate);
tableDestination = new TableDestination(tableName, null, timePartitioning);
} else {
tableDestination = new TableDestination(this.tableDestination, null);
}
return tableDestination;
}
}
1) You are trying to write to a column partitioned table described as a partitioned decorator in the table suffix: some_table_v1$20180811 this is not possible. This syntax works only on ingestion-time partitioned tables.
As your table is already partitioned by column according to the error message, this operation is not supported. You need to run an UPDATE or MERGE statements to update a column based partition, and one job is limited to change 1000 partitions only. Or to drop the column based partition and use only ingestion-time partitioned tables.
Note, BigQuery supports two kind of partitions:
ingestion time based
column based.
2) If this is not the case then you need to check again your source tables:
When you copy multiple partitioned tables, note the following:
If you copy multiple source tables into a partitioned table in the same job, the source tables can't contain a mixture of partitioned and non-partitioned tables.
If all of the source tables are partitioned tables, the partition specifications for all source tables must match the destination table's partition specification. Your settings determine whether the destination table is appended or overwritten.
The source and destination tables must be in datasets in the same location.
ps. for further details, please post your tables definition.
3) Look at this solution BigQuery partitioning with Beam streams