ERROR: column "blob" is of type jsonb but expression is of type character - apache-spark-sql

val parquetDF = session.read.parquet("s3a://test/ovd").selectExpr("id", "topic", "update_id", "blob")
Trying to read parquet file and dump into Postgres. One of the column in postgres table is of JSONB datatype and in parquet it is in String format.
parquetDF.write.format("jdbc")
.option("driver", "org.postgresql.Driver")
.option("url", "jdbc:postgresql://localhost:5432/db_metamorphosis?binaryTransfer=true&stringtype=unspecified")
.option("dbtable", "entitlements.general")
.option("user", "mdev")
.option("password", "")
.option("stringtype", "unspecified")
.mode(SaveMode.Append)
.save()
And it fails with this erorr :
Caused by: org.postgresql.util.PSQLException: ERROR: column "blob" is of type jsonb but expression is of type character
Hint: You will need to rewrite or cast the expression.
Position: 85
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2270)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1998)
... 16 more
Someone on SO suggested to put stringtype=unspecified as then Postgres will decided the datatype for string, but it seems to be not working.
<scala.major.version>2.12</scala.major.version>
<scala.version>${scala.major.version}.8</scala.version>
<spark.version>2.4.0</spark.version>
<postgres.version>9.4-1200-jdbc41</postgres.version>

Related

Cannot define a BigQuery column as ARRAY<STRUCT<INT64, INT64>>

I am trying to define a table that has a column that is an arrays of structs using standard sql. The docs here suggest this should work:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<INT64,INT64>>
)
but I get an error:
$ bq query --use_legacy_sql=false --location=asia-east2 "$(cat xxxx.ddl.temp.sql | awk 'ORS=" "')"
Waiting on bqjob_r6735048b_00000173ed2d9645_1 ... (0s) Current status: DONE
Error in query string: Error processing job 'xxxxx-10843454-yyyyy-
dev:bqjob_r6735048b_00000173ed2d9645_1': Illegal field name:
Changing the field (edit: column!) name does not fix it. What I am doing wrong?
The fields within the struct need to be named so this works:
CREATE OR REPLACE TABLE ta_producer_conformed.FundStaticData
(
id STRING,
something ARRAY<STRUCT<x INT64,y INT64>>
)

Postgres | V9.4 | Extract value from json

I have a table that one of the columns is in type TEXT and holds a json object inside.
I what to reach a key inside that json and ask about it's value.
The column name is json_representation and the json looks like that:
{
"additionalInfo": {
"dbSources": [{
"user": "Mike"
}]
}
}
I want to get the value of the "user" and ask if it is equal to "Mike".
I tried the following:
select
json_representation->'additionalInfo'->'dbSources'->>'user' as singleUser
from users
where singleUser = 'Mike';
I keep getting an error:
Query execution failed
Reason:
SQL Error [42883]: ERROR: operator does not exist: text -> unknown
Hint: No operator matches the given name and argument type(s). You might need to add explicit type casts.
Position: 31
please advice
Thanks
The error message tells you what to do: you might need to add an explicit type cast:
And as you can not reference a column alias in the WHERE clause, you need to wrap it into a derived table:
select *
from (
select json_representation::json ->'additionalInfo'->'dbSources' -> 0 ->>'user' as single_user
from users
) as t
where t.single_user = 'Mike';
:: is Postgres' cast operator
But the better solution would be to change the column's data type to json permanently. And once you upgrade to a supported version of Postgres, you should use jsonb.

Query a hive table with array<array<string>> type

I have a hive table and had to put a filter where the value of the column =[]. The type of the column in array<array<string>>. I tried to use array_contains but gave the following error
Error while compiling statement: FAILED: SemanticException [Error
10016]: line 2:41 Argument type mismatch ''[]'': "array"
expected at function ARRAY_CONTAINS, but "string" is found
The sample values of the column could be
[]
[['a','b', 'c']]
[['a'],['b'], ['c']]
[]

How do I add columns to a table created using serde when creating a Hive table?

table desc info
hive> desc log23;
OK
col_name data_type comment
17/05/25 10:49:12 INFO mapred.FileInputFormat: Total input files to process : 1
host string from deserializer
remote_host string from deserializer
remote_logname string from deserializer
remote_user string from deserializer
request_time string from deserializer
request_method string from deserializer
request_url string from deserializer
first_line string from deserializer
http_status string from deserializer
bytes string from deserializer
referer string from deserializer
agent string from deserializer
Time taken: 0.049 seconds, Fetched: 12 row(s)
apache log format serialize
serializationLib:org.apache.hadoop.hive.contrib.serde2.RegexSerDe, parameters:{output.format.string=%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s, serialization.format=1, input.regex=([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) (-|\[[^\]]*\]) "(.[A-Z]*) (.*) (.*)" (-|[0-9]*) (-|[0-9]*) "(.*)" "(.*)"})
Add a column using Alter query
hive> alter table log23 add columns (code string);
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask.
Error: type expected at the position 0 of derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:derived from deserializer:<derived from deserializer:derived from deserializer:string but>'<' is found.`
I get an error like above failed. How do I add a column...?
Unfortunately you cannot add columns if you've used serde. It is a known issue:
https://issues.apache.org/jira/browse/HIVE-17713
ADD COLUMNS lets you add new columns to the end of the existing columns but before the partition columns. This is supported for Avro backed tables as well, for Hive 0.14 and later.
REPLACE COLUMNS removes all existing columns and adds the new set of columns. This can be done only for tables with a native SerDe (DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe). Refer to Hive SerDe for more information. REPLACE COLUMNS can also be used to drop columns. For example, "ALTER TABLE test_change REPLACE COLUMNS (a int, b int);" will remove column 'c' from test_change's schema.
i tried the same but i am able to create a table and added the columns at the end:
create table log23 (host String, remote_host String);
alter table log23 add columns(code String);
which is working with textfile format. please let me know if you are using different file format so that i try to replicate the use.

Why concatenatation in hive returns error?

I use the concat function in hive (Amazon Elastic Map Reduce) for specifiying the path to S3 bucket (dval is date value, which will be changed automatically):
add jar s3://mySerdeBucket/hive-json-serde.jar;
set dval='03';
set pthstring=concat('s3://mybucket/',${hiveconf:dval},'/');
CREATE EXTERNAL TABLE table1 (uuid string, tm string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
WITH SERDEPROPERTIES ('rename_columns'='device_uuid>uuid,at>tmstmp')
LOCATION ${hiveconf:pthstring};
The server returns the following error:
FAILED: ParseException line 4:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
As far as i understand, hive reads the pthstring variable as a string with the function in it, not the result of concat function. How can I fix it?
Note that concat() is a builti-in hive string function that can be applied in a query on column fields, not with hive variables.
use like this:
hive (default)> set dval=03;
hive (default)> set pthstring=s3://mybucket/${hiveconf:dval}/;
hive (default)> set dval; ---------> gives : dval=03
hive (default)> set pthstring; ----> gives : pthstring=s3://mybucket/03/