Spectrum Scan Error while reading from external table (S3 to RS) - sql

I created an external table in Redshift from JSON files which are stored in S3 buckets.
All the columns are defined as varchar (despite the fact that the source data containing numbers and strings but I import everything as varchar to avoid error).
After I created the table and trying to query the table I got this error:
SQL Error [XX000]: ERROR: Spectrum Scan Error
Detail:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Error while reading Ion/JSON int value: Numeric overflow.
What I'm doing wrong? why do I get 'numeric overflow error' if I defined the column as varchar?
I'm using the following command in order to create the table:
CREATE EXTERNAL TABLE spectrum_schema.example_table(
column_1 varchar,
column_2 varchar,
column_3 varchar,
column_4 varchar
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://************/files/'
;

Related

Impala insert vs hive insert

When I tried to insert integer values into a column in a parquet table with Hive command, values are not getting insert and shows as null. But when used impala command it is working. But the partition size reduces with impala insert. Also number of rows in the partitions (show partitions) show as -1. What is the reason for this?
CREATE TABLE `TEST.LOGS`(
`recordtype` string,
`recordstatus` string,
`recordnumber` string,
`starttime` string,
`endtime` string,
`acctsessionid` string,
`subscriberid` string,
`framedip` string,
`servicename` string,
`totalbytes` int,
`rxbytes` int,
`txbytes` int,
`time` int,
`plan` string,
`tcpudp` string,
`intport` string)
PARTITIONED BY (`ymd` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES (
'field.delim'=',',
'serialization.format'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://dev-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
TBLPROPERTIES (
'transient_lastDdlTime'='1634390569')
Insert Statement
Hive
sudo -u hdfs hive -e 'insert into table TEST.LOGS partition (ymd="20220221") select * from TEMP.LOGS;'
Impala
impala-shell --ssl -i xxxxxxxxxxx:21000 -q 'insert into table TEST.LOGS partition (ymd="20220221") select * from TEMP.LOGS;'
When I tried to insert integer values into a column in a parquet table with Hive command, values are not getting insert and shows as null.
Could you pls share your exact insert statement and table definition for precise answer? If i have to guess, this may be because of difference in implicit data type conversion by hive and impala.
HIVE - If you set hive.metastore.disallow.incompatible.col.type.changes to false, the types of columns in Metastore can be changed from any type to any other type. After such a type change, if the data can be shown correctly with the new type, the data will be displayed. Otherwise, the data will be displayed as NULL. As per documentation forward conversion works(int> bigint) whereas backward (big int > small int) doesnt and produces null.
Impala - it supports a limited set of implicit casts to avoid undesired results from unexpected casting behavior. Impala does perform implicit casts among the numeric types, when going from a smaller or less precise type to a larger or more precise one. For example, Impala will implicitly convert a SMALLINT to a BIGINT.
Also number of rows in the partitions (show partitions) show as -1 -
Please run compute stats table_name to fix this issue.

Cloudera - Hive/Impala Show Create Table - Error with the syntax

I'm making some automatic processes to create tables on Cloudera Hive.
For that I am using the show create table statement that me give (for example) the following ddl:
CREATE TABLE clsd_core.factual_player ( player_name STRING, number_goals INT ) PARTITIONED BY ( player_name STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://nameservice1/factual_player'
What I need is to run the ddl on a different place to create a table with the same name.
However, when I run that code I return the following error:
Error while compiling statement: FAILED: ParseException line 1:123 missing EOF at 'WITH' near ')'
And I remove manually this part "WITH SERDEPROPERTIES ('serialization.format'='1')" it was able to create the table with success.
Is there a better function to retrieves the tables ddls without the SERDE information?
First issue in your DDL is that partitioned column should not be listed in columns spec, only in the partitioned by. Partition is the folder with name partition_column=value and this column is not stored in the table files, only in the partition directory. If you want partition column to be in the data files, it should be named differently.
Second issue is that SERDEPROPERTIES is a part of SERDE specification, If you do not specify SERDE, it should be no SERDEPROPERTIES. See this manual: StorageFormat andSerDe
Fixed DDL:
CREATE TABLE factual_player (number_goals INT)
PARTITIONED BY (player_name STRING)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/factual_player';
STORED AS PARQUET already implies SERDE, INPUTFORMAT and OUPPUTFORMAT.
If you want to specify SERDE with it's properties, use this syntax:
CREATE TABLE factual_player(number_goals int)
PARTITIONED BY (player_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format'='1') --I believe you really do not need this
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/factual_player'

Amazon Athena returning "mismatched input 'partitioned' expecting {, 'with'}" error when creating partitions

I'd like to use this query to create a partitioned table in Amazon Athena:
CREATE TABLE IF NOT EXISTS
testing.partitioned_test(order_id bigint, name string, car string, country string)
PARTITIONED BY (year int)
ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
STORED AS 'PARQUET'
LOCATION 's3://testing-imcm-into/partitions'
Unfortunately I don't get the error message which tells me the following:
line 3:2: mismatched input 'partitioned' expecting {, 'with'}
The quotes around 'PARQUET' seemed to be causing a problem.
Try this:
CREATE EXTERNAL TABLE IF NOT EXISTS
partitioned_test (order_id bigint, name string, car string, country string)
PARTITIONED BY (year int)
STORED AS PARQUET
LOCATION 's3://testing-imcm-into/partitions/'

Ingesting decimal in hive table of Avro Serde

I am trying to check whether i can change the precision and scale of decimal field in hive with Avro Serde.So I have writtenbelow code.
create database test_avro;
use test_avro_table;
create external table test_table(
name string,
salary decimal(17,2),
country string
)
row format delimited
fields terminated by ","
STORED AS textfile;
LOAD DATA LOCAL INPATH '/home/appsdesdssu/data/CACS_POC/data/' INTO TABLE
test_table;
create external table test_table_avro
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
tblproperties ('avro.schema.literal'='{
"name": "my_record",
"type": "record",
"fields": [
{"name":"name", "type":"string"},
{"name":"salary","type": "bytes","logicalType": "decimal","precision":
17,"scale": 2},
{"name":"country", "type":"string"}
]}');
insert overwrite table test_table_avro select * from test_table;
Here, I am getting error saying
FAILED: UDFArgumentException Only string, char, varchar or binary data can be cast into binary data types.
Data file:
steve,976475632987465.257,USA
rogers,349643905318384.137,mexico
groot,534563663653653.896,titan
If i am missing anything here than please let me know.
Hive did not support decimal to Binary version till now. So we have to work around by first converting it to string and than binary.So, below lines
insert overwrite table test_table_avro select * from test_table;
needs change to
insert overwrite table test_table_avro select name,cast(cast(salary as string) as binary),country from test_table;

Error while creating table in Hive

I am new to hadoop. I need a help regarding error encountered in Hive while creating a new table. I have gone through this Hive FAILED: ParseException line 2:0 cannot recognize input near ''macaddress'' 'CHAR' '(' in column specification
My question: Is it necessary to write a location of the table in the script? because I am writing table location at starting and I am afraid about writing the location because it should not disturb my rest of the databases by any mulfunction operation.
Here is my query:
CREATE TABLE meta_statistics.tank_items (
shop_offers_history_before bigint,
shop_offers_temp bigint,
videos_distinct_temp bigint,
deleted_temp bigint,
t_stamp timestamp )
CLUSTERED BY (
tank_items_id)
INTO 8 BUCKETS
ROW FORMAT SERDE
TBLPROPERTIES (transactional=true)
STORED AS ORC;
The error I am getting is-
ParseException line 1:3 cannot recognize input near 'TBLPROPERTIES'
'(' 'transactional'
What would be the other possibilities of errors and how can I remove those?
There is a syntax error in your create query. Error which you have shared says that hive cannot recognize input near 'TBLPROPERTIES'.
Solution:
As per hive syntax, the key value passed in TBLPROPERTIES should be in double quotes. it should be like this: TBLPROPERTIES ("transactional"="true")
So if I correct your query it will be:
CREATE TABLE meta_statistics.tank_items (
shop_offers_history_before bigint,
shop_offers_temp bigint,
videos_distinct_temp bigint,
deleted_temp bigint,
t_stamp timestamp
) CLUSTERED BY (tank_items_id) INTO 8 BUCKETS
ROW FORMAT SERDE TBLPROPERTIES ("transactional"="true") STORED AS ORC;
Execute above query, then if you get any other syntax error them make sure that the order of STORED AS , CLUSTERED BY , TBLPROPERTIES is as per the hive syntax.
Refer this for more details:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable
1) ROW FORMAT SERDE -> you should pass some serde
2) TBLPROPERTIES key value should be in double quotes
3) if you give CLUSTERED BY value should be there in the columns given
replace as follows
CREATE TABLE meta_statistics.tank_items ( shop_offers_history_before bigint, shop_offers_temp bigint, videos_distinct_temp bigint, deleted_temp bigint, t_stamp timestamp ) CLUSTERED BY (shop_offers_history_before) INTO 8 BUCKETS ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS ORC TBLPROPERTIES ("transactional"="true");
hope this helps