Creating Internal Table in Amazon Athena - amazon-s3

I'm trying to create an internal table in Athena, on data in S3 in parquet format:
CREATE TABLE IF NOT EXISTS `vdp_dev.owners_daily`(
`owner_id` string COMMENT 'from deserializer',
`username` string COMMENT 'from deserializer',
`billing_with` string COMMENT 'from deserializer',
`billing_contacts` string COMMENT 'from deserializer',
`error_code` string COMMENT 'from deserializer')
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://xxxxx-xx-xxxx-xxxxxx/dim/daily/owners';
but getting the following error:
Only external table creation is supported. (Service: AmazonAthena;
Status Code: 400; Error Code: InvalidRequestException; Request ID:
13c5325b-2217-4989-b5f3-e717462329c1)
Does someone know why it happens?
Why can't I create an internal table in Athena?

From the Athena documentation :
All Tables Are EXTERNAL
If you use CREATE TABLE without the EXTERNAL keyword, Athena issues an error; only tables with the EXTERNAL keyword can be created. We recommend that you always use the EXTERNAL keyword. When you drop a table in Athena, only the table metadata is removed; the data remains in Amazon S3.

Related

Spectrum Scan Error while reading from external table (S3 to RS)

I created an external table in Redshift from JSON files which are stored in S3 buckets.
All the columns are defined as varchar (despite the fact that the source data containing numbers and strings but I import everything as varchar to avoid error).
After I created the table and trying to query the table I got this error:
SQL Error [XX000]: ERROR: Spectrum Scan Error
Detail:
-----------------------------------------------
error: Spectrum Scan Error
code: 15001
context: Error while reading Ion/JSON int value: Numeric overflow.
What I'm doing wrong? why do I get 'numeric overflow error' if I defined the column as varchar?
I'm using the following command in order to create the table:
CREATE EXTERNAL TABLE spectrum_schema.example_table(
column_1 varchar,
column_2 varchar,
column_3 varchar,
column_4 varchar
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION
's3://************/files/'
;

How Create a hive external table with parquet format

I am trying to create an external table in hive with the following query in HDFS.
CREATE EXTERNAL TABLE `post` (
FileSK STRING,
OriginalSK STRING,
FileStatus STRING,
TransactionType STRING,
TransactionDate STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS PARQUET TBLPROPERTIES("Parquet.compression"="SNAPPY")
LOCATION 'hdfs://.../post'
getting error
Error while compiling statement: FAILED: ParseException line 11:2
missing EOF at 'LOCATION' near ')'
What is the best way to create a HIVE external table with data stored in parquet format?
I am able to create table after removing property TBLPROPERTIES("Parquet.compression"="SNAPPY")
CREATE EXTERNAL TABLE `post` (
FileSK STRING,
OriginalSK STRING,
FileStatus STRING,
TransactionType STRING,
TransactionDate STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS PARQUET,
LOCATION 'hdfs://.../post'

Cloudera - Hive/Impala Show Create Table - Error with the syntax

I'm making some automatic processes to create tables on Cloudera Hive.
For that I am using the show create table statement that me give (for example) the following ddl:
CREATE TABLE clsd_core.factual_player ( player_name STRING, number_goals INT ) PARTITIONED BY ( player_name STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://nameservice1/factual_player'
What I need is to run the ddl on a different place to create a table with the same name.
However, when I run that code I return the following error:
Error while compiling statement: FAILED: ParseException line 1:123 missing EOF at 'WITH' near ')'
And I remove manually this part "WITH SERDEPROPERTIES ('serialization.format'='1')" it was able to create the table with success.
Is there a better function to retrieves the tables ddls without the SERDE information?
First issue in your DDL is that partitioned column should not be listed in columns spec, only in the partitioned by. Partition is the folder with name partition_column=value and this column is not stored in the table files, only in the partition directory. If you want partition column to be in the data files, it should be named differently.
Second issue is that SERDEPROPERTIES is a part of SERDE specification, If you do not specify SERDE, it should be no SERDEPROPERTIES. See this manual: StorageFormat andSerDe
Fixed DDL:
CREATE TABLE factual_player (number_goals INT)
PARTITIONED BY (player_name STRING)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/factual_player';
STORED AS PARQUET already implies SERDE, INPUTFORMAT and OUPPUTFORMAT.
If you want to specify SERDE with it's properties, use this syntax:
CREATE TABLE factual_player(number_goals int)
PARTITIONED BY (player_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format'='1') --I believe you really do not need this
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/factual_player'

impala CREATE EXTERNAL TABLE and remove double quotes

i got data on CSV for example :
"Female","44","0","0","Yes","Govt_job","Urban","103.59","32.7","formerly smoked"
i put it as hdfs with hdfs dfs put
and now i want to create external table from it on impala (not in hive)
there is an option without the double quotes ?
this is what i run by impala-shell:
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( `gender` STRING,`age` STRING,`hypertension` STRING,`heart_disease` STRING,`ever_married` STRING,`work_type` STRING,`Residence_type` STRING,`avg_glucose_level` STRING,`bmi` STRING,`smoking_status` STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION "/user/test/tmp/test1"
Update 28.11
i managed to do it by create the external and then create a VIEW as select with case when concat() each col.
Impala uses the Hive metastore so anything created in Hive is available from Impala after issuing an INVALIDATE METADATA dbname.tablename. HOWEVER, to remove the quotes you need to use the Hive Serde library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' and this is not accessible from Impala. My suggestion would be to do the following:
Create the external table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( gender STRING, age STRING, hypertension STRING, heart_disease STRING, ever_married STRING, work_type STRING, Residence_type STRING, avg_glucose_level STRING, bmi STRING, smoking_status STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ",",
"quoteChar" = """
)
STORED AS TEXTFILE
LOCATION "/user/test/tmp/test1"
Create a managed table in Hive using CTAS
CREATE TABLE mytable AS SELECT * FROM test_test.test1_ext;
Make it available in Impala
INVALIDATE METADATA db.mytable;

How to import compressed AVRO files to Impala table?

In my work, I import AVRO files into impala tables by copy the files into HDFS then execute "refresh " in impala.
But when I want to do it with compressed files, it didn't work.
The only document I've found about enable compress with avro tables is this link: http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_avro.html#avro_compression_unique_1 .
Here is what I do:
Enable Hive compress in hive shell:
hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=bzip2;
Create a table:
CREATE TABLE log_bzip2(
timestamp bigint COMMENT 'from deserializer',
appid string COMMENT 'from deserializer',
clientid string COMMENT 'from deserializer',
statkey string COMMENT 'from deserializer',
expid string COMMENT 'from deserializer',
modid string COMMENT 'from deserializer',
value double COMMENT 'from deserializer',
summary string COMMENT 'from deserializer',
custom string COMMENT 'from deserializer')
PARTITIONED BY (
day string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://szq2.appadhoc.com:8020/user/hive/log.avsc');
Load the compressed AVRO file into HDFS:
hdfs dfs -put log.2016-03-07.1457184357726.avro.bz2 /user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07
Add partition and refresh in Impala shell:
alter table log_bzip2 add partition (day="2016-03-07") location '/user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07/';
refresh log_bzip2;
Query it but not work:
select * from log_bzip2 limit 10;
Query: select * from log_bzip2 limit 10
WARNINGS: Invalid AVRO_VERSION_HEADER: '42 5a 68 39 '
How can I do it right? Thanks!
It turns out the avro format has its own way to compress data instead of compress the generated avro file manually. So what we need to do is add the compress option to the AVRO writer while writing the file, then the generated file is compressed by the avro encoder. Put this file into Hive is OK. Nothing else need to config.