In my work, I import AVRO files into impala tables by copy the files into HDFS then execute "refresh " in impala.
But when I want to do it with compressed files, it didn't work.
The only document I've found about enable compress with avro tables is this link: http://www.cloudera.com/documentation/archive/impala/2-x/2-1-x/topics/impala_avro.html#avro_compression_unique_1 .
Here is what I do:
Enable Hive compress in hive shell:
hive> set hive.exec.compress.output=true;
hive> set avro.output.codec=bzip2;
Create a table:
CREATE TABLE log_bzip2(
timestamp bigint COMMENT 'from deserializer',
appid string COMMENT 'from deserializer',
clientid string COMMENT 'from deserializer',
statkey string COMMENT 'from deserializer',
expid string COMMENT 'from deserializer',
modid string COMMENT 'from deserializer',
value double COMMENT 'from deserializer',
summary string COMMENT 'from deserializer',
custom string COMMENT 'from deserializer')
PARTITIONED BY (
day string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs://szq2.appadhoc.com:8020/user/hive/log.avsc');
Load the compressed AVRO file into HDFS:
hdfs dfs -put log.2016-03-07.1457184357726.avro.bz2 /user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07
Add partition and refresh in Impala shell:
alter table log_bzip2 add partition (day="2016-03-07") location '/user/hive/warehouse/adhoc_data_fast.db/log_bzip2/2016-03-07/';
refresh log_bzip2;
Query it but not work:
select * from log_bzip2 limit 10;
Query: select * from log_bzip2 limit 10
WARNINGS: Invalid AVRO_VERSION_HEADER: '42 5a 68 39 '
How can I do it right? Thanks!
It turns out the avro format has its own way to compress data instead of compress the generated avro file manually. So what we need to do is add the compress option to the AVRO writer while writing the file, then the generated file is compressed by the avro encoder. Put this file into Hive is OK. Nothing else need to config.
Related
I would like to set the location value in my Athena SQL create table statement to a single CSV file as I do not want to query every file in the path. I can set and successfully query an s3 directory (object) path and all files in that path, but not a single file. Is setting a single file as the location supported?
Successfully queries CSV files in path:
LOCATION 's3://my_bucket/path/'
Returns zero results:
LOCATION 's3://my_bucket/path/filename.csv.gz'
Create table statement:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_db` (
`name` string,
`occupation` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ','
) LOCATION 's3://bucket-name/path/filename.csv.gz'
TBLPROPERTIES ('has_encrypted_data'='false');
I have read this Q&A and this, but it doesn't seem to address the same issue.
Thank you.
You could try adding path of that particular object in WHERE condition while querying:
SELECT * FROM default.my_db
WHERE "$path" = 's3://bucket-name/path/filename.csv.gz'
I'm trying to create an internal table in Athena, on data in S3 in parquet format:
CREATE TABLE IF NOT EXISTS `vdp_dev.owners_daily`(
`owner_id` string COMMENT 'from deserializer',
`username` string COMMENT 'from deserializer',
`billing_with` string COMMENT 'from deserializer',
`billing_contacts` string COMMENT 'from deserializer',
`error_code` string COMMENT 'from deserializer')
PARTITIONED BY (
`dt` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://xxxxx-xx-xxxx-xxxxxx/dim/daily/owners';
but getting the following error:
Only external table creation is supported. (Service: AmazonAthena;
Status Code: 400; Error Code: InvalidRequestException; Request ID:
13c5325b-2217-4989-b5f3-e717462329c1)
Does someone know why it happens?
Why can't I create an internal table in Athena?
From the Athena documentation :
All Tables Are EXTERNAL
If you use CREATE TABLE without the EXTERNAL keyword, Athena issues an error; only tables with the EXTERNAL keyword can be created. We recommend that you always use the EXTERNAL keyword. When you drop a table in Athena, only the table metadata is removed; the data remains in Amazon S3.
I placed a text file that is comma delimited in an S3 bucket. I am attempting to query the folder the file resides in but it returns zero results.
Create table DDL:
CREATE EXTERNAL TABLE myDatabase.myTable (
`field_1` string,
`field_2` string,
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://bucket/files from boss/'
TBLPROPERTIES ('has_encrypted_data'='false');
The issue was the whitespace in the location:
LOCATION 's3://bucket/files from boss/'
I removed the whitespace from the folder name in S3 and I was able to query without issue:
LOCATION 's3://bucket/files_from_boss/'
I have a few processes where I use the copy command to copy data from S3 into Redshift.
I have a new csv file where I am unable to figure out how I can bring in the "note" field- which is a free hand field a sales person writes anything into. It can have ";", ",", ".", spaces, new lines- anything.
Any common suggestions to copy this type of field? it is varchar(max) type in table_name.
Using this:
copy table_name
from 's3://location'
iam_role 'something'
delimiter as ','
ignoreheader 1
escape
removequotes
acceptinvchars
I get Delimiter not found
Using this:
copy table_name
from 's3://location'
iam_role 'something'
delimiter as ','
fillrecord
ignoreheader 1
escape
removequotes
acceptinvchars
I get String length exceeds DDL length
The second copy command command fixed your initial issue, namely of copy parsing the csv file. But now it can't be inserted because the input value exceeds the maximum column length of yr column in database. Try increasing the size of the column:
Alter column data type in Amazon Redshift
I have the following file structure
/base/{yyyy-mm-dd}/
folder1/
folderContainingCSV/
logs/
I want to load the data from my base directory for all dates. But the problem is that there are files in non csv.gz format in log/ directory. Is there a way to select only csv.gz files while querying from base directory level.
Sample query:-
CREATE EXTERNAL TABLE IF NOT EXISTS csvData (
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = '|'
) LOCATION 's3://base/'
TBLPROPERTIES ('has_encrypted_data'='true');
You may not do this at table creation level. You need to copy all the *.gz file separately into another folder.
This may be done within the hive script (containing the create table statement) itself. Just add below command at the beginning of the hive script (just before create table)
dfs -mkdir -p /new/path/folder
dfs -cp /regular/log/file/*.gz /new/path/folder
Now, you may create the external table pointing to new/path/folder.