How to rename a column when creating an external table in Athena based on Parquet files in S3? - hive

Does anybody know how to rename a column when creating an external table in Athena based on Parquet files in S3?
The Parquet files I'm trying to load have both a column named export_date as well as an export_date partition in the s3 structure.
An example file path is: 's3://bucket_x/path/to/data/export_date=2020-08-01/platform=platform_a'
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
So what I would like to do, is to rename the export_date column to export_date_exp. The AWS documentation indicates that:
To make Parquet read by index, which will allow you to rename
columns, you must create a table with parquet.column.index.access
SerDe property set to true.
https://docs.amazonaws.cn/en_us/athena/latest/ug/handling-schema-updates-chapter.html#parquet-read-by-name
But the following code does not load any data in the export_date_exp column:
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date_exp` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'parquet.column.index.access'='true')
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
This question has been asked already, but did not receive an answer:
How to rename AWS Athena columns with parquet file source?
I am asking again because the documentation explicitly says it is possible.
As a side note: in my particular use case I can just not load the export_date column, as I've learned that reading Parquet by name does not require you to load every column. In my case I don't need the export_date column, so this avoids the conflict with the partition name.

Related

BigQuery external table over GCS path with partitions

I have some data stored in GCS bucket in the following path:
gcs://my-bucket/my_data/subfolder1/subfolder2/**.csv.gz
I intent to create an external table mapping to my_data and want the external table is able to partition the data by different level of subfolders. Note that subfolder1 or subfolder2 don't have a hive partition prefix, i.e, not in the format of prefix=value.
If I would write some pseudo code in Athena syntax, it would be something like below:
CREATE EXTERNAL TABLE `my_data`(
--Column specs go here---
)
PARTITIONED BY (
`partition_0` string,
`partition_1` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'gcs://my-bucket/my-data/'
TBLPROPERTIES (...)
As a result of the pseudo code, the table will consists of two partition columns in addition to columns defined in the column spec.
partition_0
partition_1
Queries filtering on these two columns will then benefits from partition pruning.
Would anyone please advise if this possible in BigQuery. If yes, how I should go about it in SQL?

Athena returns blank response for Partitioned data, what am I missing?

I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

I have a json file and I want to create Hive external table over it but with more descriptive field names

I have a JSON file and I want to create Hive external table over it but with more descriptive field names.Basically, I want to map the less descriptive field names present in json file to more descriptive fields in Hive external table.
e.g.
{"field1":"data1","field2":100}
Hive Table:
Create External Table my_table (Name string, Id int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/path-to/my_table/';
Where Name points to field1 and Id points to field2.
Thanks!!
You can use this SerDe that allows custom mappings between the JSON data and the hive columns: https://github.com/rcongiu/Hive-JSON-Serde
See in particular this part: https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords
so, in your case, you'd need to do something like
CREATE EXTERNAL TABLE my_table(name STRING, id, INT)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.name" = "field1",
"mapping.id" = "field2" )
LOCATION '/path-to/my_table/'
Note that hive column names are case insensitive, while JSON attributes
are case sensitive.