I have a json file and I want to create Hive external table over it but with more descriptive field names - hive

I have a JSON file and I want to create Hive external table over it but with more descriptive field names.Basically, I want to map the less descriptive field names present in json file to more descriptive fields in Hive external table.
e.g.
{"field1":"data1","field2":100}
Hive Table:
Create External Table my_table (Name string, Id int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/path-to/my_table/';
Where Name points to field1 and Id points to field2.
Thanks!!

You can use this SerDe that allows custom mappings between the JSON data and the hive columns: https://github.com/rcongiu/Hive-JSON-Serde
See in particular this part: https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords
so, in your case, you'd need to do something like
CREATE EXTERNAL TABLE my_table(name STRING, id, INT)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.name" = "field1",
"mapping.id" = "field2" )
LOCATION '/path-to/my_table/'
Note that hive column names are case insensitive, while JSON attributes
are case sensitive.

Related

Json to Athena table gives 0 results

I have a json that looks like this. No nesting.
[{"id": [1984262,1984260]}]
I want to create a table in Athena using sql such that I have a column "id" and each row in that column would contain a value from the array. Something like this
id
1984262
1984260
What I tried
CREATE EXTERNAL TABLE table1 (
id string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://data-bucket/data.json';
and
CREATE EXTERNAL TABLE table2 (
id array<string>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://data-bucket/data.json';
and
CREATE EXTERNAL TABLE table2 (
id array<bigint>
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://data-bucket/data.json';
When I preview the table I see empty rows with absolutely no data. Please help.
long story short: your JSON file needs to be compliant with the JSON-SerDe.
To query json data with athena you need to define a JSON (de-)serializer. You chose Hive JSON SerDe.
https://docs.aws.amazon.com/athena/latest/ug/json-serde.html
Now you data needs to be compliant with that serializer. For Hive JSON SerDe that means that each line needs to be a single line json that corresponds to one record. For you that would mean:
{ "id" : 1984262 }
{ "id" : 1984260 }
and the corresponding table definition would be
CREATE EXTERNAL TABLE table1 (
id bigint
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://data-bucket/data.json';
https://github.com/rcongiu/Hive-JSON-Serde/blob/develop/README.md

using logic to assign values to external table

Using Greenplum external table feature to create a temporary readable table to load all files from a specific folder in S3. Here is how the ext table is created:
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_data text)
LOCATION ('s3://some_s3_location/2021/06/17/
config=some_config.conf') FORMAT 'TEXT' (DELIMITER 'OFF' null E'' escape E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 100 PERCENT;
with file names in S3 like:
appData_A2342342342.json
(where A2342342342 is the ID of the JSON file)
Full path based on above example would be:
s3://some_s3_location/2021/06/17/appData_A2342342342.json
Note that the ext table only contains a single column (the JSON data).
I would like to have another column for the ID of the JSON file (in this case A2342342342)
How would I set up a create ext table statement so that I can grab the file name from S3 and parse it for the JSON_ID column value?
something like this....
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_id text, json_data text)

How to rename a column when creating an external table in Athena based on Parquet files in S3?

Does anybody know how to rename a column when creating an external table in Athena based on Parquet files in S3?
The Parquet files I'm trying to load have both a column named export_date as well as an export_date partition in the s3 structure.
An example file path is: 's3://bucket_x/path/to/data/export_date=2020-08-01/platform=platform_a'
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
So what I would like to do, is to rename the export_date column to export_date_exp. The AWS documentation indicates that:
To make Parquet read by index, which will allow you to rename
columns, you must create a table with parquet.column.index.access
SerDe property set to true.
https://docs.amazonaws.cn/en_us/athena/latest/ug/handling-schema-updates-chapter.html#parquet-read-by-name
But the following code does not load any data in the export_date_exp column:
CREATE EXTERNAL TABLE `user_john_doe.new_table`(
`column_1` string,
`export_date_exp` DATE,
`column_3` DATE,
`column_4` bigint,
`column_5` string)
PARTITIONED BY (
`export_date` string,
`platform` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ( 'parquet.column.index.access'='true')
LOCATION
's3://bucket_x/path/to/data'
TBLPROPERTIES (
'parquet.compression'='GZIP')
;
This question has been asked already, but did not receive an answer:
How to rename AWS Athena columns with parquet file source?
I am asking again because the documentation explicitly says it is possible.
As a side note: in my particular use case I can just not load the export_date column, as I've learned that reading Parquet by name does not require you to load every column. In my case I don't need the export_date column, so this avoids the conflict with the partition name.

How to create an Hive External on & delimited Key-Value pair

I have a simple requirement of creating an "Hive external table" on a text file which has data in the format of
colAAA=2&colDDD=1065985&colBBB=valueBB&colCCC=875
COL_NAME=VALUE&COL_NAME=VALUE&COL_NAME=VALUE
I cannot use RegEx Serde as the column names don't come in a defined order. Is there a way to create external table with out writing a new CustomSerde ??
create external table if not exists custom_table_name(
colAAA int,
colBBB int,
colCCC string,
colDDD int)
row format delimited
fields terminated by '&'
????????????? How to make it read the Key-Value ??
I would like to avoid writing CustomSerde unless there is no open-source SERDE available ... Thanks.
First, create external table with one map column to parse your data
create external table some_table
(map_col map<string, string>)
row format
COLLECTION ITEMS TERMINATED BY '&'
MAP KEYS TERMINATED BY '='
stored as textfile
location <your_location>
then select map keys of your interest
create table another_table as
select map_col['colAAA'] as colAAA, ...etc
from some_table

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!