using logic to assign values to external table - amazon-s3

Using Greenplum external table feature to create a temporary readable table to load all files from a specific folder in S3. Here is how the ext table is created:
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_data text)
LOCATION ('s3://some_s3_location/2021/06/17/
config=some_config.conf') FORMAT 'TEXT' (DELIMITER 'OFF' null E'' escape E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 100 PERCENT;
with file names in S3 like:
appData_A2342342342.json
(where A2342342342 is the ID of the JSON file)
Full path based on above example would be:
s3://some_s3_location/2021/06/17/appData_A2342342342.json
Note that the ext table only contains a single column (the JSON data).
I would like to have another column for the ID of the JSON file (in this case A2342342342)
How would I set up a create ext table statement so that I can grab the file name from S3 and parse it for the JSON_ID column value?
something like this....
CREATE READABLE EXTERNAL TEMPORARY TABLE app_json_ext (json_id text, json_data text)

Related

How can I copy data from CSV to a destination table based on column names?

Context
I am receiving CSV files in S3, which do not always follow the same schema and/or order. For example, sometimes files look like:
foo, bar, bla
hi , 007, 42
bye, 008, 44
But other times, they can look like (bar can be missing):
foo, bla
hi , 42
bye, 44
Now let's say I'm only interested in getting the foo column regardless of what else is there. But I can't really count on the order of the columns in the CSV. so on some days foo could be the first column, but on other days foo could be the third column. By the way, I am using Snowflake as a database.
What I have tried to do
I created a destination table like:
CREATE TABLE woof.meow (foo TEXT);
Then I tried to use Snowflake's COPY INTO command to copy data from the CSV into the table I created. The catch here, is that I tried to do the same way I normally do for Parquet files (matching by column names!) like:
COPY INTO woof.meow
FROM '#STAGES.MY_S3_BUCKET_STAGE/'
file_format = (
TYPE=CSV,
COMPRESSION=GZIP,
)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
But sadly I always got: error: Insert value list does not match column list expecting 1 but got 0
Some research lead me to this section of the docs (about MATCH_BY_COLUMN_NAME) to discover CSV is not supported:
This copy option is supported for the following data formats:
- JSON
- Avro
- ORC
- Parquet
Desired objective
How can I copy data from the STAGE (containing csv file on s3)to a pre-created table based on column names?
I am happy to provide any further information if needed.
You are trying to insert CSV which is comma separated values file data into one text column ,to my knowledge your column order in your source data files should be same as column orders that you have created for target table in Snowflake which means if you have foo , bar and bla as columns in source csv file then your target table columns should be also be created as separate columns , in same order as source csv files;
If you have unsure of what columns could come in your source file ; i would recommend you transform this file to JSON (that is my choice you can choose other option too like avro) and load that content into VARIANT column in Snowflake;
By this way you would not worry much about order of columns in source files , you would store data as JSON/AVRO into target table and would use JSON handling mechanism to convert JSON values into Columns.(Flatten the JSON to convert it onto relational table)`

Find hive external table name from HDFS directory

Is it possible to get the external table name if the only information I have is the HDFS directory.
For example, I create the table with
CREATE EXTERNAL TABLE IF NOT EXISTS userinfo(id String, name String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'hdfs:///user/testuser/log/2019-02-18/‘
To get the location from table name, I can use
show create table userinfo;
But if I want to get the table name from "hdfs:///user/testuser/log/2019-02-18/"?
Is it possible to find the table name "userinfo" from the directory?
Thanks
David

Parquet Files Generation with hive

I'm trying to generate some parquet files with hive,to accomplish this i loaded a regular hive table from some .tbl files, throuh this command in hive:
CREATE TABLE REGION (
R_REGIONKEY BIGINT,
R_NAME STRING,
R_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
location '/tmp/tpch-generate';
After this i just execute this 2 lines:
create table parquet_reion LIKE region STORED AS PARQUET;
insert into parquet_region select * from region;
But when i check the output generated in HDFS, i dont find any .parquet file, intead i find files names like 0000_0 to 0000_21, and the sum of their sizes are much bigger that the original tbl file.
What im i doing Wrong?
Insert statement doesn't create file with extension but these are the parquet files.
You can use DESCRIBE FORMATTED <table> to show table information.
hive> DESCRIBE FORMATTED <table_name>
Additional Note: You can also create new table from source table using below query:
CREATE TABLE new_test row STORED AS PARQUET AS select * from source_table
It will create new table as parquet format and copies the structure as well as the data.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

I have a json file and I want to create Hive external table over it but with more descriptive field names

I have a JSON file and I want to create Hive external table over it but with more descriptive field names.Basically, I want to map the less descriptive field names present in json file to more descriptive fields in Hive external table.
e.g.
{"field1":"data1","field2":100}
Hive Table:
Create External Table my_table (Name string, Id int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/path-to/my_table/';
Where Name points to field1 and Id points to field2.
Thanks!!
You can use this SerDe that allows custom mappings between the JSON data and the hive columns: https://github.com/rcongiu/Hive-JSON-Serde
See in particular this part: https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords
so, in your case, you'd need to do something like
CREATE EXTERNAL TABLE my_table(name STRING, id, INT)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
"mapping.name" = "field1",
"mapping.id" = "field2" )
LOCATION '/path-to/my_table/'
Note that hive column names are case insensitive, while JSON attributes
are case sensitive.