Loading data from multiple s3 folders selectively into table in Hive - hive

I have a s3 bucket with multiple folders say, A, B and there are also some other folders. Folder structure is as below:
s3://buckets/AGGREGATED_STUDENT_REPORT/data/A/,
s3://buckets/AGGREGATED_STUDENT_REPORT/data/B/ etc.
And inside these two folders daily report gets generated in another folder like run_date=2019-01-01, so the resultant folder structure is something like below:
s3://buckets/AGGREGATED_STUDENT_REPORT/data/A/run_date=2019-01-01/..,
s3://buckets/AGGREGATED_STUDENT_REPORT/data/B/run_date=2019-01-01/..
Now in hive, I want to create a external table taking the data generated on last day of every month in only these two folders, ignoring others as follows:
CREATE EXTERNAL TABLE STUDENT_SUMMARY
(
ROLL_NUM STRING,
CLASS STRING,
REMARKS STRING,
LAST_UPDATED STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE LOCATION 's3://AGGREGATED_STUDENT_REPORT/data/*/run_date=2018-12-31';
But in the above query I am not able to figure out how to process group of selected folders.

Any chance you can copy the folders to HDFS.
Two reasons:
a) You can create just 1 folder in HDFS and copy all the A,b,c, etc into the same HDFS folder and use the same under your location parameter.
b) I am guessing the query performance would be better if the data resides in HDFS rather than S3.

Related

INSERT OVERWRITE on just created table

I have to replicate a process for a client. I have never worked with Hive, so I am trying to understand what they were doing in other cases.
The Hive script I am trying to understand is this one:
DROP TABLE IF EXISTS distribution.030601_TI11;
CREATE EXTERNAL TABLE IF NOT EXISTS distribution.030601_TI11(
mygroup STRING, year STRING, type1 STRING, type2 STRING,
type3 STRING, myvalue INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE LOCATION '/warehouse/distribution/030601_TI11';
INSERT OVERWRITE TABLE distribution.030601_TI11
SELECT *
FROM develop.030601_TI11;
What are they doing?
As far as I have read about Hive, a DROP TABLE IF EXISTS statement over a external table will only delete the table metadata and not the table data. But I would like to know if that INSERT OVERWRITE statement is dropping the previous entries stored in the table, and inserting only the new rows contained in the specified location
And also, how is the LOCATION managed? I want to create the table from a single .csv file. Can I write something like LOCATION '/warehouse/develop/myfile.csv' or I can only provide a HDFS directory as a location?
INSERT OVERWRITE TABLE removes all files inside table location and moves new file. This happens at the very end when the query has already successfully executed and result files are created in the temporary location, after that load task removes all files in table location and moves files from temp location to the table location. See also this answer: https://stackoverflow.com/a/63378038/2700344
If you want to create table on top of single file, put it in some folder and make sure there are no other files in the same folder and and specify that folder as a location in create table DDL. Also you can put that file into existing table location using hdfs dfs -put command or using LOAD command or using some other means. Main point here is that table should have it's own location, does not matter how many files are in the location - single file or many files, location is a folder (directory), not a file. Even if it was possible to create table on top of single file instead of folder, it is unsafe, because overwrite can create another files and table will have location pointing to non-existing file. carefully read answers on this question: How to point to a single file with external table
You are right, the location for external table will remain as is. So, by drop-create statements they are ensuring that the table doesn't exist before dropping or creating. And the table seems to be dynamic in nature so that can be another reason of drop-create.
Please notice you are using CREATE EXTERNAL TABLE IF NOT EXISTS which means if table exist, it will not recreate.
Storage will be cleaned and loaded using INSERT OVERWRITE.
Now, if you want to create a table on top of csv file just use LOCATION '/warehouse/develop/myfile. You dont have to use .csv in location.

Issue with Loading athena tables from multiple zip files in S3

I have a requirement to create an Athena table from multiple zip files of multiple folders in S3.
I have a folder structure in S3 as follows: S3 bucket==>Clients folder==> multiple folders for multiple countries like (US, JAPAN,UK... till 50 countries) ==> 10 to 50 '.gz' files in each country folder
I need to merge all the '.gz' files from all the region folders and create a single table in S3, i used the glue crawlers and classifiers but the files are not getting merged into table.
Please help me with other ways to create a table 'companies_all_regions' on Athena from all the files
You could create an Amazon Athena external table at the top-level of the bucket. All files at that level, and in sub-folders, will be included in the table. All files will need to be in the same format.
If your CSV files contain commas within a column, then the values for the column would need to be placed "inside double quotes".
If you are able to change the way the files are created, you could choose an alternate column separator, such as the pipe (|) character. That will avoid problems with commas inside field values. You can then configure the table to use the pipe as the separator character.

Hive: load gziped CSV from hdfs as read-only into a table

I have an hdfs folder with many csv.gz within, all with the same schema. My customer needs to read the content of these tables through Hive.
I tried to apply https://cwiki.apache.org/confluence/display/Hive/CompressedStorage . However it moves the file, whereas I need it to stay in its initial directory.
Another problem is that I should load each file one by one, I would rather create a table from the directory and not manage file individually.
I do not master Hive at all. Is his possible?
Yes, this is possible via Hive. You can create an external table and reference the existing HDFS location containing the gzip files. The schema for the data should be specified during the table creation.
hive> CREATE EXTERNAL TABLE my_data
(
column_1 int,
column_2 string
)
LOCATION 'hdfs:///my_data_folder_with_gzip_files';

HiveQL Where In Clause That Points to a Set of Files

I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?
Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!
A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.