Running HIVE queries directly from S3 input files

Running HIVE queries directly from S3 input files - amazon-s3

I am using Interative Hive Session in Elastice Map Reduce to run Hive. Previously I was loading data from S3 into Hive tables.Now, I want to run some scripts on S3 input files without loading data into Hive Tables.
Is this possible?If yes then how can this be achieved?

You can run queries on data right in S3.
CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n' LOCATION 's3n://mys3bucket/';
or similar

Related

pyspark script to copy any parquet file data to any oracle table

We have one s3 bucket called Customers/
Inside this we have multiple folders and again sub folders inside them.
And finally we have parquet files of data.
Now I want to read any parquet file (not specific to any file) and load data into oracle.
For now my script is working for one s3 path where it reads one parquet file e.g. customer_info.parquet and it loads data in oracle database table called customer.customer_info
I need help on generating a generic script where we can read any parquet file and load data in any corresponding database table.
for e.g.
S3 location : s3/Customers/new_customrers/new_customer_info.parquet
Oracle Database: Customer
Oracle table : new_customers
S3 location :s3/Customers/old_customrers/old_customer_info.parquet
Oracle Database:Customer
Oracle table:old_customers
S3 location : s3/Customers/current_customrers/current_customer_info.parquet
Oracle Database :Customer
Oracle table:current_customers**
Is there any way to make this copy process generic. Database will be same only oracle tables will be changed according to the parquet file.
My current script is a pyspark script where we are reading one s3 file data into spark dataframe and writting that dataframe to one oracle table.

How read data partitons in S3 from Trino

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command

In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS

Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

how can we load orcdata into hive using nifi hive streaming processor

I have orc files and their schema i have tried loading this orc files in local hive and its working fine, now I will generate multiple orc files and need to load this orc files to hive table using nifi put hive streamming processor ?

PutHiveStreaming expects incoming flow files to be in Avro format. If you are using PutHive3Streaming you have more flexibility but it doesn't accept flow files in ORC format; instead both of those processors convert the input into ORC and write it into a managed table in Hive.
If your files are already in ORC format, you can use PutHDFS to place them directly into HDFS. If you don't have permissions to write directly into a managed table location, you could write to a temporary location, create an external table on top of it, and then load from there into the managed table using INSERT INTO myTable FROM SELECT * FROM externalTable or whatever.

How to query data from gz file of Amazon S3 using Qubole Hive query？

I need get specific data from gz.
how to write the sql?
can I just sql as table database?:
Select * from gz_File_Name where key = 'keyname' limit 10.
but it always turn back with an error.

You need to create Hive external table over this file location(folder) to be able to query using Hive. Hive will recognize gzip format. Like this:
create external table hive_schema.your_table (
col_one string,
col_two string
)
stored as textfile --specify your file type, or use serde
LOCATION
's3://your_s3_path_to_the_folder_where_the_file_is_located'
;
See the manual on Hive table here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTableCreate/Drop/TruncateTable
To be precise s3 under the hood does not store folders, filename containing /s in s3 represented by different tools such as Hive like a folder structure. See here: https://stackoverflow.com/a/42877381/2700344

HIVE script - Specify file name as S3 Location

I am exporting data from DynamoDB to S3 using follwing script:
CREATE EXTERNAL TABLE TableDynamoDB(col1 String, col2 String)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' TBLPROPERTIES (
"dynamodb.table.name" = "TableDynamoDB",
"dynamodb.column.mapping" = "col1:col1,col2:col2"
);
CREATE EXTERNAL TABLE TableS3(col1 String, col2 String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://myBucket/DataFiles/MyData.txt';
INSERT OVERWRITE TABLE TableS3
SELECT * FROM TableDynamoDB;
In S3, I want to write the output to a given file name (MyData.txt)
but the way it is working currently is that above script created folder with name 'MyData.txt'
and then generated a file with some random name under this folder.
Is it at all possible to specify a file name in S3 using HIVE?
Thank you!

A few things:
There are 2 different ways hadoop can write data to s3. This wiki describes the differences in a little more detail. Since you are using the "s3" scheme, you are probably seeing a block number.
In general, M/R jobs (and hive queries) are going to want to write their output to multiple files. This is an artifact of parallel processing. In practice, most commands/APIs in hadoop handle directories pretty seamlessly so you shouldn't let it bug you too much. Also, you can use things like hadoop fs -getmerge on a directory to read all of the files in a single stream.
AFAIK, the LOCATION argument in the DDL for an external hive table is always treated as a directory for the reasons above.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Running HIVE queries directly from S3 input files - amazon-s3

I am using Interative Hive Session in Elastice Map Reduce to run Hive. Previously I was loading data from S3 into Hive tables.Now, I want to run some scripts on S3 input files without loading data into Hive Tables. Is this possible?If yes then how can this be achieved?

You can run queries on data right in S3. CREATE EXTERNAL TABLE mydata (key STRING, value INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\n' LOCATION 's3n://mys3bucket/'; or similar

Related

pyspark script to copy any parquet file data to any oracle table

How read data partitons in S3 from Trino

how can we load orcdata into hive using nifi hive streaming processor

How to query data from gz file of Amazon S3 using Qubole Hive query？

HIVE script - Specify file name as S3 Location

Categories

Resources