How read data partitons in S3 from Trino - amazon-s3

I'm trying to read data partitons in S3 from Trino.
What I did exactly:
I uploaded my data with all partitions into S3. I have a specified avro schema, I put it in file local system.
Then I created an external hive table to point to the data location in S3 and to the avro schema in file local system.
Table is created.
Then, normaly I can query my data and partitions in S3 from Trino.
Trino>select * from hive.default.my_table;
It return only columns names.
trino>select * from hive.default."my_table$partitions";
it return only name of partitions.
Could you please suggest me a solution how can I read data partitons in S3 from Trino ?
Knowing that I'm using Apache Hive 2, even when I query the table in hive to return the table partitions, it return Ok, and display any thing. I think because Hive 2 we should use MSCK command

In Hive uploading partition folders and files into S3 and creating table is not enough, partition metadata should be created. Normally you can have folders not mounted as partitions. To mount all existing sub-folders in the table location as partitions:
Use msck repair table command:
MSCK [REPAIR] TABLE tablename;
or Amazon EMR version:
ALTER TABLE tablename RECOVER PARTITIONS;
It will create partition metadata in Hive metastore and partitions will become available.
Read more details about both commands here: RECOVER PARTITIONS

Faced the same issue. Once the table is created, we need to manually sync up the schema to the metastore using the below command of trino.
CALL system.sync_partition_metadata('<schema>', '<table>', 'ADD');
Ref.: https://trino.io/episodes/5.html

Related

Is it possible to delete records from Hive external table with AWS S3 bucket as location?

Is it possible to delete records from Hive external table with AWS S3 bucket as location using IICS.
For example : DELETE FROM MY_HIVE_TABLE WHERE COLUMN1='TEST1';
You can not delete from hive(unless they are kudu).
As per your comment, i think you can add a filter transformation right before target hive S3 table with condition COLUMN1<>'TEST1'
And then overwrite the hive target table in S3 using IICS.
This will overwrite same table with everything but COLUMN1=TEST1 i.e. deleting the data.

How Can I create a Hive Table on top of a Parquet File

Facing issue on creating hive table on top of parquet file. Can someone help me on the same.? I have read many articles and followed the guidelines but not able to load a parquet file in Hive Table.
According "Using Parquet Tables in Hive" it is often useful to create the table as an external table pointing to the location where the files will be created, if a table will be populated with data files generated outside of Hive.
hive> create external table parquet_table_name (<yourParquetDataStructure>)
STORED AS PARQUET
LOCATION '/<yourPath>/<yourParquetFile>';

How to make a table that is automatically updated Hive

I have created an external table that in Hive that uses data from a Parquet store in HDFS.
When the data in HDFS is deleted, there is no data in the table. When the data is inserted again in the same spot in HDFS, the table does not get updated to contain the new data. If I insert new records into the existing table that contains data, no new data is shown when I run my Hive queries.
How I create the table in Hive:
CREATE EXTERNAL TABLE nodes (id string) STORED AS PARQUET LOCATION "/hdfs/nodes";
The relevant error:
Error: java.io.FileNotFoundException: File does not exist: /hdfs/nodes/part-r-00038-2149d17d-f890-48bc-a9dd-5ea07b0ec590.gz.parquet
I have seen several posts that explain that external tables should have the most up to date data in them, such as here. However, this is not the case for me, and I don't know what is happening.
I inserted the same data into the database again, and queried the table. It contained the same amount of data as before. I then created an identical table with a different name. It had twice as much data in it, which was the right amount.
The issue might be with the metastore database. I am using PostgreSQL instead of Derby for the the database.
Relevant information:
Hive 0.13.0
Spark Streaming 1.4.1
PostgreSQL 9.3
CentOS 7
EDIT:
After examining the Parquet files, I found that the part files have seemingly incompatible file names.
-rw-r--r-- 3 hdfs hdfs 18702811 2015-08-27 08:22 /hdfs/nodes/part-r-00000-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18703029 2015-08-26 15:43 /hdfs/nodes/part-r-00000-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18724320 2015-08-27 08:22 /hdfs/nodes/part-r-00001-1670f7a9-9d7c-4206-84b5-e812d1d8fd9a.gz.parquet
-rw-r--r-- 3 hdfs hdfs 18723575 2015-08-26 15:43 /hdfs/nodes/part-r-00001-7251c663-f76e-4903-8c5d-e0c6f61e0192.gz.parquet
These files are the files that causes Hive to error when it can't find it in the error described above. This means that the external table is not acting dynamically, accepting any files in the directory (if you call it that in HDFS), but instead is probably just keeping track of the list of parquet files inside the directory when it was created.
Sample Spark code:
nodes.foreachRDD(rdd => {
if (!rdd.isEmpty())
sqlContext.createDataFrame(rdd.map(
n => Row(n.stuff), ParquetStore.nodeSchema)
.write.mode(SaveMode.Append).parquet(node_name)
})
Where the nodeSchema is the schema and node_name is "/hdfs/nodes"
See my other question about getting Hive external tables to detect new files.
In order to get Hive to update its tables, I had to resort to using the partitioning feature of Hive. By creating a new partition during each Spark run, I create a series of directories internal to the /hdfs/nodes directory like this:
/hdfs/nodes/timestamp=<a-timestamp>/<parquet-files>
/hdfs/nodes/timestamp=<a-different-timestamp>/<parquet-files>
Then, after each Spark job completes, I run the Hive command MSCK REPAIR TABLE nodes using a HiveContext in my Spark job, which finds new partitions and updates the table.
I realize this isn't automatic, but it at least works.
Ok, so probably you need to encapsulate the file in a folder. Hive external table must be mapped on a folder where there could be more than one file.
try to write the file to: /path/to/hdfs/nodes/file
and then map the external table to /path/to/hdfs/nodes
so in the folder nodes you will have only the parquet file and it should works

What will be DataSet size in hive

I have 1 TB data in my HDFS in .csv format. When I load it in my Hive table what will be the total size of data. I mean will there be 2 copies of same data i.e 1 Copy in HDFS and other in Hive table ? Plz clarify. Thanks in advance.
If you create a hive external table, you provide a HDFS location for the table and you store that data into that particular location.
When you create a hive internal table hive create a directory into /apps/hive/warehouse/ directory.
Say, your table name is table1 then your directory will be /apps/hive/warehouse/table1
This directory is also a HDFS directory and when you load data into the table into internal table it goes into its directory.
Hive creates a mapping between table and their corresponding HDFS location and hence when you read the data its reading from the corresponding mapped directory.
Hence there wont be duplicate copy of data corresponding to table and their HDFS location.
But if in your Hadoop cluster Data Replication factor is set to 3(default replication) then it will take 3TB cluster disk space(as you have 1TB data) but there wont be any effect of your hive table data.
Please see below link to know more about Data replication.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
It depends whether you are creating an internal or external table in Hive.
If you create an external table in Hive, it will create a mapping on where your data is stored in HDFS and there won't be any duplication at all. Hive will automatically pick the data where ever it is stored in HDFS.
Read more about external tables here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables

Where does hive stores its table?

I am new to Hadoop and I just started working on Hive, I my understanding it provides a query language to process data in HDFS. With HiveQl we can create tables and load data into it from HDFS.
So my question is: where are those tables stored? Specifically if we have 100 GB file in our HDFS and we want to make a hive table out of that data what will be the size of that table and where is it stored?
If my understanding about this concept is wrong please correct me ..
If the table is 100GB you should consider an Hive External Table (as opposed to a "managed table", for the difference, see this).
With an external table the data itself will be still stored on the HDFS in the file path that you specify (note that you may specify a directory of files as long as they all have the same structure), but Hive will create a map of it in the meta-store whereas the managed table will store the data "in Hive".
When you drop a managed table, it drops the underlying data as opposed to dropping a hive external table which only drops the meta-data from the meta-store referencing that data.
Either way you are using only 100GB as viewed by the user and are taking advantage of the HDFS' robustness though duplication of the data.
Hive will create a directory on HDFS. If you didn't specify any location it will create a directory at /user/hive/warehouse on HDFS. After load command the files are moved to the /warehouse/tablename. You can also point to the HDFS directory if it contains partitions (if the files are partitioned), or use external table concept.