External table in Hive from two parquet files - hive

Question in short,
Is it possible to mount an external hive table on top of multiple files with differing schema, where the table is all or a subset of the total of all columns in those files specified?
I know, the question is bit complex, see the scenario in detail
below.
I do have two parquet files, with differing schema in two different locations, partitioned by yyyymmdd format
/app/data/source/file-1/20170501/file-1.pqt
/app/data/source/file-2/20170501/file-2.pqt
Let's assume file one is like this; and for the time being, the files are in Parquet format.
File-1
ID|Name
1|My Zone
File-2
APP ID, APP Name, APP owner
1,My App, Manager-1
I want it to be mounted on top an external Hive table.
Where, when some body issues a command select * from table, he gets the result.
ID, Name, App Name, App Owner
1, My Zone, NULL, NULL
1, NULL, My APP, Manager-1
If not possible what is the best recommended method?
btw, please note that both the files are entirely different source files, not an evolved schema.

Related

Create a date-limited view on a hive table containing complex types in a way that is queryable with Impala?

I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.
The usual way of doing this would be to create a time-limited view on top of the table, e.g.:
''' CREATE VIEW time_limited_view
AS SELECT * FROM my_table
WHERE partition_date >= '2020-01-01' '''
This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:
** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **
The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.
I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated
While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.
Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.
CREATE TABLE database_1.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).
CREATE EXTERNAL TABLE database_2.tablename
(`col_1` BIGINT,
`col_2` array<STRUCT<X:INT, Y:STRING>>)
PARTITIONED BY (`date_col` STRING)
STORED AS PARQUET
LOCATION 'location_1';
Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.
This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.

How to relationalize, join and aggregate multiple files from S3

I have a bucket in S3, containing hundreds of folders, each contains files with the same structure, which are csv representation of relational db tables. The different folders differ by content of the data, but overlapping might occur.
In each folder, I want to join 3 tables, and store the output in a dedicated table. The dedicated table should eventually hold joined data from all different folders. Duplications might occur between different folders but the records have a unique key that can help with the aggregation.
Data size for a specific folder, of all the files, can reach to 5 GB of disk space. 2 of the files contain hundreds of thousands of records. The third file can reach up to 20M records.
The result should be stored in AWS RDS, on a postgresql instance. However, I am considering to switch to Redshift. Will it be better for this scale of data?
The 3 tables are:
Providers
Consumers
Transactions
All of them are indexed by the key which is used in the join.
My approach is to iterate over S3 bucket, and for each folder load the 3 files to the db. Then, create the joined table for the 3 tables using sql, and finally adding the joined data to the aggregated table that should contain the data from all folders.
I am currently trying to handle 1 folder, in order to understand better how to make the process optimal, both in time and space.
After loading I noticed the db uses around 2X disk space then what I expected. Why is joining cost so much in disk space? Is there a way of loading and joining with minimal cost? The data loaded initially for each folder is used as a staging table, until I drop the duplicates and load it to the aggregated table. So its lifespan will be relatively short. I tried to use CREATE UNLOGGED TABLE but it didn't had much effect.
CREATE UNLOGGED TABLE agg_data AS SELECT * FROM
transactions t
INNER JOIN consumers c USING (consumer_id)
INNER JOIN providers p USING (provider_id);
This works ok for 1 folder, time wise. It does take a lot more disk space than I assumed it will.
How will this work in a mass scale, for hundreds of folders. How will the aggregation behave over time, as I will need to search for duplicated records in a continuously growing table?
To summarize my questions:
How to choose between RDS and Redshift? My concerns are tens of millions of records in the target table, and the need to drop duplicates while adding new data to the target table.
Why is joining data take so much db storage? Is there a way to minimize it, for data that is temporary?
What is an efficient way of inserting new data to the destination table while dropping duplications?
Will it be better to join and store the files in S3 using AWS Glue, and then load them to the target db? Currently it does not seem like an option, as Glue takes forever to join the data.
I would recommend using Amazon Athena to join the files and produce the desired output.
First, each directory needs to be recognised as a table. This can be done by manually running a CREATE EXTERNAL TABLE command in Athena and pointing at the folder. All files within the folder will be treated as containing data for the table and they should all be of the same format.
If desired, an AWS Glue crawler can instead be used to create the table definition. Create a crawler and point it to the folder. Glue will create the table definition in the AWS Glue Data Catalog, which is accessible to Athena.
Once the three input tables have been defined, you can run a query in Amazon Athena that joins the three tables and produces an output table using CREATE TABLE AS.
See: Creating a Table from Query Results (CTAS) - Amazon Athena
Glue can also be used for Program AWS Glue ETL Scripts in Python - AWS Glue, but I haven't tried this so I can't offer advice on it. However, I have used AWS Glue crawlers to create tables that I then query via Amazon Athena.
Once you have the output data, you can then load it into the database of your choice. Which database you choose depends upon your use-case. I would suggest starting with Amazon RDS for PostgreSQL since it is a traditional database and you seem to be comfortable with it. If you later need improved performance (eg billions or rows instead of millions), you can move to Amazon Redshift.
General comment: It is rather strange that you wish to join those 3 tables since there will presumably be a lot of duplicated data (very denormalized). You could instead simply load those tables into your desired database and then do the joins in the database, possibly being selective as to which columns you wish to include.

Partition Athena query by S3 created date

I have a S3 bucket with ~ 70 million JSONs (~ 15TB) and an athena table to query by timestamp and some other keys definied in the JSON.
It is guaranteed, that the timestamp in the JSON is more or less equal to the S3-createdDate of the JSON (or at least equal enough for the purpose of my query)
Can I somehow improve querying-performance (and cost) by adding the createddate as something like a "partition" - which I unterstand seems only to be possible for prefixes/folders?
edit:
I currently simulate that by using the S3 inventory CSV to pre-filter by createdDate and then download all JSONs and do the rest of the filtering, but I'd like to do that completely inside athena, if possible
There is no way to make Athena use things like S3 object metadata for query planning. The only way to make Athena skip reading objects is to organize the objects in a way that makes it possible to set up a partitioned table, and then query with filters on the partition keys.
It sounds like you have an idea of how partitioning in Athena works, and I assume there is a reason that you are not using it. However, for the benefit of others with similar problems coming across this question I'll start by explaining what you can do if you can change the way the objects are organized. I'll give an alternative suggestion at the end, you may want to jump straight to that.
I would suggest you organize the JSON objects using prefixes that contain some part of the timestamps of the objects. Exactly how much depends on the way you query the data. You don't want it too granular and not too coarse. Making it too granular will make Athena spend more time listing files on S3, making it too coarse will make it read too many files. If the most common time period of queries is a month, that is a good granularity, if the most common period is a couple of days then day is probably better.
For example, if day is the best granularity for your dataset you could organize the objects using keys like this:
s3://some-bucket/data/2019-03-07/object0.json
s3://some-bucket/data/2019-03-07/object1.json
s3://some-bucket/data/2019-03-08/object0.json
s3://some-bucket/data/2019-03-08/object1.json
s3://some-bucket/data/2019-03-08/object2.json
You can also use a Hive-style partitioning scheme, which is what other tools like Glue, Spark, and Hive expect, so unless you have reasons not to it can save you grief in the future:
s3://some-bucket/data/created_date=2019-03-07/object0.json
s3://some-bucket/data/created_date=2019-03-07/object1.json
s3://some-bucket/data/created_date=2019-03-08/object0.json
I chose the name created_date here, I don't know what would be a good name for your data. You can use just date, but remember to always quote it (and quote it in different ways in DML and DDL…) since it's a reserved word.
Then you create a partitioned table:
CREATE TABLE my_data (
column0 string,
column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')
Some guides will then tell you to run MSCK REPAIR TABLE to load the partitions for the table. If you use Hive-style partitioning (i.e. …/created_date=2019-03-08/…) you can do this, but it will take a long time and I wouldn't recommend it. You can do a much better job of it by manually adding the partitions, which you do like this:
ALTER TABLE my_data ADD
PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'
Finally, when you query the table make sure to include the created_date column to give Athena the information it needs to read only the objects that are relevant for the query:
SELECT COUNT(*)
FROM my_data
WHERE created_date >= DATE '2019-03-07'
You can verify that the query will be cheaper by observing the difference in the data scanned when you change from for example created_date >= DATE '2019-03-07' to created_date = DATE '2019-03-07'.
If you are not able to change the way the objects are organized on S3, there is a poorly documented feature that makes it possible to create a partitioned table even when you can't change the data objects. What you do is you create the same prefixes as I suggest above, but instead of moving the JSON objects into this structure you put a file called symlink.txt in each partition's prefix:
s3://some-bucket/data/created_date=2019-03-07/symlink.txt
s3://some-bucket/data/created_date=2019-03-08/symlink.txt
In each symlink.txt you put the full S3 URI of the files that you want to include in that partition. For example, in the first file you could put:
s3://data-bucket/data/object0.json
s3://data-bucket/data/object1.json
and the second file:
s3://data-bucket/data/object2.json
s3://data-bucket/data/object3.json
s3://data-bucket/data/object4.json
Then you create a table that looks very similar to the table above, but with one small difference:
CREATE TABLE my_data (
column0 string,
column1 int
)
PARTITIONED BY (created_date date)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://some-bucket/data/'
TBLPROPERTIES ('has_encrypted_data'='false')
Notice the value of the INPUTFORMAT property.
You add partitions just like you do for any partitioned table:
ALTER TABLE my_data ADD
PARTITION (created_date = '2019-03-07') LOCATION 's3://some-bucket/data/created_date=2019-03-07/'
PARTITION (created_date = '2019-03-08') LOCATION 's3://some-bucket/data/created_date=2019-03-08/'
The only Athena-related documentation of this feature that I have come across for this is the S3 Inventory docs for integrating with Athena.
I started working with Theo's answer and it was very close (Thank you, Theo for the excellent and very detailed response), but when adding multiple partitions according to the documentation you only need to specify "ADD" once near the beginning of the query.
I tried specifying "ADD" on each line per Theo's example but received an error. It works when only specified once, though. Below is the format I used which was successful:
ALTER TABLE db.table_name ADD IF NOT EXISTS
PARTITION (event_date = '2019-03-01') LOCATION 's3://bucket-name/2019-03-01/'
PARTITION (event_date = '2019-03-02') LOCATION 's3://bucket-name/2019-03-02/'
PARTITION (event_date = '2019-03-03') LOCATION 's3://bucket-name/2019-03-03/'
...

Regarding HIVE Table Storage

Can anyone please help me understanding the below point.
I have created one HIVE table which is not a partition table, but I am working in a 10 node cluster, so in this case will the data of that table (the table is a large table) will be spread across different data nodes??? or will it be there only in one node??
If it spread across different data nodes then how we can see only one file under \hive\warehouse folder?
Also please give little idea how this storage allocated for a partition table.
The data for the table and the metadata of the table are different things.
The data for the table, which is basically just a file in HDFS, will be stored as per HDFS rules (that is based on your configuration, a file will be split into n number of blocks and stored distributedly on datanodes).
In your case the data for one hive table( a file or ay number of files) will be stored distributedly among all the 10 nodes in the cluster.
Also, this distribution is done at the block level and not visible at the user level.
You can check the number of blocks created for the file int he Web UI easily.
A partitioned table is just like adding another directory inside the table directory in HDFS. So it follows the same rules.

When creating an external table in hive can I point the location to specific files in a directory?

I have defined a table as such:
create external table PageViews (Userid string, Page_View string)
partitioned by (ds string)
row format as delimited fields terminated by ','
stored as textfile location '/user/data';
I do not want all the files in the /user/data directory to be used as part of the table. Is it possible for me to do the following?
location 'user/data/*.csv'
What kmosley said is true. As of now, you can't selectively choose certain files to be a part of your Hive table. However, there are 2 ways to get around it.
Option 1:
You can move all the csv files into another HDFS directory and create a Hive table on top of that. If it works better for you, you can create a subdirectory (say, csv) within your present directory that houses all CSV files. You can then create a Hive table on top of this subdirectory. Keep in mind that any Hive tables created on top of the parent directory will NOT contain the data from the subdirectory.
Option 2:
You can change your queries to make use of a virtual column called INPUT__FILE__NAME.
Your query would look something like:
SELECT
*
FROM
my_table
WHERE
INPUT__FILE__NAME LIKE '%csv';
The ill-effect of this approach is that the Hive query will have to churn through entire data present in the directory even though you only cared about specific files. The query wouldn't filter out files based on the predicate using INPUT__FILE__NAME. It will just filter out the records that don't come from match the predicate using INPUT__FILE__NAME during the map phase (consequently filtering out all records from particular files) but the mappers would run on unnecessary files as well. It will give you the correct result, might have some, probably minor, performance overhead.
The benefit of this approach is the you can use the same Hive table if you had multiple files in your table and you wanted the ability to query all files from that table (or its partition) in a few queries and a subset of the files in other queries. You could make use of the INPUT__FILE__NAME virtual column to achieve that. As an example:
if a partition in your HDFS directory /user/hive/warehouse/web_logs/ looked like:
/user/hive/warehouse/web_logs/dt=2012-06-30/
/user/hive/warehouse/web_logs/dt=2012-06-30/00.log
/user/hive/warehouse/web_logs/dt=2012-06-30/01.log
.
.
.
/user/hive/warehouse/web_logs/dt=2012-06-30/23.log
Let's say your table definition looked like:
CREATE EXTERNAL TABLE IF NOT EXISTS web_logs_table (col1 STRING)
PARTITIONED BY (dt STRING)
LOCATION '/user/hive/warehouse/web_logs';
After adding the appropriate partitions, you could query all logs in the partition using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt='2012-06-30';
However, if you only cared about the logs from the first hour of the day, you could query the logs for the first hour using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt ='2012-06-30'
AND INPUT__FILE__NAME='00.log';
Another similar use case could be a directory that contains web logs from different domains and various queries need to analyze logs on different sets of domains. The queries can filter out domains using the INPUT__FILE__NAME virtual column.
In both the above use-cases, having a sub partition for hour or domain would solve the problem as well, without having to use the virtual column. However, there might exist some design trade-offs that require you to not create sub-partitions. In that case, arguably, using INPUT__FILE__NAME virtual column is your best bet.
Deciding between the 2 options:
It really depends on your use case. If you would never care about the files are you are trying to exclude from the Hive table, using Option 2 is probably an overkill and you should fix up the directory structure and create a Hive table on top of the directory containing files that you care about.
If the files you are presently excluding follow the same format as the other files (so they can all be part of the same Hive table) and you could see yourself writing a query that would analyze all the data in the directory, then go with Option 2.
I came across this thread when I had a similar problem to solve. I was able to resolve it by using a custom SerDe. I then added SerDe properties which guided what RegEx to apply to the file name patterns for any particular table.
A custom SerDe might seem overkill if you are only dealing with standard CSV files, I had a more complex file format to deal with. Still this is a very viable solution if you don't shy away from writing some Java. It is particularly useful when you are unable to restructure the data in your storage location and you are looking for a very specific file pattern among a disproportionately large file set.
> CREATE EXTERNAL TABLE PageViews (Userid string, Page_View string)
> ROW FORMAT SERDE 'com.something.MySimpleSerDe'
> WITH SERDEPROPERTIES ( "input.regex" = "*.csv")
> LOCATION '/user/data';
No you cannot currently do that. There is a JIRA ticket open to allow regex selection of included files for Hive tables (https://issues.apache.org/jira/browse/HIVE-951).
For now your best bet is to create a table over a different directory and just copy in the files you want to query.