I am new to Hive. I want to compare two tables and their data and get the data which are not matching. In SQL, we can do this using "MINUS", but it's not supported in Hive. So I want to know the different ways that will get my job done.
Related
I want to join two large tables with many columns using Presto SQL syntax in AWS Athena. My code is pretty simple:
select
*
from TableA as A
left join TableB as B
on A.key_id = B.key_id
;
After joining, the primary key column (key_id) is repeated two times. Both tables have more than 100 columns, and the joining takes very long. How can I fix it such that the key_id column does not repeat twice in the final result?
P.S. AWS Athena does not support except command, unlike Google BigQuery.
This would be a nice feature, but is not part of standard SQL. The EXCEPT keyword is a set-based operation (i.e. filtering rows).
In Athena, as with standard SQL, you will have to specify the columns you want to include. The argument for this is that it's lower maintenance, and in fact best practice is to always explicitly state the columns you want - never leaving this to "whatever columns exist". This will help ensure your queries don't change behaviour if/when your table structure changes.
Some SQL languages have features like this. I understand Oracle has this too. But to my knowledge Athena (/ PrestoSQL / Trino) does not.
I have two pass through queries coming from two different databases. The data structure of the databases are identical and the layout of both queries are similar. How do I combine the results of the two queries into one table?
I do understand that this should be some form of a UNION. However, in MS Access I only know how to union two local tables. So a potential solution would be to first convert the result of the respective pass through queries as local tables using a macro and then doing an union from there. However, this being my first time working with pass through queries, I am not even sure how to convert the result of a pass through query into a local table. I am more used to working with standard linked tables. I am also not sure if this solution will be the most elegant.
Any assistance will be greatly appreciated.
AFAIK, once you saved your 2 PTQ, you can write a union just like if they were local tables. However the performance will probably be terrible, just like with any heterogeneous data sources.
Depending on the use case (specially if need to read that union many times), you might rather:
1. build (or empty) a local table, or create it using a 'make table query'
2. append the data from your first PTQ into the local table
3. append the data from second first PTQ into the local table
I have a CSV file.
It has 5 columns, 4000 rows.
The database will have a single table, and each year I will add a new table to the database.
The tables itself will never be updated, they will be only created once.
I expect many multiple reads, queries at the same time.
There won't be any complex queries. Queries will be basically filtering on only one column.
The users will use sorting on one column.
Based on this, my gut feeling tells me that I should use a SQL solution, like MySQL or PostgreSQL. I am wondering your thoughts, should I use SQL, NoSQL or something else (Redis maybe?)
In my opinion, MySQL. Providing you have enough DB storage.
Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks
I recently came across Apache Kylin, and was curious what it's use cases are. From what I can tell, it seems to be a tool designed to solve very specific problems related to upwards of 10+ billion rows, aggregating, caching and querying data from other sources (HBase, Hadoop, Hive). Am I correct in this assumption?
Apache Kylin's use case is interactive big data analysis on Hadoop. It lets you query big Hive tables at sub-second latency in 3 simple steps.
Identify a set of Hive tables in star schema.
Build a cube from the Hive tables in an offline batch process.
Query the Hive tables using SQL and get results in sub-seconds, via Rest API, ODBC, or JDBC.
The use case is pretty general that it can fast query any Hive tables as long as you can define star schema and model cubes from the tables. Check out Kylin terminologies if you are not sure what is star schema and what is cube.
Kylin provides ANSI SQL interface, so you can query the Hive tables pretty much the same way you used to. One limitation however is Kylin provides only aggregated results, or in other word, SQL should contain a "group by" clause to yield correct result. This is usually fine because big data analysis focus more on the aggregated results rather than individual records.