Inequality Join in Hive and Pig - sql

I am starting a project where I need to make some not equality Join.
Now, I've read that neither Pig nor Hive support inequality Join.
I have also read that Pig can support that by using CROSS and FILTER.
Could I do that also in Hive using WHERE clause?Are there any cases where it is not possible?
Finally, supposed that I can do that both in Pig and in Hive, which would be better about performance?

I remember Hive can only use one reducer to do "CROSS". Pig uses a smart approach to implement "CROSS" and run it in parallel and it usually has better performance than Hive.
BTW, I have not updated my knowledge about Hive and Pig for one year. I'm not sure if Hive improved "CROSS" in the past year.

Related

Performance Wise, should calculations/joins/conditional logic/aggregate functions, etc, be done in the source qualifier or transformations?

I've been trying to research how to optimize my Informatica map's performance. I am trying to figure out if I should try to have my source qualifier do as much of the work as possible, so doing calculations, joining tables, handle conditional logic, or will my mappings perform better by using transformations such as expression transformation to handle calculations and other manipulations, joiner transformation for joining tables, and so on. I am also a bit new to Informatica.
There is no right answer to your question. It depends on the power of your DBMS v. the power of your Informatica environment and how efficiently your DBMS performs each type of transformation compared to your Informatica environment.
The only way is to try it and see - on your specific environment.
I think #NickW gave best answer, but here are my two cents.
If yor mapping has simple transformations like -
filter,
aggregation,
case-when/if else clause,
join 5/10 tables or have lookup on few tables in same DB
you can use SQL. Normally DB can handle these kind of operations better than infa. Also, this will help less data transfer between DB and informatica so we can assume this will be faster.
Factors to consider - if you have large table with no index, the SQL itself taking hours to return data, you can think about using informatica. In such scenario please run the query in DB and see if you can improve perf.
Pls note informatica fetches physical data into infa server so in case of large table, this can be a problem for infa as well.
Now, if you have
complex mapping logic,
join with different DB, or lookup form different DB
File/XML/COBOL/XL source
you need to use informatica only.

Hive count distinct UDAF

I ran into a Hive query calculating a count distinct without grouping, which runs very slow. So I was wondering how is this functionality implemented in Hive, is there a UDAFCountDistinct for this?
Hive 1.2.0+ provides auto-rewrite optimization for count(distinct). Check this setting:
hive.optimize.distinct.rewrite=true;

How many maximum number of subqueries can you put in a single hive query?

I am working in MySql. Hive, to be particular.
There is no such limit, however, you can refer the documentation for kind of subqueries are supported.

Is it bad to do joins in Hive?

Hi I recently joined a new job that uses Hive and PostgreSQL. The existing ETL scripts gather data from Hive partitioned by dates and creates tables for those data in PostgreSQL and then the PostgreSQL scripts/queries perform left joins and create the final table for reporting purpose. I have heard in the past that Hive joins are not a good idea. However, I noticed that Hive does allow joins so I'm not sure why it's a bad idea.
I wanted to use something like Talend or Mulesoft to create joins and do aggregations within hive and create a temporary table and transfer that temporary table as the final table to PostgreSQL for reporting.
Any suggestions, especially if this is not a good practice with HIVE. I'm new to hive.
Thanks.
The major issue with joining in hive has to do with data locality.
Hive queries are executed as MapReduce jobs and several mappers will launch, as much as possible, in the nodes where the data lies.
However, when joining tables the two rows of data from LHS and RHS tables will not in general be in the same node, which may cause a significant amount of network traffic between nodes.
Joining in Hive is not bad per se, but if the two tables being joined are large may result in slow jobs.
If one of the tables is significantly smaller than the other you may want to store it in HDFS cache, making its data available in every node, which allows the join algorithm to retrieve all data locally.
So, there's nothing wrong with running large joins in Hive, you just need to be aware they need their time to finish.
Hive is growing in maturity
It is possible that arguments against using joins, no longer apply for recent versions of hive.
The most clear example I found in the manual section on join optimization:
The MAPJOIN implementation prior to Hive 0.11 has these limitations:
The mapjoin operator can only handle one key at a time
Therefore I would recommend asking what the foundation of their reluctance is, and then checking carefully whether it still applies. Their arguments may well still be valid, or might have been resolved.
Sidenote:
Personally I find Pig code much easier to re-use and maintain than hive, consider using Pig rather than hive to do map-reduce operations on your (hive table) data.
Its perfectly fine to do joins in HIVE, I am a ETL tester and have performed left joins on big tables in Hive most of the time the queries run smoothly but some times the job do get stuck or are slow due to network traffic.
Also depends on number of Nodes the cluster is having.
Thanks

What are Apache Kylin Use Cases?

I recently came across Apache Kylin, and was curious what it's use cases are. From what I can tell, it seems to be a tool designed to solve very specific problems related to upwards of 10+ billion rows, aggregating, caching and querying data from other sources (HBase, Hadoop, Hive). Am I correct in this assumption?
Apache Kylin's use case is interactive big data analysis on Hadoop. It lets you query big Hive tables at sub-second latency in 3 simple steps.
Identify a set of Hive tables in star schema.
Build a cube from the Hive tables in an offline batch process.
Query the Hive tables using SQL and get results in sub-seconds, via Rest API, ODBC, or JDBC.
The use case is pretty general that it can fast query any Hive tables as long as you can define star schema and model cubes from the tables. Check out Kylin terminologies if you are not sure what is star schema and what is cube.
Kylin provides ANSI SQL interface, so you can query the Hive tables pretty much the same way you used to. One limitation however is Kylin provides only aggregated results, or in other word, SQL should contain a "group by" clause to yield correct result. This is usually fine because big data analysis focus more on the aggregated results rather than individual records.