Hive: Can't select one random match on right table in left outer join - hive

EDIT - I don't care about the skewness or things being slow. I found out that the slowness was more so caused by a many times many join on many matches in my left outer join... Please skip down to the bottom.
I have an issue of a skewed table, that is, many more keys than other keys to join. My problem is that I have more than one key with many appearances in the rows.
Stats on this table and table I am joining with:
Larger table: totalSize=47431500000, numRows=509500000, rawDataSize=47022050000 21052 distinct keys
Smaller table: totalSize=1154984612, numRows=13780692, rawDataSize=1141203920 AND 39313 distinct keys
The smaller table also has repeated rows of keys. The other challenge is that I need to randomly select a matching key from the smaller table.
What I have tried so far:
set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=1155mb;
and
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
LEFT OUTER JOIN smaller_table as b
on a.key = b.key
ORDER BY RAND()
)b;
It has been running for a day now.
I thought about manually doing something like this but, I have more than one key that there are a ton of, so I would have to make a bunch of tables and merge them. Which I can do if that is my only option :O
But I wanted to reach out to you all on SO.
Thanks for the help in advance
EDIT June 20th
I found to try:
set hive.optimize.skewjoin = true;
set hive.skewjoin.key = 200000;
I had already created a few separate tables to separate and join up the highest appearing keys, such that now the highest appearing key in the rest was 200k. Running the query to join the rest now took 25 minutes, finished all tasks successfully, according to the job tracker on the web interface. On the command line in the Hive shell, it is still sitting there, and when I go to check, the table does not exist.
**EDIT #2 After a lot of reading and trying out a lot of sql hive code... the 1 solution that should have worked in theory did not work, specifically the order by rand() never even happened...
CREATE TABLE joined_table AS SELECT * FROM (
select * from larger_table as a
JOIN
(SELECT *, row_number() over (partition by key order by rand() )
from smaller_table) as b
on a.key = b.key
and b.row_num=1
)b;
In the results it is being matched with the first row, not a rand() row at all..
Any other options or anything I did incorrectly here?

Related

Abap subquery Where Cond [duplicate]

I have a requirement to pull records, that do not have history in an archive table. 2 Fields of 1 record need to be checked for in the archive.
In technical sense my requirement is a left join where right side is 'null' (a.k.a. an excluding join), which in abap openSQL is commonly implemented like this (for my scenario anyways):
Select * from xxxx //xxxx is a result for a multiple table join
where xxxx~key not in (select key from archive_table where [conditions] )
and xxxx~foreign_key not in (select key from archive_table where [conditions] )
Those 2 fields are also checked against 2 more tables, so that would mean a total of 6 subqueries.
Database engines that I have worked with previously usually had some methods to deal with such problems (such as excluding join or outer apply).
For this particular case I will be trying to use ABAP logic with 'for all entries', but I would still like to know if it is possible to use results of a sub-query to check more than than 1 field or use another form of excluding join logic on multiple fields using SQL (without involving application server).
I have tested quite a few variations of sub-queries in the life-cycle of the program I was making. NOT EXISTS with multiple field check (shortened example below) to exclude based on 2 keys works in certain cases.
Performance acceptable (processing time is about 5 seconds), although, it's noticeably slower than the same query when excluding based on 1 field.
Select * from xxxx //xxxx is a result for a multiple table inner joins and 1 left join ( 1-* relation )
where NOT EXISTS (
select key from archive_table
where key = xxxx~key OR key = XXXX-foreign_key
)
EDIT:
With changing requirements (for more filtering) a lot has changed, so I figured I would update this. The construct I marked as XXXX in my example contained a single left join ( where main to secondary table relation is 1-* ) and it appeared relatively fast.
This is where context becomes helpful for understanding the problem:
Initial requirement: pull all vendors, without financial records in 3
tables.
Additional requirements: also exclude based on alternative
payers (1-* relationship). This is what example above is based on.
More requirements: also exclude based on alternative payee (*-* relationship between payer and payee).
Many-to-many join exponentially increased the record count within the construct I labeled XXXX, which in turn produces a lot of unnecessary work. For instance: a single customer with 3 payers, and 3 payees produced 9 rows, with a total of 27 fields to check (3 per row), when in reality there are only 7 unique values.
At this point, moving left-joined tables from main query into sub-queries and splitting them gave significantly better performance.
than any smarter looking alternatives.
select * from lfa1 inner join lfb1
where
( lfa1~lifnr not in ( select lifnr from bsik where bsik~lifnr = lfa1~lifnr )
and lfa1~lifnr not in ( select wyt3~lifnr from wyt3 inner join t024e on wyt3~ekorg = t024e~ekorg and wyt3~lifnr <> wyt3~lifn2
inner join bsik on bsik~lifnr = wyt3~lifn2 where wyt3~lifnr = lfa1~lifnr and t024e~bukrs = lfb1~bukrs )
and lfa1~lifnr not in ( select lfza~lifnr from lfza inner join bsik on bsik~lifnr = lfza~empfk where lfza~lifnr = lfa1~lifnr )
)
and [3 more sets of sub queries like the 3 above, just checking different tables].
My Conclusion:
When exclusion is based on a single field, both not in/not exits work. One might be better than the other, depending on filters you use.
When exclusion is based on 2 or more fields and you don't have many-to-many join in main query, not exists ( select .. from table where id = a.id or id = b.id or... ) appears to be the best.
The moment your exclusion criteria implements a many-to-many relationship within your main query, I would recommend looking for an optimal way to implement multiple sub-queries instead (even having a sub-query for each key-table combination will perform better than a many-to-many join with 1 good sub-query, that looks good).
Anyways, any additional insight into this is welcome.
EDIT2: Although it's slightly off topic, given how my question was about sub-queries, I figured I would post an update. After over a year I had to revisit the solution I worked on to expand it. I learned that proper excluding join works. I just failed horribly at implementing it the first time.
select header~key
from headers left join items on headers~key = items~key
where items~key is null
if it is possible to use results of a sub-query to check more than
than 1 field or use another form of excluding join logic on multiple
fields
No, it is not possible to check two columns in subquery, as SAP Help clearly says:
The clauses in the subquery subquery_clauses must constitute a scalar
subquery.
Scalar is keyword here, i.e. it should return exactly one column.
Your subquery can have multi-column key, and such syntax is completely legit:
SELECT planetype, seatsmax
FROM saplane AS plane
WHERE seatsmax < #wa-seatsmax AND
seatsmax >= ALL ( SELECT seatsocc
FROM sflight
WHERE carrid = #wa-carrid AND
connid = #wa-connid )
however you say that these two fields should be checked against different tables
Those 2 fields are also checked against two more tables
so it's not the case for you. Your only choice seems to be multi-join.
P.S. FOR ALL ENTRIES does not support negation logic, you cannot just use some sort of NOT IN FOR ALL ENTRIES, it won't be that easy.

Request optimisation

I have two tables, on one there are all the races that the buses do
dbo.Courses_Bus
|ID|ID_Bus|ID_Line|DateHour_Start_Course|DateHour_End_Course|
On the other all payments made in these buses
dbo.Payments
|ID|ID_Bus|DateHour_Payment|
The goal is to add the notion of a Line in the payment table to get something like this
dbo.Payments
|ID|ID_Bus|DateHour_Payment|Line|
So I tried to do this :
/** I first added a Line column to the dbo.Payments table**/
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN [dbo].[Courses_Bus] AS Table_B
ON Table_A.ID_Bus = Table_B.ID_Bus
AND Table_A.DateHour_Payment BETWEEN Table_B.DateHour_Start_Course AND Table_B.DateHour_End_Course
And this
UPDATE
Table_A
SET
Table_A.Line = Table_B.ID_Line
FROM
[dbo].[Payments] AS Table_A
INNER JOIN (
SELECT
P.*,
CP.ID_Line AS ID_Line
FROM
[dbo].[Payments] AS P
INNER JOIN [dbo].[Courses_Bus] CP ON CP.ID_Bus = P.ID_Bus
AND CP.DateHour_Start_Course <= P.Date
AND CP.DateHour_End_Course >= P.Date
) AS Table_B ON Table_A.ID_Bus = Table_B.ID_Bus
The main problem, apart from the fact that these requests do not seem to work properly, is that each table has several million lines that are increasing every day, and because of the datehour filter (mandatory since a single bus can be on several lines everyday) SSMS must compare each row of the second table to all rows of the other table.
So it takes an infinite amount of time, which will increase every day.
How can I make it work and optimise it ?
Assuming that this is the logic you want:
UPDATE p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course;
To optimize this query, then you want an index on Courses_Bus(ID_Bus, DateHour_Start_Course, DateHour_End_Course).
There might be slightly more efficient ways to optimize the query, but your question doesn't have enough information -- is there always exactly one match, for instance?
Another big issue is that updating all the rows is quite expensive. You might find that it is better to do this in loops, one chunk at a time:
UPDATE TOP (10000) p
SET p.Line = cb.ID_Line
FROM [dbo].[Payments] p JOIN
[dbo].[Courses_Bus] cb
ON p.ID_Bus = cb.ID_Bus AND
p.DateHour_Payment BETWEEN cb.DateHour_Start_Course AND cb.DateHour_End_Course
WHERE p.Line IS NULL;
Once again, though, this structure depends on all the initial values being NULL and an exact match for all rows.
Thank you Gordon for your answer.
I have investigated and came with this query :
MERGE [dbo].[Payments] AS p
USING [dbo].[Courses_Bus] AS cb
ON p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment>= cb.DateHour_Start_Course AND
p.DateHour_Payment<= cb.DateHour_End_Course
WHEN MATCHED THEN
UPDATE SET p.Line = cb.ID_Ligne;
As it seems to be the most suitable in an MS-SQL environment.
It also came with the error :
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I understood this to mean that it finds several lines with identical
[p.ID_Bus= cb.ID_Bus AND
p.DateHour_Payment >= cb.DateHour_Start_Course AND
p.DateHour_Payment <= cb.DateHour_End_Course]
Yes, this is a possible case, however the ID is different each time.
For example, if two blue cards are beeped at the same time, or if there is a loss of network and the equipment has been updated, thus putting the beeps at the same time. These are different lines that must be treated separately, and you can obtain for example:
|ID|ID_Bus|DateHour_Payments|Line|
----------------------------------
|56|204|2021-01-01 10:00:00|15|
----------------------------------
|82|204|2021-01-01 10:00:00|15|
How can I improve this query so that it takes into account different payment IDs?
I can't figure out how to do this with the help I find online. Maybe this method is not the right one in this context.

Determine datatypes of columns - SQL selection

Is it possible to determine the type of data of each column after a SQL selection, based on received results? I know it is possible though information_schema.columns, but the data I receive comes from multiple tables and is joint together and the data is renamed. Besides that, I'm not able to see or use this query or execute other queries myself.
My job is to store this received data in another table, but without knowing beforehand what I will receive. I'm obviously able to check for example if a certain column contains numbers or text, but not if it is originally stored as a TINYINT(1) or a BIGINT(128). How to approach this? To clarify, it is alright if the data-types of the columns of the source and destination aren't entirely the same, but I don't want to reserve too much space beforehand (or too less for that matter).
As I'm typing, I realize I'm formulation the question wrong. What would be the best approach to handle described situation? I thought about altering tables on the run (e.g. increasing size if needed), but that seems a bit, well, wrong and not the proper way.
Thanks
Can you issue the following query about your new table after you create it?
SELECT *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'JoinedQueryResults'
Is the query too big to run before knowing how big the results will be? Get a idea of how many rows it may return, but the trick with queries with joins is to group on the columns you are joining on, to help your estimate return more quickly. Here's of an example of just returning a row count from the query above which would have created the JoinedQueryResults table above.
SELECT SUM(A.NumRows * B.NumRows)
FROM (SELECT ID, COUNT(*) AS NumRows
FROM TableA
GROUP BY ID) AS A
INNER JOIN (SELECT ID, COUNT(*) AS NumRows
FROM TableB
GROUP BY ID) AS B ON A.ID = B.ID
The query above will run faster if all you need is a record count to help you estimate a size.
Also try instantiating a table for your results with a query like this.
SELECT TOP 0 *
INTO JoinedQueryResults
FROM TableA AS A
INNER JOIN TableB AS B ON A.ID = B.ID

Hive query stuck at 99%

I am inserting records using left joining in Hive.When I set limit 1 query works but for all records query get stuck at 99% reduce job.
Below query works
Insert overwrite table tablename select a.id , b.name from a left join b on a.id = b.id limit 1;
But this does not
Insert overwrite table tablename select table1.id , table2.name from table1 left join table2 on table1.id = table2.id;
I have increased number of reducers but still it doesn't work.
Here are a few Hive optimizations that might help the query optimizer and reduce overhead of data sent across the wire.
set hive.exec.parallel=true;
set mapred.compress.map.output=true;
set mapred.output.compress=true;
set hive.exec.compress.output=true;
set hive.exec.parallel=true;
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
However, I think there's a greater chance that the underlying problem is key in the join. For a full description of skew and possible work arounds see this https://cwiki.apache.org/confluence/display/Hive/Skewed+Join+Optimization
You also mentioned that table1 is much smaller than table2. You might try a map-side join depending on your hardware constraints. (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins)
If your query is getting stuck at 99% check out following options -
Data skewness, if you have skewed data it might possible 1 reducer is doing all the work
Duplicates keys on both side - If you have many duplicate join keys on both side your output might explode and query might get stuck
One of your table is small try to use map join or if possible SMB join which is a huge performance gain over reduce side join
Go to resource manager log and see amount of data job is accessing and writing.
Hive automatically does some optimizations when it comes to joins and loads one side of the join to memory if it fits the requirements. However in some cases these jobs get stuck at 99% and never really finish.
I have faced this multiple times and the way I have avoided this by explicitly specifying some settings to hive. Try with the settings below and see if it works for you.
hive.auto.convert.join=false
mapred.compress.map.output=true
hive.exec.parallel=true
Make sure you don't have rows with duplicate id values in one of your data tables!
I recently encountered the same issue with a left join's map-reduce process getting stuck on 99% in Hue.
After a little snooping I discovered the root of my problem: there were rows with duplicate member_id matching variables in one of my tables. Left joining all of the duplicate member_ids would have created a new table containing hundreds of millions of rows, consuming more than my allotted memory on our company's Hadoop server.
use these configuration and try
hive> set mapreduce.map.memory.mb=9000;
hive> set mapreduce.map.java.opts=-Xmx7200m;
hive> set mapreduce.reduce.memory.mb=9000;
hive> set mapreduce.reduce.java.opts=-Xmx7200m
I faced the same problem with a left outer join similar to:
select bt.*, sm.newparam from
big_table bt
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
I made an analysis based on the already given answers and I saw two of the given problems:
Left table was more than 100x bigger than the right table
select count(*) from big_table -- returned 130M
select count(*) from small_table -- returned 1.3M
I also detected that one of the join variable was rather skewed in the right table:
select count(*), cate
from small_table
group by cate
-- returned
-- A 70K
-- B 1.1M
-- C 120K
I tried most of the solutions given in other answers plus some extra parameters I found here Without success.:
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Lastly I found out that the left table had a really high % of null values for the join columns: bt.ident and bt.cate
So I tried one last thing, which finally worked for me: to split the left table depending on bt.ident and bt.cate being null or not, to later make a union all with both branches:
select * from
(select bt.*, sm.newparam from
select * from big_table bt where ident is not null or cate is not null
left outer join
small_table st
on bt.ident = sm.ident
and bt.cate - sm.cate
union all
select *, null as newparam from big_table nbt where ident is null and cate is null) combined

Why does my left join in Access have fewer rows than the left table?

I have two tables in an MS Access 2010 database: TBLIndividuals and TblIndividualsUpdates. They have a lot of the same data, but the primary key may not be the same for a given person's record in both tables. So I'm doing a join between the two tables on names and birthdates to see which records correspond. I'm using a left join so that I also get rows for the people who are in TblIndividualsUpdates but not in TBLIndividuals. That way I know which records need to be added to TBLIndividuals to get it up to date.
SELECT TblIndividuals.PersonID AS OldID,
TblIndividualsUpdates.PersonID AS UpdateID
FROM TblIndividualsUpdates LEFT JOIN TblIndividuals
ON ( (TblIndividuals.FirstName = TblIndividualsUpdates.FirstName)
and (TblIndividuals.LastName = TblIndividualsUpdates.LastName)
AND (TblIndividuals.DateBorn = TblIndividualsUpdates.DateBorn
or (TblIndividuals.DateBorn is null
and (TblIndividuals.MidName is null and TblIndividualsUpdates.MidName is null
or TblIndividuals.MidName = TblIndividualsUpdates.MidName))));
TblIndividualsUpdates has 4149 rows, but the query returns only 4103 rows. There are about 50 new records in TblIndividualsUpdates, but only 4 rows in the query result where OldID is null.
If I export the data from Access to PostgreSQL and run the same query there, I get all 4149 rows.
Is this a bug in Access? Is there a difference between Access's left join semantics and PostgreSQL's? Is my database corrupted (Compact and Repair doesn't help)?
ON (
TblIndividuals.FirstName = TblIndividualsUpdates.FirstName
and
TblIndividuals.LastName = TblIndividualsUpdates.LastName
AND (
TblIndividuals.DateBorn = TblIndividualsUpdates.DateBorn
or
(
TblIndividuals.DateBorn is null
and
(
TblIndividuals.MidName is null
and TblIndividualsUpdates.MidName is null
or TblIndividuals.MidName = TblIndividualsUpdates.MidName
)
)
)
);
What I would do is systematically remove all the join conditions except the first two until you find the records drop off. Then you will know where your problem is.
This should never happen. Unless rows are being inserted/deleted in the meantime,
the query:
SELECT *
FROM a LEFT JOIN b
ON whatever ;
should never return less rows than:
SELECT *
FROM a ;
If it happens, it's a bug. Are you sure the queries are exactly like this (and you have't omitted some detail, like a WHERE clause)? Are you sure that the first returns 4149 rows and the second one 4103 rows? You could make another check by changing the * above to COUNT(*).
Drop any indexes from both tables which include those JOIN fields (FirstName, LastName, and DateBorn). Then see whether you get the expected
4,149 rows with this simplified query.
SELECT
i.PersonID AS OldID,
u.PersonID AS UpdateID
FROM
TblIndividualsUpdates AS u
LEFT JOIN TblIndividuals AS i
ON
(
(i.FirstName = u.FirstName)
AND (i.LastName = u.LastName)
AND (i.DateBorn = u.DateBorn)
);
For whatever it is worth, since this seems to be a deceitful bug and any additional information could help resolving it, I have had the same problem.
The query is too big to post here and I don't have the time to reduce it now to something suitable, but I can report what I found. In the below, all joins are left joins.
I was gradually refining and changing my query. It had a derived table in it (D). And the whole thing was made into a derived table (T) and then joined to a last table (L). In any case, at one point in its development, no field in T that originated in D participated in the join to L. It was then the problem occurred, the total number of rows mysteriously became less than the main table, which should be impossible. As soon as I again let a field from D participate (via T) in the join to L, the number increased to normal again.
It was as if the join condition to D was moved to a WHERE clause when no field in it was participating (via T) in the join to L. But I don't really know what the explanation is.