Hive : Column data mismatch after left outer join? - hive

I am facing strange behaviour as explained below. I have tow hive table T1 and T2 , joined with LEFT OUTER JOIN ..I am getting strange value for two columns t2c2 t2c3 of table T2 after join.
See below complete detail :
Table T1 :
create table T1 ( t1c1 int , t1c2 int , t1c3 int ) clustered by (t1c1) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true');
4 4 1
4 4 0
1 1 1
1 1 0
5 5 1
Table T2:
create table T2 ( t2c0 int , t2c1 int , t2c2 int , t2c3 int ) clustered by ( t2c1) into 2 buckets stored as orc TBLPROPERTIES('transactional'='true'); ;
0 1 -1 3
0 1 0 0
0 4 6 6
0 4 1 6
1 1 0 2
1 4 3 5
1 4 2 5
Query :
select *
from T1 a
LEFT OUTER JOIN
T2 b
ON a.t1c2 = b.t2c1;
Result set : (not expected )
a.t1c1 a.t1c2 a.t1c3 b.t2c0 b.t2c1 b.t2c2 b.t2c3
4 4 1 1 4 4 3
4 4 1 0 4 4 6
4 4 1 1 4 4 2
4 4 1 0 4 4 1
4 4 0 1 4 4 3
4 4 0 0 4 4 6
4 4 0 1 4 4 2
4 4 0 0 4 4 1
1 1 1 0 1 1 -1
1 1 1 1 1 1 0
1 1 1 0 1 1 0
1 1 0 0 1 1 -1
1 1 0 1 1 1 0
1 1 0 0 1 1 0
5 5 1 NULL NULL NULL NULL
Error description :
values in result set b.t2c2 and b.t2c3 are strange and not expected . -1 in b.t2c3 is no more belong to T2.t2c3 , and 4 in b.t2c2 is no more belong to T2.t2c2.
I am not sure whats wrong with. Please help me to identify the issue and resolve it.
Expected result :
1 1 1 0 1 -1 3
1 1 1 1 1 0 2
1 1 1 0 1 0 0
1 1 0 0 1 -1 3
1 1 0 1 1 0 2
1 1 0 0 1 0 0
4 4 1 1 4 3 5
4 4 1 0 4 6 6
4 4 1 1 4 2 5
4 4 1 0 4 1 6
4 4 0 1 4 3 5
4 4 0 0 4 6 6
4 4 0 1 4 2 5
4 4 0 0 4 1 6
5 5 1 <null> <null> <null> <null>
but if i change query to below , it start giving correct result.
select *
from ( select * from T1) a
LEFT OUTER JOIN
( select * from T2) b
ON a.t1c2 = b.t2c1;

Related

How to compute column sum on the basis of other column value in pandas dataframe?

P
T1
T2
T3
0
1
2
3
1
1
2
0
2
3
1
2
3
1
0
2
In the above pandas dataframe df,
I want to add columns on the basis of the value of column 'P'.
if df['P'] == 0: 0
if df['P'] == 1: T1 (=1)
if df['P'] == 2: T1+T2 (=3+1=4)
if df['P'] == 3: T1+T2+T3 (=1+0+2=3)
In other words, I want to add from T1 to TN if df['P'] == N.
How can I implement this with Python code?
EDIT:
For sum values by P column create mask by broadcasting np.arange by length of filtered columns by DataFrame.filter, compare by P values and this mask pass to DataFrame.where, last use sum per rows:
np.random.seed(20)
c = [f'{x}{i + 1}' for x in ['T','U','V'] for i in range(3)]
df = pd.DataFrame(np.random.randint(4, size=(10,10)), columns=['P'] + c)
arrP = df['P'].to_numpy()[:, None]
for c in ['T','U','V']:
df1 = df.filter(regex=rf'^{c}')
df[f'{c}_SUM'] = df1.where(np.arange(len(df1.columns)) < arrP, 0).sum(axis=1)
print (df)
P T1 T2 T3 U1 U2 U3 V1 V2 V3 T_SUM U_SUM V_SUM
0 3 2 3 3 0 2 1 0 3 2 8 3 5
1 3 2 0 2 0 1 2 2 3 3 4 3 8
2 0 1 2 2 2 0 1 1 3 1 0 0 0
3 3 2 2 2 1 3 2 1 3 2 6 6 6
4 3 1 1 3 1 2 2 0 2 3 5 5 5
5 2 3 2 3 1 1 1 0 3 0 5 2 3
6 2 3 2 3 3 3 2 1 1 2 5 6 2
7 3 2 0 2 1 1 2 2 2 3 4 4 7
8 2 2 1 0 2 2 0 3 3 0 3 4 6
9 2 2 3 2 2 3 2 2 1 1 5 5 3

Select only data which columns does not have specific corresponding values respectively

image
Select only data which columns does not have specific corresponding values.
Table Values:
1 D675F009-6908-47A4-816A-AD25A68D8514 0
2 7C96A948-B889-4630-BF67-2187ECFA37DC 1
3 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 1
4 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
5 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 0
6 D675F009-6908-47A4-816A-AD25A68D8514 0
7 59737584-F44F-4B42-AF9C-1550DFEC1EA5 1
8 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 1
9 D675F009-6908-47A4-816A-AD25A68D8514 1
10 7C96A948-B889-4630-BF67-2187ECFA37DC 0
11 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
12 016FAF52-8FBF-4C9C-802D-CA9E13071719 0
Don't select values which have:
(D675F009-6908-47A4-816A-AD25A68D8514) have 1 respectively and
(FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E) have 1 respectively
Allow select values:
(D675F009-6908-47A4-816A-AD25A68D8514) have 0
respectively and (FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E) have 0
respectively
Expected Result::
1 D675F009-6908-47A4-816A-AD25A68D8514 0
2 7C96A948-B889-4630-BF67-2187ECFA37DC 1
4 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
5 FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E 0
6 D675F009-6908-47A4-816A-AD25A68D8514 0
7 59737584-F44F-4B42-AF9C-1550DFEC1EA5 1
10 7C96A948-B889-4630-BF67-2187ECFA37DC 0
11 178B055F-45FF-4951-A9E2-3470B1DE25E9 1
12 016FAF52-8FBF-4C9C-802D-CA9E13071719 0
Is this what you want?
Select * from table where
(is_active=1 and
Participant_id NOT IN
('D675F009-6908-47A4-816A-AD25A68D8514', 'FD6DD4B4-6E5D-4282-B421-A849DB4B1D3E' )
) or
is_active=0;

Sum distinct group values only

I would like to sum values distinct per group. Pardon the wordy post...
Context. Suppose I have a table of the form:
ID Foo Value
A 1 2
B 0 2
C 0 3
A 1 2
A 1 2
C 0 3
B 0 2
Each ID/Foo combo has a distinct value. I'd like to join this table onto another cte that has a cumulative field, e.g. suppose after joining using rows unbounded preceeding I have a new field called cumulative. Same data, just duplicated 3 times with value cumulative:
ID Foo Value Cumulative
A 1 2 1
B 0 2 1
C 0 3 1
A 1 2 1
A 1 2 1
C 0 3 1
B 0 2 1
A 1 2 2
B 0 2 2
C 0 3 2
A 1 2 2
A 1 2 2
C 0 3 2
B 0 2 2
A 1 2 3
B 0 2 3
C 0 3 3
A 1 2 3
A 1 2 3
C 0 3 3
B 0 2 3
I want to add a new field 'segment_value' that, for each foo gets the sum of distinct ID values. E.g. The distinct ID/Foo combinations are:
ID Foo Value
A 1 2
B 0 2
C 0 3
I would therefore like a new field, 'segment_value', That returns 2 for Foo=1 and 5 for Foo=0. Desired result:
ID Foo Value Cumulative segment_value
A 1 2 1 2
B 0 2 1 5
C 0 3 1 5
A 1 2 1 2
A 1 2 1 2
C 0 3 1 5
B 0 2 1 5
A 1 2 2 2
B 0 2 2 5
C 0 3 2 5
A 1 2 2 2
A 1 2 2 2
C 0 3 2 5
B 0 2 2 5
A 1 2 3 2
B 0 2 3 5
C 0 3 3 5
A 1 2 3 2
A 1 2 3 2
C 0 3 3 5
B 0 2 3 5
How can I achieve this?
I don't think you explained your problem very well and I might have misunderstood something, but can't you extract the segment_value using a query such as this one:
select
foo,
sum(val) as segment_value
from (
select distinct foo, val from table
) tab
group by foo
this would return the following result:
foo segment_value
1 2
0 5
then you could join this to the rest of you query and use it as per your needs.

How to order "group" of row?

This is the query:
SELECT WorkTypeId, WorktypeWorkID, LevelID
FROM Worktypes as w
LEFT JOIN WorktypesWorks as ww on w.ID = ww.WorktypeID
LEFT JOIN WorktypesWorksLevels as wwl on ww.ID = wwl.WorktypeWorkID
This is the result:
WorkTypeId WorktypeWorkID LevelID
1 1 1
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
1 3 1
1 4 1
1 4 2
1 5 1
NULL NULL NULL
3 19 2
4 6 1
4 7 1
4 7 2
4 7 3
4 17 1
4 17 2
4 18 1
4 18 2
NULL NULL NULL
I'd like to order the block of rows of each WorktypeWorkID, placing at the top the blocks which have the lower LevelID within the block.
Here's the result that I'd like to get:
WorkTypeId WorktypeWorkID LevelID
NULL NULL NULL
NULL NULL NULL
1 3 1 // blocks which have MinLevel 1
1 5 1
4 6 1
1 4 1 // blocks which have MinLevel 2
1 4 2
3 19 2
4 17 1
4 17 2
4 18 1
4 18 2
1 1 1 // blocks which have MinLevel 3
1 1 2
1 1 3
1 2 1
1 2 2
1 2 3
4 7 1
4 7 2
4 7 3
I think this is what you are looking for:
SELECT WorkTypeId, WorktypeWorkID, LevelID, MAX(LevelID) OVER (PARTITION BY WorktypeWorkID) as maxLevelID
FROM Worktypes as w
LEFT JOIN WorktypesWorks as ww on w.ID = ww.WorktypeID
LEFT JOIN WorktypesWorksLevels as wwl on ww.ID = wwl.WorktypeWorkID
ORDER BY maxLevelID

Which type of join can I use to reproduce these results

I have the following view which contains this data
ActivityRecId RegionRecId IsExcluded
1 null 1
2 null 1
3 1 1
3 2 1
4 1 1
5 null 0
What I would like to do is join the region table to the view above to get the following records.
ActivityRecId RegionRecId IsExcluded
1 null 1
2 null 1
3 1 1
3 2 1
3 3 0
3 4 0
4 1 1
4 2 0
4 3 0
4 4 0
5 null 0
The region table has the following columns:
RegionRecId
RegionName
Any suggestions. Let me know if you need any other information.
--------------------- CORRECT QUESTION ------------------------
ActivityRecId RegionRecId IsExcluded
1 null 1
2 null 1
3 1 1
3 2 1
3 3 0
3 4 0
4 1 1
4 2 0
4 3 0
4 4 0
5 1 0
5 2 0
5 3 0
5 4 0
If it makes it easier activity 1 and 2 can list all the regions also.
Thanks,
I don't have an SQL Server handy to test this, but would something like
select *
from myView
union
select myView.ActivityRecId,
region.RegionRecId,
0 as IsExcluded
from myView cross join region
where (myView.RegionRecId is not null or myView.IsExcluded = 0)
and not exists (
select null
from myView
where myView.RegionRecId = region.RegionRecId
)
be what you want?
Here's great reference to take a look at for joins. I think you'd need a Left Outer Join