Pandas Merge 2 Databases - pandas

I'm trying to do a merge and am having trouble.
These are my 2 dataframes:
DF1
Team_Id Team_Name Season Daynum Wteam Wscore Lteam
0 1104 Alabama 1985 137 1104 50 1112
1 1104 Alabama 1985 139 1104 63 1433
2 1104 Alabama 1986 137 1104 97 1462
3 1104 Alabama 1986 139 1104 58 1228
4 1104 Alabama 1987 136 1104 88 1299
DF2
Season Seed Team
0 1985 X07 1104
1 1986 Y05 1104
2 1987 X02 1104
I want the seeds from DF2 to be in the rows of DF1. There is more information in DF2 then there is in DF1.
The expected results are:
Team_Id Team_Name Season Daynum Wteam Wscore Lteam Seed
0 1104 Alabama 1985 137 1104 50 1112 X07
1 1104 Alabama 1985 139 1104 63 1433 X07
2 1104 Alabama 1986 137 1104 97 1462 Y05
3 1104 Alabama 1986 139 1104 58 1228 Y05
4 1104 Alabama 1987 136 1104 88 1299 X02

You need merge with left_on and right_on:
DF1.merge(DF2, left_on=['Season','Team_Id'], right_on=['Season','Team'])
Output:
Team_Id Team_Name Season Daynum Wteam Wscore Lteam Seed Team
0 1104 Alabama 1985 137 1104 50 1112 X07 1104
1 1104 Alabama 1985 139 1104 63 1433 X07 1104
2 1104 Alabama 1986 137 1104 97 1462 Y05 1104
3 1104 Alabama 1986 139 1104 58 1228 Y05 1104
4 1104 Alabama 1987 136 1104 88 1299 X02 1104

Related

How to fill the gaps, and assigne the values, when using cumulative sum with Pandas?

I have following dataset (much larger, this is just small sample from it):
City Year Votes
Detroit 1964 23
Detroit 1977 61
Detroit 1978 89
Detroit 1986 116
Detroit 1993 144
Baltimore 1964 42
Baltimore 1965 91
Baltimore 1966 161
Baltimore 1967 219
Baltimore 1968 263
Baltimore 1969 312
Baltimore 1970 346
Baltimore 1978 375
Baltimore 1980 415
Baltimore 1981 449
Baltimore 1995 484
Baltimore 1996 529
Baltimore 1997 578
Baltimore 1998 619
Baltimore 1999 660
Baltimore 2000 713
Baltimore 2001 757
Baltimore 2002 807
Baltimore 2003 852
Baltimore 2004 884
Boston 1968 47
Boston 1969 101
Boston 1970 123
Boston 2007 157
Phoenix 1971 41
Phoenix 1972 41
Phoenix 1979 76
Phoenix 1981 112
Phoenix 1982 154
Phoenix 1983 197
Phoenix 1984 242
Phoenix 1985 279
Phoenix 1997 319
Phoenix 1998 351
Phoenix 2000 381
Phoenix 2003 417
Phoenix 2005 457
Phoenix 2006 494
Phoenix 2007 536
Phoenix 2008 570
Phoenix 2009 598
Phoenix 2021 633
Phoenix 2022 661
Years should be in range from 1950 to 2023, and I would like to populate years for each city that are missing:
if city has votes at the starting year (1950) then use that value
if city doesn't have the votes at the starting year (1950), then use 0 as a start
for every city I would like to fill missing years with the next logic: always use value of votes for the previous year.
Result, (only Detroit in, as I did it manually, but for all cities) should look like this:
City Year Votes
Detroit 1950 0
Detroit 1951 0
Detroit 1952 0
Detroit 1953 0
Detroit 1954 0
Detroit 1955 0
Detroit 1956 0
Detroit 1957 0
Detroit 1958 0
Detroit 1959 0
Detroit 1960 0
Detroit 1961 0
Detroit 1962 0
Detroit 1963 0
Detroit 1964 23
Detroit 1965 23
Detroit 1966 23
Detroit 1967 23
Detroit 1968 23
Detroit 1969 23
Detroit 1970 23
Detroit 1971 23
Detroit 1972 23
Detroit 1973 23
Detroit 1974 23
Detroit 1975 23
Detroit 1976 23
Detroit 1977 61
Detroit 1978 89
Detroit 1979 89
Detroit 1980 89
Detroit 1981 89
Detroit 1982 89
Detroit 1983 89
Detroit 1984 89
Detroit 1985 89
Detroit 1986 116
Detroit 1987 116
Detroit 1988 116
Detroit 1989 116
Detroit 1990 116
Detroit 1991 116
Detroit 1992 116
Detroit 1993 144
Detroit 1994 144
Detroit 1995 144
Detroit 1996 144
Detroit 1997 144
Detroit 1998 144
Detroit 1999 144
Detroit 2000 144
Detroit 2001 144
Detroit 2002 144
Detroit 2003 144
Detroit 2004 144
Detroit 2005 144
Detroit 2006 144
Detroit 2007 144
Detroit 2008 144
Detroit 2009 144
Detroit 2010 144
Detroit 2011 144
Detroit 2012 144
Detroit 2013 144
Detroit 2014 144
Detroit 2015 144
Detroit 2016 144
Detroit 2017 144
Detroit 2018 144
Detroit 2019 144
Detroit 2020 144
Detroit 2021 144
Detroit 2022 144
Detroit 2023 144
import pandas as pd
df = pd.read_clipboard() # Your df here
cities = df["City"].unique()
years = range(1950, 2024)
index = pd.MultiIndex.from_product([cities, years], names=["City", "Year"])
out = df.set_index(["City", "Year"]).reindex(index).groupby(level=0).ffill().fillna(0).astype(int).reset_index()
One option is with complete from pyjanitor to expose the missing rows:
# pip install pyjanitor
import pandas as pd
import janitor
# create a dictionary of the range of all the years
# the key of the dictionary must exist in the dataframe
years = {'Year': range(1950, 2024)}
(df
.complete(years, 'City')
.assign(Votes = lambda df: df.Votes.ffill().fillna(0, downcast='infer'))
)
City Year Votes
0 Detroit 1950 0
1 Baltimore 1950 0
2 Boston 1950 0
3 Phoenix 1950 0
4 Detroit 1951 0
.. ... ... ...
291 Phoenix 2022 661
292 Detroit 2023 661
293 Baltimore 2023 661
294 Boston 2023 661
295 Phoenix 2023 661
[296 rows x 3 columns]

finding duplicate values with join

ITEMS
ITEM_ID NAME_ID ITEM_NAME
1001 2001 Office chair
1002 2002 Writing Desk
1003 2003 Filing cabinet
1004 2004 Bookshelf bookcase
1005 2005 Table lamp
1006 2001 Office chair
1007 2002 Writing Desk
1008 2003 Filing cabinet
1009 2004 Bookshelf bookcase
1010 2005 Table lamp
1011 2001 Office chair
1012 2002 Writing Desk
1013 2003 Filing cabinet
1014 2004 Bookshelf bookcase
1015 2005 Table lamp
1016 2016 Triangle window
1017 2017 Screen
1018 2018 Cradle
1019 2017 Screen
1020 2018 Cradle
1021 2017 Screen
1022 2018 Cradle
1023 2023 Futon
1024 2024 Single bed
1025 2025 Bunk beds
1026 2026 Sofa bed
1027 2027 Camp bed cot sleeping bag
1028 2028 Airbed air mattress
1029 2029 Hammock
1030 2030 Loveseat
1031 2031 Sleeper sofa
1032 2032 Settee
1032 2032 Settee
1033 2001 Office chair
1034 2002 Writing Desk
1035 2003 Filing cabinet
1036 2004 Bookshelf/bookcase
1037 2005 Table lamp
1038 2001 Office chair
1039 2002 Writing Desk
1040 2003 Filing cabinet
1041 2004 Bookshelf/bookcase
1042 2005 Table lamp
1043 2017 Screen
1044 2018 Cradle
1045 2017 Screen
1046 2018 Cradle
1047 2017 Screen
1048 2018 Cradle
1049 2017 Screen
1050 2018 Cradle
ITEMS_DETAILS:
CITY ITEM_ID SHOP_ID
NEW YORK 1001 4001
NEW YORK 1002 4002
NEW YORK 1003 4003
NEW YORK 1004 4004
NEW YORK 1005 4005
DALLAS 1006 4006
DALLAS 1007 4007
DALLAS 1008 4008
DALLAS 1009 4001
DALLAS 1010 4002
DALLAS 1011 4003
DALLAS 1012 4004
WASHINGTON 1013 4005
WASHINGTON 1014 4006
WASHINGTON 1015 4007
WASHINGTON 1016 4008
WASHINGTON 1017 4009
WASHINGTON 1018 4010
WASHINGTON 1019 4011
SANFRANSISCO 1020 4012
SANFRANSISCO 1021 4013
CHICAGO 1022 4014
CHICAGO 1023 4015
CHICAGO 1024 4016
CHICAGO 1025 4017
BOSTON 1026 4018
BOSTON 1027 4019
BOSTON 1028 4020
BOSTON 1029 4021
BOSTON 1030 4022
SANFRANSISCO 1031 4023
SANFRANSISCO 1032 4024
SANFRANSISCO 1032 4025
SANFRANSISCO 1033 4026
Las Vegas 1034 4027
Austin 1035 4028
Houston 1036 4029
Los Angeles 1037 4030
Seattle 1038 4031
Atlanta 1039 4032
McKinney 1040 4033
Vancouver 1041 4034
Las Vegas 1042 4035
Austin 1043 4036
Houston 1044 4037
Los Angeles 1045 4038
Seattle 1046 4034
Atlanta 1047 4035
McKinney 1048 4036
Vancouver 1049 4037
Las Vegas 1050 4043
Austin 1051 4044
Houston 1052 4045
Los Angeles 1053 4046
Seattle 1054 4047
Atlanta 1055 4048
McKinney 1056 4049
Vancouver 1057 4050
Las Vegas 1058 4051
Austin 1059 4052
Houston 1060 4053
Hi All,
I am trying to find the duplicates values of the columns after the result of the join ITEMS & ITEM_DETAILS.
I know the sql for duplicate values of column on a single table. A bit confused with join.
Logic: If ITEM_NAME is the same but SHOP_ID is different, it should show as duplicate. If SHOP_ID is the same, it should show as unique
Please help me.
I tried as below:
select * from (
select a.NAME_ID from ITEMS a inner join ITEMS_DETAILS b on b.ITEM_ID = a.ITEM_ID) x
inner join ITEMS y on y.NAME_ID=x.NAME_ID
inner join ITEMS_DETAILS z on z.ITEM_ID=y.ITEM_ID
If you are interested in grouping and counting dups then try the query below:
SELECT
COUNT(*) As DupCount,
y.ITEM_ID
FROM
ITEMS y
INNER JOIN ITEMS_DETAILS z ON z.ITEM_ID=y.ITEM_ID
GROUP BY
y.ITEM_ID
HAVING
COUNT(*) > 1

Oracle SQL Flight Database - Find flight from A to B with stopover?

I have a flight Database with the following table:
FID from fto dep arrive days flightno
---- ----- --- ---- ------ ---- --------
167 MPB KYM 1020 1040 0 EA5203
168 MPB KYM 1510 1530 0 EA5205
169 MPB KYM 1705 1725 0 EA5207
221 NEB KYM 850 1025 0 EA782
222 NEB KYM 1355 1530 0 EA784
223 NEB KYM 1810 1945 0 EA786
557 BAH NEB 1305 1500 0 EA686
558 BAH ELM 605 715 0 EA162
559 BAH ELM 1005 1115 0 EA340
560 BAH ELM 1230 1340 0 EA872
561 BAH ELM 1325 1435 0 EA442
562 BAH ELM 1400 1510 0 EA872
563 BAH ELM 1455 1605 0 EA978
564 BAH ELM 1640 1750 0 EA640
565 BAH ELM 2025 2135 0 EA940
566 BAH YDS 645 845 0 EA992
567 BAH YDS 945 1130 0 EA974
1163 PPP KYM 1040 1110 0 EA3201
1164 PPP KYM 1450 1520 0 EA3207
1190 OKR KYM 825 920 0 EA3200
1191 OKR KYM 1010 1100 0 EA3204
1192 OKR KYM 1500 1605 0 EA3214
1517 SVT KYM 810 920 0 EA3201
1518 SVT KYM 940 1050 0 EA3201
1519 SVT KYM 1215 1310 0 EA3211
1520 SVT KYM 1510 1605 0 EA3211
How do I query it to show indirect flights from BAH to KYM?
I've tried a number of ways to no avail. Any help is greatly appreciated.
select CONNECT_BY_ROOT ffrom ||SYS_CONNECT_BY_PATH(fto, '/') path_ ,level,ffrom,fto
from FLIGHTS_TABLE
where level > 1
start with ffrom='BAH' CONNECT BY PRIOR fto =ffrom ORDER SIBLINGS By ffrom ;
It retuns :
PATH_ |LEVEL| FFROM | FTO
BAH/NEB/KYM | 2 | NEB | KYM
BAH/NEB/KYM | 2 | NEB | KYM
BAH/NEB/KYM | 2 | NEB | KYM
It returns 3 rows because of there are 3 row/flight from 'NEB' to 'KYM'. I don't know which flight is indirect flight.

Oracle HR schema right self join on employees table

I have the following right self-join query performed on oracles HR schema, but I can't really understand what it returns. When I've performed the exactly same query but with LEFT JOIN I understood that it returned all employees regardless they have a supervisor.
The manager ID's are a bit confusing, for example 156, King - but King has ID of 100.
SELECT emps.employee_id as "Employee", emps.last_name, mgr.employee_id as "Manager", mgr.last_name
FROM employees emps
RIGHT JOIN employees mgr
ON emps.manager_id = mgr.employee_id
The result
Employee LAST_NAME Manager LAST_NAME
---------- ------------------------- ---------- -------------------------
101 Kochhar 100 King
102 De Haan 100 King
103 Hunold 102 De Haan
104 Ernst 103 Hunold
105 Austin 103 Hunold
106 Pataballa 103 Hunold
107 Lorentz 103 Hunold
108 Greenberg 101 Kochhar
109 Faviet 108 Greenberg
110 Chen 108 Greenberg
111 Sciarra 108 Greenberg
112 Urman 108 Greenberg
113 Popp 108 Greenberg
114 Raphaely 100 King
115 Khoo 114 Raphaely
116 Baida 114 Raphaely
117 Tobias 114 Raphaely
118 Himuro 114 Raphaely
119 Colmenares 114 Raphaely
120 Weiss 100 King
121 Fripp 100 King
122 Kaufling 100 King
123 Vollman 100 King
124 Mourgos 100 King
125 Nayer 120 Weiss
126 Mikkilineni 120 Weiss
127 Landry 120 Weiss
128 Markle 120 Weiss
129 Bissot 121 Fripp
130 Atkinson 121 Fripp
131 Marlow 121 Fripp
132 Olson 121 Fripp
133 Mallin 122 Kaufling
134 Rogers 122 Kaufling
135 Gee 122 Kaufling
136 Philtanker 122 Kaufling
137 Ladwig 123 Vollman
138 Stiles 123 Vollman
139 Seo 123 Vollman
140 Patel 123 Vollman
141 Rajs 124 Mourgos
142 Davies 124 Mourgos
143 Matos 124 Mourgos
144 Vargas 124 Mourgos
145 Russell 100 King
146 Partners 100 King
147 Errazuriz 100 King
148 Cambrault 100 King
149 Zlotkey 100 King
150 Tucker 145 Russell
151 Bernstein 145 Russell
152 Hall 145 Russell
153 Olsen 145 Russell
154 Cambrault 145 Russell
155 Tuvault 145 Russell
156 King 146 Partners
157 Sully 146 Partners
158 McEwen 146 Partners
159 Smith 146 Partners
160 Doran 146 Partners
161 Sewall 146 Partners
162 Vishney 147 Errazuriz
163 Greene 147 Errazuriz
164 Marvins 147 Errazuriz
165 Lee 147 Errazuriz
166 Ande 147 Errazuriz
167 Banda 147 Errazuriz
168 Ozer 148 Cambrault
169 Bloom 148 Cambrault
170 Fox 148 Cambrault
171 Smith 148 Cambrault
172 Bates 148 Cambrault
173 Kumar 148 Cambrault
174 Abel 149 Zlotkey
175 Hutton 149 Zlotkey
176 Taylor 149 Zlotkey
177 Livingston 149 Zlotkey
178 Grant 149 Zlotkey
179 Johnson 149 Zlotkey
180 Taylor 120 Weiss
181 Fleaur 120 Weiss
182 Sullivan 120 Weiss
183 Geoni 120 Weiss
184 Sarchand 121 Fripp
185 Bull 121 Fripp
186 Dellinger 121 Fripp
187 Cabrio 121 Fripp
188 Chung 122 Kaufling
189 Dilly 122 Kaufling
190 Gates 122 Kaufling
191 Perkins 122 Kaufling
192 Bell 123 Vollman
193 Everett 123 Vollman
194 McCain 123 Vollman
195 Jones 123 Vollman
196 Walsh 124 Mourgos
197 Feeney 124 Mourgos
198 OConnell 124 Mourgos
199 Grant 124 Mourgos
200 Whalen 101 Kochhar
201 Hartstein 100 King
202 Fay 201 Hartstein
203 Mavris 101 Kochhar
204 Baer 101 Kochhar
205 Higgins 101 Kochhar
206 Gietz 205 Higgins
162 Vishney
133 Mallin
136 Philtanker
154 Cambrault
196 Walsh
104 Ernst
184 Sarchand
172 Bates
197 Feeney
150 Tucker
142 Davies
143 Matos
191 Perkins
119 Colmenares
200 Whalen
183 Geoni
180 Taylor
152 Hall
137 Ladwig
139 Seo
126 Mikkilineni
125 Nayer
170 Fox
175 Hutton
129 Bissot
163 Greene
105 Austin
176 Taylor
188 Chung
116 Baida
115 Khoo
144 Vargas
195 Jones
174 Abel
157 Sully
182 Sullivan
156 King
194 McCain
193 Everett
187 Cabrio
117 Tobias
179 Johnson
135 Gee
159 Smith
131 Marlow
190 Gates
169 Bloom
166 Ande
151 Bernstein
204 Baer
203 Mavris
160 Doran
155 Tuvault
107 Lorentz
185 Bull
128 Markle
134 Rogers
140 Patel
168 Ozer
178 Grant
141 Rajs
181 Fleaur
165 Lee
138 Stiles
173 Kumar
206 Gietz
164 Marvins
202 Fay
112 Urman
189 Dilly
110 Chen
153 Olsen
161 Sewall
186 Dellinger
109 Faviet
177 Livingston
198 OConnell
106 Pataballa
111 Sciarra
118 Himuro
132 Olson
192 Bell
113 Popp
171 Smith
127 Landry
167 Banda
130 Atkinson
158 McEwen
199 Grant
195 rows selected
In my opinion, the query with right join is confusing managers and employees. That's why the right join doesn't seem to return a clear answer. With left join, you require "All employees and if there is a related manager, also the manager". With right join, you still get "All employees", no matter if they are managers or not. So the meaning of the "right side" is wrong.
Of course it is the intention of the table to contain both types, but probably you may get a clearer picture by better separation.
Say, a manager is everybody who has no manager_id (that's the case if your table is not a deep tree). Then, look at this modification:
SELECT emps.employee_id as "Employee", emps.last_name, mgr.employee_id as "Manager", mgr.last_name
FROM employees emps
RIGHT JOIN (SELECT * FROM employees WHERE manager_id IS NULL) mgr
ON emps.manager_id = mgr.employee_id
Like this, your data basis for right join would be a proper selection of all managers. They will have showed also an employee, even if they have none. Yet this last "even" does not happen with the kind of relation you have chosen.
Then, I fully agree to #Ameya Desphande that there is a second King with ID 156. Which is even more puzzling ;-)

Pandas series bar chart plotting

I have the following data in the format of pandas.core.series.Series (after processing the original DataFrame), and I want to do some visualisation of the data. What I need is to plot a bar chart per Fruit (Apple, Pear, Oranges, etc.) where the annual production of producers X1, X2, X3 are next to each other on the chart (multiple bar chart). The x axis of the figure should be the Year.
Could anyone help please!
Thanks
The data:
Fruit Producer Year Production (tons)
Apple X1 1981 125
1982 146
1983 251
1984 278
1985 161
X2 1981 510
1982 456
1983 531
1984 563
1985 508
X3 1981 68
1982 121
1983 126
1984 189
1985 134
Pear X1 1981 126
1982 148
1983 255
1984 272
1985 166
X2 1981 515
1982 454
1983 539
1984 565
1985 558
X3 1981 516
1982 485
1983 567
1984 519
1985 588
Oranges X1 1981 68
1982 100
1983 109
1984 190
1985 136
X2 1981 50
1982 155
1983 126
1984 155
1985 139
X3 1981 12
1982 163
1983 198
1984 174
1985 136