How can I merge two dataframes outside the intersection of the data? - pandas

I have a dataframe of presidential candiates, their received donation amount, and the states where the donations came from (contbr_st).
However, the state includes non state abbreviations such as AA, FF, FM as shown below. And, I have a single column dataframe of 50 state abbreviations.
dataframe below is "total"
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 NaN
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
AP 37130.50 1655.00
AR 359247.28 105556.00
AS 2955.00 NaN
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FF NaN 99030.00
FL 7318178.58 8338458.81
FM 600.00 NaN
Dataframe below is 50 state, it is "state"
state
0 AL
1 AK
2 AZ
3 AR
4 CA
5 CO
6 CT
7 DC
8 DE
9 FL
10 GA
11 HI
12 ID
13 IL
14 IN
15 IA
16 KS
17 KY
18 LA
19 ME
20 MD
21 MA
22 MI
23 MN
24 MS
25 MO
26 MT
27 NE
28 NV
29 NH
30 NJ
31 NM
32 NY
33 NC
34 ND
35 OH
36 OK
37 OR
38 PA
39 RI
40 SC
41 SD
42 TN
43 TX
44 UT
45 VT
46 VA
47 WA
48 WV
49 WI
50 WY
Is there a simple way in Pandas to merge these two dataframes to discard the intersecting states, and keep the non state data from the original dataframe ('total')?
so my expected output would include non state abbreviation data as below
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AP 37130.50 1655.00
FF NaN 99030.00
FM 600.00 NaN
.
.
The only way I can think of is convert state list from each dataframe, convert to set, use the difference() method. Then, convert the result to dataframe, and merge with the "total" dataframe.

Related

Not sure the order of melt/stacking/unstacking to morph my Data Frame

I have a multiindex column dataframe. I want to preserve the existing index, but move a level from the multindex columns to become a sublevel of the index instead.
I can't figure out the correct incantation of melt/stack/unstack/pivot to move from what i have to what i want. Unstacking() turned things into a series and lost the original date index.
names = ['mike', 'matt', 'dave']
details = ['bla', 'foo', ]
columns = pd.MultiIndex.from_tuples((n,d) for n in names for d in details)
index = pd.date_range(start="2022-10-30", end="2022-11-3" ,freq="d", )
have = pd.DataFrame(np.random.randint(0,100, size = (5,6)), index=index, columns=columns)
have
want_columns = details
want_index = pd.MultiIndex.from_product([index, names])
want = pd.DataFrame(np.random.randint(0,100, size = (15,2)), index=want_index, columns=want_columns)
want
Use DataFrame.stack with level=0:
print (have.stack(level=0))
bla foo
2022-10-30 dave 88 18
matt 49 55
mike 92 45
2022-10-31 dave 33 27
matt 53 41
mike 24 16
2022-11-01 dave 48 19
matt 94 75
mike 11 19
2022-11-02 dave 16 90
matt 14 93
mike 38 72
2022-11-03 dave 80 15
matt 97 2
mike 11 94

Find maximum value of each group within a Pandas Frame

I do have a question, hoping that you could give me a little support. I looked into the archiv here, found a solution but that's taking much time and is not "beautiful", since works with Loops
Suppose you have a following frame
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra 8 2
PL1 AD Andorra 15 5
PPE AD Andorra 14 5
P11 AD Andorra 9 5
P16 AD Andorra 12 4
PEM AE Emirates 3 5
PL1 AE Emirates 15 4
PPE AE Emirates 15 5
P11 AE Emirates 15 6
P16 AE Emirates 13 5
I found the following approach for two columns Get the max value from each group with pandas.DataFrame.groupby
However, in my case I do really have many columns and need to set the index for the first three columns "System", "Country_Key" and "Name"
my desire output would be the following
System Country_Key Name Bank_number_length Check rule for bank acct no.
PEM AD Andorra
PL1 15 5
PPE 5
P11 5
P16
PEM AE Emirates
PL1 15
PPE 15
P11 15 6
P16
So actually dropping the lowest values except the max value. Any kind of hint would be really benefical
You can try mask the not max value to empty string and mask the duplicated values to empty string
keys = ['Country_Key', 'Name']
cols = ['Bank_number_length', 'Check rule for bank acct no.']
df[cols] = df[cols].mask(df[cols].ne(df.groupby(keys)[cols].transform(max)), '')
df.loc[df.duplicated(keys), keys] = ''
print(df)
System Country_Key Name Bank_number_length Check rule for bank acct no.
0 PEM AD Andorra
1 PL1 15 5
2 PPE 5
3 P11 5
4 P16
5 PEM AE Emirates
6 PL1 15
7 PPE 15
8 P11 15 6
9 P16

Need assistance with below query

I'm getting this error:
Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
Code: import webbrowser
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
webbrowser.open(website)
league_frame = pd.read_clipboard()
And the above mentioned comes next.
I believe you need use read_html - returned all parsed tables and select Dataframe by position:
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
#select first parsed table
df1 = pd.read_html(website)[0]
print (df1.head())
Win % Wins Losses Year Team Comment
0 0.798 67 17 1882 Chicago White Stockings best pre-modern season
1 0.763 116 36 1906 Chicago Cubs best 154-game NL season
2 0.721 111 43 1954 Cleveland Indians best 154-game AL season
3 0.716 116 46 2001 Seattle Mariners best 162-game AL season
4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season
#select second parsed table
df2 = pd.read_html(website)[1]
print (df2)
Win % Wins Losses Season Team \
0 0.890 73 9 2015–16 Golden State Warriors
1 0.110 9 73 1972–73 Philadelphia 76ers
2 0.106 7 59 2011–12 Charlotte Bobcats
Comment
0 best 82 game season
1 worst 82-game season
2 worst season statistically

AWS Athena SQL to group and find minimum in distinct rows

I have a query against AWS Athena and the core of it works great. My companies code is AA (field ACD) and our competitors codes are BB, CC and DD (field OCD). So for each distinct trip my company makes I get a series of similar trips from competitors. I end up with a table like this:
main =
AID ATRIPDT ACD ACAR CY1 CY2 OID OTRIPDT OCD BCAR DELMN
0 10/30/2018 AA XX22 LAS LAX 300 10/30/2018 BB ZZ1 21
0 10/30/2018 AA XX22 LAS LAX 544 10/30/2018 CC T09 36
0 10/30/2018 AA XX22 LAS LAX 755 10/30/2018 BB KLQ 57
0 10/30/2018 AA XX22 LAS LAX 912 10/30/2018 DD 75Q 5
1 10/30/2018 AA P700 LAS LAX 390 10/30/2018 BB MNZ 13
1 10/30/2018 AA P700 LAS LAX 603 10/30/2018 BB JJ1 30
However, the last step is to group by AID and select only one record for each OCD which should be the minimum value of DELMN.
In this case I am looking for this as a result:
AID ATRIPDT ACD ACAR CY1 CY2 OID OTRIPDT OCD BCAR DELMN
0 10/30/2018 AA XX22 LAS LAX 300 10/30/2018 BB ZZ1 21
0 10/30/2018 AA XX22 LAS LAX 544 10/30/2018 CC T09 36
0 10/30/2018 AA XX22 LAS LAX 912 10/30/2018 DD 75Q 5
1 10/30/2018 AA P700 LAS LAX 390 10/30/2018 BB MNZ 13
I tried this
with main as
(
<complex query that returns main table>
)
select * from main
where DELMN = (select min(DELMN) from main as b where b.OCD=main.OCD
which returns a total of three records so I am not setting up the grouping correctly. Am brain drained so not sure what else to try.
You want one row per AID+OCD value, so you'll want something like:
WITH main AS
(
<complex query that returns main table>
)
SELECT *
FROM main
WHERE DELMN = (SELECT MIN(DELMN)
FROM main AS b
WHERE b.OCD=main.OCD AND b.AID = main.AID)
GROUP BY AID, OCD
It won't be a very efficient query, but should work. It can be more efficient by JOINing to a query that pulls the minimum DELMN group by AID and OCD (rather than using a sub-select that runs for every row). That way, it only needs to scan those tables once. Don't worry about that unless you have LOTS of rows, which causes it to slow down.

how to concat corresponding rows value to make column name in pandas?

I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000