Need assistance with below query - pandas

I'm getting this error:
Error tokenizing data. C error: Expected 2 fields in line 11, saw 3
Code: import webbrowser
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
webbrowser.open(website)
league_frame = pd.read_clipboard()
And the above mentioned comes next.

I believe you need use read_html - returned all parsed tables and select Dataframe by position:
website = 'https://en.wikipedia.org/wiki/Winning_percentage'
#select first parsed table
df1 = pd.read_html(website)[0]
print (df1.head())
Win % Wins Losses Year Team Comment
0 0.798 67 17 1882 Chicago White Stockings best pre-modern season
1 0.763 116 36 1906 Chicago Cubs best 154-game NL season
2 0.721 111 43 1954 Cleveland Indians best 154-game AL season
3 0.716 116 46 2001 Seattle Mariners best 162-game AL season
4 0.667 108 54 1975 Cincinnati Reds best 162-game NL season
#select second parsed table
df2 = pd.read_html(website)[1]
print (df2)
Win % Wins Losses Season Team \
0 0.890 73 9 2015–16 Golden State Warriors
1 0.110 9 73 1972–73 Philadelphia 76ers
2 0.106 7 59 2011–12 Charlotte Bobcats
Comment
0 best 82 game season
1 worst 82-game season
2 worst season statistically

Related

Not sure the order of melt/stacking/unstacking to morph my Data Frame

I have a multiindex column dataframe. I want to preserve the existing index, but move a level from the multindex columns to become a sublevel of the index instead.
I can't figure out the correct incantation of melt/stack/unstack/pivot to move from what i have to what i want. Unstacking() turned things into a series and lost the original date index.
names = ['mike', 'matt', 'dave']
details = ['bla', 'foo', ]
columns = pd.MultiIndex.from_tuples((n,d) for n in names for d in details)
index = pd.date_range(start="2022-10-30", end="2022-11-3" ,freq="d", )
have = pd.DataFrame(np.random.randint(0,100, size = (5,6)), index=index, columns=columns)
have
want_columns = details
want_index = pd.MultiIndex.from_product([index, names])
want = pd.DataFrame(np.random.randint(0,100, size = (15,2)), index=want_index, columns=want_columns)
want
Use DataFrame.stack with level=0:
print (have.stack(level=0))
bla foo
2022-10-30 dave 88 18
matt 49 55
mike 92 45
2022-10-31 dave 33 27
matt 53 41
mike 24 16
2022-11-01 dave 48 19
matt 94 75
mike 11 19
2022-11-02 dave 16 90
matt 14 93
mike 38 72
2022-11-03 dave 80 15
matt 97 2
mike 11 94

Smoothed Average over rows and columns with pandas

I am trying to create a function that averages over both row and column. For example:
**State** **1943 1944 1945 1946 1947 (1947_AVG) 1948 (1948_AVG)**
Alaska 1 2 3 4 5 2 6 3
CA 234 234 234 6677 34
I want a code that will give me an average for 1947 using 1943, 1944, and 1945. Something that gives me 1948 using 1944, 1945, 1946, ect, ect.
I currently have:
d3['pandas_SMA_Year'] = d3.iloc[:,1].rolling(window=3).mean()
But this is simply working over the rows, not the columns, and it doesn't take into account the fact that I'm looking 2 years back. Please and thank you for any guidance!

How can I merge two dataframes outside the intersection of the data?

I have a dataframe of presidential candiates, their received donation amount, and the states where the donations came from (contbr_st).
However, the state includes non state abbreviations such as AA, FF, FM as shown below. And, I have a single column dataframe of 50 state abbreviations.
dataframe below is "total"
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AB 2048.00 NaN
AE 42973.75 5680.00
AK 281840.15 86204.24
AL 543123.48 527303.51
AP 37130.50 1655.00
AR 359247.28 105556.00
AS 2955.00 NaN
AZ 1506476.98 1888436.23
CA 23824984.24 11237636.60
CO 2132429.49 1506714.12
CT 2068291.26 3499475.45
DC 4373538.80 1025137.50
DE 336669.14 82712.00
FF NaN 99030.00
FL 7318178.58 8338458.81
FM 600.00 NaN
Dataframe below is 50 state, it is "state"
state
0 AL
1 AK
2 AZ
3 AR
4 CA
5 CO
6 CT
7 DC
8 DE
9 FL
10 GA
11 HI
12 ID
13 IL
14 IN
15 IA
16 KS
17 KY
18 LA
19 ME
20 MD
21 MA
22 MI
23 MN
24 MS
25 MO
26 MT
27 NE
28 NV
29 NH
30 NJ
31 NM
32 NY
33 NC
34 ND
35 OH
36 OK
37 OR
38 PA
39 RI
40 SC
41 SD
42 TN
43 TX
44 UT
45 VT
46 VA
47 WA
48 WV
49 WI
50 WY
Is there a simple way in Pandas to merge these two dataframes to discard the intersecting states, and keep the non state data from the original dataframe ('total')?
so my expected output would include non state abbreviation data as below
cand_nm Obama, Barack Romney, Mitt
contbr_st
AA 56405.00 135.00
AP 37130.50 1655.00
FF NaN 99030.00
FM 600.00 NaN
.
.
The only way I can think of is convert state list from each dataframe, convert to set, use the difference() method. Then, convert the result to dataframe, and merge with the "total" dataframe.

how to concat corresponding rows value to make column name in pandas?

I have the below dataframe has in a messy way and I need to club row 0 and 1 to make that as columns and keep rest rows from 3 asis:
Start Date 2005-01-01 Unnamed: 3 Unnamed: 4 Unnamed: 5
Dat an_1 an_2 an_3 an_4 an_5
mt mt s t inch km
23 45 67 78 89 9000
change to below dataframe :
Dat_mt an_1_mt an_2 _s an_3_t an_4_inch an_5_km
23 45 67 78 89 9000
IIUC
df.columns=df.loc[0]+'_'+df.loc[1]
df=df.loc[[2]]
df
Out[429]:
Dat_mt an_1_mt an_2_s an_3_t an_4_inch an_5_km
2 23 45 67 78 89 9000

Dictionary of sorted values gets unsorted when transformed into pandas dataframe

I am reading a csv file with the GDP for the 32 states in Mexico from 1940 to 2004. The columns are state names, and GDP values for each year.
Unfortunately, I can't add images just now... but, basically, the dataframe has as columns the following: state_name, 1940, 1950, etc... the values for state_name are the names of each state (as strings), and the values for the rest of the columns are the GDPs per state per year.
So, I am trying to produce a new dataframe in which there is no longer a state_names column, but only columns 1940, 1950, etc... where values are no longer the corresponding GDPs, but the names of the states according to the GDP in a given year. So, the column 1940 in the new dataframe would list the states not in alphabetical order, as I the current output does, but by the sorting of GDPs (as the one I have produced in my loop to create a dictionary below).
I am using the following loop to (in states) sort the entire data frame by each year (1940 to 2004), and then slice the names of this sorted data frame (in names).
ranks = {}
for year in pibe.columns.values[1:]:
states = pibe.sort(columns=year, ascending=False)
names = states["entidad"]
ranks[year] = names
The output of this dictionary looks like below:
{'1940': 1 Baja California
22 Quintana Roo
8 Distrito Federal
9 Durango
21 Queretaro
0 Aguascalientes
2 Baja California Sur
...
Name: entidad, dtype: object,
'1950': 22 Quintana Roo
1 Baja California
8 Distrito Federal
2 Baja California Sur
5 Chihuahua...}
So long so good. But, when I try to transform the dictionary into a data frame it somehow overrides my previous sorting and retrieves an alphabetically ordered list of state names. So, the new data frame has as columns each year populated by the same list of names.
To transform the dictionary into a data frame I am using:
pd.DataFrame(ranks)
Create a new dataframe based on the ordering that you need:
In [6]: ordered_df = original_df.sort(['Year','GDP'],axis=0,ascending=False)
Create a new dictionary to pass into the final dataframe (this can be done more efficiently):
In [7]: unique_years = {item[1]['Year']:[] for item in ordered_df.iterrows()}
Loop through new dataframe populating the dictionary:
In [8]: for row in ordered_df.iterrows():
unique_years[row[1]['Year']].append(row[1]['State'])
Create final dataframe:
In [9]: final_df = pd.DataFrame(unique_years)
Input:
In [11]: original_df
Out[11]:
Year State GDP
0 1945 New York 84
1 1945 Texas 38
2 1945 California 84
3 1946 New York 56
4 1946 Texas 6
5 1946 California 84
6 1947 New York 75
7 1947 Texas 95
8 1947 California 92
9 1948 New York 50
10 1948 Texas 25
11 1948 California 30
12 1949 New York 16
13 1949 Texas 33
14 1949 California 31
15 1950 New York 37
16 1950 Texas 75
17 1950 California 49
18 1951 New York 28
19 1951 Texas 74
20 1951 California 78
21 1952 New York 57
22 1952 Texas 5
23 1952 California 28
Output:
In [12]: final_df
Out[12]:
1945 1946 1947 1948 1949 1950 \
0 New York California Texas New York Texas Texas
1 California New York California California California California
2 Texas Texas New York Texas New York New York
1951 1952
0 California New York
1 Texas California
2 New York Texas
Check final dataframe against the ordered dataframe to ensure proper GDP ordering:
In [13]: ordered_df
Out[13]:
Year State GDP
21 1952 New York 57
23 1952 California 28
22 1952 Texas 5
20 1951 California 78
19 1951 Texas 74
18 1951 New York 28
16 1950 Texas 75
17 1950 California 49
15 1950 New York 37
13 1949 Texas 33
14 1949 California 31
12 1949 New York 16
9 1948 New York 50
11 1948 California 30
10 1948 Texas 25
7 1947 Texas 95
8 1947 California 92
6 1947 New York 75
5 1946 California 84
3 1946 New York 56
4 1946 Texas 6
0 1945 New York 84
2 1945 California 84
1 1945 Texas 38