My code returns an error when I run it. Why might this be so?
import pandas as pd
df1 = pd.read_csv('sample.csv')
points = [0,1,2,3,4,5,6,7,8,9,10,11,12]
bins = ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']
df1['DA'] = pd.cut(df1.AA,bins,labels=points)
df1['DE'] = pd.cut(df1['BB'],bins,labels=points)
df1['CDI'] = pd.cut(df1.CC,bins,labels=points)
The error
ValueError: could not convert string to float: 'X'
EDITS
Those are student grades that I want to convert to points. Like grade A is 12 points in that order...
You can try using replace instead. First create a dict with the conversion you want to apply, then you can create your columns
# Sample DataFrame
df = pd.DataFrame({'AA': ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']})
# conversion dict
points = [0,1,2,3,4,5,6,7,8,9,10,11,12]
grades = ['X','E','D-','D','D+','C-','C','C+','B-','B','B+','A-','A']
conversion = dict(zip(grades, points))
# applying conversion
df['DA'] = df.AA.replace(conversion)
The DataFrame will now look like:
AA DA
0 X 0
1 E 1
2 D- 2
3 D 3
4 D+ 4
5 C- 5
6 C 6
7 C+ 7
8 B- 8
9 B 9
10 B+ 10
11 A- 11
12 A 12
Related
I have 3 DataFrames like below.
A =
ops lat
0 9,453 13,536
1 8,666 14,768
2 8,377 15,278
3 8,236 15,536
4 8,167 15,668
5 8,099 15,799
6 8,066 15,867
7 8,029 15,936
8 7,997 16,004
9 7,969 16,058
10 7,962 16,073
B =
ops lat
0 9,865 12,967
1 8,908 14,366
2 8,546 14,976
3 8,368 15,294
4 8,289 15,439
5 8,217 15,571
6 8,171 15,662
7 8,130 15,741
8 8,093 15,809
9 8,072 15,855
10 8,058 15,882
C =
ops lat
0 9,594 13,332
1 8,718 14,670
2 8,396 15,242
3 8,229 15,553
4 8,137 15,725
5 8,062 15,875
6 8,008 15,982
7 7,963 16,070
8 7,919 16,159
9 7,892 16,218
10 7,874 16,255
How do I merge them into a single dataframe where ops column is a sum and lat column will be average of these three dataframes.
pd.concat() - seems to append the dataframes.
There are likely many ways, but to keep it on the same line of thinking as you had with pd.concat, the below will work.
First, concat your dataframes together and then we will calculate .sum() and .mean() on our newly created dataframe and construct our final table with those two fields.
Dummy Data and Example Below:
import pandas as pd
data = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[303,2084,494,2028,4049,112]}
df1 = pd.DataFrame(data)
data2 = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[8,234,75,123,689,1256]}
df2 = pd.DataFrame(data2)
joined = pd.concat([df1,df2])
final = pd.DataFrame({'Sum_Col': [joined["Value"].sum()],
'Mean_Col': [joined["Value2"].mean()]})
display(final)
I'm facing a bit of a problem. This is my dataframe:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
I want to plot a barplot with four "bars", for students respecting the next four conditions:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
Have neither things
At first I was stuck, but I was suggested to proceed this way:
df["Subject"].str.count("\w+") >= 3
df["Mark"].str.count("3") >= 1
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
What I obtain are three boolean columns, but I don't know how to go from here to plot the barplot.
I was thinking about counting the values in each column, but I don't seem to find a way to do so, since it looks like I can't apply value_counts() to the boolean columns.
If you have any idea, please help!
I think you need create DataFrame with all 4 masks, then count Trues by sum and last plot:
m1 = df["Subject"].str.count("\w+") >= 3
m2 = df["Mark"].str.count("3") >= 1
df1 = pd.concat([m1, m2, m1 & m2, ~m1 & ~m2], axis=1, keys=('a','b','c','d'))
out = df1.sum()
If need seaborn solution:
import seaborn as sns
ax = sns.barplot(x="index", y="val", data=out.reset_index(name='val'))
For pandas (matplotlib) solution:
out.plot.bar()
When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.
Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...
Given a pandas crosstab, how do you convert that into a stacked dataframe?
Assume you have a stacked dataframe. First we convert it into a crosstab. Now I would like to revert back to the original stacked dataframe. I searched a problem statement that addresses this requirement, but could not find any that hits bang on. In case I have missed any, please leave a note to it in the comment section.
I would like to document the best practice here. So, thank you for your support.
I know that pandas.DataFrame.stack() would be the best approach. But one needs to be careful of the the "level" stacking is applied to.
Input: Crosstab:
Label a b c d r
ID
1 0 1 0 0 0
2 1 1 0 1 1
3 1 0 0 0 1
4 1 0 0 1 0
6 1 0 0 0 0
7 0 0 1 0 0
8 1 0 1 0 0
9 0 1 0 0 0
Output: Stacked DataFrame:
ID Label
0 1 b
1 2 a
2 2 b
3 2 d
4 2 r
5 3 a
6 3 r
7 4 a
8 4 d
9 6 a
10 7 c
11 8 a
12 8 c
13 9 b
Step-by-step Explanation:
First, let's make a function that would create our data. Note that it randomly generates the stacked dataframe, and so, the final output may differ from what I have given below.
Helper Function: Make the Stacked And Crosstab DataFrames
import numpy as np
import pandas as pd
# Make stacked dataframe
def _create_df():
"""
This dataframe will be used to create a crosstab
"""
B = np.array(list('abracadabra'))
A = np.arange(len(B))
AB = list()
for i in range(20):
a = np.random.randint(1,10)
b = np.random.randint(1,10)
AB += [(a,b)]
AB = np.unique(np.array(AB), axis=0)
AB = np.unique(np.array(list(zip(A[AB[:,0]], B[AB[:,1]]))), axis=0)
AB_df = pd.DataFrame({'ID': AB[:,0], 'Label': AB[:,1]})
return AB_df
original_stacked_df = _create_df()
# Make crosstab
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
What to expect?
You would expect a function to regenerate the stacked dataframe from the crosstab. I would provide my own solution to this in the answer section. If you could suggest something better that would be great.
Other References:
Closest stackoverflow discussion: pandas stacking a dataframe
Misleading stackoverflow question-topic: change pandas crossstab dataframe into plain table format:
You can just do stack
df[df.astype(bool)].stack().reset_index().drop(0,1)
The following produces the desired outcome.
def crosstab2stacked(crosstab):
stacked = crosstab.stack(dropna=True).reset_index()
stacked = stacked[stacked.replace(0,np.nan)[0].notnull()].drop(columns=[0])
return stacked
# Make original dataframe
original_stacked_df = _create_df()
# Make crosstab dataframe
crosstab_df = pd.crosstab(original_stacked_df['ID'],
original_stacked_df['Label']).reindex()
# Recontruct stacked dataframe
recon_stacked_df = crosstab2stacked(crosstab = crosstab_df)
Check if original == reconstructed:
np.alltrue(original_stacked_df == recon_stacked_df)
Output: True