Obtain a barplot from boolean values? - pandas

I'm facing a bit of a problem. This is my dataframe:
Students Subject Mark
1 M F 7 4 3 7
2 I 5 6
3 M F I S 2 3 0
4 M 2 2
5 F M I 5 1
6 I M F 6 2 3
7 I M 7
I want to plot a barplot with four "bars", for students respecting the next four conditions:
Have 3 ore more letters in the column "Subject"
Have at least one 3 in the colum "Marks"
Have both things
Have neither things
At first I was stuck, but I was suggested to proceed this way:
df["Subject"].str.count("\w+") >= 3
df["Mark"].str.count("3") >= 1
(df["Subject"].str.count("\w+") >= 3) & (df["Mark"].str.count("3") >= 1)
What I obtain are three boolean columns, but I don't know how to go from here to plot the barplot.
I was thinking about counting the values in each column, but I don't seem to find a way to do so, since it looks like I can't apply value_counts() to the boolean columns.
If you have any idea, please help!

I think you need create DataFrame with all 4 masks, then count Trues by sum and last plot:
m1 = df["Subject"].str.count("\w+") >= 3
m2 = df["Mark"].str.count("3") >= 1
df1 = pd.concat([m1, m2, m1 & m2, ~m1 & ~m2], axis=1, keys=('a','b','c','d'))
out = df1.sum()
If need seaborn solution:
import seaborn as sns
ax = sns.barplot(x="index", y="val", data=out.reset_index(name='val'))
For pandas (matplotlib) solution:
out.plot.bar()

Related

Merge Dataframes and apply different math function by column

I have 3 DataFrames like below.
A =
ops lat
0 9,453 13,536
1 8,666 14,768
2 8,377 15,278
3 8,236 15,536
4 8,167 15,668
5 8,099 15,799
6 8,066 15,867
7 8,029 15,936
8 7,997 16,004
9 7,969 16,058
10 7,962 16,073
B =
ops lat
0 9,865 12,967
1 8,908 14,366
2 8,546 14,976
3 8,368 15,294
4 8,289 15,439
5 8,217 15,571
6 8,171 15,662
7 8,130 15,741
8 8,093 15,809
9 8,072 15,855
10 8,058 15,882
C =
ops lat
0 9,594 13,332
1 8,718 14,670
2 8,396 15,242
3 8,229 15,553
4 8,137 15,725
5 8,062 15,875
6 8,008 15,982
7 7,963 16,070
8 7,919 16,159
9 7,892 16,218
10 7,874 16,255
How do I merge them into a single dataframe where ops column is a sum and lat column will be average of these three dataframes.
pd.concat() - seems to append the dataframes.
There are likely many ways, but to keep it on the same line of thinking as you had with pd.concat, the below will work.
First, concat your dataframes together and then we will calculate .sum() and .mean() on our newly created dataframe and construct our final table with those two fields.
Dummy Data and Example Below:
import pandas as pd
data = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[303,2084,494,2028,4049,112]}
df1 = pd.DataFrame(data)
data2 = {'Name':['node1','node1','node1','node2','node2','node3'],
'Value':[1000,20000,40000,30000,589,682],
'Value2':[8,234,75,123,689,1256]}
df2 = pd.DataFrame(data2)
joined = pd.concat([df1,df2])
final = pd.DataFrame({'Sum_Col': [joined["Value"].sum()],
'Mean_Col': [joined["Value2"].mean()]})
display(final)

How to group by one column or another in pandas

I have a table like:
col1 col2
0 1 a
1 2 b
2 2 c
3 3 c
4 4 d
I'd like rows to be grouped together if they have a matching value in col1 or col2. That is, I'd like something like this:
> (
df
.groupby(set('col1', 'col2')) # Made-up syntax
.ngroup())
0 0
1 1
2 1
3 1
4 2
Is there a way to do this with pandas?
This is not easy to achieve simply with pandas. Indeed, two far away groups can become connected when two items are connected in the second group.
You can approach this using graph theory. Find the connected components using edges formed by the two (or more) groups. A python library for this is networkx:
import networkx as nx
g1 = df.groupby('col1').ngroup()
g2 = 'a'+df.groupby('col2').ngroup().astype(str)
# make graph and get connected components to form a mapping dictionary
G = nx.from_edgelist(zip(g1, g2))
d = {k:v for v,s in enumerate(nx.connected_components(G)) for k in s}
# find common group
group = g1.map(d)
df.groupby(group).ngroup()
output:
0 0
1 1
2 1
3 1
4 2
dtype: int64
graph:

drop consecutive duplicates of groups

I am removing consecutive duplicates in groups in a dataframe. I am looking for a faster way than this:
def remove_consecutive_dupes(subdf):
dupe_ids = [ "A", "B" ]
is_duped = (subdf[dupe_ids].shift(-1) == subdf[dupe_ids]).all(axis=1)
subdf = subdf[~is_duped]
return subdf
# dataframe with columns key, A, B
df.groupby("key").apply(remove_consecutive_dupes).reset_index()
Is it possible to remove these without grouping first? Applying the above function to each group individually takes a lot of time, especially if the group count is like half the row count. Is there a way to do this operation on the entire dataframe at once?
A simple example for the algorithm if the above was not clear:
input:
key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5
5 x 1 2
output:
key A B
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
5 x 1 2
Row 2 was dropped because A=1 B=2 was also the previous row in group x.
Row 5 will not be dropped because it is not a consecutive duplicate in group x.
According to your code, you drop only lines if they appear below each other if
they are grouped by the key. So rows with another key inbetween do not influence this logic. But doing this, you want to preserve the original order of the records.
I guess the biggest influence in the runtime is the call of your function and
possibly not the grouping itself.
If you want to avoid this, you can try the following approach:
# create a column to restore the original order of the dataframe
df.reset_index(drop=True, inplace=True)
df.reset_index(drop=False, inplace=True)
df.columns= ['original_order'] + list(df.columns[1:])
# add a group column, that contains consecutive numbers if
# two consecutive rows differ in at least one of the columns
# key, A, B
compare_columns= ['key', 'A', 'B']
df.sort_values(['key', 'original_order'], inplace=True)
df['group']= (df[compare_columns] != df[compare_columns].shift(1)).any(axis=1).cumsum()
df.drop_duplicates(['group'], keep='first', inplace=True)
df.drop(columns=['group'], inplace=True)
# now just restore the original index and it's order
df.set_index('original_order', inplace=True)
df.sort_index(inplace=True)
df
Testing this, results in:
key A B
original_order
0 x 1 2
1 y 1 4
3 x 1 4
4 y 2 5
If you don't like the index name above (original_order), you just need to add the following line to remove it:
df.index.name= None
Testdata:
from io import StringIO
infile= StringIO(
""" key A B
0 x 1 2
1 y 1 4
2 x 1 2
3 x 1 4
4 y 2 5"""
)
df= pd.read_csv(infile, sep='\s+') #.set_index('Date')
df

Pandas, multiply part of one DF against another based on condition

Pretty new to this and am having trouble finding the right way to do this.
Say I have dataframe1 looking like this with column names and a bunch of numbers as data:
D L W S
1 2 3 4
4 3 2 1
1 2 3 4
and I have dataframe2 looking like this:
Name1 Name2 Name3 Name4
2 data data D
3 data data S
4 data data L
5 data data S
6 data data W
I would like a new dataframe produced with the result of multiplying each row of the second dataframe against each row of the first dataframe, where it multiplies the value of Name1 against the value in the column of dataframe1 which matches the Name4 value of dataframe2.
Is there any nice way to do this? I was trying to look at using methods like where, condition, and apply but haven't been understanding things well enough to get something working.
EDIT: Use the following code to create fake data for the DataFrames:
d1 = {'D':[1,2,3,4,5,6],'W':[2,2,2,2,2,2],'L':[6,5,4,3,2,1],'S':[1,2,3,4,5,6]}
d2 = {'col1': [3,2,7,4,5,6], 'col2':[2,2,2,2,3,4], 'col3':['data', 'data', 'data','data', 'data', 'data' ], 'col4':['D','L','D','W','S','S']}
df1 = pd.DataFrame(data = d1)
df2 = pd.DataFrame(data = d2)
EDIT AGAIN FOR MORE INFO
First I changed the data in df1 at this point so this new example will turn out better.
Okay so from those two dataframes the data frame I'd like to create would come out like this if the multiplication when through for the first four rows of df2. You can see that Col2 and Col3 are unchanged, but depending on the letter of Col4, Col1 was multiplied with the corresponding factor from df1:
d3 = { 'col1':[3,6,9,12,15,18,12,10,8,6,4,2,7,14,21,28,35,42,8,8,8,8,8,8], 'col2':[2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2], 'col3':['data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data','data'], 'col4':['D','D','D','D','D','D','L','L','L','L','L','L','D','D','D','D','D','D','W','W','W','W','W','W']}
df3 = pd.DataFrame(data = d3)
I think I understand what you are trying to achieve. You want to multiply each row r in df2 with the corresponding column c in df1 but the elements from c are only multiplied with the first element in r the rest of the row doesn't change.
I was thinking there might be a way to join df1.transpose() and df2 but I didn't find one.
While not pretty, I think the code below solves your problem:
def stretch(row):
repeated_rows = pd.concat([row]*len(df1), axis=1, ignore_index=True).transpose()
factor = row['col1']
label = row['col4']
first_column = df1[label] * factor
repeated_rows['col1'] = first_column
return repeated_rows
pd.concat((stretch(r) for _, r in df2.iterrows()), ignore_index=True)
#resulting in
col1 col2 col3 col4
0 3 2 data D
1 6 2 data D
2 9 2 data D
3 12 2 data D
4 15 2 data D
5 18 2 data D
0 12 2 data L
1 10 2 data L
2 8 2 data L
3 6 2 data L
4 4 2 data L
5 2 2 data L
0 7 2 data D
1 14 2 data D
2 21 2 data D
3 28 2 data D
4 35 2 data D
5 42 2 data D
0 8 2 data W
1 8 2 data W
2 8 2 data W
3 8 2 data W
4 8 2 data W
5 8 2 data W
...

Pandas sort grouby groups by arbitrary condition on its contents

Ok, this is getting ridiculous ... I've spent way too much time on something that should be trivial.
I want to group a data frame by a column, then sort the groups (not within the group) by some condition (in my case maximum over some column B in the group).
I expected something along these lines:
df.groupby('A').sort_index(lambda group_content: group_content.B.max())
I also tried:
groups = df.groupby('A')
maxx = gg['B'].max()
groups.sort_index(...)
But, of course, no sort_index on a group by object ..
EDIT:
I ended up using (almost) the solution suggested by #jezrael
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max', 'B'], ascending=True).drop('max', axis=1)
groups = df.groupby('A', sort=False)
I had to add ascending=True to sort_values, but more importantly sort=False to groupby, otherwise I would get the groups sort lex (A contains strings).
I think you need if possible same max for some groups use GroupBy.transform with max for new column and then sort by DataFrame.sort_values:
df = pd.DataFrame({
'A':list('aaabcc'),
'B':[7,8,9,100,20,30]
})
df['max'] = df.groupby('A')['B'].transform('max')
df = df.sort_values(['max','A'])
print (df)
A B max
0 a 7 9
1 a 8 9
2 a 9 9
4 c 20 30
5 c 30 30
3 b 100 100
If always max values are unique use Series.argsort:
s = df.groupby('A')['B'].transform('max')
df = df.iloc[s.argsort()]
print (df)
A B
0 a 7
1 a 8
2 a 9
4 c 20
5 c 30
3 b 100