I want to combine multiple rows that have a missing column id into one row. I am trying to use pandas for this - pandas

Column_id
statements
a1
My dog
Nan
ate
Nan
a bone
a2
an apple,
Nan
banana &
Nan
orange
The original dataframe is called df. I would like the statement rows where column_id is missing to merge into one. The resulting data frame would look like this:
Column_id
statements
a1
My dog ate a bone
a2
an apple, banana & orange

I figured out the way to go about it and posting the solution to help someone stuck on the same problem:
df_new = {'Column_id': 'first',
'statements': lambda x: ' '.join(x.dropna())}
df.groupby(df.Column_id.notnull().cumsum().rename(None)).agg(df_new)

Related

python pandas : how to merge multiple columns into one column and use a pie chart

pd.DataFrame([["Stress", "NaN"], ["NaN", "Pregnancy"], ["Alcohol", "Pregnancy"]], columns=['causes', 'causes.2'])
I have a sample dataset here, technically, these columns should have been merged to one but for some reason, they weren't. now, I am tasked to make a pie chart and I do know how to do that with one column hence I want to merge these columns into a single column with a distinct name.
I tried using df.stack().reset_index()
but that gives me a weird object I do not know how to manipulate:
level_0 level_1 0
0 0 causes Stress
1 0 causes.2 NaN
2 1 causes NaN
3 1 causes.2 Pregnancy
4 2 causes Alcohol
5 2 causes.2 Pregnancy
Anyone know how I could achieve this?
I plan on using for the pie chart:
values = df["Cause of...."].value_counts()
ax = values.plot(kind="pie", autopct='%1.1f%%', shadow=True, legend=True, title="", ylabel='', labeldistance=None)
ax.legend(bbox_to_anchor=(1, 1.02), loc='upper left')
plt.show()
You can flatten using the underlying numpy array and create a new Series:
pd.Series(df.to_numpy().ravel(), name='causes')
Output:
0 Stress
1 NaN
2 NaN
3 Pregnancy
4 Alcohol
5 Pregnancy
Name: causes, dtype: object
If you have many columns, you need to select only the ones you want to flatten, for example selecting by name:
pd.Series(df.filter(like='causes').to_numpy().ravel(), name='causes')

Key Error after droping null values in pandas data frame

I have a simple dataframe as :
0 1 2 3
1 NaN like dislike
2 Cow dog snail
After dropping the nan value the dataframe is :
0 1 2 3
2 Cow dog snail
Now when I try the following code to print the values it gives key error :
for i in range(len(data)):
print(data.loc[i,:])
Any help will be appreciated.
Please add the following line after dropping nan values:
data = data.reset_index(drop=True)
You're dropping a value with an specific index in which after dropping NA values will not exist anymore. In your specific case, the index 1 will be dropped and you won't be able to iterate your dataframe via len. I highly recommend you to use iterrows instead as it's better form. Example:
for index, value in mydataframe.iterrows():
print(index, " ", value)
# 2
# 1 Cow
# 2 dog
# 3 snail
The value is of class 'pandas.core.series.Series' which in your case functions a lot like a dictionary. Pay attention to the column names which are exactly the ones in your example.

Pandas: Filter rows by comparing a column's value to another value for the same column in a different row

I have searched the heck out of this one, and I don't think I've found anything applicable. But I'm new to Pandas, so I may have missed something-apologies in this case.
Suppose I have a dataframe, df, with the following contents:
Column1 Column2
A Apple
B Apple
A Pear
A Orange
B Orange
A Pear
I want to filter the dataframe to show ONLY rows where:
-Column2's value matches at least 1 other Column2 value
-For these 2 matching rows, at least 1 of Column1's values are different.
Expected results of the above df:
Column1 Column2
A Apple
B Apple
A Orange
B Orange
I have tried using the .loc() method for this, but I cannot find an appropriate filter/set of filters. (I also tried to use a 'for i in df' loop, but this just gave an error).
I would usually post some sample code in these situations, but I don't think any of my approaches so far have made much sense.
Any help would be much appreciated-thanks.
Use GroupBy.transform with nunqiue:
df_filtered = df[df.groupby('Column2')['Column1'].transform('nunique').gt(1)]
print(df_filtered)
We could also use pd.crosstab:
df[df['Column2'].map(pd.crosstab(df['Column1'],df['Column2']).gt(0).sum().gt(1))]
#df[df['Column2'].map(pd.crosstab(df['Column1'],df['Column2']).where(lambda x: x>0).count().gt(1))]
We coluld also use groupby.filter in general this is slower
df.groupby('Column2').filter(lambda x: x.Column1.nunique()>1)
Output
Column1 Column2
0 A Apple
1 B Apple
3 A Orange
4 B Orange
The best solution is the first with groupby.transform
You can use a groupby and filter:
(
df.groupby('Column2')
.filter(lambda x: len(x.drop_duplicates(subset='Column1'))>1)
)
Column1 Column2
0 A Apple
1 B Apple
3 A Orange
4 B Orange

Pandas merge: how to return the column on which you have done the merge?

I have a dataframe with some categorical data. I want to create a new column which shows only some of those values, and converts the others to 'other'. E.g. to show only the top 10 cities, or, in the example below, show only two colours and convert the others to 'other'.
I want to do it via a pandas.merge, like a SQL outer join: on one hand my table, on the other side a table with only the values I want to keep (ie not convert to 'others').
The problem is, and it took me a bit of debugging and swearing to find that out, that pandas.merge does not return both columns on which you have done the merge, even if one of the columns has nulls.
The solution I have found is to create another column with the same values - which I think would make anyone familiar with SQL cringe. Is there a more elegant way?
This is the code to show what I mean:
import pandas as pd
df=pd.DataFrame()
df['colour']=['yellow','yellow','green','red']
mycols=pd.DataFrame()
mycols['colour']=['yellow','red']
# after this merge, I have no way of knowing which colour in df has no match in mycols
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
# so I need to create another column in mycols
mycols['colour copied']=mycols['colour']
newdf2=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'))
newdf2['colour copied']=newdf2['colour copied'].fillna('other')
newdf2.rename(columns={'colour copied': 'colour - reclassified'})
You can add parameter indicator=True for return if matched both, left_only or right_only values:
newdf=pd.merge(df, mycols, on='colour', how='outer', suffixes=('','_r'), indicator=True)
print (newdf)
colour _merge
0 yellow both
1 yellow both
2 green left_only
3 red both
And then set values by condition - here if _merge is both set column colour, else value colour by numpy.where, DataFrame.pop is for extract column:
newdf['colour copied'] = np.where(newdf.pop('_merge') == 'both', newdf['colour'], 'colour')
print (newdf)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red
But if working only with one column is possible simplier alternative - compare by Series.isin for test membership:
df['colour copied'] = np.where(df['colour'].isin(mycols['colour']), df['colour'], 'other')
print (df)
colour colour copied
0 yellow yellow
1 yellow yellow
2 green other
3 red red

Pandas: fill in NaN values with dictionary references another column

I have a dictionary that looks like this
dict = {'b' : '5', 'c' : '4'}
My dataframe looks something like this
A B
0 a 2
1 b NaN
2 c NaN
Is there a way to fill in the NaN values using the dictionary mapping from columns A to B while keeping the rest of the column values?
You can map dict values inside fillna
df.B = df.B.fillna(df.A.map(dict))
print(df)
A B
0 a 2
1 b 5
2 c 4
This can be done simply
df['B'] = df['B'].fillna(df['A'].apply(lambda x: dict.get(x)))
This can work effectively for a bigger dataset as well.
Unfortunately, this isn't one of the options for a built-in function like pd.fillna().
Edit: Thanks for the correction. Apparently this is possible as illustrated in #Vaishali's answer.
However, you can subset the data frame first on the missing values and then apply the map with your dictionary.
df.loc[df['B'].isnull(), 'B'] = df['A'].map(dict)