Pandas: Replace duplicates by their mean values in a dataframe [duplicate] - pandas

This question already has answers here:
group by in group by and average
(3 answers)
Closed 4 years ago.
I have been working with a dataframe in Pandas that contains duplicate entries along with non-duplicates in a column. The dataframe looks something like this:
country_name values category
0 country_1 10 a
1 country_2 20 b
2 country_1 50 a
3 country_2 10 b
4 country_3 100 c
5 country_4 10 d
I want to write something that converts(replaces) duplicates with their mean values in my dataframe. An ideal output would be something similar to the following:
country_name values category
0 country_1 30 a
1 country_2 15 b
2 country_3 100 c
3 country_4 10 d
I have been struggling with this for a while so I would appreciate any help. I have forgotten to add category column. The problem with groupby() method as you now when you call mean() it does not return category column back. My solution was to take numerical columns and the column that has duplicates together apply groupby().mean() then concatenate back to the categorical columns. So I am looking for a solution shorten than what I have done.
My method get tedious when you are dealing with many categorical columns.

You can use df.groupby():
df.groupby('country_name').mean().reset_index()

Related

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

pandas dataframe - how to find multiple column names with minimum values

I have a dataframe (small sample shown below, it has more columns), and I want to find the column names with the minimum values.
Right now, I have the following code to deal with it:
finaldf['min_pillar_score'] = finaldf.iloc[:, 2:9].idxmin(axis="columns")
This works fine, but does not return multiple values of column names in case there is more than one instance of minimum values. How can I change this to return multiple column names in case there is more than one instance of the minimum value?
Please note, I want row wise results, i.e. minimum column names for each row.
Thanks!
try the code below and see if it's in the output format you'd anticipated. it produces the intended result at least.
result will be stored in mins.
mins = df.idxmin(axis="columns")
for i, r in df.iterrows():
mins[i] = list(r[r == r[mins[i]]].index)
Get column name where value is something in pandas dataframe might be helpful also.
EDIT: adding an image of the output and the full code context.
Assuming this input as df:
A B C D
0 5 8 9 5
1 0 0 1 7
2 6 9 2 4
3 5 2 4 2
4 4 7 7 9
You can use the underlying numpy array to get the min value, then compare the values to the min and get the columns that have a match:
s = df.eq(df.to_numpy().min()).any()
list(s[s].index)
output: ['A', 'B']

Is there a way to use the unique values of a count of occurences into column headers pandas? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I have a dataframe df1.
Time Category
23:05:07 a
23:11:12 b
23:12:15 a
23:16:12 a
Another dataframe df2 has been created returning the count of occurences of each of the unique values in column category and into intervals of 5 minutes. Using this code df2= df1.resample('5T').category.value_counts()
df2
Time Category Category
23:05 a 1
23:10 a 1
23:10 b 1
23:15 a 1
Is there a way to use the unique values as column headers? Look like:
Time a b
23:05 1 0
23:10 1 1
23:15 1 0
value_counts returns multiindex series. So, you just need to work directly on its result by chaining unstack to get your desired output.
Since you were able to resample, I assume Time is your index and it is already in datetime or timedelta dtype
df_final = df1.resample('5T').Category.value_counts().unstack(fill_value=0)
Out[79]:
Category a b
Time
23:05:07 1 0
23:10:07 1 1
23:15:07 1 0
The code below should work. I renamed your second 'Category' column to 'CatUnique'.
df.groupby(['Time','Category'])['CatUnique'].sum().unstack().fillna(0).reset_index()

How can I get the total counts of columns in a Dataset having null values? [duplicate]

This question already has answers here:
How to check if any value is NaN in a Pandas DataFrame
(27 answers)
Closed 2 years ago.
Say I have a dataset having 100 columns and 25 columns having one or more null values.
How can I get the total counts of columns as the output say something like out of 100 columns 25 columns have null values and 75 has no null values?
As the following code gives me an error:
data[data.columns[data.isnull() == True]].shape[1]
You need to use any:
s = data.isnull().any()
# number of columns with null
num_col_with_null = s.sum()
# number without
df.shape[1] - num_col_with_null

Apply diffs down columns of pandas dataframe [duplicate]

This question already has answers here:
How to replace NaNs by preceding or next values in pandas DataFrame?
(10 answers)
Closed 3 years ago.
I want to apply diffs down columns for a pandas dataframe.
EX:
A B C
23 40000 1
24 nan nan
nan 42000 2
I would want something like:
A B C
23 40000 1
24 40000 1
24 42000 2
I have tried variations of pandas groupby. I think this is probably the right approach. (or applying some function down columns, but not sure if this is efficient correct me if i'm wrong)
I was able to "apply diffs down the column" and get something like:
A B C
24 42000 2
by calling: df = df.groupby('col', as_index=False).last() for each column, but this is not what I am looking for. I am not a pandas expert so apologies if this is a silly question.
Explained above
Look at this: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html
df = df.fillna(method='ffill')