Merge dataframe - pandas

I have a dataframe df as follows:
I would like to convert it in the following way.
How do I go about it?
All help appreciated.
Thanks

Try creating a DataFrame for each class_id then concat on axis=1:
import pandas as pd
df = pd.DataFrame({'student_id': [1, 1, 1, 2],
'class_id': [1, 2, 3, 1],
'teacher': ['rex', 'fred', 'hulio', 'ross']})
# Generate DF for each class_id
dfs = tuple(df[df['class_id'].eq(c_id)].reset_index(drop=True)
for c_id in df['class_id'].unique())
# concat on axis 1
new_df = pd.concat(dfs, axis=1)
# For Display
print(new_df.fillna('').to_string(index=False))
new_df:
student_id class_id teacher student_id class_id teacher student_id class_id teacher
1 1 rex 1.0 2.0 fred 1.0 3.0 hulio
2 1 ross

Related

Pandas dataframe median of a column with condition

So I have a dataframe with two columns (price, location). Now I want to get the median of price, if the location is e.g. "Paris". How do I achieve that?
dataframe:
location price
paris 5
paris 2
rome 5
paris 4
...
desired result: 4 (median of 2,5,4)
I think you need df.groupby to group on location, and then .median():
median = df.groupby('location').median()
To get the value for each location:
median.loc['paris', 'price']
Output:
4
import pandas as pd
# Build dataframe
data = [['Paris', 2], ['New York', 3], ['Rome', 4], ['Paris', 5], ['Paris', 4]]
df = pd.DataFrame(data, columns=['location', 'price'])
# Get paris only rows
df_paris = df[df['location'] == 'Paris']
# Print median
print(df_paris['price'].median())

Selecting specific rows by date in a multi-dimensional df Python

image of df
I would like to select a specific date e.g 2020-07-07 and get the Adj Cls and ExMA for each of the symbols. I'm new in Python and I tried using df.loc['xy'], (xy being a specific date on the datetime) and keep getting a KeyError. Any insight is greatly appreciated.
Info on the df MultiIndex: 30 entries, (SNAP, 2020-07-06 00:00:00) to (YUM, 2020-07-10 00:00:00)
Data columns (total 2 columns):
dtypes: float64(2)
You can use pandas.DataFrame.xs for this.
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(8).reshape(4, 2), index=[[0, 0, 1, 1], [2, 3, 2, 3]], columns=list("ab")
)
print(df)
# a b
# 0 2 0 1
# 3 2 3
# 1 2 4 5
# 3 6 7
print(df.xs(3, level=1).filter(["a"]))
# a
# 0 2
# 1 6

optimizing groupby excluding last row

I am trying to apply groupby -> mean to the n-1 rows and then assign the mean to the n-th row in pandas. Here is the following code and desired output. It takes a long time to run and I wonder does anyone know how to optimize this.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': ['A', 'A', 'A', 'B', 'B', 'C'],
'vals': [2, 3, 4, 5, 6, 7]})
# current solution
for h in df['id'].unique():
h_df = df[df['id'] == h]
indices = h_df.index
size = h_df.shape[0]
last_index = indices[size-1]
if size == 1:
df.iloc[last_index, df.columns.get_loc('vals')] = np.nan
continue
exclude_last = h_df[:size-1]
avg = (exclude_last.groupby('id')['vals'].mean()).values[0]
df.iloc[last_index, df.columns.get_loc('vals')] = avg
# output
# id vals
# A 2
# A 3
# A 2.5 => (2+3) / 2
# B 5
# B 5 => (5/1)
# C np.nan
There's no reason to iterate over the unique values and select the groups and do another groupby. All that can be done by the .groupby itself:
In [1]: def mean_head(group):
...: group.vals.iloc[-1] = group.vals.iloc[:-1].mean()
...: return group
...:
In [2]: df.groupby("id").apply(mean_head)
Out[2]:
id vals
0 A 2.0
1 A 3.0
2 A 2.5
3 B 5.0
4 B 5.0
5 C NaN

How do I do an average plus count of a column using pandas data frame?

This code looks really stupid but this is a basic representation of the problem I've been dealing with all day - I have 3 columns, type, day and month. I'd like to count the number of dogs/cats by day, and then average it out over the month.
import numpy as np
import pandas as pd
data = {'Type':['Dog', 'Cat', 'Cat', 'Cat', 'Dog', 'Dog', 'Dog', 'Cat'], 'Day':[1, 1, 2, 2, 3, 3, 4, 4], 'Month': [1, 1, 1, 1, 2, 2, 2, 2]}
newDF = pd.DataFrame(data)
Which creates a dataframe that looks like this:
Type|Day|Month
---------
Dog|1|1
Cat|1|1
Cat|2|1
Cat|2|1
Dog|3|2
Dog|3|2
Dog|4|2
Cat|4|2
What I'm trying to do here is create a table below showing this:
Type | Month1 | Month2
------------------------
Dog | 1 | 1.5
Cat | 1.5 | 1
So basically, I just want to use some combination of pivot table or groupby to create a pivot_table containing the count of number of cats / dogs per day, and then average that out over the month. For some reason, I just can't manage to figure it out. Can someone smart enough with pandas please help? Thank you!
Two groupbys + unstack
(newDF.groupby(['Type', 'Day', 'Month']).size()
.groupby(level=[0,2]).mean()
.unstack()
.add_prefix('Month').rename_axis(None, 1))
Output:
Month1 Month2
Type
Cat 1.5 1.0
Dog 1.0 1.5
Just a groupby combined with an unstack and mean:
df.groupby(df.columns.tolist()) \
.size() \
.unstack(level='Day') \
.mean(axis=1) \
.unstack(level='Month')
Output:
Month 1 2
Type
Cat 1.5 1.0
Dog 1.0 1.5

Pandas get unique values in every column

I'd like to print unique values in each column of a grouped dataframe and the following code snippet doesn't work as expected:
df = pd.DataFrame({'a' : [1, 2, 1, 2], 'b' : [5, 5, 5, 5], 'c' : [11, 12, 13, 14]})
print(
df.groupby(['a']).apply(
lambda df: df.apply(
lambda col: col.unique(), axis=0))
)
I'd expect it to print
1 [5] [11, 13]
2 [5] [12, 14]
While there are other ways of doing so, I'd like to understand what's wrong with this approach. Any ideas?
This should do the trick:
print(df.groupby(['a', 'b'])['c'].unique())
a | b |
--+---+---------
1 | 5 | [11, 13]
2 | 5 | [12, 14]
As to what's wrong with your approach - when you groupby on df and then apply some function f, the input for f will be a DataFrame with all of df's columns, unless otherwise specified (as is in my code snippet with ['c']). So your first apply is passing a DataFrame with 3 columns, and so is your second apply. Then your function also_print iterates over each of those 3 columns and prints them out, so you get 3 prints for every group.