Make pivot with duplicates that are equal - pandas

I want to make a pivot from a dataframe with multiple duplicates in 'index' and 'column', where the values I want are always equal when 'index' and 'column' are duplicates.
df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
... "bar": ['A', 'A', 'B', 'C'],
... "baz": [1, 1, 3, 4]})
But I get:
ValueError: Index contains duplicate entries, cannot reshape
when I try
df.pivot(index='foo', columns='bar', values='baz')

Try this:
df1 = df[~df.duplicated()].pivot(index='foo', columns='bar', values='baz')
print(df1)
bar A B C
foo
one 1.0 NaN NaN
two NaN 3.0 4.0

Related

How to merge same name column from two different dataframes?

I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.
using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")
I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.

How to convert pandas multiindex into table columns and index?

I have a DataFrame
It has a multi-index and each row has a value
I want to unstack it but it has duplicated keys. Now I want to convet it into this format:
where index1 and index2 is one of the multi-index, and index3 and index4 is another multi-index, value1~3 are values in the original df, if the value doesn't exist, the value is nan.
How should I do this?
See if this example helps you:
import pandas as pd
df = pd.DataFrame(data={'value': [1, 3, 5, 8, 5, 7]}, index=[[1, 2, 3, 4, 5, 6], ['A', 'B', 'A', 'C', 'C', 'A']])
df = df.reset_index(level=1)
df = df.pivot_table(values='value', index=df.index, columns='level_1')
Original data:
Result:

Why I can't merge the columns together

My goal is to transform the array to DataFrame, and the error occurred only at the columns=...
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=[housing.columns,'rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Consequently, it returns
AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 12
It said I only do input 4 columns, but the housing.columns itself has 9 columns
here, when I run housing.columns ;
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'ocean_proximity'],
dtype='object')
So, My question is how can I merge the existed column which is housing.columns with the 3 new columns; ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'] together.
You can use Index.union to add a list of columns to existing dataframe columns:
columns= housing.columns.union(
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'],
sort=False)
Or convert to list and then add the remaining columns as list:
columns = (housing.columns.tolist() +
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Then:
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=columns)
Some example:
Assume this df:
df = pd.util.testing.makeDataFrame()
print(df.columns)
#Index(['A', 'B', 'C', 'D'], dtype='object')
When you pass this into a list:
[df.columns,'E','F','G']
you get:
[Index(['userId', 'column_1', 'column_2', 'column_3'], dtype='object'),'E','F','G']
v/s when you use union:
df.columns.union(['E','F','G'],sort=False)
You get:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Cumulative standard deviation of groups

How can one calculate cumulative standard deviation of groups with varying lengths?
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A')['B'].nunique() gives bar: 2, foo: 3
...but...
df.groupby('A')['C', 'D'].rolling(df.groupby('A')['B'].nunique(), min_periods=2).std()
...gives...
ValueError: window must be an integer
I think you could use expanding (new since Pandas 0.18) to get a rolling window that expands with the size of the group, first adding B as index and sorting:
df.set_index('B').sort_index().groupby(['A'])['C', 'D'].expanding(2).std()
C D
A B
bar one NaN NaN
two 0.174318 0.039794
foo one NaN NaN
one 1.395085 1.364566
three 1.010592 1.029694
three 0.986744 0.957615
two 0.854773 0.876763
two 1.048024 0.807519

Pandas - understanding output of pivot table

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.
pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.