Make pivot with duplicates that are equal

Make pivot with duplicates that are equal - pandas

I want to make a pivot from a dataframe with multiple duplicates in 'index' and 'column', where the values I want are always equal when 'index' and 'column' are duplicates.
df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
... "bar": ['A', 'A', 'B', 'C'],
... "baz": [1, 1, 3, 4]})
But I get:
ValueError: Index contains duplicate entries, cannot reshape
when I try
df.pivot(index='foo', columns='bar', values='baz')

Try this:
df1 = df[~df.duplicated()].pivot(index='foo', columns='bar', values='baz')
print(df1)
bar A B C
foo
one 1.0 NaN NaN
two NaN 3.0 4.0

Related

How to merge same name column from two different dataframes?

I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.

using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")

I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.

How to convert pandas multiindex into table columns and index?

I have a DataFrame
It has a multi-index and each row has a value
I want to unstack it but it has duplicated keys. Now I want to convet it into this format:
where index1 and index2 is one of the multi-index, and index3 and index4 is another multi-index, value1~3 are values in the original df, if the value doesn't exist, the value is nan.
How should I do this?

See if this example helps you:
import pandas as pd
df = pd.DataFrame(data={'value': [1, 3, 5, 8, 5, 7]}, index=[[1, 2, 3, 4, 5, 6], ['A', 'B', 'A', 'C', 'C', 'A']])
df = df.reset_index(level=1)
df = df.pivot_table(values='value', index=df.index, columns='level_1')
Original data:
Result:

Why I can't merge the columns together

My goal is to transform the array to DataFrame, and the error occurred only at the columns=...
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=[housing.columns,'rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Consequently, it returns
AssertionError: Number of manager items must equal union of block items
# manager items: 4, # tot_items: 12
It said I only do input 4 columns, but the housing.columns itself has 9 columns
here, when I run housing.columns ;
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
'total_bedrooms', 'population', 'households', 'median_income',
'ocean_proximity'],
dtype='object')
So, My question is how can I merge the existed column which is housing.columns with the 3 new columns; ['rooms_per_household', 'population_per_household', 'bedrooms_per_room'] together.

You can use Index.union to add a list of columns to existing dataframe columns:
columns= housing.columns.union(
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'],
sort=False)
Or convert to list and then add the remaining columns as list:
columns = (housing.columns.tolist() +
['rooms_per_household', 'population_per_household', 'bedrooms_per_room'])
Then:
housing_extra = pd.DataFrame(housing_extra_attribs,
index=housing_num.index,
columns=columns)
Some example:
Assume this df:
df = pd.util.testing.makeDataFrame()
print(df.columns)
#Index(['A', 'B', 'C', 'D'], dtype='object')
When you pass this into a list:
[df.columns,'E','F','G']
you get:
[Index(['userId', 'column_1', 'column_2', 'column_3'], dtype='object'),'E','F','G']
v/s when you use union:
df.columns.union(['E','F','G'],sort=False)
You get:
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G'], dtype='object')

Cumulative standard deviation of groups

How can one calculate cumulative standard deviation of groups with varying lengths?
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
df.groupby('A')['B'].nunique() gives bar: 2, foo: 3
...but...
df.groupby('A')['C', 'D'].rolling(df.groupby('A')['B'].nunique(), min_periods=2).std()
...gives...
ValueError: window must be an integer

I think you could use expanding (new since Pandas 0.18) to get a rolling window that expands with the size of the group, first adding B as index and sorting:
df.set_index('B').sort_index().groupby(['A'])['C', 'D'].expanding(2).std()
C D
A B
bar one NaN NaN
two 0.174318 0.039794
foo one NaN NaN
one 1.395085 1.364566
three 1.010592 1.029694
three 0.986744 0.957615
two 0.854773 0.876763
two 1.048024 0.807519

Pandas - understanding output of pivot table

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.

pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Make pivot with duplicates that are equal - pandas

Try this: df1 = df[~df.duplicated()].pivot(index='foo', columns='bar', values='baz') print(df1) bar A B C foo one 1.0 NaN NaN two NaN 3.0 4.0

Related

How to merge same name column from two different dataframes?

How to convert pandas multiindex into table columns and index?

Why I can't merge the columns together

Cumulative standard deviation of groups

Pandas - understanding output of pivot table

Categories

Resources