Pandas transformation, duplicate index values to column values - pandas

I have the following pandas dataframe:
0
0
A 0
B 0
C 0
C 4
A 1
A 7
Now there are some index letter (A and C) that appear multiple times. I want the values of these index letters on a extra column beside instead of a extra row. The desired pandas dataframe looks like:
0 1 3
0
A 0 1 7
B 0 np.nan np.nan
C 0 4 np.nan
Anything would help!

IIUC, you need to add a helper column:
(df.assign(group=df.groupby(level=0).cumcount())
.set_index('group', append=True)[0] # 0 is the name of the column here
.unstack('group')
)
or:
(df.reset_index()
.assign(group=lambda d: d.groupby('index').cumcount())
.pivot('index', 'group', 0) # col name here again
)
output:
group 0 1 2
A 0.0 1.0 7.0
B 0.0 NaN NaN
C 0.0 4.0 NaN

Related

What is the difference between the 'set' operation using loc vs iloc?

What is the difference between the 'set' operation using loc vs iloc?
df.iloc[2, df.columns.get_loc('ColName')] = 3
#vs#
df.loc[2, 'ColName'] = 3
Why does the website of iloc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) not have any set examples like those shown in loc website (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)? Is loc the preferred way?
There isn't much of a difference to say. It all comes down to your need and requirement.
Say you have label of the index and column name (most of the time) you are supposed to use loc (location) operator to assign the values.
Whereas like in normal matrix, you usually are going to have only the index number of the row and column and hence the cell location via integers (for i) your are supposed to use iloc (integer based location) for assignment.
Pandas DataFrame support indexing via both usual integer based and index based.
The problem arise when the index (the row or column) is itself integer instead of some string. So to make a clear difference to what operation user want to perform using integer based or label based indexing the two operations is provided.
Main difference is iloc set values by position, loc by label.
Here are some alternatives:
Sample:
Not default index (if exist label 2 is overwritten cell, else appended new row with label):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=[2,1,8])
print (df)
A B C
2 2 2 6
1 1 3 9
8 6 1 0
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
2 2 2 6
1 1 3 9
8 30 1 0
Appended new row with 0:
df.loc[0, 'A'] = 70
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
0 70.0 NaN NaN
Overwritten label 2:
df.loc[2, 'A'] = 50
print (df)
A B C
2 50 2 6
1 1 3 9
8 30 1 0
Default index (working same, because 3rd index has label 2):
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'])
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
0 2 2 6
1 1 3 9
2 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
0 2 2 6
1 1 3 9
2 50 1 0
Not integer index - (working for set by position, for select by label is appended new row):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=list('abc'))
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
a 2 2 6
b 1 3 9
c 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
2 50.0 NaN NaN

Sum columns in pandas having string and number

I need to sum column and column b, which contain string in 1st row
>>> df
a b
0 c d
1 1 2
2 3 4
>>> df['sum'] = df.sum(1)
>>> df
a b sum
0 c d cd
1 1 2 3
2 3 4 7
I only need to add numeric values and get an output like
>>> df
a b sum
0 c d "dummyString/NaN"
1 1 2 3
2 3 4 7
I need to add only some columns
df['sum']=df['a']+df['b']
solution if mixed data - numeric with strings:
I think simpliest is convert non numeric values after sum by to_numeric to NaNs:
df['sum'] = pd.to_numeric(df[['a','b']].sum(1), errors='coerce')
Or:
df['sum'] = pd.to_numeric(df['a']+df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
EDIT:
Solutions id numbers are strings represenation - first convert to numeric and then sum:
df['sum'] = pd.to_numeric(df['a'], errors='coerce') + pd.to_numeric(df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
Or:
df['sum'] = (df[['a', 'b']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1, min_count=1))
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0

pandas: How to compare value of column with next value

I have a dataframe which looks as follows:
colA colB
0 A 10
1 B 20
2 C 5
3 D 2
4 F 30
....
I would like to compare column 1 values to detect two successive decrements. That is, I want to report the index values where I have two successive decrements of column 1. For example, I want to report 'B' because there are two successive rows following B where column 1 values are decremented. I am not sure how to approach this without writing a loop. ( If there is no way to avoid a loop I'd like to know.)
Thanks
You can use loc for this:
desired=frame.loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired)
The output will be:
colA colB
1 B 20
if you only wish to report the value B:
desired=frame["colA"].loc[(frame["colB"]>=frame["colB"].shift(-1)) &
(frame["colB"].shift(-1)>=frame["colB"].shift(-2) )]
print(desired.values)
The output will be:
['B']
Yes you can do this without using loop.
df = pd.DataFrame({'colA':['A', 'B', 'C', 'D', 'F'], 'colB':[10, 20, 5, 2, 30]})
>>> df['colC'] = df['colB'].diff(-1)
>>> df
colA colB colC
0 A 10 -10.0
1 B 20 15.0
2 C 5 3.0
3 D 2 -28.0
4 F 30 NaN
'colC' is the difference between the consecutive row.
>>> df['colD'] = np.where(df['colC'] > 0, 1, 0)
>>> df
colA colB colC colD
0 A 10 -10.0 0
1 B 20 15.0 1
2 C 5 3.0 1
3 D 2 -28.0 0
4 F 30 -1.0 0
In 'colD' we are marking flag where the difference is greater than 0.
>>> df1['s'] = df1['colD'].shift(-1)
>>> df1
colA colB colC colD s
0 A 10 -10.0 0 1.0
1 B 20 15.0 1 1.0
2 C 5 3.0 1 0.0
3 D 2 -28.0 0 0.0
4 F 30 -1.0 0 NaN
In column 's' we shift the value of 'colD'.
>>> df1['flag'] = np.where((df1['colD'] == 1) & (df1['colD'] == df1['s']), 1, 0)
>>> df1
colA colB colC colD s flag
0 A 10 -10.0 0 1.0 0
1 B 20 15.0 1 1.0 1
2 C 5 3.0 1 0.0 0
3 D 2 -28.0 0 0.0 0
4 F 30 -1.0 0 NaN 0
Then 'flag' is required column.
Need a little bit logic here
s=df.colB.diff().gt(0) # get the diff
df.loc[df.groupby(s.cumsum()).colA.transform('count').ge(3)&s,'colA'] # then we using count to see which one is more than 3 items (include the line start to two items decreasing )
Out[45]:
1 B
Name: colA, dtype: object

Drop 'NaN' value in dataframe

Is there any way to drop only 'nan' from a dataset not to remove the entire row or column which contains 'nan'? I have tried below code but the result was not the one that i wanted.
df = pd.read_csv('...csv')
df.stack()
Here is the part of csv
And here is after '.stack()'
The headers are mixed up with the actual data. I don't want to be mixed up!
You can use:
df.fillna('')
Which will fill na with an empty string ''. Or you can fill it whatever you like.
using dropna with condition.
nan vlaue not equal itself.
and
you can drop column or row by using,
column: del df.column_name
row: df.drop([row_index])
Consider the dataframe df
df = pd.DataFrame(np.arange(9).reshape(3, 3))
df.iloc[1, 1] = np.nan
print(df)
0 1 2
0 0 1.0 2
1 3 NaN 5
2 6 7.0 8
You can drop just the middle cell but only if we stack
df.stack()
0 0 0.0
1 1.0
2 2.0
1 0 3.0
2 5.0
2 0 6.0
1 7.0
2 8.0
dtype: float64

how to insert a new integer index ipython pandas

I made value count dataframe from another dataframe
for example
freq
0 2
0.33333 10
1.66667 13
automatically, its indexs are 0, 0.3333, 1.66667
and the indexs can be variable
because I intend to make many dataframes based on a specific value
how can I insert a integer index?
like
freq
0 0 2
1 0.33333 10
2 1.66667 13
thanks
The result you get back from values_count is a series, and to set a generic 1 ... n index, you can use reset_index:
In [4]: s = pd.Series([0,0.3,0.3,1.6])
In [5]: s.value_counts()
Out[5]:
0.3 2
1.6 1
0.0 1
dtype: int64
In [9]: s.value_counts().reset_index(name='freq')
Out[9]:
index freq
0 0.3 2
1 1.6 1
2 0.0 1