What does this line do (uniques) in dataframes - pandas

can someone help me understand this line. Thanks in advance
uniques = data.apply(lambda x: [x.unique])

I believe that the syntax that you're looking for is data.apply(lambda x: [x.unique()]). If so, then it outputs a DataFrame with a single row. Each element is a list containing the unique elements in that column. An example:-
In [14]: data.head()
Out[14]:
ipip wage1992 prtyid ideol gender
1 3 0.614226 7.0 5.000000 1
2 4 0.674754 5.0 5.000000 1
3 4 0.267990 7.0 6.000000 1
4 1 0.261490 1.0 4.000000 1
5 3 0.109723 2.0 1.760314 1
In [15]: data.apply(lambda x: [x.unique()])
Out[15]:
ipip ... gender
0 [3, 4, 1, 5, 2] ... [1, 0]
[1 rows x 5 columns]
As an example, the gender column contains two unique values, 1 and 0.

Related

Setting multiple column at once give error "Not in index error!"

import pandas as pd
df = pd.DataFrame(
[
[5, 2],
[3, 5],
[5, 5],
[8, 9],
[90, 55]
],
columns = ['max_speed', 'shield']
)
df.loc[(df.max_speed > df.shield), ['stat', 'delta']] \
= 'overspeed', df['max_speed'] - df['shield']
I am setting multiple column using .loc as above, for some cases I get Not in index error!. Am I doing something wrong above?
Create list of tuples by same size like number of Trues with filtered Series after subtract with repeat scalar overspeed:
m = (df.max_speed > df.shield)
s = df['max_speed'] - df['shield']
df.loc[m, ['stat', 'delta']] = list(zip(['overspeed'] * m.sum(), s[m]))
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0
Another idea with helper DataFrame:
df.loc[m, ['stat', 'delta']] = pd.DataFrame({'stat':'overspeed', 'delta':s})[m]
Details:
print(list(zip(['overspeed'] * m.sum(), s[m])))
[('overspeed', 3), ('overspeed', 35)]
print (pd.DataFrame({'stat':'overspeed', 'delta':s})[m])
stat delta
0 overspeed 3
4 overspeed 35
Simpliest is assign separately:
df.loc[m, 'stat'] = 'overspeed'
df.loc[m, 'delta'] = df['max_speed'] - df['shield']
print(df)
max_speed shield stat delta
0 5 2 overspeed 3.0
1 3 5 NaN NaN
2 5 5 NaN NaN
3 8 9 NaN NaN
4 90 55 overspeed 35.0

Subtract values from different groups

I have the following DataFrame:
A X
Time
1 a 10
2 b 17
3 b 20
4 c 21
5 c 36
6 d 40
given by pd.DataFrame({'Time': [1, 2, 3, 4, 5, 6], 'A': ['a', 'b', 'b', 'c', 'c', 'd'], 'X': [10, 17, 20, 21, 36, 40]}).set_index('Time')
The desired output is:
Time Difference
0 2 7
1 4 1
2 6 4
The first difference 1 is a result of subtracting 21 from 20: (first "c" value - last "b" value).
I'm open to numPy transformations as well.
Aggregate by GroupBy.agg with GroupBy.first,
GroupBy.last and then subtract shifted values for last column with omit first row by positions:
df = df.reset_index()
df1 = df.groupby('A',as_index=False, sort=False).agg(first=('X', 'first'),
last=('X','last'),
Time=('Time','first'))
df1['Difference'] = df1['first'].sub(df1['last'].shift(fill_value=0))
df1 = df1[['Time','Difference']].iloc[1:].reset_index(drop=True)
print (df1)
Time Difference
0 2 7
1 4 1
2 6 4
IIUC, you can pivot, ffill the columns, and compute the difference:
g = df.reset_index().groupby('A')
(df.assign(col=g.cumcount().values)
.pivot('A', 'col', 'X')
.ffill(axis=1)
.assign(Time=g['Time'].first(),
diff=lambda d: d[0]-d[1].shift())
[['Time', 'diff']].iloc[1:]
.rename_axis(index=None, columns=None)
)
output:
Time Difference
b 2 7.0
c 4 1.0
d 6 4.0
Intermediate, pivoted/ffilled dataframe:
col 0 1 Time Difference
A
a 10.0 10.0 1 NaN
b 17.0 20.0 2 7.0
c 21.0 36.0 4 1.0
d 40.0 40.0 6 4.0
Another possible solution:
(df.assign(Y = df['X'].shift())
.iloc[df.index % 2 == 0]
.assign(Difference = lambda z: z['X'] - z['Y'])
.reset_index()
.loc[:, ['Time', 'Difference']]
)
Output:
Time Difference
0 2 7.0
1 4 1.0
2 6 4.0

pandas dataframe how to sum all value of bigger columns per row

I have a dataframe:
0. 1. 2. 3
2. 3. 5. 9
5. 1. 0. 3
And for columns 1,2,3 - I want the value for each row to be the sum of higher columns
So the new df will be:
0. 1. 2. 3
2. 17 14. 9
5. 4. 3. 3
What is the best way to do so?
Use inverse DataFrame.cumsum:
L = [1,2,3]
df[L[::-1]] = df[L[::-1]].cumsum(axis=1)
print (df)
0 1 2 3
0 2.0 17.0 14.0 9.0
1 5.0 4.0 3.0 3.0
Let us try cumsum along axis=1 after reverting the columns 1, 2, 3:
c = [1, 2, 3]
df.loc[:, c] = df.loc[:, c[::-1]].cumsum(axis=1)
0 1 2 3
0 2 17 14 9
1 5 4 3 3

Replace all values from one pandas dataframe to another without extra columns

These are my two dataframes:
df1 = pd.DataFrame({'animal': ['falcon', 'dog', 'spider', 'fish'],'num_legs': [2, 4, 8, 0],'num_wings': [2, 0, 0, 0],'num_specimen_seen': [10, 2, 1, 8]})
df2 = pd.DataFrame({'animal': ['falcon', 'dog'],'num_legs': [4, 2],'num_wings': [0, 2],'num_specimen_seen': [2, 10]})
When I use left join , this is the result:
merge = df1.merge(df2, on='animal', how='left')
Output:
animal num_legs_x num_wings_x num_specimen_seen_x num_legs_y num_wings_y num_specimen_seen_y
falcon 2 2 10 4 0 2
dog 4 0 2 2 2 10
spider 8 0 1 NaN NaN NaN
fish 0 0 8 NaN NaN NaN
I am looking for an output like this , where row 1 and 2 values are replaced by values coming from df2 :
animal num_legs num_wings num_specimen_seen
falcon 4 0 2
dog 2 2 10
spider 8 0 1
fish 0 0 8
I attempted using np.where but couldnt write something correctly
df = np.where(df1.animal == df2.animal, ?, ?)
Maybe left join isnt correct way to achieve what I want. I am new to pandas , any help would be appreciated.
Let us do update
df1 = df1.set_index('animal')
df1.update(df2.set_index('animal'))
df1 = df1.reset_index()
df1
animal num_legs num_wings num_specimen_seen
0 falcon 4.0 0.0 2.0
1 dog 2.0 2.0 10.0
2 spider 8.0 0.0 1.0
3 fish 0.0 0.0 8.0

How to fill in pandas column with previous column value using apply [duplicate]

Suppose I have a DataFrame with some NaNs:
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df
0 1 2
0 1 2 3
1 4 NaN NaN
2 NaN NaN 9
What I need to do is replace every NaN with the first non-NaN value in the same column above it. It is assumed that the first row will never contain a NaN. So for the previous example the result would be
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
I can just loop through the whole DataFrame column-by-column, element-by-element and set the values directly, but is there an easy (optimally a loop-free) way of achieving this?
You could use the fillna method on the DataFrame and specify the method as ffill (forward fill):
>>> df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
>>> df.fillna(method='ffill')
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
This method...
propagate[s] last valid observation forward to next valid
To go the opposite way, there's also a bfill method.
This method doesn't modify the DataFrame inplace - you'll need to rebind the returned DataFrame to a variable or else specify inplace=True:
df.fillna(method='ffill', inplace=True)
The accepted answer is perfect. I had a related but slightly different situation where I had to fill in forward but only within groups. In case someone has the same need, know that fillna works on a DataFrameGroupBy object.
>>> example = pd.DataFrame({'number':[0,1,2,nan,4,nan,6,7,8,9],'name':list('aaabbbcccc')})
>>> example
name number
0 a 0.0
1 a 1.0
2 a 2.0
3 b NaN
4 b 4.0
5 b NaN
6 c 6.0
7 c 7.0
8 c 8.0
9 c 9.0
>>> example.groupby('name')['number'].fillna(method='ffill') # fill in row 5 but not row 3
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
5 4.0
6 6.0
7 7.0
8 8.0
9 9.0
Name: number, dtype: float64
You can use pandas.DataFrame.fillna with the method='ffill' option. 'ffill' stands for 'forward fill' and will propagate last valid observation forward. The alternative is 'bfill' which works the same way, but backwards.
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df = df.fillna(method='ffill')
print(df)
# 0 1 2
#0 1 2 3
#1 4 2 3
#2 4 2 9
There is also a direct synonym function for this, pandas.DataFrame.ffill, to make things simpler.
One thing that I noticed when trying this solution is that if you have N/A at the start or the end of the array, ffill and bfill don't quite work. You need both.
In [224]: df = pd.DataFrame([None, 1, 2, 3, None, 4, 5, 6, None])
In [225]: df.ffill()
Out[225]:
0
0 NaN
1 1.0
...
7 6.0
8 6.0
In [226]: df.bfill()
Out[226]:
0
0 1.0
1 1.0
...
7 6.0
8 NaN
In [227]: df.bfill().ffill()
Out[227]:
0
0 1.0
1 1.0
...
7 6.0
8 6.0
Only one column version
Fill NAN with last valid value
df[column_name].fillna(method='ffill', inplace=True)
Fill NAN with next valid value
df[column_name].fillna(method='backfill', inplace=True)
Just agreeing with ffill method, but one extra info is that you can limit the forward fill with keyword argument limit.
>>> import pandas as pd
>>> df = pd.DataFrame([[1, 2, 3], [None, None, 6], [None, None, 9]])
>>> df
0 1 2
0 1.0 2.0 3
1 NaN NaN 6
2 NaN NaN 9
>>> df[1].fillna(method='ffill', inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 NaN 2.0 6
2 NaN 2.0 9
Now with limit keyword argument
>>> df[0].fillna(method='ffill', limit=1, inplace=True)
>>> df
0 1 2
0 1.0 2.0 3
1 1.0 2.0 6
2 NaN 2.0 9
ffill now has it's own method pd.DataFrame.ffill
df.ffill()
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
You can use fillna to remove or replace NaN values.
NaN Remove
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df.fillna(method='ffill')
0 1 2
0 1.0 2.0 3.0
1 4.0 2.0 3.0
2 4.0 2.0 9.0
NaN Replace
df.fillna(0) # 0 means What Value you want to replace
0 1 2
0 1.0 2.0 3.0
1 4.0 0.0 0.0
2 0.0 0.0 9.0
Reference pandas.DataFrame.fillna
There's also pandas.Interpolate, which I think gives one more control
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, None, None], [None, None, 9]])
df=df.interpolate(method="pad",limit=None, downcast="infer") #downcast keeps dtype as int
print(df)
0 1 2
0 1 2 3
1 4 2 3
2 4 2 9
In my case, we have time series from different devices but some devices could not send any value during some period. So we should create NA values for every device and time period and after that do fillna.
df = pd.DataFrame([["device1", 1, 'first val of device1'], ["device2", 2, 'first val of device2'], ["device3", 3, 'first val of device3']])
df.pivot(index=1, columns=0, values=2).fillna(method='ffill').unstack().reset_index(name='value')
Result:
0 1 value
0 device1 1 first val of device1
1 device1 2 first val of device1
2 device1 3 first val of device1
3 device2 1 None
4 device2 2 first val of device2
5 device2 3 first val of device2
6 device3 1 None
7 device3 2 None
8 device3 3 first val of device3