What is the difference between the 'set' operation using loc vs iloc? - pandas

What is the difference between the 'set' operation using loc vs iloc?
df.iloc[2, df.columns.get_loc('ColName')] = 3
#vs#
df.loc[2, 'ColName'] = 3
Why does the website of iloc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) not have any set examples like those shown in loc website (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html)? Is loc the preferred way?

There isn't much of a difference to say. It all comes down to your need and requirement.
Say you have label of the index and column name (most of the time) you are supposed to use loc (location) operator to assign the values.
Whereas like in normal matrix, you usually are going to have only the index number of the row and column and hence the cell location via integers (for i) your are supposed to use iloc (integer based location) for assignment.
Pandas DataFrame support indexing via both usual integer based and index based.
The problem arise when the index (the row or column) is itself integer instead of some string. So to make a clear difference to what operation user want to perform using integer based or label based indexing the two operations is provided.

Main difference is iloc set values by position, loc by label.
Here are some alternatives:
Sample:
Not default index (if exist label 2 is overwritten cell, else appended new row with label):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=[2,1,8])
print (df)
A B C
2 2 2 6
1 1 3 9
8 6 1 0
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
2 2 2 6
1 1 3 9
8 30 1 0
Appended new row with 0:
df.loc[0, 'A'] = 70
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
0 70.0 NaN NaN
Overwritten label 2:
df.loc[2, 'A'] = 50
print (df)
A B C
2 50 2 6
1 1 3 9
8 30 1 0
Default index (working same, because 3rd index has label 2):
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'])
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
0 2 2 6
1 1 3 9
2 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
0 2 2 6
1 1 3 9
2 50 1 0
Not integer index - (working for set by position, for select by label is appended new row):
np.random.seed(123)
df = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=['A','B','C'], index=list('abc'))
print (df)
df.iloc[2, df.columns.get_loc('A')] = 30
print (df)
A B C
a 2 2 6
b 1 3 9
c 30 1 0
df.loc[2, 'A'] = 50
print (df)
A B C
a 2.0 2.0 6.0
b 1.0 3.0 9.0
c 30.0 1.0 0.0
2 50.0 NaN NaN

Related

iLocation based boolean indexing on an integer type is not available

I have an issue, i Want to get those rows which contains missing values. Using iloc and pd.isnull, for column 'Mileage' in my table.
import pandas as pd
df=pd.read_csv('BikeList.csv')
d1=df['Mileage']
print(d1)
print(pd.isnull(df['Mileage']))
d2=df.iloc[pd.isnull(df['Mileage']),['Bike','Mileage']]
I am having this error,
iLocation based boolean indexing on an integer type is not available
import pandas as pd
df=pd.read_csv('BikeList.csv')
d1=df['Mileage']
print(d1)
print(pd.isnull(df['Mileage']))
d2=df.iloc[pd.isnull(df['Mileage']),['Bike','Mileage']]
You need use DataFrame.loc, because select by labels Bike and Mileage:
d2 = df.loc[pd.isnull(df['Mileage']),['Bike','Mileage']]
Or use Series.isna:
d2 = df.loc[df['Mileage'].isna(),['Bike','Mileage']]
If need DataFrame.iloc is necessary convert boolean mask to numpy array, but also columns to positions of columns by Index.get_indexer:
d2 = df.iloc[pd.isnull(df['Mileage']).values, df.columns.get_indexer(['Bike','Mileage'])]
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'Mileage':[np.nan,5,4,5,5,np.nan],
'Bike':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A Mileage Bike D E F
0 a NaN 7 1 5 a
1 b 5.0 8 3 3 a
2 c 4.0 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f NaN 3 0 4 b
d2 = df.loc[pd.isnull(df['Mileage']),['Bike','Mileage']]
print (d2)
Bike Mileage
0 7 NaN
5 3 NaN
d2 = df.iloc[pd.isnull(df['Mileage']).values, df.columns.get_indexer(['Bike','Mileage'])]
print (d2)
Bike Mileage
0 7 NaN
5 3 NaN

Sum columns in pandas having string and number

I need to sum column and column b, which contain string in 1st row
>>> df
a b
0 c d
1 1 2
2 3 4
>>> df['sum'] = df.sum(1)
>>> df
a b sum
0 c d cd
1 1 2 3
2 3 4 7
I only need to add numeric values and get an output like
>>> df
a b sum
0 c d "dummyString/NaN"
1 1 2 3
2 3 4 7
I need to add only some columns
df['sum']=df['a']+df['b']
solution if mixed data - numeric with strings:
I think simpliest is convert non numeric values after sum by to_numeric to NaNs:
df['sum'] = pd.to_numeric(df[['a','b']].sum(1), errors='coerce')
Or:
df['sum'] = pd.to_numeric(df['a']+df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
EDIT:
Solutions id numbers are strings represenation - first convert to numeric and then sum:
df['sum'] = pd.to_numeric(df['a'], errors='coerce') + pd.to_numeric(df['b'], errors='coerce')
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0
Or:
df['sum'] = (df[['a', 'b']].apply(lambda x: pd.to_numeric(x, errors='coerce'))
.sum(axis=1, min_count=1))
print (df)
a b sum
0 c d NaN
1 1 2 3.0
2 3 4 7.0

pandas column operation on certain row in succession

I have a panda dataframe like this:
second block
0 1 a
1 2 b
2 3 c
3 4 a
4 5 c
This is a sequential data and I would like to get a new column which is the time difference between the current block and next time it repeats.
second block freq
0 1 a 3 //(4-1)
1 2 b 0 //(not repeating)
2 3 c 2 //(5-3)
3 4 a 0 //(not repeating)
4 5 c 0 //(not repeating)
I have tried to get the unique list of blocks. Then a for loop that do as below.
for i in unique_block:
df['freq'] = df['timestamp'].shift(-1) - df['timestamp']
I do not know how to get 0 for row index 1,3,4 and since the dataframe is too big. This is not efficient. This is not working.
Thanks.
Use groupby + diff(periods=-1). Multiply by -1 to get your difference convention and fillna with 0.
df['freq'] = (df.groupby('block').diff(-1)*-1).fillna(0)
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
You can use shift and transform in your groupby:
df['freq'] = df.groupby('block').second.transform(lambda x: x.shift(-1) - x).fillna(0)
>>> df
second block freq
0 1 a 3.0
1 2 b 0.0
2 3 c 2.0
3 4 a 0.0
4 5 c 0.0
Using
df.groupby('block').second.apply(lambda x : x.diff().shift(-1)).fillna(0)
Out[242]:
0 3.0
1 0
2 2.0
3 0
4 0
Name: second, dtype: float64

How to find average of two tables in pandas?

I have one table with 1000s of rows that looks like this:
file1:
apples1 + hate 0 0 0 2 4 6 0 1
apples2 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 2 0 4 4 6 0 2
and another file with same headers in file2 - nb some headers are missing in file1:
apples1 + hate 0 0 0 1 4 6 0 2
apples2 + hate 0 1 0 6 4 6 0 2
apples3 + hate 0 2 0 4 4 6 0 2
apples4 + hate 0 1 0 3 4 3 0 1
I want to compare the two files in pandas and average across common columns. I do not want to print columns that are in one file only. So the resultant file would look like:
apples1 + hate 0 0 0 1.5 4 6 0 1.5
apples2 + hate 0 1.5 0 5 4 6 0 2
apples4 + hate 0 2 0 3.5 4 6 0 2
There are two steps in this solution.
concatenate all your dataframes by stacking them vertically (axis=0, the default) using pandas.concat(...) and specifying a join of 'inner' to only maintain columns that in all the dataframes.
call mean(...) function on resultant dataframe.
Example:
In [1]: df1 = pd.DataFrame([[1,2,3], [4,5,6]], columns=['a','b','c'])
In [2]: df2 = pd.DataFrame([[1,2],[3,4]], columns=['a','c'])
In [3]: df1
Out[3]:
a b c
0 1 2 3
1 4 5 6
In [4]: df2
Out[4]:
a c
0 1 2
1 3 4
In [5]: df3 = pd.concat([df1, df2], join='inner')
In [6]: df3
Out[6]:
a c
0 1 3
1 4 6
0 1 2
1 3 4
In [7]: df3.mean()
Out[7]:
a 2.25
c 3.75
dtype: float64
Let's try this:
df1 = pd.read_csv('file1', header=None)
df2 = pd.read_csv('file2', header=None)
Set index to first three columns ie.. "apple1 + hate"
df1 = df1.set_index([0,1,2])
df2 = df2.set_index([0,1,2])
Let's use merge to inner join datafiles on indexes, and the groupby columns with the same name and aggregate with mean:
df1.merge(df2, right_index=True, left_index=True)\
.pipe(lambda x: x.groupby(x.columns.str.extract('(\w+)\_[xy]', expand=False),
axis=1, sort=False).mean()).reset_index()
Output:
0 1 2 3 4 5 6 7 8 9 10
0 apples1 + hate 0.0 0.0 0.0 1.5 4.0 6.0 0.0 1.5
1 apples2 + hate 0.0 1.5 0.0 5.0 4.0 6.0 0.0 2.0
2 apples4 + hate 0.0 1.5 0.0 3.5 4.0 4.5 0.0 1.5

create new column using a shift within a groupby values

I want to create a new column which is a result of a shift function applied to a grouped values.
df = pd.DataFrame({'X': [0,1,0,1,0,1,0,1], 'Y':[2,4,3,1,2,3,4,5]})
df
X Y
0 0 2
1 1 4
2 0 3
3 1 1
4 0 2
5 1 3
6 0 4
7 1 5
def func(x):
x['Z'] = test['Y']-test['Y'].shift(1)
return x
df_new = df.groupby('X').apply(func)
X Y Z
0 0 2 NaN
1 1 4 2.0
2 0 3 -1.0
3 1 1 -2.0
4 0 2 1.0
5 1 3 1.0
6 0 4 1.0
7 1 5 1.0
As you can see from the output the values are shifted sequentally without accounting for a group by.
I have seen a similar question, but I could not figure out why it does not work as expected.
Python Pandas: how to add a totally new column to a data frame inside of a groupby/transform operation
The values are shifted without accounting for the groups because your func uses test (presumably some other object, likely another name for what you call df) directly instead of simply the group x.
def func(x):
x['Z'] = x['Y']-x['Y'].shift(1)
return x
gives me
In [8]: df_new
Out[8]:
X Y Z
0 0 2 NaN
1 1 4 NaN
2 0 3 1.0
3 1 1 -3.0
4 0 2 -1.0
5 1 3 2.0
6 0 4 2.0
7 1 5 2.0
but note that in this particular case you don't need to write a custom function, you can just call diff on the groupby object directly. (Of course other functions you might want to work with may be more complicated).
In [13]: df_new["Z2"] = df.groupby("X")["Y"].diff()
In [14]: df_new
Out[14]:
X Y Z Z2
0 0 2 NaN NaN
1 1 4 NaN NaN
2 0 3 1.0 1.0
3 1 1 -3.0 -3.0
4 0 2 -1.0 -1.0
5 1 3 2.0 2.0
6 0 4 2.0 2.0
7 1 5 2.0 2.0