Pandas using iloc, apply and lambda with df column as part of condition - pandas

So i have that kind of code:
import pandas as pd
import numpy as np
myData = {'Price': [30000, 199, 30000, 199, 199],
'Length': [7, 7, 7, 7, 6]
}
df = pd.DataFrame(myData, columns=['Price', 'Length'])
print(df)
df.iloc[:, np.r_[0]] = df.iloc[:, np.r_[0]].apply(lambda x: [y if y >= 30000 else round(y / 2, 0) for y in x])
print(df)
What it does is, it takes value from column "Price" and if its equal or above 30 000 then it doesnt change the value otherwise it divides it by 2 and rounds to full numbers.
This on works great, but problem I do have is how to change this code to divide it by value in column "Length" instead ???
I need to use iloc since i dont know names of the columns (they may change but their position wont) and I would like to have it solved using apply and lambda.
Other question is how to use same thing but for example i want to divide two columns (lets say "Price" and "Age" by values in column "Length").
Thanks for any help on this issue.
EDIT:
Based on answer below from jezrael i managed to solve my second question by using loop:
import pandas as pd
import numpy as np
myData = {'Price': [30000, 199, 30000, 199, 199],
'Age': [7, 14, 21, 28, 30000],
'Length': [7, 7, 7, 7, 7]
}
df = pd.DataFrame(myData, columns=['Price', 'Age', 'Length'])
for column in df.columns[np.r_[0, 1]]:
df[column] = np.where(df[column] >= 30000, df[column], (df[column] / df.iloc[:, 2]).round())
print(df[column])
print(df)
I wonder if it can be done without using loops though ???

Use numpy.where by condition, here apply is not recommended, because slow:
df.iloc[:, 0] = np.where(df.iloc[:, 0] >= 30000,
df.iloc[:, 0],
(df.iloc[:, 0] / df.iloc[:, 1]).round())
print(df)
Price Length
0 30000.0 7
1 28.0 7
2 30000.0 7
3 28.0 7
4 33.0 6
EDIT:
For working with multiple columns use DataFrame.iloc and divide values by DataFrame.div with axis=0:
df.iloc[:, [0, 1]] = np.where(df.iloc[:, [0, 1]] >= 30000,
df.iloc[:, [0, 1]],
df.iloc[:, [0, 1]].div(df.iloc[:, 2], axis=0).round())
print (df)
Price Age Length
0 30000.0 1.0 7
1 28.0 2.0 7
2 30000.0 3.0 7
3 28.0 4.0 7
4 28.0 30000.0 7

One way is to find all indexes where the column is less than 30000 using .loc and .iloc. With this filter apply the division to the desired data
mask = df.loc[df.iloc[:,0] < 30000].index
df.iloc[mask, 0] = (df.iloc[mask, 0] / df.iloc[mask, 1]).round()
#output
Price Length
0 30000.0 7
1 28.0 7
2 30000.0 7
3 28.0 7
4 33.0 6

Related

how to add costum ID in pandas dataframe [duplicate]

In pandas, how can I convert a column of a DataFrame into dtype object?
Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)
Also, what's the difference between pandas.Factor and pandas.Categorical?
You can use the astype method to cast a Series (one column):
df['col_name'] = df['col_name'].astype(object)
Or the entire DataFrame:
df = df.astype(object)
Update
Since version 0.15, you can use the category datatype in a Series/column:
df['col_name'] = df['col_name'].astype('category')
Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.
There's also pd.factorize function to use:
# use the df data from #herrfz
In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]
In [153]: df
Out[153]:
a b c
0 1 yes 0
1 2 no 1
2 3 yes 0
3 4 no 1
4 5 absent 2
Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:
In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})
In [28]: df
Out[28]:
a b
0 1 yes
1 2 no
2 3 yes
3 4 no
4 5 absent
In [29]: df['c'] = pd.Categorical.from_array(df.b).labels
In [30]: df
Out[30]:
a b c
0 1 yes 2
1 2 no 1
2 3 yes 2
3 4 no 1
4 5 absent 0

Pandas: slice by named index using loc, but not include first index

I have a dataframe with named indexes, need to select all above particular index, not including it.
For example:
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
max_speed
shield
cobra
1
2
viper
4
5
sidewinder
7
8
I need to select df below cobra. So like pseudo code: df.loc['cobra'+1 : ]
There are several ways to go about this:
>>> df.iloc[df.index.tolist().index('cobra')+1:]
max_speed shield
viper 4 5
sidewinder 7 8
>>> df.drop('cobra', axis=0)
max_speed shield
viper 4 5
sidewinder 7 8
>>> df[df.index != 'cobra']
max_speed shield
viper 4 5
sidewinder 7 8
An addition method that #Quang Hoang proposed:
>>> df.iloc[df.index.get_indexer(['cobra'])[0]+1:]
max_speed shield
viper 4 5
sidewinder 7 8
Selecting without include cobra:
df.iloc[df.index.get_indexer(['cobra'])[0]+2:,:]
try
df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield'])
print(df.loc[df.index > 'cobra'])
output
max_speed shield
viper 4 5
sidewinder 7 8

Numpy vs Pandas axis

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.
axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.
When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

Convert all entries in a pandas (list) to just first entry of list [duplicate]

I have a Pandas DataFrame with a column containing lists objects
A
0 [1,2]
1 [3,4]
2 [8,9]
3 [2,6]
How can I access the first element of each list and save it into a new column of the DataFrame? To get a result like this:
A new_col
0 [1,2] 1
1 [3,4] 3
2 [8,9] 8
3 [2,6] 2
I know this could be done via iterating over each row, but is there any "pythonic" way?
As always, remember that storing non-scalar objects in frames is generally disfavoured, and should really only be used as a temporary intermediate step.
That said, you can use the .str accessor even though it's not a column of strings:
>>> df = pd.DataFrame({"A": [[1,2],[3,4],[8,9],[2,6]]})
>>> df["new_col"] = df["A"].str[0]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
>>> df["new_col"]
0 1
1 3
2 8
3 2
Name: new_col, dtype: int64
You can use map and a lambda function
df.loc[:, 'new_col'] = df.A.map(lambda x: x[0])
Use apply with x[0]:
df['new_col'] = df.A.apply(lambda x: x[0])
print df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
You can use the method str.get:
df['A'].str.get(0)
You can just use a conditional list comprehension which takes the first value of any iterable or else uses None for that item. List comprehensions are very Pythonic.
df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
Timings
df = pd.concat([df] * 10000)
%timeit df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
100 loops, best of 3: 13.2 ms per loop
%timeit df["new_col"] = df["A"].str[0]
100 loops, best of 3: 15.3 ms per loop
%timeit df['new_col'] = df.A.apply(lambda x: x[0])
100 loops, best of 3: 12.1 ms per loop
%timeit df.A.map(lambda x: x[0])
100 loops, best of 3: 11.1 ms per loop
Removing the safety check ensuring an interable.
%timeit df['new_col'] = [val[0] for val in df["A"]]
100 loops, best of 3: 7.38 ms per loop

Aggregating a time series in Pandas given a window size

Lets say I have this data
a = pandas.Series([1,2,3,4,5,6,7,8])
a
Out[313]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
dtype: int64
I would like aggregate data which groups data n rows at a time and sums them up. So if n=2 the new series would look like {3,7,11,15}.
try this:
In [39]: a.groupby(a.index//2).sum()
Out[39]:
0 3
1 7
2 11
3 15
dtype: int64
In [41]: a.index//2
Out[41]: Int64Index([0, 0, 1, 1, 2, 2, 3, 3], dtype='int64')
n=3
In [42]: n=3
In [43]: a.groupby(a.index//n).sum()
Out[43]:
0 6
1 15
2 15
dtype: int64
In [44]: a.index//n
Out[44]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2], dtype='int64')
you can use pandas rolling mean and get it like the following:
if n is your interval:
sums = list(a.rolling(n).sum()[n-1::n])
# Optional !!!
rem = len(a)%n
if rem != 0:
sums.append(a[-rem:].sum())
The first line perfectly adds the rows if the data can be properly divided into groups, else, we also can add the remaining sum (depends on your preference).
For e.g., in the above case, if n=3, then you may want to get either {6, 15, 15} or just {6, 15}. The code above is for the former case. And skipping the optional part gives you just {6, 15}.