Numpy vs Pandas axis - pandas

Why axis differs in Numpy vs Pandas?
Example:
If I want to get rid of column in Pandas I could do this:
df.drop("column", axis = 1, inplace = True)
Here, we are using axis = 1 to drop a column (vertically in a DF).
In Numpy, if I want to sum a matrix A vertically I would use:
A.sum(axis = 0)
Here I use axis = 0.

axis isn't used that often in pandas. A dataframe has 2 dimensions, which are often treated quite differently. In drop the axis definition is well documented, and actually corresponds to the numpy usage.
Make a simple array and data frame:
In [180]: x = np.arange(9).reshape(3,3)
In [181]: df = pd.DataFrame(x)
In [182]: df
Out[182]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
Delete a row from the array, or a column:
In [183]: np.delete(x, 1, 0)
Out[183]:
array([[0, 1, 2],
[6, 7, 8]])
In [184]: np.delete(x, 1, 1)
Out[184]:
array([[0, 2],
[3, 5],
[6, 8]])
Drop does the same thing for the same axis:
In [185]: df.drop(1, axis=0)
Out[185]:
0 1 2
0 0 1 2
2 6 7 8
In [186]: df.drop(1, axis=1)
Out[186]:
0 2
0 0 2
1 3 5
2 6 8
In sum, the definitions are the same as well:
In [188]: x.sum(axis=0)
Out[188]: array([ 9, 12, 15])
In [189]: df.sum(axis=0)
Out[189]:
0 9
1 12
2 15
dtype: int64
In [190]: x.sum(axis=1)
Out[190]: array([ 3, 12, 21])
In [191]: df.sum(axis=1)
Out[191]:
0 3
1 12
2 21
dtype: int64
The pandas sums are Series, which are the pandas equivalent of a 1d array.
Visualizing what axis does with reduction operations like sum is a bit tricky - especially with 2d arrays. Is the axis kept or removed? It can help to think about axis for 1d arrays (the only axis is removed), or 3d arrays, where one axis is removed leaving two.

When you get rid of a column, the name is picked from the axis 1, which is the horizontal axis. When you sum along the axis 0, you sum vertically.

Related

how to add costum ID in pandas dataframe [duplicate]

In pandas, how can I convert a column of a DataFrame into dtype object?
Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)
Also, what's the difference between pandas.Factor and pandas.Categorical?
You can use the astype method to cast a Series (one column):
df['col_name'] = df['col_name'].astype(object)
Or the entire DataFrame:
df = df.astype(object)
Update
Since version 0.15, you can use the category datatype in a Series/column:
df['col_name'] = df['col_name'].astype('category')
Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.
There's also pd.factorize function to use:
# use the df data from #herrfz
In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]
In [153]: df
Out[153]:
a b c
0 1 yes 0
1 2 no 1
2 3 yes 0
3 4 no 1
4 5 absent 2
Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:
In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})
In [28]: df
Out[28]:
a b
0 1 yes
1 2 no
2 3 yes
3 4 no
4 5 absent
In [29]: df['c'] = pd.Categorical.from_array(df.b).labels
In [30]: df
Out[30]:
a b c
0 1 yes 2
1 2 no 1
2 3 yes 2
3 4 no 1
4 5 absent 0

Pandas using iloc, apply and lambda with df column as part of condition

So i have that kind of code:
import pandas as pd
import numpy as np
myData = {'Price': [30000, 199, 30000, 199, 199],
'Length': [7, 7, 7, 7, 6]
}
df = pd.DataFrame(myData, columns=['Price', 'Length'])
print(df)
df.iloc[:, np.r_[0]] = df.iloc[:, np.r_[0]].apply(lambda x: [y if y >= 30000 else round(y / 2, 0) for y in x])
print(df)
What it does is, it takes value from column "Price" and if its equal or above 30 000 then it doesnt change the value otherwise it divides it by 2 and rounds to full numbers.
This on works great, but problem I do have is how to change this code to divide it by value in column "Length" instead ???
I need to use iloc since i dont know names of the columns (they may change but their position wont) and I would like to have it solved using apply and lambda.
Other question is how to use same thing but for example i want to divide two columns (lets say "Price" and "Age" by values in column "Length").
Thanks for any help on this issue.
EDIT:
Based on answer below from jezrael i managed to solve my second question by using loop:
import pandas as pd
import numpy as np
myData = {'Price': [30000, 199, 30000, 199, 199],
'Age': [7, 14, 21, 28, 30000],
'Length': [7, 7, 7, 7, 7]
}
df = pd.DataFrame(myData, columns=['Price', 'Age', 'Length'])
for column in df.columns[np.r_[0, 1]]:
df[column] = np.where(df[column] >= 30000, df[column], (df[column] / df.iloc[:, 2]).round())
print(df[column])
print(df)
I wonder if it can be done without using loops though ???
Use numpy.where by condition, here apply is not recommended, because slow:
df.iloc[:, 0] = np.where(df.iloc[:, 0] >= 30000,
df.iloc[:, 0],
(df.iloc[:, 0] / df.iloc[:, 1]).round())
print(df)
Price Length
0 30000.0 7
1 28.0 7
2 30000.0 7
3 28.0 7
4 33.0 6
EDIT:
For working with multiple columns use DataFrame.iloc and divide values by DataFrame.div with axis=0:
df.iloc[:, [0, 1]] = np.where(df.iloc[:, [0, 1]] >= 30000,
df.iloc[:, [0, 1]],
df.iloc[:, [0, 1]].div(df.iloc[:, 2], axis=0).round())
print (df)
Price Age Length
0 30000.0 1.0 7
1 28.0 2.0 7
2 30000.0 3.0 7
3 28.0 4.0 7
4 28.0 30000.0 7
One way is to find all indexes where the column is less than 30000 using .loc and .iloc. With this filter apply the division to the desired data
mask = df.loc[df.iloc[:,0] < 30000].index
df.iloc[mask, 0] = (df.iloc[mask, 0] / df.iloc[mask, 1]).round()
#output
Price Length
0 30000.0 7
1 28.0 7
2 30000.0 7
3 28.0 7
4 33.0 6

Access elements of pandas series

I have a dataframe and I want to extract the frequency of 0/1 in a particular column.
df=pd.DataFrame({'A':[0,0,1,0,1]})
df
Out[6]:
A
0 0
1 0
2 1
3 0
4 1
Calculating frequency of occurance of 0/1s -
df['A'].value_counts()
Out[8]:
0 3
1 2
Name: A, dtype: int64
type(df['A'].value_counts())
Out[9]: pandas.core.series.Series
How can I extract the frequency of 0s and 1s, in lets suppose two variables, ones and zeros as -
zeros=3, ones=2
I think it would be bit more flexible to return a dictionary:
In [234]: df['A'].value_counts().to_dict()
Out[234]: {0: 3, 1: 2}
or
In [236]: d = df['A'].astype(str).replace(['0','1'], ['zeros','ones']).value_counts().to_dict()
In [237]: d
Out[237]: {'ones': 2, 'zeros': 3}
In [238]: d['ones']
Out[238]: 2
In [239]: d['zeros']
Out[239]: 3
you can also access it directly:
In [3]: df['A'].value_counts().loc[0]
Out[3]: 3
In [4]: df['A'].value_counts().loc[1]
Out[4]: 2
Another way to solve this issue is to use collections library and the function counter() in it.
import collections
c = collections.Counter(df['A'])
c
Out[31]: Counter({0: 3, 1: 2})
count_0s=c.Counter(df['A'])[0]#Returns 3
count_1s=c.Counter(df['A'])[1]#Returns 2

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

Aggregating a time series in Pandas given a window size

Lets say I have this data
a = pandas.Series([1,2,3,4,5,6,7,8])
a
Out[313]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
dtype: int64
I would like aggregate data which groups data n rows at a time and sums them up. So if n=2 the new series would look like {3,7,11,15}.
try this:
In [39]: a.groupby(a.index//2).sum()
Out[39]:
0 3
1 7
2 11
3 15
dtype: int64
In [41]: a.index//2
Out[41]: Int64Index([0, 0, 1, 1, 2, 2, 3, 3], dtype='int64')
n=3
In [42]: n=3
In [43]: a.groupby(a.index//n).sum()
Out[43]:
0 6
1 15
2 15
dtype: int64
In [44]: a.index//n
Out[44]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2], dtype='int64')
you can use pandas rolling mean and get it like the following:
if n is your interval:
sums = list(a.rolling(n).sum()[n-1::n])
# Optional !!!
rem = len(a)%n
if rem != 0:
sums.append(a[-rem:].sum())
The first line perfectly adds the rows if the data can be properly divided into groups, else, we also can add the remaining sum (depends on your preference).
For e.g., in the above case, if n=3, then you may want to get either {6, 15, 15} or just {6, 15}. The code above is for the former case. And skipping the optional part gives you just {6, 15}.