Check whether a column in a dataframe is an integer or not, and perform operation - pandas

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')

You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]

You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

Related

how to add costum ID in pandas dataframe [duplicate]

In pandas, how can I convert a column of a DataFrame into dtype object?
Or better yet, into a factor? (For those who speak R, in Python, how do I as.factor()?)
Also, what's the difference between pandas.Factor and pandas.Categorical?
You can use the astype method to cast a Series (one column):
df['col_name'] = df['col_name'].astype(object)
Or the entire DataFrame:
df = df.astype(object)
Update
Since version 0.15, you can use the category datatype in a Series/column:
df['col_name'] = df['col_name'].astype('category')
Note: pd.Factor was been deprecated and has been removed in favor of pd.Categorical.
There's also pd.factorize function to use:
# use the df data from #herrfz
In [150]: pd.factorize(df.b)
Out[150]: (array([0, 1, 0, 1, 2]), array(['yes', 'no', 'absent'], dtype=object))
In [152]: df['c'] = pd.factorize(df.b)[0]
In [153]: df
Out[153]:
a b c
0 1 yes 0
1 2 no 1
2 3 yes 0
3 4 no 1
4 5 absent 2
Factor and Categorical are the same, as far as I know. I think it was initially called Factor, and then changed to Categorical. To convert to Categorical maybe you can use pandas.Categorical.from_array, something like this:
In [27]: df = pd.DataFrame({'a' : [1, 2, 3, 4, 5], 'b' : ['yes', 'no', 'yes', 'no', 'absent']})
In [28]: df
Out[28]:
a b
0 1 yes
1 2 no
2 3 yes
3 4 no
4 5 absent
In [29]: df['c'] = pd.Categorical.from_array(df.b).labels
In [30]: df
Out[30]:
a b c
0 1 yes 2
1 2 no 1
2 3 yes 2
3 4 no 1
4 5 absent 0

Why does .loc not always match column names?

I noticed this today and wanted to ask because I am a little confused about this.
Lets say we have two df's
df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
A B C
0 3 1 6
1 2 4 0
2 8 8 0
3 8 6 7
4 4 5 0
df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))
C B A
0 3 5 5
1 7 4 6
2 0 7 7
3 6 6 5
4 4 0 6
If we wanted to conditionally assign new values in the first df with values, we could do this:
df.loc[df['A'].gt(3)] = df2
I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)
A B C
0 3 1 6
1 2 4 0
2 0 7 7
3 6 6 5
4 4 0 6
on index 2 instead of [7,7,0] we have [0,7,7].
However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.
df.loc[df['A'].gt(3),['A','B','C']] = df2
A B C
0 3 1 6
1 2 4 0
2 7 7 0
3 5 6 6
4 6 0 4
Why does this happen?
Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.
Both Row and Column Indexes Included
When passing both a row index and a column index the __setitem__ function:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)
Interprets the key as a tuple.
key:
(0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool,
['A', 'B', 'C'])
This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:
indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
self._ensure_listlike_indexer(key)
if self.axis is not None:
return self._convert_tuple(key, is_setter=True)
ax = self.obj._get_axis(0)
if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
with suppress(TypeError, KeyError, InvalidIndexError):
# TypeError e.g. passed a bool
return ax.get_loc(key)
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
if isinstance(key, range):
return list(key)
try:
return self._convert_to_indexer(key, axis=0, is_setter=True)
except TypeError as e:
# invalid indexer type vs 'other' indexing errors
if "cannot do" in str(e):
raise
elif "unhashable type" in str(e):
raise
raise IndexingError(key) from e
This generates a tuple indexer (both rows and columns are converted):
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
returns
(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))
Only Row Index Included
However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:
if isinstance(key, range):
return list(key)
returns
[2 3 4]
For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.
That is why an empty slice is often used:
df.loc[df['A'].gt(3), :] = df2
As this is sufficient to align the columns appropriately.
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)
df.loc[df['A'].gt(3), :] = df2
print(df)
Example:
df:
A B C
0 3 6 6
1 0 8 4
2 7 0 0
3 7 1 5
4 7 0 1
df2:
C B A
0 4 6 2
1 1 2 7
2 0 5 0
3 0 4 4
4 3 2 4
df.loc[df['A'].gt(3), :] = df2:
A B C
0 3 6 6
1 0 8 4
2 0 5 0
3 4 4 0 # Aligned as expected
4 4 2 3

Pandas .corr() returning "__"

It was working great until it wasn't, and no idea what I'm doing wrong. I've reduced it to a very simple datsaset t:
1 2 3 4 5 6 7 8
0 3 16 3 2 17 2 3 2
1 3 16 3 2 19 4 3 2
2 3 16 3 2 9 2 3 2
3 3 16 3 2 19 1 3 2
4 3 16 3 2 17 2 3 1
5 3 16 3 2 17 1 17 1
6 3 16 3 2 19 1 17 2
7 3 16 3 2 19 4 3 1
8 3 16 3 2 19 1 3 2
9 3 16 3 2 7 2 17 1
corr = t.corr()
corr
returns "__"
and
sns.heatmap(corr)
throws the following error "zero-size array to reduction operation minimum which has no identity"
I have no idea what's wrong? I've tried it with more rows etc, and double checked that I don't have nay missing values...what's going on? I had such a pretty heatmap earlier, I've been trying to
As mentioned above, change type to float. Simply,
corr = t.astype('float64').corr()
The problem here is not the dataframe itself but the origin of it. I found same problem by using drop or iloc in a dataframe. The key is the global type the dataframe has.
Let's say we have the following dataframe:
list_ex = [[1.1,2.1,3.1,4,5,6,7,8],[1.2,2.2,3.3,4.1,5.5,6,7,8],
[1.3,2.3,3,4,5,6.2,7,8],[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new=pd.DataFrame(list_ex)
you can calculate the list_ex_new.corr() with no problem. If you check the arguments of the dataframe by vars(list_ex_new), you'll obtain:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 4, dtype: float64, '_item_cache': {}}
where dtype is float64.
A new dataframe can be defined by list_new_new = list_ex_new.iloc[1:,:] and the correlations can be evaluated, successfully. A check of the dataframe's attributes shows:
{'_is_copy': ,
'_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=1, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 3, dtype: float64,
'_item_cache': {}}
where dtype is still float64.
A third dataframe can be defined:
list_ex_w = [['a','a','a','a','a','a','a','a'],[1.1,2.1,3.1,4,5,6,7,8],
[1.2,2.2,3.3,4.1,5.5,6,7,8],[1.3,2.3,3,4,5,6.2,7,8],
[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new_w=pd.DataFrame(list_ex_w)
A evaluation of dataframe's correlation will result in a empty dataframe, since list_ex_w attributes look like:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: Index(['a', 1, 2, 3, 4], dtype='object')
ObjectBlock: slice(0, 8, 1), 8 x 5, dtype: object, '_item_cache': {}}
Where now dtype is 'object', since the dataframe is not consistent in its types. there are strings and floats together. Finally, a fourth dataframe can be generated:
list_new_new_w = list_ex_new_w.iloc[1:,:]
this will generate as a result same notebook but with no 'a's, apparently a perfectly correct dataframe to calculate the correlations. However this will return again an empty dataframe. A final check of the dataframe attributes shows:
vars(list_new_new_w)
{'_is_copy': None, '_data': BlockManager
Items: Index([1, 2, 3, 4], dtype='object')
Axis 1: RangeIndex(start=0, stop=8, step=1)
ObjectBlock: slice(0, 4, 1), 4 x 8, dtype: object, '_item_cache': {}}
where dtype is still object, thus the method corr returns an empty dataframe.
This problem can be solved by using astype(float)
list_new_new_w.astype(float).corr()
In summary, it seems pandas at the time corr or cov among others methods are called generate a new dataframe with same attibutes ignoring the case the new dataframe has a consistent global type. I've been checking out the pandas source code and I understand this is the correct interpretation of pandas' implementation.

Access elements of pandas series

I have a dataframe and I want to extract the frequency of 0/1 in a particular column.
df=pd.DataFrame({'A':[0,0,1,0,1]})
df
Out[6]:
A
0 0
1 0
2 1
3 0
4 1
Calculating frequency of occurance of 0/1s -
df['A'].value_counts()
Out[8]:
0 3
1 2
Name: A, dtype: int64
type(df['A'].value_counts())
Out[9]: pandas.core.series.Series
How can I extract the frequency of 0s and 1s, in lets suppose two variables, ones and zeros as -
zeros=3, ones=2
I think it would be bit more flexible to return a dictionary:
In [234]: df['A'].value_counts().to_dict()
Out[234]: {0: 3, 1: 2}
or
In [236]: d = df['A'].astype(str).replace(['0','1'], ['zeros','ones']).value_counts().to_dict()
In [237]: d
Out[237]: {'ones': 2, 'zeros': 3}
In [238]: d['ones']
Out[238]: 2
In [239]: d['zeros']
Out[239]: 3
you can also access it directly:
In [3]: df['A'].value_counts().loc[0]
Out[3]: 3
In [4]: df['A'].value_counts().loc[1]
Out[4]: 2
Another way to solve this issue is to use collections library and the function counter() in it.
import collections
c = collections.Counter(df['A'])
c
Out[31]: Counter({0: 3, 1: 2})
count_0s=c.Counter(df['A'])[0]#Returns 3
count_1s=c.Counter(df['A'])[1]#Returns 2

In PANDAS, how to get the index of a known value?

If we have a known value in a column, how can we get its index-value? For example:
In [148]: a = pd.DataFrame(np.arange(10).reshape(5,2),columns=['c1','c2'])
In [149]: a
Out[149]:
c1 c2
0 0 1
1 2 3
2 4 5
........
As we know, we can get a value by the index corresponding to it, like this.
In [151]: a.ix[0,1] In [152]: a.c2[0] In [154]: a.c2.ix[0] <-- use index
Out[151]: 1 Out[152]: 1 Out[154]: 1 <-- get value
But how to get the index by value?
There might be more than one index map to your value, it make more sense to return a list:
In [48]: a
Out[48]:
c1 c2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [49]: a.c1[a.c1 == 8].index.tolist()
Out[49]: [4]
Using the .loc[] accessor:
In [25]: a.loc[a['c1'] == 8].index[0]
Out[25]: 4
Can also use the get_loc() by setting 'c1' as the index. This will not change the original dataframe.
In [17]: a.set_index('c1').index.get_loc(8)
Out[17]: 4
The other way around using numpy.where() :
import numpy as np
import pandas as pd
In [800]: df = pd.DataFrame(np.arange(10).reshape(5,2),columns=['c1','c2'])
In [801]: df
Out[801]:
c1 c2
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [802]: np.where(df["c1"]==6)
Out[802]: (array([3]),)
In [803]: indices = list(np.where(df["c1"]==6)[0])
In [804]: df.iloc[indices]
Out[804]:
c1 c2
3 6 7
In [805]: df.iloc[indices].index
Out[805]: Int64Index([3], dtype='int64')
In [806]: df.iloc[indices].index.tolist()
Out[806]: [3]
To get the index by value, simply add .index[0] to the end of a query. This will return the index of the first row of the result...
So, applied to your dataframe:
In [1]: a[a['c2'] == 1].index[0] In [2]: a[a['c1'] > 7].index[0]
Out[1]: 0 Out[2]: 4
Where the query returns more than one row, the additional index results can be accessed by specifying the desired index, e.g. .index[n]
In [3]: a[a['c2'] >= 7].index[1] In [4]: a[(a['c2'] > 1) & (a['c1'] < 8)].index[2]
Out[3]: 4 Out[4]: 3
I think this may help you , both index and columns of the values.
value you are looking for is not duplicated:
poz=matrix[matrix==minv].dropna(axis=1,how='all').dropna(how='all')
value=poz.iloc[0,0]
index=poz.index.item()
column=poz.columns.item()
you can get its index and column
duplicated:
matrix=pd.DataFrame([[1,1],[1,np.NAN]],index=['q','g'],columns=['f','h'])
matrix
Out[83]:
f h
q 1 1.0
g 1 NaN
poz=matrix[matrix==minv].dropna(axis=1,how='all').dropna(how='all')
index=poz.stack().index.tolist()
index
Out[87]: [('q', 'f'), ('q', 'h'), ('g', 'f')]
you will get a list