Pandas .corr() returning "__" - pandas

It was working great until it wasn't, and no idea what I'm doing wrong. I've reduced it to a very simple datsaset t:
1 2 3 4 5 6 7 8
0 3 16 3 2 17 2 3 2
1 3 16 3 2 19 4 3 2
2 3 16 3 2 9 2 3 2
3 3 16 3 2 19 1 3 2
4 3 16 3 2 17 2 3 1
5 3 16 3 2 17 1 17 1
6 3 16 3 2 19 1 17 2
7 3 16 3 2 19 4 3 1
8 3 16 3 2 19 1 3 2
9 3 16 3 2 7 2 17 1
corr = t.corr()
corr
returns "__"
and
sns.heatmap(corr)
throws the following error "zero-size array to reduction operation minimum which has no identity"
I have no idea what's wrong? I've tried it with more rows etc, and double checked that I don't have nay missing values...what's going on? I had such a pretty heatmap earlier, I've been trying to

As mentioned above, change type to float. Simply,
corr = t.astype('float64').corr()

The problem here is not the dataframe itself but the origin of it. I found same problem by using drop or iloc in a dataframe. The key is the global type the dataframe has.
Let's say we have the following dataframe:
list_ex = [[1.1,2.1,3.1,4,5,6,7,8],[1.2,2.2,3.3,4.1,5.5,6,7,8],
[1.3,2.3,3,4,5,6.2,7,8],[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new=pd.DataFrame(list_ex)
you can calculate the list_ex_new.corr() with no problem. If you check the arguments of the dataframe by vars(list_ex_new), you'll obtain:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 4, dtype: float64, '_item_cache': {}}
where dtype is float64.
A new dataframe can be defined by list_new_new = list_ex_new.iloc[1:,:] and the correlations can be evaluated, successfully. A check of the dataframe's attributes shows:
{'_is_copy': ,
'_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=1, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 3, dtype: float64,
'_item_cache': {}}
where dtype is still float64.
A third dataframe can be defined:
list_ex_w = [['a','a','a','a','a','a','a','a'],[1.1,2.1,3.1,4,5,6,7,8],
[1.2,2.2,3.3,4.1,5.5,6,7,8],[1.3,2.3,3,4,5,6.2,7,8],
[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new_w=pd.DataFrame(list_ex_w)
A evaluation of dataframe's correlation will result in a empty dataframe, since list_ex_w attributes look like:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: Index(['a', 1, 2, 3, 4], dtype='object')
ObjectBlock: slice(0, 8, 1), 8 x 5, dtype: object, '_item_cache': {}}
Where now dtype is 'object', since the dataframe is not consistent in its types. there are strings and floats together. Finally, a fourth dataframe can be generated:
list_new_new_w = list_ex_new_w.iloc[1:,:]
this will generate as a result same notebook but with no 'a's, apparently a perfectly correct dataframe to calculate the correlations. However this will return again an empty dataframe. A final check of the dataframe attributes shows:
vars(list_new_new_w)
{'_is_copy': None, '_data': BlockManager
Items: Index([1, 2, 3, 4], dtype='object')
Axis 1: RangeIndex(start=0, stop=8, step=1)
ObjectBlock: slice(0, 4, 1), 4 x 8, dtype: object, '_item_cache': {}}
where dtype is still object, thus the method corr returns an empty dataframe.
This problem can be solved by using astype(float)
list_new_new_w.astype(float).corr()
In summary, it seems pandas at the time corr or cov among others methods are called generate a new dataframe with same attibutes ignoring the case the new dataframe has a consistent global type. I've been checking out the pandas source code and I understand this is the correct interpretation of pandas' implementation.

Related

Find common values within groupby in pandas Dataframe based on two columns

I have following dataframe:
period symptoms recovery
1 4 2
1 5 2
1 6 2
2 3 1
2 5 2
2 8 4
2 12 6
3 4 2
3 5 2
3 6 3
3 8 5
4 5 2
4 8 4
4 12 6
I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value
of two columns 'symptoms' and 'recovery'
Result should be :
symptoms recovery period
5 2 [1, 2, 3, 4]
8 4 [2, 4]
where each same two columns values has the periods occurrence in a list or column.
I'm I approaching the problem in the wrong way ? Appreciate your help.
I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame.
Tried sorting values based on 3 columns but couldn't get the common ones between each period section.
Last attempt :
df2 = df[['period', 'how_long', 'days_to_ex']].copy()
#s = df.groupby(["period", "symptoms", "recovery"]).size()
s = df.groupby(["symptoms", "recovery"]).size()
You were almost there:
from io import StringIO
import pandas as pd
# setup sample data
data = StringIO("""
period;symptoms;recovery
1;4;2
1;5;2
1;6;2
2;3;1
2;5;2
2;8;4
2;12;6
3;4;2
3;5;2
3;6;3
3;8;5
4;5;2
4;8;4
4;12;6
""")
df = pd.read_csv(data, sep=";")
# collect unique periods
df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index()
This gives
symptoms recovery period
0 3 1 [2]
1 4 2 [1, 3]
2 5 2 [1, 2, 3, 4]
3 6 2 [1]
4 6 3 [3]
5 8 4 [2, 4]
6 8 5 [3]
7 12 6 [2, 4]

how to convert a pandas column containing list into dataframe

I have a pandas dataframe.
One of its columns contains a list of 60 elements, constant across its rows.
How do I convert each of these lists into a row of a new dataframe?
Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements.
I need to create a new dataframe nx60.
My tentative:
def expand(x):
return(pd.DataFrame(np.array(x)).reshape(-1,len(x)))
df["col"].apply(lambda x: expand(x))]
it gives funny results....
The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it
expand(df["col"][0])
To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data
df = pd.DataFrame()
df['col'] = np.arange(4*5).reshape(4,5).tolist()
df
Output:
col
0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11, 12, 13, 14]
3 [15, 16, 17, 18, 19]
now exctract DataFrame from col
df.col.apply(pd.Series)
Output:
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
Try this:
new_df = pd.DataFrame(df["col"].tolist())
This is a little frankensteinish, but you could also try:
import numpy as np
np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',')
new_df = pd.read_csv('outfile.csv')
You can try this as well:
newCol = pd.Series(yourList)
df['colD'] = newCol.values
The above code:
1. Creates a pandas series.
2. Maps the series value to columns in original dataframe.

Check whether a column in a dataframe is an integer or not, and perform operation

Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10
import numpy as np
import pandas as pd
df = pd.dataframe(....)
#function to check and multiply if a column is integer
def xtimes(x):
for col in x:
if type(x[col]) == np.int64:
return x[col]*10
else:
return x[col]
#using apply to apply that function on df
df.apply(xtimes).head(10)
I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply.
In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10
You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc.
df.loc[:, df.dtypes <= np.integer] *= 10
Explanation
pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy.
Demo
Consider the dataframe df
df = pd.DataFrame([
[1, 2, 3, 4, 5, 6],
[1, 2, 3, 4, 5, 6]
]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str]))
df
0 1 2 3 4 5
0 1 2 3 4.0 5 6
1 1 2 3 4.0 5 6
The dtypes are
df.dtypes
0 int32
1 int16
2 int64
3 float64
4 object
5 object
dtype: object
We'd like to change columns 0, 1, and 2
Conveniently
df.dtypes <= np.integer
0 True
1 True
2 True
3 False
4 False
5 False
dtype: bool
And that is what enables us to use this within a loc assignment.
df.loc[:, df.dtypes <= np.integer] *= 10
df
0 1 2 3 4 5
0 10 20 30 4.0 5 6
1 10 20 30 4.0 5 6

How to multiply iteratively down a column?

I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.

Pandas dropping columns by index drops all columns with same name

Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me