Pandas .corr() returning "__" - pandas
It was working great until it wasn't, and no idea what I'm doing wrong. I've reduced it to a very simple datsaset t:
1 2 3 4 5 6 7 8
0 3 16 3 2 17 2 3 2
1 3 16 3 2 19 4 3 2
2 3 16 3 2 9 2 3 2
3 3 16 3 2 19 1 3 2
4 3 16 3 2 17 2 3 1
5 3 16 3 2 17 1 17 1
6 3 16 3 2 19 1 17 2
7 3 16 3 2 19 4 3 1
8 3 16 3 2 19 1 3 2
9 3 16 3 2 7 2 17 1
corr = t.corr()
corr
returns "__"
and
sns.heatmap(corr)
throws the following error "zero-size array to reduction operation minimum which has no identity"
I have no idea what's wrong? I've tried it with more rows etc, and double checked that I don't have nay missing values...what's going on? I had such a pretty heatmap earlier, I've been trying to
As mentioned above, change type to float. Simply,
corr = t.astype('float64').corr()
The problem here is not the dataframe itself but the origin of it. I found same problem by using drop or iloc in a dataframe. The key is the global type the dataframe has.
Let's say we have the following dataframe:
list_ex = [[1.1,2.1,3.1,4,5,6,7,8],[1.2,2.2,3.3,4.1,5.5,6,7,8],
[1.3,2.3,3,4,5,6.2,7,8],[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new=pd.DataFrame(list_ex)
you can calculate the list_ex_new.corr() with no problem. If you check the arguments of the dataframe by vars(list_ex_new), you'll obtain:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=0, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 4, dtype: float64, '_item_cache': {}}
where dtype is float64.
A new dataframe can be defined by list_new_new = list_ex_new.iloc[1:,:] and the correlations can be evaluated, successfully. A check of the dataframe's attributes shows:
{'_is_copy': ,
'_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: RangeIndex(start=1, stop=4, step=1)
FloatBlock: slice(0, 8, 1), 8 x 3, dtype: float64,
'_item_cache': {}}
where dtype is still float64.
A third dataframe can be defined:
list_ex_w = [['a','a','a','a','a','a','a','a'],[1.1,2.1,3.1,4,5,6,7,8],
[1.2,2.2,3.3,4.1,5.5,6,7,8],[1.3,2.3,3,4,5,6.2,7,8],
[1.4,2.4,3,4,5,6.2,7.3,8.1]]
list_ex_new_w=pd.DataFrame(list_ex_w)
A evaluation of dataframe's correlation will result in a empty dataframe, since list_ex_w attributes look like:
{'_is_copy': None, '_data': BlockManager
Items: RangeIndex(start=0, stop=8, step=1)
Axis 1: Index(['a', 1, 2, 3, 4], dtype='object')
ObjectBlock: slice(0, 8, 1), 8 x 5, dtype: object, '_item_cache': {}}
Where now dtype is 'object', since the dataframe is not consistent in its types. there are strings and floats together. Finally, a fourth dataframe can be generated:
list_new_new_w = list_ex_new_w.iloc[1:,:]
this will generate as a result same notebook but with no 'a's, apparently a perfectly correct dataframe to calculate the correlations. However this will return again an empty dataframe. A final check of the dataframe attributes shows:
vars(list_new_new_w)
{'_is_copy': None, '_data': BlockManager
Items: Index([1, 2, 3, 4], dtype='object')
Axis 1: RangeIndex(start=0, stop=8, step=1)
ObjectBlock: slice(0, 4, 1), 4 x 8, dtype: object, '_item_cache': {}}
where dtype is still object, thus the method corr returns an empty dataframe.
This problem can be solved by using astype(float)
list_new_new_w.astype(float).corr()
In summary, it seems pandas at the time corr or cov among others methods are called generate a new dataframe with same attibutes ignoring the case the new dataframe has a consistent global type. I've been checking out the pandas source code and I understand this is the correct interpretation of pandas' implementation.
Related
Find common values within groupby in pandas Dataframe based on two columns
I have following dataframe: period symptoms recovery 1 4 2 1 5 2 1 6 2 2 3 1 2 5 2 2 8 4 2 12 6 3 4 2 3 5 2 3 6 3 3 8 5 4 5 2 4 8 4 4 12 6 I'm trying to find the common values of df['period'] groups (1, 2, 3, 4) based on value of two columns 'symptoms' and 'recovery' Result should be : symptoms recovery period 5 2 [1, 2, 3, 4] 8 4 [2, 4] where each same two columns values has the periods occurrence in a list or column. I'm I approaching the problem in the wrong way ? Appreciate your help. I tried to turn each period into dict and loop through to find values but didn't work for me. Also tried to use grouby().apply() but I'm not getting a meaningful data frame. Tried sorting values based on 3 columns but couldn't get the common ones between each period section. Last attempt : df2 = df[['period', 'how_long', 'days_to_ex']].copy() #s = df.groupby(["period", "symptoms", "recovery"]).size() s = df.groupby(["symptoms", "recovery"]).size()
You were almost there: from io import StringIO import pandas as pd # setup sample data data = StringIO(""" period;symptoms;recovery 1;4;2 1;5;2 1;6;2 2;3;1 2;5;2 2;8;4 2;12;6 3;4;2 3;5;2 3;6;3 3;8;5 4;5;2 4;8;4 4;12;6 """) df = pd.read_csv(data, sep=";") # collect unique periods df.groupby(['symptoms','recovery'])[['period']].agg(list).reset_index() This gives symptoms recovery period 0 3 1 [2] 1 4 2 [1, 3] 2 5 2 [1, 2, 3, 4] 3 6 2 [1] 4 6 3 [3] 5 8 4 [2, 4] 6 8 5 [3] 7 12 6 [2, 4]
how to convert a pandas column containing list into dataframe
I have a pandas dataframe. One of its columns contains a list of 60 elements, constant across its rows. How do I convert each of these lists into a row of a new dataframe? Just to be clearer: say A is the original dataframe with n rows. One of its columns contains a list of 60 elements. I need to create a new dataframe nx60. My tentative: def expand(x): return(pd.DataFrame(np.array(x)).reshape(-1,len(x))) df["col"].apply(lambda x: expand(x))] it gives funny results.... The weird thing is that if i call the function "expand" on a single raw, it does exactly what I expect from it expand(df["col"][0]) To ChootsMagoots: Thjis is the result when i try to apply your suggestion. It does not work.
Sample data df = pd.DataFrame() df['col'] = np.arange(4*5).reshape(4,5).tolist() df Output: col 0 [0, 1, 2, 3, 4] 1 [5, 6, 7, 8, 9] 2 [10, 11, 12, 13, 14] 3 [15, 16, 17, 18, 19] now exctract DataFrame from col df.col.apply(pd.Series) Output: 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19
Try this: new_df = pd.DataFrame(df["col"].tolist()) This is a little frankensteinish, but you could also try: import numpy as np np.savetxt('outfile.csv', np.array(df['col'].tolist()), delimiter=',') new_df = pd.read_csv('outfile.csv')
You can try this as well: newCol = pd.Series(yourList) df['colD'] = newCol.values The above code: 1. Creates a pandas series. 2. Maps the series value to columns in original dataframe.
Check whether a column in a dataframe is an integer or not, and perform operation
Check whether a column in a dataframe is an integer or not, and if it is an integer, it must be multiplied by 10 import numpy as np import pandas as pd df = pd.dataframe(....) #function to check and multiply if a column is integer def xtimes(x): for col in x: if type(x[col]) == np.int64: return x[col]*10 else: return x[col] #using apply to apply that function on df df.apply(xtimes).head(10) I am getting an error like ('GP', 'occurred at index school')
You could use select_dtypes to get numeric columns and then multiply. In [1284]: df[df.select_dtypes(include=['int', 'int64', np.number]).columns] *= 10 You could have your specific check list for include=[... np.int64, ..., etc]
You can use the dtypes attribute and loc. df.loc[:, df.dtypes <= np.integer] *= 10 Explanation pd.DataFrame.dtypes returns a pd.Series of numpy dtype objects. We can use the comparison operators to determine subdtype status. See this document for the numpy.dtype hierarchy. Demo Consider the dataframe df df = pd.DataFrame([ [1, 2, 3, 4, 5, 6], [1, 2, 3, 4, 5, 6] ]).astype(pd.Series([np.int32, np.int16, np.int64, float, object, str])) df 0 1 2 3 4 5 0 1 2 3 4.0 5 6 1 1 2 3 4.0 5 6 The dtypes are df.dtypes 0 int32 1 int16 2 int64 3 float64 4 object 5 object dtype: object We'd like to change columns 0, 1, and 2 Conveniently df.dtypes <= np.integer 0 True 1 True 2 True 3 False 4 False 5 False dtype: bool And that is what enables us to use this within a loc assignment. df.loc[:, df.dtypes <= np.integer] *= 10 df 0 1 2 3 4 5 0 10 20 30 4.0 5 6 1 10 20 30 4.0 5 6
How to multiply iteratively down a column?
I am having a tough time with this one - not sure why...maybe it's the late hour. I have a dataframe in pandas as follows: 1 10 2 11 3 20 4 5 5 10 I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200. How do I do this?
Use cumprod. Example: df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6)) df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series: import pandas as pd s = pd.Series([10, 11, 20, 5, 10]) s # Output 0 10 1 11 2 20 3 5 4 10 dtype: int64 s.cumprod() # Output 0 10 1 110 2 2200 3 11000 4 110000 dtype: int64 Kudos to #bananafish for locating the inherent cumprod method.
Pandas dropping columns by index drops all columns with same name
Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( ) >>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)}) >>> df.rename(columns={"b":"a"},inplace=True) df a a 0 10 5 1 11 6 2 12 7 3 13 8 4 14 9 >>> df.columns Index(['a', 'a'], dtype='object') I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case. >>> df.drop(df.columns[-1],1) 0 1 2 3 4 Is there a way to get rid of columns with duplicated column names? EDIT: I choose missleading values for the first column, fixed now EDIT2: the expected outcome is a 0 10 1 11 2 12 3 13 4 14
Actually just do this: In [183]: df.ix[:,~df.columns.duplicated()] Out[183]: a 0 0 1 1 2 2 3 3 4 4 So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~ The output from duplicated: In [184]: df.columns.duplicated() Out[184]: array([False, True], dtype=bool) UPDATE As .ix is deprecated (since v0.20.1) you should do any of the following: df.iloc[:,~df.columns.duplicated()] or df.loc[:,~df.columns.duplicated()] Thanks to #DavideFiocco for alerting me