Truth value of a Dataframe is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() - pandas

I've tried to do pairplot by seaborn with my csv data (this link) by follow code according to seaborn site.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
freq_data = pd.read_csv('C:\\Users\\frequency.csv')
freq = sns.load_dataset(freq_data)
df = sns.pairplot(iris, hue="condition", height=2.5)
plt.show()
the results show the trackback of ambiguous of dataframe
Traceback (most recent call last):
File "\.vscode\test.py", line 8, in <module>
freq = sns.load_dataset(freq_data)
File "\site-packages\seaborn\utils.py", line 485, in load_dataset
if name not in get_dataset_names():
File "\site-packages\pandas\core\generic.py", line 1441, in __nonzero__
raise ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I've checked my data and result it here
condition area sphericity aspect_ratio
0 20 kHz 0.249 0.287 1.376
1 20 kHz 0.954 0.721 1.421
2 20 kHz 0.118 0.260 1.409
3 20 kHz 0.540 0.552 1.526
4 20 kHz 0.448 0.465 1.160
.. ... ... ... ...
310 30 kHz 6.056 0.955 2.029
311 30 kHz 4.115 1.097 1.398
312 30 kHz 11.055 1.816 1.838
313 30 kHz 4.360 1.183 1.162
314 30 kHz 10.596 0.940 1.715
[315 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315 entries, 0 to 314
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 condition 315 non-null object
1 area 315 non-null float64
2 sphericity 315 non-null float64
3 aspect_ratio 315 non-null float64
dtypes: float64(3), object(1)
memory usage: 10.0+ KB
I have no ideas what happen with my dataframe :(
Please advice me to solve these problem
Thank you everyone

The first argument of seaborn.load_dataset() is name of the dataset ({name}.csv on https://github.com/mwaskom/seaborn-data) not a pandas.DataFrame object. The return value of seaborn.load_dataset() is just pandas.DataFrame, so you don't need to do
freq = sns.load_dataset(freq_data)
Moreover, you may want freq_data rather than iris in df = sns.pairplot(iris, hue="condition", height=2.5).
Here is the final example code
from io import StringIO
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
TESTDATA = StringIO("""condition;area;sphericity;aspect_ratio
20 kHz;0.249;0.287;1.376
20 kHz;0.954;0.721;1.421
20 kHz;0.118;0.260;1.409
20 kHz;0.540;0.552;1.526
20 kHz;0.448;0.465;1.160
30 kHz;6.056;0.955;2.029
30 kHz;4.115;1.097;1.398
30 kHz;11.055;1.816;1.838
30 kHz;4.360;1.183;1.162
30 kHz;10.596;0.940;1.715
""")
freq_data = pd.read_csv(TESTDATA, sep=";")
sns.pairplot(freq_data, hue="condition", height=2.5)
plt.show()

Related

Getting variable no of pandas rows w.r.t. a dictionary lookup

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'

I am trying to plot using hvplot, and I am getting this:
TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'
Here is my data:
TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
3 2020-04-06 1153
4 2020-04-07 1252
5 2020-04-08 1491
... ... ...
71 2020-06-13 2242
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
75 NaT NaN
Below is my code:
import numpy as np
import matplotlib.pyplot as plt
import xlsxwriter
import pandas as pd
from pandas import DataFrame
path = ('Casecountdata.xlsx')
xl = pd.ExcelFile(path)
df1 = xl.parse('Hospitalization by Day')
df2 = df1[['Unnamed: 1','Unnamed: 2']]
df2 = df2.drop(df2.index[0])
df2 = df2.rename(columns={"Unnamed: 1": "Time", "Unnamed: 2": "Hospitalizations"})
df2['TimeConv'] = pd.to_datetime(df2.Time)
df3 = df2[['TimeConv','Hospitalizations']]
When I take a sample of your data above and try to plot it, it works for me, so there might be something wrong in the way you read your data from excel to pandas. You can try to do df.info() to see what the datatypes of your data look like. Column TimeConv should be datetime64[ns] and column Hospitalizations should be int64 (or float). Could also be a version problem... do you have the latest versions of hvplot etc installed? But my guess is, your data doesn't look right.
In any case, when I run the following, it works and plots your data:
# import libraries
import pandas as pd
import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')
from io import StringIO # need this to read your text data
# your sample data
text_data = StringIO("""
column1 TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
""")
# read text data to dataframe
df = pd.read_csv(text_data, sep="\s+")
df['TimeConv'] = pd.to_datetime(df.TimeConv, yearfirst=True)
# shortly checkout datatypes of your data
df.info()
# create scatter plot of your data
df.hvplot.scatter(
x='TimeConv',
y='Hospitalizations',
width=500,
title='Showing hospitalizations over time',
)
This code results in the following plot:

pandas simpleimputer preserve datatypes

I am facing a simple error with the code below.
My objective is to use simpleimputer to plug missing values of different datatypes in one shot.
When i try to do that, the fit_transform seems to be not work as expected.
When dtype argument is not used, the code works just fine, but the resulting dataframe loses its data type information. When i include the dtype list in the arguments, i am seeing the below error. You should be able to simulate the error by just copying and pasting in your local.
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
import sklearn
print(sklearn.__version__)
0.21.dev0
data = [['Alex','NJ',21,5.10],['Mary','NY',20,np.nan],['Sam',np.nan,np.nan,6.3]]
df = pd.DataFrame(data,columns=['Name','State','Age','Height'])
df.dtypes
Name object
State object
Age float64
Height float64
dtype: object
imp = SimpleImputer(strategy="most_frequent")
#df = pd.DataFrame(imp.fit_transform(df),columns=df.columns) <<<<----- This works just fine
#df
#Name State Age Height
#0 Alex NJ 21 5.1
#1 Mary NY 20 5.1
#2 Sam NJ 20 6.3
#df.dtypes
#Name object
#State object
#Age object
#Height object
#dtype: object
The below statement fails - with the error listed below ( I am trying to preserve dtypes during imputing process)
df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-23-e9780979921f> in <module>()
7
8 imp = SimpleImputer(strategy="most_frequent")
----> 9 df = pd.DataFrame(imp.fit_transform(df),columns=df.columns,dtype=df.dtypes)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
337 data = {}
338 if dtype is not None:
--> 339 dtype = self._validate_dtype(dtype)
340
341 if isinstance(data, DataFrame):
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in _validate_dtype(self, dtype)
166
167 if dtype is not None:
--> 168 dtype = pandas_dtype(dtype)
169
170 # a compound dtype
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\common.py in pandas_dtype(dtype)
2020 # which we safeguard against by catching them earlier and returning
2021 # np.dtype(valid_dtype) before this condition is evaluated.
-> 2022 if dtype in [object, np.object_, 'object', 'O']:
2023 return npdtype
2024 elif npdtype.kind == 'O':
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1574 raise ValueError("The truth value of a {0} is ambiguous. "
1575 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576 .format(self.__class__.__name__))
1577
1578 __bool__ = __nonzero__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
If you want to preserve the dtype, I recommend using pandas to find the mode and then call fillna:
df = df.fillna(df.agg(lambda x: pd.Series.mode(x)[0], axis=0))
print(df)
Name State Age Height
0 Alex NJ 21.0 5.1
1 Mary NY 20.0 5.1
2 Sam NJ 20.0 6.3
print(df.dtypes)
Name object
State object
Age float64
Height float64
dtype: object
Alternatively, use astype and pass a dictionary:
df = pd.DataFrame(
imp.fit_transform(df), columns=df.columns
).astype(df.dtypes.to_dict())
print(df)
Name State Age Height
0 Alex NJ 21.0 5.1
1 Mary NY 20.0 5.1
2 Sam NJ 20.0 6.3
print(df.dtypes)
Name object
State object
Age float64
Height float64
dtype: object
Explicit astype call is needed because, as per the documentation, only a single dtype can be passed to the pd.DataFrame constructor.
?pd.DataFrame
...
dtype : dtype, default None
| Data type to force. Only a single dtype is allowed.

How do I read tabulator separated CSV in blaze?

I have a "CSV" data file with the following format (well, it's rather a TSV):
event pdg x y z t px py pz ekin
3383 11 -161.515 5.01938e-05 -0.000187112 0.195413 0.664065 0.126078 -0.736968 0.00723234
1694 11 -161.515 -0.000355633 0.000263174 0.195413 0.511853 -0.523429 0.681196 0.00472714
4228 11 -161.535 6.59631e-06 -3.32796e-05 0.194947 -0.713983 -0.0265468 -0.69966 0.0108681
4233 11 -161.515 -0.000524488 6.5069e-05 0.195413 0.942642 0.331324 0.0406377 0.017594
This file is interpretable as-is in pandas:
from pandas import read_csv, read_table
data = read_csv("test.csv", sep="\t", index_col=False) # Works
data = read_table("test.csv", index_col=False) # Works
However, when I try to read it in blaze (that declares to use pandas keyword arguments), an exception is thrown:
from blaze import Data
Data("test.csv") # Attempt 1
Data("test.csv", sep="\t") # Attempt 2
Data("test.csv", sep="\t", index_col=False) # Attempt 3
None of these works and pandas is not used at all. The "sniffer" that tries to deduce column names and types just calls csv.Sniffer.sniff() from standard library (which fails).
Is there a way how to properly read this file in blaze (given that its "little brother" has few hundred MBs, I want to use blaze's sequential processing capabilities)?
Thanks for any ideas.
Edit: I think it might be a problem of odo/csv and filed an issue: https://github.com/blaze/odo/issues/327
Edit2:
Complete error:
Error Traceback (most recent call last) in () ----> 1 bz.Data("test.csv", sep="\t", index_col=False)
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/blaze/interactive.py in Data(data, dshape, name, fields, columns, schema, **kwargs)
54 if isinstance(data, _strtypes):
55 data = resource(data, schema=schema, dshape=dshape, columns=columns,
---> 56 **kwargs)
57 if (isinstance(data, Iterator) and
58 not isinstance(data, tuple(not_an_iterator))):
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/regex.py in __call__(self, s, *args, **kwargs)
62
63 def __call__(self, s, *args, **kwargs):
---> 64 return self.dispatch(s)(s, *args, **kwargs)
65
66 #property
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in resource_csv(uri, **kwargs)
276 #resource.register('.+\.(csv|tsv|ssv|data|dat)(\.gz|\.bz2?)?')
277 def resource_csv(uri, **kwargs):
--> 278 return CSV(uri, **kwargs)
279
280
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in __init__(self, path, has_header, encoding, sniff_nbytes, **kwargs)
102 if has_header is None:
103 self.has_header = (not os.path.exists(path) or
--> 104 infer_header(path, sniff_nbytes))
105 else:
106 self.has_header = has_header
/home/[username-hidden]/anaconda3/lib/python3.4/site-packages/odo/backends/csv.py in infer_header(path, nbytes, encoding, **kwargs)
58 with open_file(path, 'rb') as f:
59 raw = f.read(nbytes)
---> 60 return csv.Sniffer().has_header(raw if PY2 else raw.decode(encoding))
61
62
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in has_header(self, sample)
392 # subtracting from the likelihood of the first row being a header.
393
--> 394 rdr = reader(StringIO(sample), self.sniff(sample))
395
396 header = next(rdr) # assume first row is header
/home/[username-hidden]/anaconda3/lib/python3.4/csv.py in sniff(self, sample, delimiters)
187
188 if not delimiter:
--> 189 raise Error("Could not determine delimiter")
190
191 class dialect(Dialect):
Error: Could not determine delimiter
I am working with Python 2.7.10, dask v0.7.1, blaze v0.8.2 and conda v3.17.0.
conda install dask
conda install blaze
Here is a way you can import the data for use with blaze. Parse the data first with pandas and then convert it into blaze. Perhaps this defeats the purpose, but there are no troubles this way.
As a side note in order to parse the data file correctly your line in pandas parse statment should be:
from blaze import Data
from pandas import DataFrame, read_csv
data = read_csv("csvdata.dat", sep="\s*", index_col=False)
bdata = Data(data)
Now the data is formatted correctly with no errors, bdata:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
Here is an alternative, use dask, it probably can do the same chunking, or large scale processing you are looking for. Dask certainly makes it immediately easy to correctly load a tsv format.
In [17]: import dask.dataframe as dd
In [18]: df = dd.read_csv('tsvdata.txt', sep='\t', index_col=False)
In [19]: df.head()
Out[19]:
event pdg x y z t px py \
0 3383 11 -161.515 0.000050 -0.000187 0.195413 0.664065 0.126078
1 1694 11 -161.515 -0.000356 0.000263 0.195413 0.511853 -0.523429
2 4228 11 -161.535 0.000007 -0.000033 0.194947 -0.713983 -0.026547
3 4233 11 -161.515 -0.000524 0.000065 0.195413 0.942642 0.331324
4 854 11 -161.515 0.000032 0.000418 0.195414 0.675752 0.315671
pz ekin
0 -0.736968 0.007232
1 0.681196 0.004727
2 -0.699660 0.010868
3 0.040638 0.017594
4 -0.666116 0.012641
In [20]:
See also: http://dask.pydata.org/en/latest/array-blaze.html#how-to-use-blaze-with-dask

Condition in Pandas

I have a very peculiar problem in Pandas: one condition works but the other does not. You may download the linked file to test my code. Thanks!
I have a file (stars.txt) that I read in with Pandas. I would like to create two groups: (1) with Log_g < 4.0 and (2) Log_g > 4.0. In my code (see below) I can successfully get rows for group (1):
Kepler_ID RA Dec Teff Log_G g H
3 2305372 19 27 57.679 +37 40 21.90 5664 3.974 14.341 12.201
14 2708156 19 21 08.906 +37 56 11.44 11061 3.717 10.672 10.525
19 2997455 19 32 31.296 +38 07 40.04 4795 3.167 14.694 11.500
34 3352751 19 36 17.249 +38 25 36.91 7909 3.791 13.541 12.304
36 3440230 19 21 53.100 +38 31 42.82 7869 3.657 13.706 12.486
But for some reason I cannot get (2). The code returns the following for error:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90 entries, 0 to 108
Data columns (total 7 columns):
Kepler_ID 90 non-null values
RA 90 non-null values
Dec 90 non-null values
Teff 90 non-null values
Log_G 90 non-null values
g 90 non-null values
H 90 non-null values
dtypes: float64(4), int64(1), object(2)
Here's my code:
#------------------------------------------
# IMPORT STATEMENTS
#------------------------------------------
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#------------------------------------------
# READ FILE AND ASSOCIATE COMPONENTS
#------------------------------------------
star_file = 'stars.txt'
header_row = ['Kepler_ID', 'RA','Dec','Teff', 'Log_G', 'g', 'H']
df = pd.read_csv(star_file, names=header_row, skiprows=2)
#------------------------------------------
# ASSOCIATE VARIABLES
#------------------------------------------
Kepler_ID = df['Kepler_ID']
#RA = df['RA']
#Dec = df['Dec']
Teff = df['Teff']
Log_G = df['Log_G']
g = df['g']
H = df['H']
#------------------------------------------
# SUBSTITUTE MISSING DATA WITH NAN
#------------------------------------------
df = df.replace('', np.nan)
#------------------------------------------
# CHANGE DATA TYPE OF THE REST OF DATA TO FLOAT
#------------------------------------------
df[['Teff', 'Log_G', 'g', 'H']] = df[['Teff', 'Log_G', 'g', 'H']].astype(float)
#------------------------------------------
# SORTING SPECTRA TYPES FOR GIANTS
#------------------------------------------
# FIND GIANTS IN THE SAMPLE
giants = df[(df['Log_G'] < 4.)]
#print giants
# FIND GIANTS IN THE SAMPLE
dwarfs = df[(df['Log_G'] > 4.)]
print dwarfs
This is not an error. You are seeing a summarized view of the DataFrame:
In [11]: df = pd.DataFrame([[2, 1], [3, 4]])
In [12]: df
Out[12]:
0 1
0 2 1
1 3 4
In [13]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
0 2 non-null values
1 2 non-null values
dtypes: int64(2)
Which is displayed is decided by several display package options, for example, max_rows:
In [14]: pd.options.display.max_rows
Out[14]: 60
In [15]: pd.options.display.max_rows = 120
In 0.13, this behaviour changed, so you will see the first max_rows followed by ....