TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64' - pandas

I am trying to plot using hvplot, and I am getting this:
TypeError: '<=' not supported between instances of 'Timestamp' and 'numpy.float64'
Here is my data:
TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
3 2020-04-06 1153
4 2020-04-07 1252
5 2020-04-08 1491
... ... ...
71 2020-06-13 2242
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
75 NaT NaN
Below is my code:
import numpy as np
import matplotlib.pyplot as plt
import xlsxwriter
import pandas as pd
from pandas import DataFrame
path = ('Casecountdata.xlsx')
xl = pd.ExcelFile(path)
df1 = xl.parse('Hospitalization by Day')
df2 = df1[['Unnamed: 1','Unnamed: 2']]
df2 = df2.drop(df2.index[0])
df2 = df2.rename(columns={"Unnamed: 1": "Time", "Unnamed: 2": "Hospitalizations"})
df2['TimeConv'] = pd.to_datetime(df2.Time)
df3 = df2[['TimeConv','Hospitalizations']]

When I take a sample of your data above and try to plot it, it works for me, so there might be something wrong in the way you read your data from excel to pandas. You can try to do df.info() to see what the datatypes of your data look like. Column TimeConv should be datetime64[ns] and column Hospitalizations should be int64 (or float). Could also be a version problem... do you have the latest versions of hvplot etc installed? But my guess is, your data doesn't look right.
In any case, when I run the following, it works and plots your data:
# import libraries
import pandas as pd
import hvplot.pandas
import holoviews as hv
hv.extension('bokeh')
from io import StringIO # need this to read your text data
# your sample data
text_data = StringIO("""
column1 TimeConv Hospitalizations
1 2020-04-04 827
2 2020-04-05 1132
72 2020-06-14 2287
73 2020-06-15 2326
74 NaT NaN
""")
# read text data to dataframe
df = pd.read_csv(text_data, sep="\s+")
df['TimeConv'] = pd.to_datetime(df.TimeConv, yearfirst=True)
# shortly checkout datatypes of your data
df.info()
# create scatter plot of your data
df.hvplot.scatter(
x='TimeConv',
y='Hospitalizations',
width=500,
title='Showing hospitalizations over time',
)
This code results in the following plot:

Related

Python DataFrame: How to write and read multiple tickers time-series dataframe?

This seems a fairly complicated dataframe using a simple download. After saving to file (to_csv), I can't seem to read it properly (read_csv) back into a dataframe as before. Please help.
import yfinance as yf
import pandas as pd
tickers=['AAPL', 'MSFT']
header = ['Open', 'High', 'Low', 'Close', 'Adj Close']
df = yf.download(tickers, period='1y')[header]
df.to_csv("data.csv", index=True)
dfr = pd.read_csv("data.csv")
dfr = dfr.set_index('Date')
print(dfr)`
KeyError: "None of ['Date'] are in the columns"
Note:
df: Date is the Index
Open High
AAPL MSFT AAPL MSFT
Date
2022-02-07 172.86 306.17 173.95 307.84
2022-02-08 171.73 301.25 175.35 305.56
2022-02-09 176.05 309.87 176.65 311.93
2022-02-10 174.14 304.04 175.48 309.12
2022-02-11 172.33 303.19 173.08 304.29
But dfr (after read_csv)
Unnamed: 0 Open ... High High.1
0 NaN AAPL ... AAPL MSFT
1 Date NaN ... NaN NaN
2 2022-02-07 172.86 ... 173.94 307.83
3 2022-02-08 171.72 ... 175.35 305.55
4 2022-02-09 176.05 ... 176.64 311.92
How to make dfr like df?
I run the code, but got the error:
KeyError: "None of ['Date'] are in the columns"

Truth value of a Dataframe is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

I've tried to do pairplot by seaborn with my csv data (this link) by follow code according to seaborn site.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
freq_data = pd.read_csv('C:\\Users\\frequency.csv')
freq = sns.load_dataset(freq_data)
df = sns.pairplot(iris, hue="condition", height=2.5)
plt.show()
the results show the trackback of ambiguous of dataframe
Traceback (most recent call last):
File "\.vscode\test.py", line 8, in <module>
freq = sns.load_dataset(freq_data)
File "\site-packages\seaborn\utils.py", line 485, in load_dataset
if name not in get_dataset_names():
File "\site-packages\pandas\core\generic.py", line 1441, in __nonzero__
raise ValueError(
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I've checked my data and result it here
condition area sphericity aspect_ratio
0 20 kHz 0.249 0.287 1.376
1 20 kHz 0.954 0.721 1.421
2 20 kHz 0.118 0.260 1.409
3 20 kHz 0.540 0.552 1.526
4 20 kHz 0.448 0.465 1.160
.. ... ... ... ...
310 30 kHz 6.056 0.955 2.029
311 30 kHz 4.115 1.097 1.398
312 30 kHz 11.055 1.816 1.838
313 30 kHz 4.360 1.183 1.162
314 30 kHz 10.596 0.940 1.715
[315 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315 entries, 0 to 314
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 condition 315 non-null object
1 area 315 non-null float64
2 sphericity 315 non-null float64
3 aspect_ratio 315 non-null float64
dtypes: float64(3), object(1)
memory usage: 10.0+ KB
I have no ideas what happen with my dataframe :(
Please advice me to solve these problem
Thank you everyone
The first argument of seaborn.load_dataset() is name of the dataset ({name}.csv on https://github.com/mwaskom/seaborn-data) not a pandas.DataFrame object. The return value of seaborn.load_dataset() is just pandas.DataFrame, so you don't need to do
freq = sns.load_dataset(freq_data)
Moreover, you may want freq_data rather than iris in df = sns.pairplot(iris, hue="condition", height=2.5).
Here is the final example code
from io import StringIO
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
TESTDATA = StringIO("""condition;area;sphericity;aspect_ratio
20 kHz;0.249;0.287;1.376
20 kHz;0.954;0.721;1.421
20 kHz;0.118;0.260;1.409
20 kHz;0.540;0.552;1.526
20 kHz;0.448;0.465;1.160
30 kHz;6.056;0.955;2.029
30 kHz;4.115;1.097;1.398
30 kHz;11.055;1.816;1.838
30 kHz;4.360;1.183;1.162
30 kHz;10.596;0.940;1.715
""")
freq_data = pd.read_csv(TESTDATA, sep=";")
sns.pairplot(freq_data, hue="condition", height=2.5)
plt.show()

Finding greatest fall and rise in a dynamic rolling window based on index

Have a df of readings as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1000, size=100), index=range(100), columns = ['reading'])
Want to find the greatest rise and the greatest fall for each row based on its index, which theoretically may be achieved using the formula...
How can this be coded?
Tried:
df.assign(gr8Rise=df.rolling(df.index).apply(lambda x: x[-1]-x[0], raw=True).max())
...and failed with ValueError: window must be an integer
UPDATE: Based on #jezrael dataset the output for gr8Rise is expected as follows:
Use:
np.random.seed(2019)
df = pd.DataFrame(np.random.randint(100, size=10), index=range(10), columns = ['reading'])
df['gr8Rise'] = [df['reading'].rolling(x).apply(lambda x: x[0]-x[-1], raw=True).max()
for x in range(1, len(df)+1)]
df.loc[0, 'gr8Rise']= np.nan
print (df)
reading gr8Rise
0 72 NaN
1 31 41.0
2 37 64.0
3 88 59.0
4 62 73.0
5 24 76.0
6 29 72.0
7 15 57.0
8 12 60.0
9 16 56.0

Create a heatmap from pandas dataframe

I have a pandas dataframe of the form:
colA | colB | counts
car1 plane1 23
car2 plane2 51
car1 plane2 12
car2 plane3 41
I first want to create a pandas dataframe that looks a bit like a matrix (similar to the df in this example), also filling the missing values with 0. So the desired result for the above would be:
plane1 plane2 plane3
car1 23 12 0
car2 0 51 41
And then be able to turn this into a heat map. Is there a pandas command I can use for this?
pandas.pivot_table to transform data, seaborn.heatmap to create a heatmap
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
piv = pd.pivot_table(df, index='colA', columns='colB', aggfunc='sum', fill_value=0)
piv.columns = piv.columns.droplevel(0)
sns.heatmap(piv)
plt.show()

Fast conversion from String array to Pandas Dataframe

I have a string array where each element of this array is a csv file's row(comma separated). I want to convert this into a pandas Dataframe.However when I tried row by row it is very slow.Can a faster alternative be proposed apart from writelines() followed by pandas.read_csv()?
CSV Import
In pandas you can read an entire csv at once without looping over the lines.
Use read_csv with filename as argument:
import pandas as pd
from cStringIO import StringIO
# Set up fake csv data as test for example only
fake_csv = '''
Col_0,Col_1,Col_2,Col_3
0,0.5,A,123
1,0.2,J,234
2,1.4,F,345
3,0.7,E,456
4,0.4,G,576
5,0.8,T,678
6,1.6,A,789
'''
# Read in whole csv to DataFrame at once
# StringIO is for example only
# Normally you would load your file with
# df = pd.read_csv('/path/to/your/file.csv')
df = pd.read_csv(StringIO(fake_csv))
print 'DataFrame from CSV:'
print df
DataFrame from CSV:
Col_0 Col_1 Col_2 Col_3
0 0 0.5 A 123
1 1 0.2 J 234
2 2 1.4 F 345
3 3 0.7 E 456
4 4 0.4 G 576
5 5 0.8 T 678
6 6 1.6 A 789