argument must be a string or a number, not 'datetime.datetime', but i have a string (Pandas + Matplotlib) - pandas

I have a pandas dataframe
published | sentiment
2022-01-31 10:00:00 | 0
2021-12-29 00:30:00 | 5
2021-12-20 | -5
Since some rows don't have hours, minutes and seconds I delete them:
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
I get:
published | sentiment
2022-01-31 | 0
2021-12-29 | 5
2021-12-20 | -5
If I plot the data:
plt.pyplot.plot_date(df['published'],df['sentiment'] )
I get this error:
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
But I don't know why since it should be a string.
How can I plot it (possibly keeping the temporal order)? Thank you

Try like this:
import pandas as pd
from matplotlib import pyplot as plt
values=[('2022-01-31 10:00:00',0),('2021-12-29 00:30:00',5),('2021-12-20',-5)]
cols=['published','sentiment']
df_dominant_topic2 = pd.DataFrame.from_records(values, columns=cols)
df_dominant_topic2['published']=df_dominant_topic2['published'].astype(str).str.slice(0, 10)
df_dominant_topic2['published']=df_dominant_topic2['published'].str.slice(0, 10)
#you may sort the data by date
df_dominant_topic2.sort_values(by='published', ascending=True, inplace=True)
plt.plot(df_dominant_topic2['published'],df_dominant_topic2['sentiment'])
plt.show()

Related

Count how many non-zero entries at each month in a dataframe column

I have a dataframe, df, with datetimeindex and a single column, like this:
I need to count how many non-zero entries i have at each month. For example, according to those images, in January i would have 2 entries, in February 1 entry and in March 2 entries. I have more months in the dataframe, but i guess that explains the problem.
I tried using pandas groupby:
df.groupby(df.index.month).count()
But that just gives me total days at each month and i don't saw any other parameter in count() that i could use here.
Any ideas?
Try index.to_period()
For example:
In [1]: import pandas as pd
import numpy as np
x_df = pd.DataFrame(
{
'values': np.random.randint(low=0, high=2, size=(120,))
} ,
index = pd.date_range("2022-01-01", periods=120, freq="D")
)
In [2]: x_df
Out[2]:
values
2022-01-01 0
2022-01-02 0
2022-01-03 1
2022-01-04 0
2022-01-05 0
...
2022-04-26 1
2022-04-27 0
2022-04-28 0
2022-04-29 1
2022-04-30 1
[120 rows x 1 columns]
In [3]: x_df[x_df['values'] != 0].groupby(lambda x: x.to_period("M")).count()
Out[3]:
values
2022-01 17
2022-02 15
2022-03 16
2022-04 17
can you try this:
#drop nans
import numpy as np
dfx['col1']=dfx['col1'].replace(0,np.nan)
dfx=dfx.dropna()
dfx=dfx.resample('1M').count()

Plt Nameerror (name 'xxx' not defined) - Jupyter Notebook, Python 3.8.3 [duplicate]

apple is a dataframe whose data structure is as the below:
apple
Date Open High Low Close Adj Close
0 2017-01-03 115.800003 116.330002 114.760002 116.150002 114.311760
1 2017-01-04 115.849998 116.510002 115.750000 116.019997 114.183815
2 2017-01-05 115.919998 116.860001 115.809998 116.610001 114.764473
3 2017-01-06 116.779999 118.160004 116.470001 117.910004 116.043915
4 2017-01-09 117.949997 119.430000 117.940002 118.989998 117.106812
5 2017-01-10 118.769997 119.379997 118.300003 119.110001 117.224907
6 2017-01-11 118.739998 119.930000 118.599998 119.750000 117.854782
7 2017-01-12 118.900002 119.300003 118.209999 119.250000 117.362694
8 2017-01-13 119.110001 119.620003 118.809998 119.040001 117.156021
9 2017-01-17 118.339996 120.239998 118.220001 120.000000 118.100822
Now i want to select two columns Date and Close ,to set Date as x axis and Close as y axis,how to plot it?
import pandas as pd
import matplotlib.pyplot as plt
x=pd.DataFrame({'key':apple['Date'],'data':apple['Close']})
x.plot()
plt.show()
I got the graph such as below.
The x axis is not Date column !
New DataFrame is not necessary, plot apple and use parameters x and y:
#if not datetime column first convert
#apple['Date'] = pd.to_datetime(apple['Date'])
apple.plot(x='Date', y='Close')

Efficient Dataframe column (Object) to DateTime conversion

I'm attempting to create a new column that contains the data of the Date input column as a datetime. I'd also happily accept changing the datatype of the Date column, but I'm just as unsure how to to that.
I'm currently using DateTime = dd.to_datetime. I'm importing from a CSV and letting dask decide on data types.
I'm fairly new to this, so I've tried a few stackoverflow answers, but I'm just fumbling and getting more errors than answers.
My input date string is, for example:
2019-20-09 04:00
This is what I currently have,
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
# Dataframes implement the Pandas API
import dask.dataframe as dd
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\State_Weathergrids.csv')
print(ddf.describe(include='all'))
ddf['DateTime'] = dd.to_datetime(ddf['Date'], format='%y-%d-%m %H:%M')
The error I'm receiving is below. I 'm assuming that the last line is the most relevant piece, but for the life of me I cannot work out why the date format given doesn't match the format I'm specifying.
TypeError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\pandas\core\tools\datetimes.py in _convert_listlike_datetimes(arg, box, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
290 try:
--> 291 values, tz = conversion.datetime_to_datetime64(arg)
292 return DatetimeIndex._simple_new(values, name=name, tz=tz)
pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()
TypeError: Unrecognized value type: <class 'str'>
During handling of the above exception, another exception occurred:
....
ValueError: time data '2019-20-09 04:00' does not match format '%y-%d-%m %H:%M' (match)
Data Frame current properties using describe:
Dask DataFrame Structure:
Location Date Temperature RH
npartitions=1
float64 object float64 float64
... ... ... ...
Dask Name: describe, 971 tasks
Sample Data
+-----------+------------------+-------------+--------+
| Location | Date | Temperature | RH |
+-----------+------------------+-------------+--------+
| 1075 | 2019-20-09 04:00 | 6.8 | 99.3 |
| 1075 | 2019-20-09 05:00 | 6.4 | 100.0 |
| 1075 | 2019-20-09 06:00 | 6.7 | 99.3 |
| 1075 | 2019-20-09 07:00 | 8.6 | 95.4 |
| 1075 | 2019-20-09 08:00 | 12.2 | 76.0 |
+-----------+------------------+-------------+--------+
Try this,
['DateTime'] = dd.to_datetime(ddf['Date'], format='%Y-%d-%m %H:%M', errors = 'ignore')
errors ignore will return Nan wherever to_datetime fails..
For more detail visit https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

KeyError after resampling

I have two dataframes df indexed by Datetime and df2 which has column Date (Series).
Before resampling I can run:
>>>> df[df2['Date'][0]]
and obtain all rows corresponding to day df2['Date'][0] which is 2013-08-07 in this example. However after resampling by day I can no longer obtain the row corresponding to that day as:
>>>> df.resample('D', how=np.max)[df2['Date'][0]]
KeyError: u'no item named 2013-08-07'
although that day is in the dataset
>>>> df.resample('D', how=np.max).head()
| Temp | etc
Date | |
---------------------------
2013-08-07 | 26.1 |
---------------------------
2013-08-08 | 28.2 |
---------------------------
etc
I am not sure whether it is a bug or it is designed to be like this, or, if the latter is true, why. But you can do this to get the desired result:
In [168]:
df1=pd.DataFrame(np.random.random(100), columns=['Temp'])
df1.index=pd.date_range('2013-08-07',periods=100,freq='5H')
df1.index.name='Date'
In [169]:
df2=pd.DataFrame(pd.date_range('2013-08-07',periods=23, freq='D'), columns=['Date'])
In [170]:
#You can do this
df3=df1.resample('D', how=np.max)
print df3[df3.index==df2['Date'][0]]
Temp
Date
2013-08-07 0.8128
[1 rows x 1 columns]
In [171]:
df3[df2['Date'][0]]
#Error