How do you deal with datetime obj when applying ANN models? I have thought of writing function which iterates through the column but there has to be a cleaner way to do so, right?
dataset.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 299 non-null int64
1 ZIP 299 non-null int64
2 START_TIME 299 non-null datetime64[ns]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
float() argument must be a string or a number, not 'Timestamp'
With attempt:
TypeError: float() argument must be a string or a number, not 'datetime.time' in relation with a scatter plot
could not convert string to float: '2022-03-16 11:55:00'
I would suggest doing the following steps:
converting string to datetime.datetime objects
from datetime import datetime
t = datetime.strptime("2022-03-16 11:55:00","%Y-%m-%d %H:%M:%S")
Then extract the necessary components to pass as inputs to the network:
x1,x2,x3 = t.month, t.hour, t.minute
As an aside, I noticed you are directly scaling the time components. Rather, do some different pre-processing depending on the problem. For example, extracting sine and cosine information of the time components rather than using them directly or scaling them. sine and cosine components preserve the distance between time points.
import numpy as np
hour_cos = np.cos(t.hour)
hour_sin = np.sin(t.hour)
extract other periodic components as necessary for the problem
e.g. if you are looking at weather variable: sine and cosine of hour, month are typically useful. If you are looking at sales, day of month, month, day of week sine and cosine are useful
Update: from the comments I noticed you mentioned that you are predicting decibel levels. Assuming, you are already factoring in spatial inputs variables, you should definitely try something like a sine/cosine transformation assuming the events generating sounds exhibit a periodic pattern. Again, this is an assumption and might not be completely true.
dataset['START_TIME'] = pd.to_datetime(dataset['START_TIME']).apply(lambda x: x.value)
Seems like a clean way of doing so, but I'm still open to alternatives.
Related
I am hoping to plot a graph representing monthly temperature from 1981-01-01 to 2016-12-31.
I would like the months "Jan Feb Mar Apr May...Dec" on the x-axis and the temperature record as the y-axis as my plan is to compare monthly temperature record of 1981 - 1999 with 2000 - 2016.
I have read in the data no problem.
temp1 = xr.open_dataarray('temp1981-1999.nc') temp2 = xr.open_dataarray('temp2000-2016.nc')
and have got rid of the lat and lon dimensions
temp1mean = temp1.mean(dim=['latitude','longitude']) temp2mean = temp2.mean(dim=['latitude','longitude'])
I tried to convert it into a dataframe to allow me to carry on the next step such as averaging the months using group by
temp1.cftime_range(start=None, end=None, periods=None, freq='M', normalize=False, name=None, closed=None, calendar='standard')
t2m time 1981-01-01 276.033295 1981-02-01 278.882935 1981-03-01 282.905579 1981-04-01 289.908936 1981-05-01 294.862457 ... ... 1999-08-01 295.841553 1999-09-01 294.598053 1999-10-01 289.514771 1999-11-01 283.360687 1999-12-01 278.854431
monthly = temp1mean.groupby(temp1mean.index.month).mean()
However I got the following error.
"'DataArray' object has no attribute 'index'"
Therefore, I am wondering if there's any way to groupby all the monthly means and create a graph as followed.
In addition to the main question, I would greatly appreciate if you could also suggest ways to convert the unit kelvin into celsius when plotting the graph.
As I have tried the command
celsius = temp1mean.attrs['units'] = 'kelvin'
but the output is merely
"'air_temperature"
I greatly appreciate any suggestion you may have for plotting this grpah! Thank you so so much and if you need any further information please do not hesitate to ask, I will reply as soon as possible.
Computing monthly means
The xarray docs have a helpful section on using the datetime accessor on any datetime dimensions:
Similar to pandas, the components of datetime objects contained in a given DataArray can be quickly computed using a special .dt accessor.
...
The .dt accessor works on both coordinate dimensions as well as multi-dimensional data.
xarray also supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”
In your case, you need to use the name of the datetime coordinate (whatever it is named) along with the .dt.month reference in your groupby. If your datetime coordinate is named "time", the groupby operation would be:
monthly_means = temp1mean.groupby(temp1mean.time.dt.month).mean()
or, using the string shorthand:
monthly_means = temp1mean.groupby('time.month').mean()
Units in xarray
As for units, you should definitely know that xarray does not interpret/use attributes or metadata in any way, with the exception of plotting and display.
The following assignment:
temp1mean.attrs['units'] = 'kelvin'
simply assigns the string "kelvin" to the user-defined attribute "units" - nothing else. This may show up as the data's units in plots, but that doesn't mean the data isn't in Fahrenheit or dollars or m/s. It's just a string you put there.
If the data is in fact in kelvin, the best way to convert it to Celsius that I know of is temp1mean - 273.15 :)
If you do want to work with units explicitly, check out the pint-xarray extension project. It's currently in early stages and is experimental, but it does what I think you're looking for.
Is there a preferred way to keep the data type of a numpy array fixed as int (or int64 or whatever), while still having an element inside listed as numpy.NaN?
In particular, I am converting an in-house data structure to a Pandas DataFrame. In our structure, we have integer-type columns that still have NaN's (but the dtype of the column is int). It seems to recast everything as a float if we make this a DataFrame, but we'd really like to be int.
Thoughts?
Things tried:
I tried using the from_records() function under pandas.DataFrame, with coerce_float=False and this did not help. I also tried using NumPy masked arrays, with NaN fill_value, which also did not work. All of these caused the column data type to become a float.
NaN can't be stored in an integer array. This is a known limitation of pandas at the moment; I have been waiting for progress to be made with NA values in NumPy (similar to NAs in R), but it will be at least 6 months to a year before NumPy gets these features, it seems:
http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na
(This feature has been added beginning with version 0.24 of pandas, but note it requires the use of extension dtype Int64 (capitalized), rather than the default dtype int64 (lower case):
https://pandas.pydata.org/pandas-docs/version/0.24/whatsnew/v0.24.0.html#optional-integer-na-support
)
This capability has been added to pandas beginning with version 0.24.
At this point, it requires the use of extension dtype 'Int64' (capitalized), rather than the default dtype 'int64' (lowercase).
If performance is not the main issue, you can store strings instead.
df.col = df.col.dropna().apply(lambda x: str(int(x)) )
Then you can mix then with NaN as much as you want. If you really want to have integers, depending on your application, you can use -1, or 0, or 1234567890, or some other dedicated value to represent NaN.
You can also temporarily duplicate the columns: one as you have, with floats; the other one experimental, with ints or strings. Then inserts asserts in every reasonable place checking that the two are in sync. After enough testing you can let go of the floats.
In case you are trying to convert a float (1.143) vector to integer (1), and that vector has NAs, converting it to the new 'Int64' dtype will give you an error. In order to solve this you have to round the numbers and then do ".astype('Int64')"
s1 = pd.Series([1.434, 2.343, np.nan])
#without round() the next line returns an error
s1.astype('Int64')
#cannot safely cast non-equivalent float64 to int64
##with round() it works
s1.round().astype('Int64')
0 1
1 2
2 NaN
dtype: Int64
My use case is that I have a float series that I want to round to int, but when you do .round() still has decimals, you need to convert to int to remove decimals.
This is not a solution for all cases, but mine (genomic coordinates) I've resorted to using 0 as NaN
a3['MapInfo'] = a3['MapInfo'].fillna(0).astype(int)
This at least allows for the proper 'native' column type to be used, operations like subtraction, comparison etc work as expected
Pandas v0.24+
Functionality to support NaN in integer series will be available in v0.24 upwards. There's information on this in the v0.24 "What's New" section, and more details under Nullable Integer Data Type.
Pandas v0.23 and earlier
In general, it's best to work with float series where possible, even when the series is upcast from int to float due to inclusion of NaN values. This enables vectorised NumPy-based calculations where, otherwise, Python-level loops would be processed.
The docs do suggest : "One possibility is to use dtype=object arrays instead." For example:
s = pd.Series([1, 2, 3, np.nan])
print(s.astype(object))
0 1
1 2
2 3
3 NaN
dtype: object
For cosmetic reasons, e.g. output to a file, this may be preferable.
Pandas v0.23 and earlier: background
NaN is considered a float. The docs currently (as of v0.23) specify the reason why integer series are upcasted to float:
In the absence of high performance NA support being built into NumPy
from the ground up, the primary casualty is the ability to represent
NAs in integer arrays.
This trade-off is made largely for memory and performance reasons, and
also so that the resulting Series continues to be “numeric”.
The docs also provide rules for upcasting due to NaN inclusion:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object
New for Pandas v1.00 +
You do not (and can not) use numpy.nan any more.
Now you have pandas.NA.
Please read: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
IntegerArray is currently experimental. Its API or implementation may
change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value
rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to
represent missing data. Because NaN is a float, this forces an array
of integers with any missing values to become floating point. In some
cases, this may not matter much. But if your integer column is, say,
an identifier, casting to float can be problematic. Some integers
cannot even be represented as floating point numbers.
If there are blanks in the text data, columns that would normally be integers will be cast to floats as float64 dtype because int64 dtype cannot handle nulls. This can cause inconsistent schema if you are loading multiple files some with blanks (which will end up as float64 and others without which will end up as int64
This code will attempt to convert any number type columns to Int64 (as opposed to int64) since Int64 can handle nulls
import pandas as pd
import numpy as np
#show datatypes before transformation
mydf.dtypes
for c in mydf.select_dtypes(np.number).columns:
try:
mydf[c] = mydf[c].astype('Int64')
print('casted {} as Int64'.format(c))
except:
print('could not cast {} to Int64'.format(c))
#show datatypes after transformation
mydf.dtypes
This is now possible, since pandas v 0.24.0
pandas 0.24.x release notes
Quote: "Pandas has gained the ability to hold integer dtypes with missing values.
I know that OP has asked for NumPy or Pandas only, but I think it is worth mentioning polars as an alternative that supports the requested feature.
In Polars any missing values in an integer column are simply null values and the column remains an integer column.
See Polars - User Guide > Coming from Pandas for more info.
i am quite a newbie to Time Series Analysis and this might be a stupid question.
I am trying to generate the trend, seasonal, and residual time series elements, however, my timestamps index are actually strings (lets say 'window1', 'window2', 'window3'). Now, when i try to apply seasonal_decompose(data, model='multiplicative'), it returns an error as, Index' object has no attribute 'inferred_freq' and which is pretty understandable.
However, how to go around this issue by keeping strings as time series index?
Basically here you need to specify freq parameter.
Suppose you have following dataset
s = pd.Series([102,200,322,420], index=['window1', 'window2', 'window3','window4'])
s
>>>window1 102
window2 200
window3 322
window4 420
dtype: int64
Now specify freq parameter,in this case I used freq=1
plt.style.use('default')
plt.figure(figsize = (16,8))
import statsmodels.api as sm
sm.tsa.seasonal_decompose(s.values,freq=1).plot()
result = sm.tsa.stattools.adfuller(s,maxlag=1)
plt.show()
I am not allowed to post image ,but I hope this code will solve your problem.Also here maxlag by default give an error for my dataset ,therefore I used maxlag=1.If you are not sure about its values ,do use default value for maxlag.
I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]
How to convert a variable from Python's datetime.timedelta to numpy.timedelta64?
array([datetime.timedelta(1)], dtype="timedelta64[ms]")[0]
This link explains many things about datetime64 and timedelta64.
This is also relevant for converting datetime.datetime to datetime64
You can do this without creating an np.array by mapping the fundamental integer representations in datetime.timedelta (days, seconds, and microseconds) to corresponding np.timedelta64 representations, and then summing.
The downside of this approach is that, while you will get the same delta duration, you will not always get the same units. The upside of this approach is that, if you are converting single values rather than large arrays of values, it will generally be faster than creating an array.
You can also just call np.timedelta64() with a datetime.timedelta, but that approach only returns a np.timedelta64() with microsecond units.
from functools import reduce
import operator
TIME_DELTA_ATTR_MAP = (
('days', 'D'),
('seconds', 's'),
('microseconds', 'us')
)
def to_timedelta64(value: datetime.timedelta) -> np.timedelta64:
return reduce(operator.add,
(np.timedelta64(getattr(value, attr), code)
for attr, code in TIME_DELTA_ATTR_MAP if getattr(value, attr) > 0))