Grouping xarray daily data into monthly means - matplotlib

I am hoping to plot a graph representing monthly temperature from 1981-01-01 to 2016-12-31.
I would like the months "Jan Feb Mar Apr May...Dec" on the x-axis and the temperature record as the y-axis as my plan is to compare monthly temperature record of 1981 - 1999 with 2000 - 2016.
I have read in the data no problem.
temp1 = xr.open_dataarray('temp1981-1999.nc') temp2 = xr.open_dataarray('temp2000-2016.nc')
and have got rid of the lat and lon dimensions
temp1mean = temp1.mean(dim=['latitude','longitude']) temp2mean = temp2.mean(dim=['latitude','longitude'])
I tried to convert it into a dataframe to allow me to carry on the next step such as averaging the months using group by
temp1.cftime_range(start=None, end=None, periods=None, freq='M', normalize=False, name=None, closed=None, calendar='standard')
t2m time 1981-01-01 276.033295 1981-02-01 278.882935 1981-03-01 282.905579 1981-04-01 289.908936 1981-05-01 294.862457 ... ... 1999-08-01 295.841553 1999-09-01 294.598053 1999-10-01 289.514771 1999-11-01 283.360687 1999-12-01 278.854431
monthly = temp1mean.groupby(temp1mean.index.month).mean()
However I got the following error.
"'DataArray' object has no attribute 'index'"
Therefore, I am wondering if there's any way to groupby all the monthly means and create a graph as followed.
In addition to the main question, I would greatly appreciate if you could also suggest ways to convert the unit kelvin into celsius when plotting the graph.
As I have tried the command
celsius = temp1mean.attrs['units'] = 'kelvin'
but the output is merely
"'air_temperature"
I greatly appreciate any suggestion you may have for plotting this grpah! Thank you so so much and if you need any further information please do not hesitate to ask, I will reply as soon as possible.

Computing monthly means
The xarray docs have a helpful section on using the datetime accessor on any datetime dimensions:
Similar to pandas, the components of datetime objects contained in a given DataArray can be quickly computed using a special .dt accessor.
...
The .dt accessor works on both coordinate dimensions as well as multi-dimensional data.
xarray also supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”
In your case, you need to use the name of the datetime coordinate (whatever it is named) along with the .dt.month reference in your groupby. If your datetime coordinate is named "time", the groupby operation would be:
monthly_means = temp1mean.groupby(temp1mean.time.dt.month).mean()
or, using the string shorthand:
monthly_means = temp1mean.groupby('time.month').mean()
Units in xarray
As for units, you should definitely know that xarray does not interpret/use attributes or metadata in any way, with the exception of plotting and display.
The following assignment:
temp1mean.attrs['units'] = 'kelvin'
simply assigns the string "kelvin" to the user-defined attribute "units" - nothing else. This may show up as the data's units in plots, but that doesn't mean the data isn't in Fahrenheit or dollars or m/s. It's just a string you put there.
If the data is in fact in kelvin, the best way to convert it to Celsius that I know of is temp1mean - 273.15 :)
If you do want to work with units explicitly, check out the pint-xarray extension project. It's currently in early stages and is experimental, but it does what I think you're looking for.

Related

How do you deal with datetime obj when applying ANN models?

How do you deal with datetime obj when applying ANN models? I have thought of writing function which iterates through the column but there has to be a cleaner way to do so, right?
dataset.info()
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 299 non-null int64
1 ZIP 299 non-null int64
2 START_TIME 299 non-null datetime64[ns]
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = sc.fit_transform(x)
float() argument must be a string or a number, not 'Timestamp'
With attempt:
TypeError: float() argument must be a string or a number, not 'datetime.time' in relation with a scatter plot
could not convert string to float: '2022-03-16 11:55:00'
I would suggest doing the following steps:
converting string to datetime.datetime objects
from datetime import datetime
t = datetime.strptime("2022-03-16 11:55:00","%Y-%m-%d %H:%M:%S")
Then extract the necessary components to pass as inputs to the network:
x1,x2,x3 = t.month, t.hour, t.minute
As an aside, I noticed you are directly scaling the time components. Rather, do some different pre-processing depending on the problem. For example, extracting sine and cosine information of the time components rather than using them directly or scaling them. sine and cosine components preserve the distance between time points.
import numpy as np
hour_cos = np.cos(t.hour)
hour_sin = np.sin(t.hour)
extract other periodic components as necessary for the problem
e.g. if you are looking at weather variable: sine and cosine of hour, month are typically useful. If you are looking at sales, day of month, month, day of week sine and cosine are useful
Update: from the comments I noticed you mentioned that you are predicting decibel levels. Assuming, you are already factoring in spatial inputs variables, you should definitely try something like a sine/cosine transformation assuming the events generating sounds exhibit a periodic pattern. Again, this is an assumption and might not be completely true.
dataset['START_TIME'] = pd.to_datetime(dataset['START_TIME']).apply(lambda x: x.value)
Seems like a clean way of doing so, but I'm still open to alternatives.

Issues with Decomposing Trend, Seasonal, and Residual Time Series Elements

i am quite a newbie to Time Series Analysis and this might be a stupid question.
I am trying to generate the trend, seasonal, and residual time series elements, however, my timestamps index are actually strings (lets say 'window1', 'window2', 'window3'). Now, when i try to apply seasonal_decompose(data, model='multiplicative'), it returns an error as, Index' object has no attribute 'inferred_freq' and which is pretty understandable.
However, how to go around this issue by keeping strings as time series index?
Basically here you need to specify freq parameter.
Suppose you have following dataset
s = pd.Series([102,200,322,420], index=['window1', 'window2', 'window3','window4'])
s
>>>window1 102
window2 200
window3 322
window4 420
dtype: int64
Now specify freq parameter,in this case I used freq=1
plt.style.use('default')
plt.figure(figsize = (16,8))
import statsmodels.api as sm
sm.tsa.seasonal_decompose(s.values,freq=1).plot()
result = sm.tsa.stattools.adfuller(s,maxlag=1)
plt.show()
I am not allowed to post image ,but I hope this code will solve your problem.Also here maxlag by default give an error for my dataset ,therefore I used maxlag=1.If you are not sure about its values ,do use default value for maxlag.

Is there a way of using Pandas or Matplotlib to plot Pandas Time Series density?

I am having a hard time of plotting the density of Pandas time series.
I have a data frame with perfectly organised timestamps, like below:
It's a web log, and I want to show the density of the timestamp, which indicates how many visitors in certain period of time.
My solution atm is extracting the year, month, week and day of each timestamp, and group them. Like below:
But I don't think it would be a efficient way of dealing with time. And I couldn't find any good info on this, more of them are about plot the calculated values on a date or something.
So, anybody have any suggestions on how to plot Pandas time series?
Much appreciated!
The best way to compute the values you want to plot is to use Series.resample; for example, to aggregate the count of dates daily, use this:
ser = pd.Series(1, index=dates)
ser.resample('D').sum()
The documentation there has more details depending on exactly how you want to resample & aggregate the data.
If you want to plot the result, you can use Pandas built-in plotting capabilities; for example:
ser.resample('D').sum().plot()
More info on plotting is here.

Pandas, compute many means with bootstrap confidence intervals for plotting

I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:
ATG12 Norm ATG5 Norm ATG7 Norm Cancer Stage
5.55 4.99 8.99 IIA
4.87 5.77 8.88 IIA
5.98 7.88 8.34 IIC
The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:
df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()
But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/
It boils down to:
import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals
I tried to apply this method to each subset of data with a nested-loop script:
for i in data.groupby('Cancer Stage'):
for p in i.columns[1:3]: # PROBLEM!!
Series = i[p]
print p
print Series.mean()
ci = bootstrap.ci(data=Series, statfunction=scipy.mean)
Which produced an error message
AttributeError: 'tuple' object has no attribute called 'columns'
Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.
The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple":
(name, dataforgroup). The correct recipe for iterating over groupby-objects is
for name, group in data.groupby('Cancer Stage'):
print name
for p in group.columns[0:3]:
...
Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!
Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:
cols=data.columns[0:2]
for col in columns:
print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
does everything you need in one line, and produces a (nicely plotable) series for you
EDIT:
I toyed around with a data frame object I created myself:
df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
for col in ['A', 'C']:
print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
yields two series that look like this:
B
a [6.58333333333, 14.3333333333]
b [8.5, 16.25]
B
a [21.5833333333, 29.3333333333]
b [23.4166666667, 31.25]

Which features of Pandas DataFrame could be used to model GPS Tracklog data (read from GPX file)

It's been months now since I started to use Pandas DataFrame to deserialize GPS data and perform some data processing and analyses.
Although I am very impressed with Pandas robustness, flexibility and power, I'm a bit lost about which features, and in which way, I should use to properly model the data, both for clarity, simplicity and computational speed.
Basically, each DataFrame is primarily indexed by a datetime object, having at least one column for a latitude-longitude tuple, and one column for elevation.
The first thing I do is to calculate a new column with the geodesic distance between coordinate pairs (first one being 0.0), using a function that takes two coordinate pairs as arguments, and from that new column I can calculate the cumulative distance along the track, which I use as a Linear Referencing System
The questions I need to address would be:
Is there a way in which I can use, in the same dataframe, two different monotonically increasing columns (cumulative distance and timestamp), choosing whatever is more convenient in each given context at runtime, and use these indexes to auto-align newly inserted rows?
In the specific case of applying a diff function that could be vectorized (applied like an array operation instead of an iterative pairwise loop), is there a way to do that idiomatically in pandas? Should I create a "coordinate" class which support the diff (__sub__) operation so I could use dataframe.latlng.diff directly?
I'm not sure these questions are well formulated, but that is due, at least a bit, by the overwhelming number of possibilities, and a somewhat fragmented documentation (yet).
Also, any tip about using Pandas for GPS data (tracklogs) or Geospatial data in general is very much welcome.
Thanks for any help!