Querying a DataFrame values from a specific year - pandas

I have a pandas dataframe I have created from weather data that shows the high and low temperatures by day from 2005-2015. I want to be able to query my dataframe such that it only shows the values with the year 2015. Is there any way to do this without first changing the datetime values to only show year (i.e. not making strtime(%y) only first)?
DataFrame Creation:
df=pd.read_csv('data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv')
df['Date']=pd.to_datetime(df.Date)
df['Date'] = df['Date'].dt.strftime('%m-%d-%y')
Attempt to Query:
daily_df=df[df['Date']==datetime.date(year=2015)]
Error: asks for a month and year to be specified.
Data:
An NOAA dataset has been stored in the file data/C2A2_data/BinnedCsvs_d400/fb441e62df2d58994928907a91895ec62c2c42e6cd075c2700843b89.csv. The data for this assignment comes from a subset of The National Centers for Environmental Information (NCEI) Daily Global Historical Climatology Network (GHCN-Daily). The GHCN-Daily is comprised of daily climate records from thousands of land surface stations across the globe.
Each row in the assignment datafile corresponds to a single observation.
The following variables are provided to you:
id : station identification code
date : date in YYYY-MM-DD format (e.g. 2012-01-24 = January 24, 2012)
element : indicator of element type
TMAX : Maximum temperature (tenths of degrees C)
TMIN : Minimum temperature (tenths of degrees C)
value : data value for element (tenths of degrees C)
Image of DataFrame:

I resolved this by adding a row with just the year and then querying that way but there has to be a better way to do this?
df['Date']=pd.to_datetime(df['Date']).dt.strftime('%d-%m-%y')
df['Year']=pd.to_datetime(df['Date']).dt.strftime('%y')
daily_df = df[df['Year']=='15']
return daily_df

Related

calculate value based on other column values with some step for rows of other columns

total beginner here. If my question is irrelevant, apologies in advance, I'll remove it. So, I have a question : using pandas, I want to calculate an evolution ratio for a week data compared with the previous rolling 4 weeks mean data.
df['rolling_mean_fourweeks'] = df.rolling(4).mean().round(decimals=1)
from here I wanna create a new column for the evolution ratio based on the week data compared with the row of the rolling mean at the previous week.
what is the best way to go here? (I don't have big data) I have tried unsuccessfully with .shift() but am very foreign to .shift()... I should get NAN for week 3 (fourth week) and ~47% for fifth week.
Any suggestion for retrieving the value at row with step -1?
Thanks and have a good day!
Your idea about using shift can perfectly work. The shift(x) function simply shifts a series (a full column in your case) of x steps.
A simple way to check if the rolling_mean_fourweeks is a good predictor can be to shift Column1 and then check how it differs from rolling_mean_fourweeks:
df['column1_shifted'] = df['Column1'].shift(-1)
df['rolling_accuracy'] = ((df['column1_shifted']-df['rolling_mean_fourweeks'])
/df['rolling_mean_fourweeks'])
resulting in:

How to divide a groupby Object by pandas Series efficiently? Or how to convert yfinance multiple ticker data to another currency?

I am pulling historical price data for the S&P500 index components with yfinance and would now like to convert the Close & Volume from USD into EUR.
This is what I tried:
data = yf.download(set(components), group_by="Ticker", start=get_data.start_date)
where start="2020-11-04" and components is a list of yfinance tickers of S&P500 members and the "EURUSD=X" -> the symbol for the conversion rate
#Group by Ticker and Date
df = data.stack(level=0).rename_axis(['Date', 'Ticker']).reset_index(level=1)
df = df.sort_values(by='Ticker',axis='index',kind='stable')
After adding columns for the name, sector & name of the currency (I need this as in my application I am appending several dataframes with tickers of different currency) and dropping columns I dont need, I have a dataframe that looks like this:
I now what to convert the Close & the Volume Column into EUR. I have found a way that works on most of the data except the S&P500 and other US stocks, which is why I am posting the question here.
# Check again if Currency is not EUR
if currency != "EUR":
df['Close in EUR'] = df.groupby('Ticker')['Close'].apply(lambda group: group.iloc[::]/df[df['Ticker']==currency]['Close'])
df['Volume in Mio. EUR'] = df['Volume']*df['Close in EUR']/1000000
else:
df['Volume in Mio. EUR'] = df['Volume']*df['Close']/1000000
This does not only take a lot of time (~46 seconds), but it also shows NaN values for "Close in EUR" and "Volume in Mio. EUR" columns. Do you have any idea?
I have found that df[df['Ticker']==currency] has more rows than the stock tickers have due to public holidays of the stock exchanges and even after deleting the unmatched rows, I am left with NaN values. Doing the whole process for other index members, e.g. ^JKLQ45 (Indonesia Stock Exchange index) works, which is surprising.
Please any help or even an idea how to do this more efficiently is highly appreciated!!!
If you want to get a sense of my final project - check out: https://equityanalysis.herokuapp.com/

How to set Custom Business Day End Frequency in Pandas

I have a pandas dataframe with an unusual DatetimeIndex. The frame contains daily data (end of each day) from 1985 to 1990 but some "random" days are missing:
DatetimeIndex(['1985-01-02', '1985-01-03', '1985-01-04', '1985-01-07',
'1985-01-08', '1985-01-09', '1985-01-10', '1985-01-11',
'1985-01-14', '1985-01-15',
...
'1990-12-17', '1990-12-18', '1990-12-19', '1990-12-20',
'1990-12-21', '1990-12-24', '1990-12-26', '1990-12-27',
'1990-12-28', '1990-12-31'],
dtype='datetime64[ns]', name='date', length=1516, freq=None)
I often need operations like shifting an entire column such that a value that is at the last day of a month (which could e.g. in my DatetimeIndex be '1985-05-30') is shifted to the last day of the next (which could e.g. my DatetimeIndex be '1985-06-27').
While looking for a smart way to perform such shifts, I stumbled over Offset Aliases provided by pandas.tseries.offsets. It can be observed that there are the aliases custom business day frequency (C) and custom business month end frequency (CBM). When looking at an example, it seems like that this could provide exactly what I need:
mth_us = pd.offsets.CustomBusinessMonthEnd(calendar=USFederalHolidayCalendar())
day_us = pd.offsets.CustomBusinessDay(calendar=USFederalHolidayCalendar())
df['Col1_shifted'] = df['Col1'].shift(periods=1, freq = mth_us) # shifted by 1 month
df['Col2_shifted'] = df['Col2'].shift(periods=1, freq = day_us) # shifted by 1 day
The problem is that my DatetimeIndex is not equal to USFederalHolidayCalendar(). Can someone please tell me how I can use pd.offsets.CustomBusinessMonthEnd (and also pd.offsets.CustomBusinessDay) with my own custom DatetimeIndex?
If not, has any of you an idea how to tackle this issue in a different way?
Thanks a lot for your help!

summarize data to one point of a different year and plot them in a graph

I want to summarize data point to different years.
So I want to plot a graph of with different years, and summarize from a specifies attribute a graph so let's day coins of the year 2011 -2015
So I choose as Data Frame pandas
I used:
Summerize the data point (way 1)
testn = test.groupby(by=[test.index.year,test.index.month]).sum()
print(testn['SalesDateTime'])
print (type(testn))
This give me a Error.
The Error tells me that there is no attribute year or month?
enter code here
Summerize the data point (way 1)
testn = test.groupby(by=[test.index.year,test.index.month]).sum()
print(testn['SalesDateTime'])
#print (type(testn))
Summerize the data point (way 2)
nieuw = test.groupby(by=[test.index.month, test.index.year])
print(nieuw)
testn.plot("SalesDateTime","Coins",)
plt.show(testn)
I think there is no DatetimeIndex, so need to_datetime:
test.index = pd.to_datetime(test.index)
testn = test.groupby(by=[test.index.year,test.index.month]).sum()
Or:
testn = (test.groupby(by=[test.index.year.rename('year'),
test.index.month.rename('month')]).sum().reset_index())

Calling preprocessing.scale on a heterogeneous array

I have this TypeError as per below, I have checked my df and it all contains numbers only, can this be caused when I converted to numpy array? After the conversion the array has items like
[Timestamp('1993-02-11 00:00:00') 28.1216 28.3374 ...]
Any suggestion how to solve this, please?
df:
Date Open High Low Close Volume
9 1993-02-11 28.1216 28.3374 28.1216 28.2197 19500
10 1993-02-12 28.1804 28.1804 28.0038 28.0038 42500
11 1993-02-16 27.9253 27.9253 27.2581 27.2974 374800
12 1993-02-17 27.2974 27.3366 27.1796 27.2777 210900
X = np.array(df.drop(['High'], 1))
X = preprocessing.scale(X)
TypeError: float() argument must be a string or a number
While you're saying that your dataframe "all contains numbers only", you also note that the first column consists of datetime objects. The error is telling you that preprocessing.scale only wants to work with float values.
The real question, however, is what you expect to happen to begin with. preprocessing.scale centers values on the mean and normalizes the variance. This is such that measured quantities are all represented on roughly the same footing. Now, your first column tells you what dates your data correspond to, while the rest of the columns are numeric data themselves. Why would you want to normalize the dates? How would you normalize the dates?
Semantically speaking, I believe you should leave your dates alone. Whatever post-processing you're planning to perform on your numerical data, the normalized data should still be parameterized by the original dates. If you want to process your dates too, you need to come up with an explicit way to handle your dates to something numeric (say, elapsed time from a given date in given units).
So I believe you should drop your dates from your processing round altogether, and start with
X = df.drop(['Date','High'], 1).as_matrix()