Extract index values from groupby using numpy array - numpy

I'm trying to extract 7 rows of data grouped by the DTY column in a Dataframe. I think I need to use a numpy array to filter, but can't seem to get it working.
Here's an extract of my original CLimate_DF dataframe (note there are NaN values for the :
Year Month Day DTY Precip9am MaxT ... Wind ms 21 hr
1989 1 1 1 0 29.7 ... 0
1989 1 2 2 0 31.1 ... 4.6
1989 1 3 3 0.4 32 ... 2.1
... ... ... ... ... ... ... ...
2019 12 31 365 21.2 31.3 ... 2.1
First, I created a numpy array filter based on a given date - this works and creates the right array:
#Enter Day of Interest (DOI) yyyy, mm, dd
DOI = datetime.datetime(2020, 2, 6)
DTY = DOI.timetuple()[7]
DTYMinus3 = DTY-3
DTYPlus3 = DTY+3
DTY_Array = np.linspace(DTYMinus3, DTYPlus3,7)
DTY_Array = np.array(DTY_Array)
I now want to extract all values from all years for columns 'Precip9am' through to 'Winds ms 21 hr' and filter those columns to just include DTY = DTY_Array
I've grouped the Climate_DF to DTY and then try to apply a filter:
ClimateDTY_DF = Climate_DF.groupby("DTY")
DTY_Climate = ClimateDTY_DF[DTY_Array]
I get the following error:
KeyError: 'Columns not found: 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0'
I'm assuming this is because the filter is trying to find the values in the ClimateDTY_DF columns, but I need **it to be filtering through the DTY column to find all the indexes from the array and extract the values from each column aftern 'Precip9am'**.
How do I do this? Do I transpose first? Or do I need to create some kind of loop that creates a DF for DTY and each column from 'Precip9am' through to 'Winds ms 21 hr' and extracts just the DTY from the array for each?

Related

how split data with respect of months?

Hi I have a time series data set. I would like to make a new column for each month.
data:
creationDate fre skill
2019-02-15T20:43:29Z 14 A
2019-02-15T21:10:32Z 15 B
2019-03-22T07:14:50Z 41 A
2019-03-22T06:47:41Z 64 B
2019-04-11T09:49:46Z 25 A
2019-04-11T09:49:46Z 29 B
output:
skill 2019-02 2019-03 2019-04
A 14 41 25
B 15 64 29
I know I can do it manually like below and make columns (when I have date1_start and date1_end):
dfdate1=data[(data['creationDate'] >= date1_start) & (data['creationDate']<= date1_end)]
But since I have many many months, it is not feasible to that this ways for each month.
Use DataFrame.pivot with convert datetimes to month periods by Series.dt.to_period:
df['dates'] = pd.to_datetime(df['creationDate']).dt.to_period('m')
df = df.pivot('skill','dates','fre')
Or to custom strings YYYY-MM by Series.dt.strftime:
df['dates'] = pd.to_datetime(df['creationDate']).dt.strftime('%Y-%m')
df = df.pivot('skill','dates','fre')
EDIT:
ValueError: Index contains duplicate entries, cannot reshape
It means there are duplicates, use DataFrame.pivot_table with some aggregation, e.g. sum, mean:
df = df.pivot_table(index='skill',columns='dates',values='fre', aggfunc='sum')

Creating a base 100 Index from time series that begins with a number of NaNs

I have the following dataframe (time-series of returns truncated for succinctness):
import pandas as pd
import numpy as np
df = pd.DataFrame({'return':np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])})
I'm trying to start the index (i.e., "base-100") at the last NaN before the first return - while at the same time keep the NaNs preceding the 100 value in place - (thinking in terms of appending to existing dataframe and for graphing purposes).
I only have found a way to create said index when there are no NaNs in the return vector:
df['index'] = 100*np.exp(np.nan_to_num(df['return'].cumsum()))
Any ideas - thx in advance!
If your initial array is
zz = np.array([np.nan, np.nan, np.nan, 0.015, -0.024, 0.033, 0.021, 0.014, -0.092])
Then you can obtain your desired output like this (although there's probably a more optimized way to do it):
np.concatenate((zz[:np.argmax(np.isfinite(zz))],
100*np.exp(np.cumsum(zz[np.isfinite(zz)]))))
Use Series.isna, change order by indexing and get index of last NaN by Series.idxmax:
idx = df['return'].isna().iloc[::-1].idxmax()
Pass to DataFrame.loc, repalce missing value and use cumulative sum:
df['return'] = df.loc[idx:, 'return'].fillna(100).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967
You can use Series.isna with Series.cumsum and compare by max, then replace last NaN by Series.fillna and last use cumulative sum:
s = df['return'].isna().cumsum()
df['return'] = df['return'].mask(s.eq(s.max()), df['return'].fillna(100)).cumsum()
print (df)
return
0 NaN
1 NaN
2 100.000
3 100.015
4 99.991
5 100.024
6 100.045
7 100.059
8 99.967

Pandas - Filtering out data by weekday

I have a Dataframe that has list of dates with sales count for each of the days as shown below:
date,count
11/1/2018,345
11/2/2018,100
11/5/2018,432
11/7/2018,500
11/11/2018,555
11/17/2018,754
I am trying to check of all the sales that were done how many were done on a weekday. To pull all week-days in November I am doing the below:
weekday = pd.DataFrame(pd.bdate_range('2018-11-01', '2018-11-30'))
Now I am trying to compare dates in df with value in weekday as below:
df_final = df[df['date'].isin(weekday)]
But the above returns no rows.
You should remove pd.DataFrame when create the weekday, since when we using Series and DataFrame with isin means we not only match the values but also the index and columns , since the original index and columns may different from the new created dataframe weekday, that is why return the False
df.date=pd.to_datetime(df.date)
weekday = pd.bdate_range('2018-11-01', '2018-11-30')
df_final = df[df['date'].isin(weekday)]
df_final
Out[39]:
date count
0 2018-11-01 345
1 2018-11-02 100
2 2018-11-05 432
3 2018-11-07 500
Simple example address the issue I mentioned above
df=pd.DataFrame({'A':[1,2,3,4,5]})
newdf=pd.DataFrame({'B':[2,3]})
df.isin(newdf)
Out[43]:
A
0 False
1 False
2 False
3 False
4 False
df.isin(newdf.B.tolist())
Out[44]:
A
0 False
1 True
2 True
3 False
4 False
Use a DatetimeIndex and let pandas do the work for you as follows:
# generate some sample sales data for the month of November
df = pd.DataFrame(
{'count': np.random.randint(0, 900, 30)},
index=pd.date_range('2018-11-01', '2018-11-30', name='date')
)
# resample by business day and call `.asfreq()` on the resulting groupby-like object to get your desired filtering
df.resample(rule='B').asfreq()
Other values for the resampling rule can be found here

Division between two numbers in a Dataframe

I am trying to calculate a percent change between 2 numbers in one column when a signal from another column is triggered.
The trigger can be found with np.where() but what I am having trouble with is the percent change. .pct_change does not work because if you .pct_change(-5) you get 16.03/20.35 and I want the number the opposite way 20.35/16.03. See table below. I have tried returning the array from the index in the np.where and adding it to an .iloc from the 'Close' column but it says I can't use that array to get an .iloc position. Can anyone help me solve this problem. Thank you.
IdxNum | Close | Signal (1s)
==============================
0 21.45 0
1 21.41 0
2 21.52 0
3 21.71 0
4 20.8 0
5 20.35 0
6 20.44 0
7 16.99 0
8 17.02 0
9 16.69 0
10 16.03 1<< 26.9% <<< 20.35/16.03-1 (df.Close[5]/df.Close[10]-1)
11 15.67 0
12 15.6 0
You can try this code block:
#Create DataFrame
df = pd.DataFrame({'IdxNum':range(13),
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Create a function that calculates the reqd diff
def cal_diff(row):
if(row['Signal']==1):
signal_index = int(row['IdxNum'])
row['diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1
return row
#Create a column and apply that difference
df['diff'] = 0
df = df.apply(lambda x:cal_diff(x),axis=1)
In case you don't have IdxNum column, you can use the index to calculate difference
#Create DataFrame
df = pd.DataFrame({
'Close':[21.45,21.41,21.52,21.71,20.8,20.35,20.44,16.99,17.02,16.69,16.03,15.67,15.6],
'Signal':[0] * 13})
df.ix[10,'Signal']=1
#Calculate the reqd difference
df['diff'] = 0
signal_index = df[df['Signal']==1].index[0]
df.ix[signal_index,'diff'] = df.Close[signal_index-5]/df.Close[signal_index]-1

Append a tuple to a dataframe as a row

I am looking for a solution to add rows to a dataframe. Here is the data I have :
A grouped object ( obtained by grouping a dataframe on month and year i.e in this grouped object key is [month,year] and value is all the rows / dates in that month and year).
I want to extract all the month , year combinations and put that in a new dataframe. Issue : When I iterate over the grouped object, month, row is a tuple, so I converted the tuple into a list and added it to a dataframe using thye append command. Instead of getting added as rows :
1 2014
2 2014
3 2014
it got added in one column
0 1
1 2014
0 2
1 2014
0 3
1 2014
...
I want to store these values in a new dataframe. Here is how I want the new dataframe to be :
month year
1 2014
2 2014
3 2014
I tried converting the tuple to list and then I tried various other things like pivoting. Inputs would be really helpful.
Here is the sample code :
df=df.groupby(['month','year'])
df = pd.DataFrame()
for key, value in df:
print "type of key is:",type(key)
print "type of list(key) is:",type(list(key))
df = df.append(list(key))
print df
When you do the groupby the resulting MultiIndex is available as:
In [11]: df = pd.DataFrame([[1, 2014, 42], [1, 2014, 44], [2, 2014, 23]], columns=['month', 'year', 'val'])
In [12]: df
Out[12]:
month year val
0 1 2014 42
1 1 2014 44
2 2 2014 23
In [13]: g = df.groupby(['month', 'year'])
In [14]: g.grouper.result_index
Out[14]:
MultiIndex(levels=[[1, 2], [2014]],
labels=[[0, 1], [0, 0]],
names=['month', 'year'])
Often this will be sufficient, and you won't need a DataFrame. If you do, one way is the following:
In [21]: pd.DataFrame(index=g.grouper.result_index).reset_index()
Out[21]:
month year
0 1 2014
1 2 2014
I thought there was a method to get this, but can't recall it.
If you really want the tuples you can use .values or to_series:
In [31]: g.grouper.result_index.values
Out[31]: array([(1, 2014), (2, 2014)], dtype=object)
In [32]: g.grouper.result_index.to_series()
Out[32]:
month year
1 2014 (1, 2014)
2 2014 (2, 2014)
dtype: object
You had initially declared both the groupby and empty dataframe as df. Here's a modified version of your code that allows you to append a tuple as a dataframe row.
g=df.groupby(['month','year'])
df = pd.DataFrame()
for (key1,key2), value in g:
row_series = pd.Series((key1,key),index=['month','year'])
df = df.append(row_series, ignore_index = True)
print df
If all you want are the unique values, you could use drop_duplicates
In [29]: df[['month','year']].drop_duplicates()
Out[29]:
month year
0 1 2014
2 2 2014