Find index range of a sequence in dataframe column - pandas

I have a timeseries:
Sales
2018-01-01 66.65
2018-01-02 66.68
2018-01-03 65.87
2018-01-04 66.79
2018-01-05 67.97
2018-01-06 96.92
2018-01-07 96.90
2018-01-08 96.90
2018-01-09 96.38
2018-01-10 95.57
Given an arbitrary sequence of values, let's say [66.79,67.97,96.92,96.90], how could I obtain the corresponding indices, for example: [2018-01-04, 2018-01-05,2018-01-06,2018-01-07]?

Use pandas.Series.isin to filter the column Sales then pandas.DataFrame.index to return the row labels (aka index, dates in your df) and finally pandas.Series.to_list to build a list :
vals = [66.79,67.97,96.92,96.90]
result = df[df['Sales'].isin(vals)].index.to_list()
# Output :
print(result)
['2018-01-04', '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08']

Related

loc function for the months [duplicate]

With a datetime index to a Pandas dataframe, it is easy to get a range of dates:
df[datetime(2018,1,1):datetime(2018,1,10)]
Filtering is straightforward too:
df[ (df['column A'] = 'Done') & (df['column B'] < 3.14 )]
But what is the best way to simultaneously filter by range of dates and any other non-date criteria?
3 boolean conditions
c0 = df.index.to_series().between('2018-01-01', '2018-01-10')
c1 = df['column A'] == 'Done'
c2 = df['column B'] < 3.14
df[c0 & c1 & c2]
column A column B
2018-01-04 Done 2.533385
2018-01-06 Done 2.789072
2018-01-08 Done 2.230017
Setup
np.random.seed([3, 1415])
df = pd.DataFrame({
'column A': ['Done', 'Not Done'] * 10,
'column B': np.random.randn(20) + np.pi
}, pd.date_range('2017-12-25', periods=20))
df
column A column B
2017-12-25 Done 1.011868
2017-12-26 Not Done 1.873127
2017-12-27 Done 1.171093
2017-12-28 Not Done 0.882538
2017-12-29 Done 2.792306
2017-12-30 Not Done 3.114638
2017-12-31 Done 3.457829
2018-01-01 Not Done 3.490375
2018-01-02 Done 3.856957
2018-01-03 Not Done 3.912356
2018-01-04 Done 2.533385
2018-01-05 Not Done 3.493983
2018-01-06 Done 2.789072
2018-01-07 Not Done 2.725724
2018-01-08 Done 2.230017
2018-01-09 Not Done 2.999055
2018-01-10 Done 3.888432
2018-01-11 Not Done 1.637436
2018-01-12 Done 3.752955
2018-01-13 Not Done 3.541812
If there is multiple boolean masks is possible use np.logical_and.reduce:
m1 = df.index > '2018-01-01'
m2 = df.index < '2018-01-10'
m3 = df['column A'] == 'Done'
m4 = df['column B'] < 3.14
#piRSquared's data sample
df = df[np.logical_and.reduce([m1, m2, m3, m4])]
print (df)
column A column B
2018-01-04 Done 2.533385
2018-01-06 Done 2.789072
2018-01-08 Done 2.230017
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((200,3)))
df['date'] = pd.date_range('2018-1-1', periods=200, freq='D')
df = df.set_index(['date'])
print(df.loc['2018-2-1':'2018-2-10'])
Hope! it will helpful
I did this below to filter for both dataframes to have the same date
corn_url = 'https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=WPU012202&scale=left&cosd=1971-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-06-09&revision_date=2020-06-09&nd=1971-01-01'
wheat_url ='https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1168&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=WPU0121&scale=left&cosd=1947-01-01&coed=2020-04-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2009-06-01&line_index=1&transformation=lin&vintage_date=2020-06-09&revision_date=2020-06-09&nd=1947-01-01'
corn = pd.read_csv(corn_url,index_col=0,parse_dates=True)
wheat = pd.read_csv(wheat_url,index_col=0, parse_dates=True)
corn.head()
PP Index 1982
DATE
1971-01-01 63.4
1971-02-01 63.6
1971-03-01 62.0
1971-04-01 60.8
1971-05-01 60.2
wheat.head()
PP Index 1982
DATE
1947-01-01 53.1
1947-02-01 56.5
1947-03-01 68.0
1947-04-01 66.0
1947-05-01 66.7
wheat = wheat[wheat.index > '1970-12-31']
wheat.head()
PP Index 1982
DATE
1971-01-01 42.6
1971-02-01 42.6
1971-03-01 41.4
1971-04-01 41.7
1971-05-01 41.8

Using a custom scoring function with pandas groupby to create a column in another dataframe

This is my partial df=
dStart y_test y_pred
2018-01-01 1 2
2018-01-01 2 2
2018-01-02 3 3
2018-01-02 1 2
2018-01-02 2 3
I want to create a column in another dataframe (df1) with the Mathews Correlation Coefficient of each unique dStart.
from sklearn.metrics import matthews_corrcoef
def mcc_func(y_test,y_pred):
return matthews_corrcoef(df[y_test].values,df[y_pred].values)
df1['mcc']=df.groupby('dStart').apply(mcc_func('y_test','y_pred'))
This function doesn't work -- I think because the function returns a float, and 'apply' wants to use the function on the groupby data itself, but I can't figure out how to give the right function to apply.
You need to apply the function within the grouped object -
g = df.groupby('dStart')
g.apply(lambda x: matthews_corrcoef(x['y_test'], x['y_pred']))
#OUTPUT
#dStart
#2018-01-01 0.0
#2018-01-02 0.0
#dtype: float64
Use apply with lambda function:
df = (df.groupby(['dStart']).apply(lambda x: matthews_corrcoef(x['y_test'], x['y_pred']))
.reset_index(name='Matthews_corrcoef'))
print(df)
dStart Matthews_corrcoef
0 2018-01-01 0.0
1 2018-01-02 0.0

Loading a time series from CSV into a DataFrame

Is it possible to create a Daru DataFrame from a CSV in which the first column is a series of dates?
Take the following CSV, for instance:
time,min,max
2018-01-01,101,103
2018-01-02,102,105
2018-01-03,103,200
2018-01-04,104,109
2018-01-05,105,110
If loaded with Daru::DataFrame.from_csv it will create a 5x3 DataFrame with a 0-based numerical index, instead of a 5x2 DataFrame with a DateTimeIndex.
Is there a way to instruct Daru to use the first vector as a DateTimeIndex index?
df = Daru::DataFrame.from_csv("df.csv")
df.set_index "time"
df.index = Daru::DateTimeIndex.new(df.index)
df
<Daru::DataFrame(5x2)>
min max
2018-01-01 101 103
2018-01-02 102 105
2018-01-03 103 200
2018-01-04 104 109
2018-01-05 105 110

difference between group sums pandas

DF:
fruits date amount
0 Apple 2018-01-01 100
1 Orange 2018-01-01 200
2 Apple 2018-01-01 150
3 Apple 2018-01-02 100
4 Orange 2018-01-02 100
5 Orange 2018-01-02 100
Code to create this:
f = [["Apple","2018-01-01",100],["Orange","2018-01-01",200],["Apple","2018-01-01",150],
["Apple","2018-01-02",100],["Orange","2018-01-02",100],["Orange","2018-01-02",100]]
df = pd.DataFrame(f,columns = ["fruits","date","amount"])
I am trying to aggregate the sale of fruits for each date and find the difference between sums
Expected Op:
date diff
2018-01-01 . 50
2018-01-02 . -100
As in find the sum of sales of Apple and orange and find the difference between the sums
I am able to find the sum:
df.groupby(["date","fruits"])["amount"].agg("sum")
date fruits
2018-01-01 Apple 250
Orange 200
2018-01-02 Apple 100
Orange 200
Name: amount, dtype: int64
Any suggestions on how to find the difference in pandas itself.
Using groupby date and apply using lambda function as:
df.groupby("date").apply(lambda x: x.loc[x['fruits']=='Apple','amount'].sum() -
x.loc[x['fruits']=='Orange','amount'].sum())
date
2018-01-01 50
2018-01-02 -100
dtype: int64
Or grouping the fruits separately and finding the difference:
A = df[df.fruits.isin(['Apple'])].groupby('date')['amount'].sum()
O = df[df.fruits.isin(['Orange'])].groupby('date')['amount'].sum()
O-A
date
2018-01-01 -50
2018-01-02 100
Name: amount, dtype: int64
def get_diff(grp):
grp = grp.groupby('fruits').agg(sum)['amount'].values
return grp[0] - grp[1]
df.groupby('date').apply(get_diff)
Output
date
2018-01-01 50
2018-01-02 -100
Add unstack for reshape and then subtract with pop for extract columns:
df = df.groupby(["date","fruits"])["amount"].sum().unstack()
df['diff'] = df.pop('Apple') - df.pop('Orange')
print (df)
fruits diff
date
2018-01-01 50
2018-01-02 -100

Re-sampling and interpolating data using pandas from a given date column to a different date column

I can mostly find conversions and down/upsampling from e.g. daily date range to monthly date ranges or from monthly/yearly date ranges to daily date ranges using pandas.
Is there a way that given data for some arbitrary days one can map them to different days using interpolation/extrapolation?
Index.union, reindex, and interpolate
MCVE
Create toy data. Three rows every other day.
tidx = pd.date_range('2018-01-01', periods=3, freq='2D')
df = pd.DataFrame(dict(A=[1, 3, 5]), tidx)
df
A
2018-01-01 1
2018-01-03 3
2018-01-05 5
New index for those days in between
other_tidx = pd.date_range(tidx.min(), tidx.max()).difference(tidx)
Solution
Create a new index that is the union of the old index and the new index
union_idx = other_tidx.union(df.index)
When we reindex with this we get
df.reindex(union_idx)
A
2018-01-01 1.0
2018-01-02 NaN
2018-01-03 3.0
2018-01-04 NaN
2018-01-05 5.0
We see the gaps we expected. Now we can use interpolate. But we need to use the argument method='index' to ensure we interpolate relative to the size of the gaps in the index.
df.reindex(union_idx).interpolate('index')
A
2018-01-01 1.0
2018-01-02 2.0
2018-01-03 3.0
2018-01-04 4.0
2018-01-05 5.0
And now those gaps are filled.
We can reindex again to reduce to just the other index values
df.reindex(union_idx).interpolate('index').reindex(other_tidx)
A
2018-01-02 2.0
2018-01-04 4.0