I’m trying to sum the values in a column VAL for the last 14 days from T_DATE, by account.
My expression is
if([RND_FLG]=1 ,Sum([VAL]) over (Intersect([T_ACC],LastPeriods(14,[T_DATE]))),null)
9/10 the results are accurate, but this is not always the case.
Any help is appreciated.
Sample data below:
ALLDATE T_ACC VAL 14DAYVAL
12/13/2016 1501313137 500000 500000
12/15/2016 1501313137 800000 1300000
12/19/2016 1501313137 500000 1800000
12/20/2016 1501313137 500000 2300000
12/21/2016 1501313137 500000 2300000
12/22/2016 1501313137 500000 3300000
12/30/2016 1501313137 200000 3500000
You are probably getting incorrect results when you have gaps in your dates. LastPeriods() isn't the same as n - days so it's aggregating over n number of rows versus days. You can normalize your data to have 1 row per date to get around this.
Try adding a rank column like Rank([T_DATE],[T_ACC]) Then you can sum using over intersect and lastperiods
Related
I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")
I need to filter commission depend on amount means I give a amount like 300000 and get the commission for those slab where 300000 is match
SlabStartAmount
SlabEndAmount
CommissionAmount
100000
200000
62.5
2000001
5000000
75
5000001
7500000
81.25
7500001
10000000
87.5
10000001
0
100
You can use a condition to find out the slab
WHERE #Amount BETWEEN SlabStartAmount AND SlabEndAmount
See this fiddle.
I'm using microsoft access and I need a sql query to return the top x (40 in my case) most recent sales for each neighborhood (NBHD). My data looks something like this:
PARID PRICE SALEDT SALEVAL NBHD
04021000 140000 1/29/2016 11 700
04021000 160000 2/16/2016 11 700
04018470 250000 4/23/2015 08 701
04018470 300000 4/23/2015 08 701
04016180 40000 5/9/2017 11 705
04023430 600000 6/12/2017 19 700
And what I need is the top 40 most recent SALEDT entries for each NBHD, and if the same PARID would show up in that top 40 twice or more, I only want the most recent one. If the rows have the same PARID and the same SALEDT, I need the only most expensive one. For this small set of sample data, I would get:
PARID PRICE SALEDT SALEVAL NBHD
04021000 160000 2/16/2016 11 700
04023430 600000 6/12/2017 19 700
04018470 300000 4/23/2015 08 701
04016180 40000 5/9/2017 11 705
I get row 2 (as it has a later SALEDT than row 1), row 4 (as it has a higher PRICE than row 3, and row 5 and row 6. Hopefully that is clear. Also, I'm using MS access SQL to do this, but wouldn't be opposed to some VBA solution if that is easier. Thanks in advance.
Here you go:
select a.parid, max(a.price)price, a.saledt, a.saleval, a.nbhd from #table a join (
select parid, max(saledt) saledt from #table
group by parid ) b on a.parid=b.parid and a.saledt=b.saledt
group by a.parid, a.saledt, a.saleval, a.nbhd
order by a.nbhd
In MS Access, you can do the following to get the 40 most recent entries for each neighborhood:
select t.*
from t
where t.salesdt in (select top 40 t2.salesdt
from t as t2
where t2.nbhd = t.nbhd
order by t2.salesdt desc
);
Your additional constraints are rather confusing. I'm not sure I fully follow them because I don't know what the columns really refer to.
I want to make the regression model from this dataset(first two are dependent variable and last one is dependent variable).I have import dataset using dataset=pd.read_csv('data.csv')
Now I have made model previously also but never have done with date format dataset as independent variable so how should we handle these date format to make the regression model.
also how should we handle 0 value data in given dataset.
My dataset is like:in .csv format:
Month/Day, Sales, Revenue
01/01 , 0 , 0
01/02 , 100000, 0
01/03 , 400000, 0
01/06 ,300000, 0
01/07 ,950000, 1000000
01/08 ,10000, 15000
01/10 ,909000, 1000000
01/30 ,12200, 12000
02/01 ,950000, 1000000
02/09 ,10000, 15000
02/13 ,909000, 1000000
02/15 ,12200, 12000
I don't know to handle this format date and 0 value
Here's a start. I saved your data into a file and stripped all the whitespace.
import pandas as pd
df = pd.read_csv('20180112-2.csv')
df['Month/Day'] = pd.to_datetime(df['Month/Day'], format = '%m/%d')
print(df)
Output:
Month/Day Sales Revenue
0 1900-01-01 0 0
1 1900-01-02 100000 0
2 1900-01-03 400000 0
3 1900-01-06 300000 0
4 1900-01-07 950000 1000000
5 1900-01-08 10000 15000
6 1900-01-10 909000 1000000
7 1900-01-30 12200 12000
8 1900-02-01 950000 1000000
9 1900-02-09 10000 15000
10 1900-02-13 909000 1000000
11 1900-02-15 12200 12000
The year defaults to 1900 since it is not provided in your data. If you need to change it, that's an additional, different question. To change the year, see: Pandas: Change day
import datetime as dt
df['Month/Day'] = df['Month/Day'].apply(lambda dt: dt.replace(year = 2017))
print(df)
Output:
Month/Day Sales Revenue
0 2017-01-01 0 0
1 2017-01-02 100000 0
2 2017-01-03 400000 0
3 2017-01-06 300000 0
4 2017-01-07 950000 1000000
5 2017-01-08 10000 15000
6 2017-01-10 909000 1000000
7 2017-01-30 12200 12000
8 2017-02-01 950000 1000000
9 2017-02-09 10000 15000
10 2017-02-13 909000 1000000
11 2017-02-15 12200 12000
Finally, to find the correlation between columns, just use df.corr():
print(df.corr())
Output:
Sales Revenue
Sales 1.000000 0.953077
Revenue 0.953077 1.000000
How to handle missing data?
There is a number of ways to replace it. By average, by median or using moving average window or even RF-approach (or similar, MICE and so on).
For 'sales' column you can try any of this methods.
For 'revenue' column better not to use any of this especially if you have many missing values (it will harm the model). Just remove rows with missing values in 'revenue' column.
By the way, a few methods in ML accept missing values: XGBoost and in some way Trees/Forests. For the latest ones you may replace zeroes to some very different values like -999999.
What to do with the data?
Many things related to feature engineering can be done here:
1. Day of week
2. Weekday or weekend
3. Day in month (number)
4. Pre- or post-holiday
5. Week number
6. Month number
7. Year number
8. Indication of some factors (for example, if it is fruit sales data you can some boolean columns related to it)
9. And so on...
Almost every feature here should be preprocessed via one-hot-encoding.
And clean from correlations of course if you use linear models.
i have a requirement with a below table.
conditions:-
1> i have to take the avg of salaries clints, who has 1day date of birth gap.
2> if there are no nearest 1day dob's gap between the gap between the clients, then no need to take that client into consideration.
please see the results.
Table:
ClientID ClinetDOB's Slaries
1 2012-03-14 300
2 2012-04-11 400
3 2012-05-09 200
4 2012-06-06 400
5 2012-07-30 600
6 2012-08-14 1200
7 2012-08-15 1800
8 2012-08-17 1200
9 2012-08-20 2400
10 2012-08-21 1500
Result Should looks LIKE this:-
ClientID ClinetDOB's AVG(Slaries)
7 2012-08-15 1500 --This avg of 1200,1800(because clientID's 6,7 have dob's have 1day gap)
10 2012-08-20 1950 --This avg of 2400,1500(because clientID's 9,10 have dob's have 1day gap))
Please help.
Thank You In advance!
A self-join will connect current record with all records having yesterday's date. In this context group by allows many records having the same date to be counted. t1 needs to be accounted for separately, so the Salary is added afterwards, and count(*) is incremented to calculate average.
Here is Sql Fiddle with example.
select t1.ClientID,
t1.ClinetDOBs,
(t1.Slaries + sum (t2.Slaries)) / (count (*) + 1) Avg_Slaries
from table1 t1
inner join table1 t2
on t1.ClinetDOBs = dateadd(day, 1, t2.ClinetDOBs)
group by t1.ClientID,
t1.ClinetDOBs,
t1.Slaries