I have a data frame of the format
county ones emplvl
date
2003-01-01 1001 1 10955.000000
2003-04-01 1001 1 11090.333333
2003-07-01 1001 1 11157.000000
2003-10-01 1001 1 11335.666667
2004-01-01 1001 1 11045.000000
2004-04-01 1001 1 11175.666667
2004-07-01 1001 1 11135.666667
2004-10-01 1001 1 11480.333333
2005-01-01 1001 1 11441.000000
2005-04-01 1001 1 11531.000000
2005-07-01 1001 1 11320.000000
2005-10-01 1001 1 11516.666667
2006-01-01 1001 1 11291.000000
2006-04-01 1001 1 11223.000000
2006-07-01 1001 1 11230.000000
2006-10-01 1001 1 11293.000000
2007-01-01 1001 1 11126.666667
2007-04-01 1001 1 11383.666667
2007-07-01 1001 1 11535.666667
2007-10-01 1001 1 11567.333333
2008-01-01 1001 1 11226.666667
2008-04-01 1001 1 11342.000000
2008-07-01 1001 1 11201.666667
2008-10-01 1001 1 11321.000000
2009-01-01 1001 1 11082.333333
2009-04-01 1001 1 11099.000000
2009-07-01 1001 1 10905.666667
2009-10-01 1001 1 10928.333333
2010-01-01 1001 1 10616.000000
2010-04-01 1001 1 10746.333333
2010-07-01 1001 1 10652.333333
2010-10-01 1001 1 10761.000000
2011-01-01 1001 1 10659.000000
2011-04-01 1001 1 10821.000000
2011-07-01 1001 1 10442.666667
2011-10-01 1001 1 10585.333333
2012-01-01 1001 1 10065.333333
2012-04-01 1001 1 10172.666667
2012-07-01 1001 1 10042.000000
2012-10-01 1001 1 10267.666667
and I would like to run a regression onto each group (based on values before 2007), and then add predicted values for the whole time period. The code I have right now iterates over each group. As I have hundreds of groups, it takes quite long to run:
def predictedValues(group):
sub = group[group.year < 2007]
if len(sub) == 0:
return None
regression = sm.OLS(sub.emplvl, sub[['ones', 'quarter_index']], hasconst=True).fit()
result = regression.predict(group[['ones', 'quarter_index']])
result = pd.DataFrame(data=result, columns=['predicted'], index=group.index)
return result
result = df.groupby(['county']).apply(predictedValues)
What is a more efficient way to do this? I'd prefer statsmodels over pandas, as pandas.ols is deprecated.
A little more efficient
The following runs through quite quickly, but it's atypical pandas code. So I'm still happy for improvements:
for county in df.county.unique():
group = df.loc[df.county == county]
df.loc[df.county == county, 'predicted'] = predictedValues(group)
Related
i have the following table :
id name start end
1 Asla 2021-01-01 2021-12-31
1 Asla 2022-01-01 2022-04-15
2 Tina 2021-05-16 2021-09-23
3 Layla 2021-01-01 2021-09-27
3 Layla 2022-01-01 2022-07-18
2 Sim 2020-05-12 2020-08-13
3 Anderas 2021-07-01 2021-09-13
3 Anderas 2021-10-01 2021-11-18
3 Anderas 2022-01-01 2029-11-18
4 Klara 2022-01-01 null
what i want to do get persons that have work (date) under 2021 and create a new column that show status (if the person continue having work under 2022 -- ok else not ok and if the person is new like 'Klara' get new ) and show last record for every person . maybe too End = null ??????
i tried this .
select w.id ,w.name ,w.start ,w.end, max_date.end
from Work_date w
left join (select * from Work_date where start>='2022-01-01')max_date on max_date.id=id
where w.start>='2021-01-01'
``` but the problem i get the result as this
<pre>
id name start end
1 Asla 2021-01-01 null
1 Asla 2022-01-01 2022-04-15
2 Tina 2021-05-16 null
3 Layla 2021-01-01 null
3 Layla 2022-01-01 2022-07-18
3 Anderas 2021-07-01 null
3 Anderas 2021-10-01 2021-11-18
3 Anderas 2022-01-01 null
4 Klara 2022-01-01 null
</pre>
men i want to get result as <pre>
id name start end status
1 Asla 2022-01-01 2022-04-15 ok
2 Tina 2021-05-16 2021-09-23 not ok
3 Layla 2022-01-01 2022-07-18 ok
3 Anderas 2022-01-01 2029-11-18 ok
4 Klara 2022-01-01 null ok
Looks like you can simply aggregate.
Then use a CASE WHEN for the status.
select
w.id
, w.name
, max(w.start) as start
, max(w.end) as end
, case
when year(max(end)) < 2022 then 'not ok'
else 'ok'
end as status
from Work_date w
where w.start >= '2021-01-01'
group by w.id, w.name
order by w.id, max(w.start), max(w.end);
ID
NAME
START
END
STATUS
1
Asla
2022-01-01
2022-04-15
ok
2
Tina
2021-05-16
2021-09-23
not ok
3
Layla
2022-01-01
2022-07-18
ok
3
Anderas
2022-01-01
2029-11-18
ok
4
Klara
2022-01-01
null
ok
Demo on db<>fiddle here
I have a dataframe as following:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1
XYZ 3/3/2020 1
XYZ 3/4/2020 1
XYZ 3/5/2020 1
XYZ 3/5/2020 0
XYZ 3/6/2020 1
XYZ 3/8/2020 1
ABC 3/9/2020 0
ABC 3/10/2020 1
ABC 3/11/2020 0
ABC 3/12/2020 1
The relTweet displays whether the tweet is relevant (1) or not (0).
\nI need to find the days difference (GaplastRel) between each successive rows for each company, with a condition that the previous day's tweet should be relevant tweet (i.e. relTweet =1 ). e.g. For the first record relTweet should be 0. For the 2nd record, relTweet should be 1 as the last relevant tweet was made one day ago.
Below is the example of needed output:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 0
XYZ 3/3/2020 1 1
XYZ 3/4/2020 1 1
XYZ 3/5/2020 1 1
XYZ 3/5/2020 0 1
XYZ 3/6/2020 1 1
XYZ 3/8/2020 1 2
ABC 3/9/2020 0 0
ABC 3/10/2020 1 0
ABC 3/11/2020 0 1
ABC 3/12/2020 1 2
Following is my code:
dataDf['Date'] = pd.to_datetime(dataDf['Date'], format='%m/%d/%Y')
dataDf['relTweet'] = (dataDf.groupby('Company', group_keys=False)
.apply(lambda g: g['Date'].diff().replace(0, np.nan).ffill()))
This code gives the days difference between successive rows for each company without conisidering the relTweet =1 condition. I am not sure how to apply the condition.
Following is the output of the above code:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 NaT
XYZ 3/3/2020 1 1 days
XYZ 3/4/2020 1 1 days
XYZ 3/5/2020 1 1 days
XYZ 3/5/2020 0 0 days
XYZ 3/6/2020 1 1 days
XYZ 3/8/2020 1 2 days
ABC 3/9/2020 0 NaT
ABC 3/10/2020 1 1 days
ABC 3/11/2020 0 1 days
ABC 3/12/2020 1 1 days
Change your mind sometime we need merge_asof rather than groupby
df1=df.loc[df['relTweet']==1,['Company','Date']]
df=pd.merge_asof(df,df1.assign(Date1=df1.Date),by='Company',on='Date', allow_exact_matches=False)
df['GaplastRel']=(df.Date-df.Date1).dt.days.fillna(0)
df
Out[31]:
Company Date relTweet Date1 GaplastRel
0 XYZ 2020-03-02 1 NaT 0.0
1 XYZ 2020-03-03 1 2020-03-02 1.0
2 XYZ 2020-03-04 1 2020-03-03 1.0
3 XYZ 2020-03-05 1 2020-03-04 1.0
4 XYZ 2020-03-05 0 2020-03-04 1.0
5 XYZ 2020-03-06 1 2020-03-05 1.0
6 XYZ 2020-03-08 1 2020-03-06 2.0
7 ABC 2020-03-09 0 NaT 0.0
8 ABC 2020-03-10 1 NaT 0.0
9 ABC 2020-03-11 0 2020-03-10 1.0
10 ABC 2020-03-12 1 2020-03-10 2.0
I have a table that looks like:
id code date1 date2 block
--------------------------------------------------
20 1234 2017-07-01 2017-07-31 1
15 1234 2017-06-01 2017-06-30 1
13 1234 2017-05-01 2017-05-31 0
11 1234 2017-03-01 2017-03-31 0
9 1234 2017-02-01 2017-02-28 1
8 1234 2017-01-01 2017-01-31 0
7 1234 2016-11-01 2016-11-31 0
6 1234 2016-10-01 2016-10-31 1
2 1234 2016-09-01 2016-09-31 1
I need to rank the rows according to the blocks of 0's and 1's, like:
id code date1 date2 block desired_rank
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 1
13 1234 2017-05-01 2017-05-31 0 2
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 4
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 5
2 1234 2016-09-01 2016-09-31 1 5
I've tried to use rank() and dense_rank(), but the result I end up with is:
id code date1 date2 block dense_rank()
-------------------------------------------------------------------
20 1234 2017-07-01 2017-07-31 1 1
15 1234 2017-06-01 2017-06-30 1 2
13 1234 2017-05-01 2017-05-31 0 1
11 1234 2017-03-01 2017-03-31 0 2
9 1234 2017-02-01 2017-02-28 1 3
8 1234 2017-01-01 2017-01-31 0 3
7 1234 2016-11-01 2016-11-31 0 4
6 1234 2016-10-01 2016-10-31 1 4
2 1234 2016-09-01 2016-09-31 1 5
In the last table, the rank doesn't care about the rows, it just takes all the 1's and 0's as a unit and sets an ascending count starting at the first 1 and 0.
My query goes like this:
CREATE TEMP TABLE data (id integer,code text, date1 date, date2 date, block integer);
INSERT INTO data VALUES
(20,'1234', '2017-07-01','2017-07-31',1),
(15,'1234', '2017-06-01','2017-06-30',1),
(13,'1234', '2017-05-01','2017-05-31',0),
(11,'1234', '2017-03-01','2017-03-31',0),
(9, '1234', '2017-02-01','2017-02-28',1),
(8, '1234', '2017-01-01','2017-01-31',0),
(7, '1234', '2016-11-01','2016-11-30',0),
(6, '1234', '2016-10-01','2016-10-31',1),
(2, '1234', '2016-09-01','2016-09-30',1);
SELECT *,dense_rank() OVER (PARTITION BY code,block ORDER BY date2 DESC)
FROM data
ORDER BY date2 DESC;
By the way, the database is in postgreSQL.
I hope there's a workaround... Thanks :)
Edit: Note that the blocks of 0's and 1's aren't equal.
There's no way to get this result using a single Window Function:
SELECT *,
Sum(flag) -- now sum the 0/1 to create the "rank"
Over (PARTITION BY code
ORDER BY date2 DESC)
FROM
(
SELECT *,
CASE
WHEN Lag(block) -- check if this is the 1st row of a new block
Over (PARTITION BY code
ORDER BY date2 DESC) = block
THEN 0
ELSE 1
END AS flag
FROM DATA
) AS dt
I have two SQL Server tables containing the following information:
Table t_venues:
venue_id is unique
venue_id | start_date | end_date
1 | 01/01/2014 | 02/01/2014
2 | 05/01/2014 | 05/01/2014
3 | 09/01/2014 | 15/01/2014
4 | 20/01/2014 | 30/01/2014
Table t_venueuser:
venue_id is not unique
venue_id | start_date | end_date
1 | 02/01/2014 | 02/01/2014
2 | 05/01/2014 | 05/01/2014
3 | 09/01/2014 | 10/01/2014
4 | 23/01/2014 | 25/01/2014
From these two tables I need to find the dates that haven't been selected for each range, so the output would look like this:
venue_id | start_date | end_date
1 | 01/01/2014 | 01/01/2014
3 | 11/01/2014 | 15/01/2014
4 | 20/01/2014 | 22/01/2014
4 | 26/01/2014 | 30/01/2014
I can compare the two tables and get the date ranges from t_venues to appear in my query using 'except' but I can't get the query to produce the non-selected dates. Any help would be appreciated.
Calendar Table!
Another perfect candidate for a calendar table. If you can't be bothered to search for one, here's one I made earlier.
Setup Data
DECLARE #t_venues table (
venue_id int
, start_date date
, end_date date
);
INSERT INTO #t_venues (venue_id, start_date, end_date)
VALUES (1, '2014-01-01', '2014-01-02')
, (2, '2014-01-05', '2014-01-05')
, (3, '2014-01-09', '2014-01-15')
, (4, '2014-01-20', '2014-01-30')
;
DECLARE #t_venueuser table (
venue_id int
, start_date date
, end_date date
);
INSERT INTO #t_venueuser (venue_id, start_date, end_date)
VALUES (1, '2014-01-02', '2014-01-02')
, (2, '2014-01-05', '2014-01-05')
, (3, '2014-01-09', '2014-01-10')
, (4, '2014-01-23', '2014-01-25')
;
The Query
SELECT t_venues.venue_id
, calendar.the_date
, CASE WHEN t_venueuser.venue_id IS NULL THEN 1 ELSE 0 END As is_available
FROM dbo.calendar /* see: http://gvee.co.uk/files/sql/dbo.numbers%20&%20dbo.calendar.sql for an example */
INNER
JOIN #t_venues As t_venues
ON t_venues.start_date <= calendar.the_date
AND t_venues.end_date >= calendar.the_date
LEFT
JOIN #t_venueuser As t_venueuser
ON t_venueuser.venue_id = t_venues.venue_id
AND t_venueuser.start_date <= calendar.the_date
AND t_venueuser.end_date >= calendar.the_date
ORDER
BY t_venues.venue_id
, calendar.the_date
;
The Result
venue_id the_date is_available
----------- ----------------------- ------------
1 2014-01-01 00:00:00.000 1
1 2014-01-02 00:00:00.000 0
2 2014-01-05 00:00:00.000 0
3 2014-01-09 00:00:00.000 0
3 2014-01-10 00:00:00.000 0
3 2014-01-11 00:00:00.000 1
3 2014-01-12 00:00:00.000 1
3 2014-01-13 00:00:00.000 1
3 2014-01-14 00:00:00.000 1
3 2014-01-15 00:00:00.000 1
4 2014-01-20 00:00:00.000 1
4 2014-01-21 00:00:00.000 1
4 2014-01-22 00:00:00.000 1
4 2014-01-23 00:00:00.000 0
4 2014-01-24 00:00:00.000 0
4 2014-01-25 00:00:00.000 0
4 2014-01-26 00:00:00.000 1
4 2014-01-27 00:00:00.000 1
4 2014-01-28 00:00:00.000 1
4 2014-01-29 00:00:00.000 1
4 2014-01-30 00:00:00.000 1
(21 row(s) affected)
The Explanation
Our calendar tables contains an entry for every date.
We join our t_venues (as an aside, if you have the choice, lose the t_ prefix!) to return every day between our start_date and end_date. Example output for venue_id=4 for just this join:
venue_id the_date
----------- -----------------------
4 2014-01-20 00:00:00.000
4 2014-01-21 00:00:00.000
4 2014-01-22 00:00:00.000
4 2014-01-23 00:00:00.000
4 2014-01-24 00:00:00.000
4 2014-01-25 00:00:00.000
4 2014-01-26 00:00:00.000
4 2014-01-27 00:00:00.000
4 2014-01-28 00:00:00.000
4 2014-01-29 00:00:00.000
4 2014-01-30 00:00:00.000
(11 row(s) affected)
Now we have one row per day, we [outer] join our t_venueuser table. We join this in much the same manner as before, but with one added twist: we need to join based on the venue_id too!
Running this for venue_id=4 gives this result:
venue_id the_date t_venueuser_venue_id
----------- ----------------------- --------------------
4 2014-01-20 00:00:00.000 NULL
4 2014-01-21 00:00:00.000 NULL
4 2014-01-22 00:00:00.000 NULL
4 2014-01-23 00:00:00.000 4
4 2014-01-24 00:00:00.000 4
4 2014-01-25 00:00:00.000 4
4 2014-01-26 00:00:00.000 NULL
4 2014-01-27 00:00:00.000 NULL
4 2014-01-28 00:00:00.000 NULL
4 2014-01-29 00:00:00.000 NULL
4 2014-01-30 00:00:00.000 NULL
(11 row(s) affected)
See how we have a NULL value for rows where there is no t_venueuser record. Genius, no? ;-)
So in my first query I gave you a quick CASE statement that shows availability (1=available, 0=not available). This is for illustration only, but could be useful to you.
You can then either wrap the query up and then apply an extra filter on this calculated column or simply add a where clause in: WHERE t_venueuser.venue_id IS NULL and that will do the same trick.
This is a complete hack, but it gives the results you require, I've only tested it on the data you provided so there may well be gotchas with larger sets.
In general what you are looking at solving here is a variation of gaps and islands problem ,this is (briefly) a sequence where some items are missing. The missing items are referred as gaps and the existing items are referred as islands. If you would like to understand this issue in general check a few of the articles:
Simple talk article
blogs.MSDN article
SO answers tagged gaps-and-islands
Code:
;with dates as
(
SELECT vdates.venue_id,
vdates.vdate
FROM ( SELECT DATEADD(d,sv.number,v.start_date) vdate
, v.venue_id
FROM t_venues v
INNER JOIN master..spt_values sv
ON sv.type='P'
AND sv.number BETWEEN 0 AND datediff(d, v.start_date, v.end_date)) vdates
LEFT JOIN t_venueuser vu
ON vdates.vdate >= vu.start_date
AND vdates.vdate <= vu.end_date
AND vdates.venue_id = vu.venue_id
WHERE ISNULL(vu.venue_id,-1) = -1
)
SELECT venue_id, ISNULL([1],[2]) StartDate, [2] EndDate
FROM (SELECT venue_id, rDate, ROW_NUMBER() OVER (PARTITION BY venue_id, DateType ORDER BY rDate) AS rType, DateType as dType
FROM( SELECT d1.venue_id
,d1.vdate AS rDate
,'1' AS DateType
FROM dates AS d1
LEFT JOIN dates AS d0
ON DATEADD(d,-1,d1.vdate) = d0.vdate
LEFT JOIN dates AS d2
ON DATEADD(d,1,d1.vdate) = d2.vdate
WHERE CASE ISNULL(d2.vdate, '01 Jan 1753') WHEN '01 Jan 1753' THEN '2' ELSE '1' END = 1
AND ISNULL(d0.vdate, '01 Jan 1753') = '01 Jan 1753'
UNION
SELECT d1.venue_id
,ISNULL(d2.vdate,d1.vdate)
,'2'
FROM dates AS d1
LEFT JOIN dates AS d2
ON DATEADD(d,1,d1.vdate) = d2.vdate
WHERE CASE ISNULL(d2.vdate, '01 Jan 1753') WHEN '01 Jan 1753' THEN '2' ELSE '1' END = 2
) res
) src
PIVOT (MIN (rDate)
FOR dType IN
( [1], [2] )
) AS pvt
Results:
venue_id StartDate EndDate
1 2014-01-01 2014-01-01
3 2014-01-11 2014-01-15
4 2014-01-20 2014-01-22
4 2014-01-26 2014-01-30
4I have these tables :
day_shift
id_dshift dshift_name on_duty off_duty in_start in_end out_start out_end workday
1001 ds_normal 7:00 15:00 7:00 10:00 11:00 20:00 1
1002 ds_Saturday 7:00 14:00 7:00 10:00 11:00 14:00 1
week_shift
id_wshift wshift_name mon tue wed thu fri sat sun
2001 ws_normal 1001 1001 1001 1001 1001 1002 1001
2002 ws_2013_w1 0 1001 1001 1001 1001 1001 0
2003 ws_2013_w2 1003 1001 1001 1001 1001 1002 1001
daily_attendance
emp_id checkdate in out emp_shift_id
10 15/06/2013 7:10 15:05 2001 <-- saturday
10 16/06/2013 7:05 15:03 2001 <-- sunday
what I want is having a result like this :
emp_id checkdate in out on_duty off_duty
10 15/06/2013 7:10 15:05 07:00 14:00
10 16/06/2013 7:30 14:30 07:00 15:00
in first row of daily_attendance, since the weekday is saturday so i want to get the value of week_shift.sat (1002)
if the weekday is sunday, i want to get the value of week_shift.sun (1001)
so I get the on_duty and off_duty values from day_shift
How to do it in query?
The trick here would be to create a saved query in Access named [week_shift_transformed] to transform your [week_shift] table into separate rows for each day of the week:
SELECT id_wshift, wshift_name, 1 AS [weekday], [sun] as id_dshift FROM week_shift
UNION ALL
SELECT id_wshift, wshift_name, 2 AS [weekday], [mon] as id_dshift FROM week_shift
UNION ALL
SELECT id_wshift, wshift_name, 3 AS [weekday], [tue] as id_dshift FROM week_shift
UNION ALL
SELECT id_wshift, wshift_name, 4 AS [weekday], [wed] as id_dshift FROM week_shift
UNION ALL
SELECT id_wshift, wshift_name, 5 AS [weekday], [thu] as id_dshift FROM week_shift
UNION ALL
SELECT id_wshift, wshift_name, 6 AS [weekday], [fri] as id_dshift FROM week_shift
UNION ALL
SELECT id_wshift, wshift_name, 7 AS [weekday], [sat] as id_dshift FROM week_shift
That will give you
id_wshift wshift_name weekday id_dshift
--------- ----------- ------- ---------
2001 ws_normal 1 1001
2002 ws_2013_w1 1 0
2003 ws_2013_w2 1 1001
2001 ws_normal 2 1001
2002 ws_2013_w1 2 0
2003 ws_2013_w2 2 1003
2001 ws_normal 3 1001
2002 ws_2013_w1 3 1001
2003 ws_2013_w2 3 1001
2001 ws_normal 4 1001
2002 ws_2013_w1 4 1001
2003 ws_2013_w2 4 1001
2001 ws_normal 5 1001
2002 ws_2013_w1 5 1001
2003 ws_2013_w2 5 1001
2001 ws_normal 6 1001
2002 ws_2013_w1 6 1001
2003 ws_2013_w2 6 1001
2001 ws_normal 7 1002
2002 ws_2013_w1 7 1001
2003 ws_2013_w2 7 1002
Then you can use a query like this:
SELECT da.emp_id, da.checkdate, da.in, da.out, ds.on_duty, ds.off_duty
FROM
daily_attendance da
INNER JOIN
(
week_shift_transformed wtt
INNER JOIN
day_shift ds
ON ds.id_dshift = wtt.id_dshift
)
ON wtt.weekday = Weekday(da.checkdate)
AND wtt.id_wshift = da.emp_shift_id