Groupby by Year with pd.Timestamp / dateTime64 in the format YYYY-MM-DD while keeping full timestamp - pandas

I have a dataframe with a column "time" and "value" in the format YYYY-MM-DD and np.int64
time | value
2009-11-03 | 13
2009-11-14 | 25
2009-12-05 | 25
2016-03-02 | 80
2016-05-17 | 56
I need to groupby by year, getting the maximum value by year. If days within the same year both have the highest value I need tp keep both. But I need to keep the full timestamp as well.
Desired output:
time | value
2009-11-14 | 25
2009-12-05 | 25
2016-03-02 | 80
My code so far:
df["year"] = df["time"].dt.year
df = df.groupby(["year"], sort=False)['value'].max()
But this removes the timestamp and I only have the year + value as a column. How can I get the desired result?

Let us try transform first then do filter
m=df.value.eq(df.groupby(df.time.dt.year).value.transform('max'))
df=df[m]
Out[111]:
time value
1 2009-11-14 25
2 2009-12-05 25
3 2016-03-02 80

Calculate the maximum values per year, and then join the result with the original data frame:
df["year"] = pd.to_datetime(df["time"]).dt.year
max_val = df.groupby(["year"], sort=False)['value'].max()
pd.merge(max_val, df, on=["value", "year"])
result:
value year time
0 25 2009 2009-11-14
1 25 2009 2009-12-05
2 80 2016 2016-03-02

Related

Extract 30 minutes from timestamp and group it by 30 mins time interval -PGSQL

In PostgreSQL I am extracting hour from the timestamp using below query.
select count(*) as logged_users, EXTRACT(hour from login_time::timestamp) as Hour
from loginhistory
where login_time::date = '2021-04-21'
group by Hour order by Hour;
And the output is as follows
logged_users | hour
--------------+------
27 | 7
82 | 8
229 | 9
1620 | 10
1264 | 11
1990 | 12
1027 | 13
1273 | 14
1794 | 15
1733 | 16
878 | 17
126 | 18
21 | 19
5 | 20
3 | 21
1 | 22
I want the same output for same SQL for 30 mins. Please suggest
SELECT to_timestamp((extract(epoch FROM login_time::timestamp)::bigint / 1800) * 1800)::timestamp AS interval_30_min
, count(*) AS logged_users
FROM loginhistory
WHERE login_time::date = '2021-04-21' -- inefficient!
GROUP BY 1
ORDER BY 1;
Extracting the epoch gets the number of seconds since the epoch. Integer division truncates. Multiplying back effectively rounds down, achieving the same as date_trunc() for arbitrary time intervals.
1800 because 30 minutes contain 1800 seconds.
Detailed explanation:
Truncate timestamp to arbitrary intervals
The cast to timestamp makes me wonder about the actual data type of login_time? If it's timestamptz, the cast depends on your current time zone setting and sets you up for surprises if that setting changes. See:
How do I match an entire day to a datetime field?
Subtract hours from the now() function
Ignoring time zones altogether in Rails and PostgreSQL
Depending on the actual data type, and exact definition of your date boundaries, there is a more efficient way to phrase your WHERE clause.
You can change the column on which you're aggregating to use the minute too:
select
count(*) as logged_users,
CONCAT(EXTRACT(hour from login_time::timestamp), '-', CASE WHEN EXTRACT(minute from login_time::timestamp) < 30 THEN 0 ELSE 30 END) as HalfHour
from loginhistory
where login_time::date = '2021-04-21'
group by HalfHour
order by HalfHour;

Query a table so that data in one column could be shown as different fields

I have a table that stores data of customer care . The table/view has the following structure.
userid calls_received calls_answered calls_rejected call_date
-----------------------------------------------------------------------
1030 134 100 34 28-05-2018
1012 140 120 20 28-05-2018
1045 120 80 40 28-05-2018
1030 99 39 50 28-04-2018
1045 50 30 20 28-04-2018
1045 200 100 100 28-05-2017
1030 160 90 70 28-04-2017
1045 50 30 20 28-04-2017
This is the sample data. The data is stored on day basis.
I have to create a report in a report designer software that takes date as an input. When user selects a date for eg. 28/05/2018. This date is send as parameter ${call_date}. i have to query the view in such a way that result should look like as below. If user selects date 28/05/2018 then data of 28/04/2018 and 28/05/2017 should be displayed side by side as like the below column order.
userid | cl_cur | ans_cur | rej_cur |success_percentage |diff_percent|position_last_month| cl_last_mon | ans_las_mon | rej_last_mon |percentage_lm|cl_last_year | ans_last_year | rej_last_year
1030 | 134 | 100 | 34 | 74.6 % | 14% | 2 | 99 | 39 | 50 | 39.3% | 160 | 90 | 70
1045 | 120 | 80 | 40 | 66.6% | 26.7% | 1 | 50 | 30 | 20 | 60% | 50 | 30 | 20
The objective of this query is to show data of selected day, data of same day previous month and same day previous years in columns so that user can have a look and compare. Here the result is ordered by percentage(ans_cur/cl_cur) of selected day in descending order of calculated percentage and show under success_percentage.
The column position_last_month is the position of that particular employee in previous month when it is ordered in descending order of percentage. In this example userid 1030 was in 2nd position last month and userid 1045 in 1 st position last month. Similarly I have to calculate this also for year.
Also there is a field called diff_percent which calculates the difference of percentage between the person who where in same position last month.Same i have to do for last year. How i can achieve this result.Please help.
THIS ANSWERS THE ORIGINAL VERSION OF THE QUESTION.
One method is a join:
select t.user_id,
t.calls_received as cr_cur, t.calls_answered as ca_cur, t.calls_rejected as cr_cur,
tm.calls_received as cr_last_mon, tm.calls_answered as ca_last_mon, tm.calls_rejected as cr_last_mon,
ty.calls_received as cr_last_year, ty.calls_answered as ca_last_year, ty.calls_rejected as cr_last_year
from t left join
t tm
on tm.userid = t.userid and
tm.call_date = dateadd(month, -1, t.call_date) left join
t ty
on ty.userid = t.userid and
tm.call_date = dateadd(year, -1, t.call_date)
where t.call_date = ${call_date};

PostgreSQL select only rows whose dates match a specific number of the week in a table

I have a table that looks like this for a span of many years:
dump_time | group_id | client_count
---------------------+----------+--------------
2014-10-21 19:45:00 | 145 | 74
2014-10-21 19:45:00 | 131 | 279
2014-10-21 19:45:00 | 139 | 49
where dump_time is of type 'timestamp without time zone'.
I want to select only rows that match a specific week of the year and a specific day of the week. For instance, I want all rows that are 3rd day of the 15th week of the year. Any idea on how I could do this? I've explored the EXTRACT command, but haven't quite figured it out.
Thanks!
select *
from testme
where extract(week from dump_time) = 15
and extract(dow from dump_time) = 3

Left join with nested selects and aggregate functions

Problem
I have one table of generated dates (s) which I want to join with another table (d) which is a list of dates where a specific occurrence has happened.
table s
Wednesday 23rd August 2017
Thursday 24th August 2017
Friday 25th August 2017
Saturday 26th August 2017
table d
day_created -------------------------------- count
Thursday 24th August 2017 ---------------- 45
Saturday 26th August 2017 ---------------- 32
I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
I want something that looks like:
day_created -------------------------------- count
Wednesday 23rd August --------------------- 0
Thursday 24th August 2017 ---------------- 45
Friday 25th August 2017 ------------------ 0
Saturday 26th August 2017 ---------------- 32
I've tried joining with a left join as follows:
SELECT day_created, COUNT(d.day_created) as total_per_day
FROM
(SELECT date_trunc('day', task_1.created_at) as day_created
FROM task_1
)
d
LEFT JOIN (
SELECT (generate_series('2017-05-01', current_date, '1 day'::INTERVAL)) as standard_date
)
s
ON d.day_created=s.standard_date
GROUP BY d.day_created
ORDER BY day_created DESC;
I don't get an error however the join isn't working (i.e. it doesn't return dates where the count is null). What it returns is the dates from table d and the count, but not the dates in between where there are 0 occurrences.
I've been going round in circles and have understood that I need to make table s (I think!) the left table, but I'm getting confused as a newbie with the syntax.
This is all in PostgreSQL 9.5.8.
Basically, you had the LEFT JOIN backwards. This should work, with some other simplifications and performance optimizations:
SELECT s.standard_date, COUNT(d.day_created) AS total_per_day
FROM generate_series('2017-05-01', current_date, interval '1 day') s(standard_date)
LEFT JOIN task_1 d ON d.day_created >= s.standard_date
AND d.day_created < s.standard_date + interval '1 day'
GROUP BY 1
ORDER BY 1;
This counts rows in d, like you commented. Does not sum values.
Be aware that generate_series() still returns timestamp with time zone, even if you pass date values to it. You may want to cast to date or format with to_char() for display in the outer SELECT. (But rather group and order by the original timestamp value, not the formatted string.)
There may be corner cases depending on the current time zone setting depending on the actual undisclosed table definition.
Related:
How to avoid a subquery in FILTER clause?
I have one table of generated dates (s)
In real databases, we don't store a generated series. We just generate them when needed.
which I want to join with another table (d) which is a list of dates where a specific occurrence has happened. [...] I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
Nah, you can do it.
CREATE TABLE d(day_created, count) AS VALUES
('24 August 2017'::date, 45),
('26 August 2017'::date, 32);
SELECT day_created, coalesce(count,0)
FROM (
SELECT d::date
FROM generate_series(
'2017-08-01'::timestamp without time zone,
'2017-09-01'::timestamp without time zone,
'1 day'
) AS gs(d)
) AS gs(day_created)
LEFT OUTER JOIN d USING(day_created)
ORDER BY day_created;
day_created | coalesce
-------------+----------
2017-08-01 | 0
2017-08-02 | 0
2017-08-03 | 0
2017-08-04 | 0
2017-08-05 | 0
2017-08-06 | 0
2017-08-07 | 0
2017-08-08 | 0
2017-08-09 | 0
2017-08-10 | 0
2017-08-11 | 0
2017-08-12 | 0
2017-08-13 | 0
2017-08-14 | 0
2017-08-15 | 0
2017-08-16 | 0
2017-08-17 | 0
2017-08-18 | 0
2017-08-19 | 0
2017-08-20 | 0
2017-08-21 | 0
2017-08-22 | 0
2017-08-23 | 0
2017-08-24 | 45
2017-08-25 | 0
2017-08-26 | 32
2017-08-27 | 0
2017-08-28 | 0
2017-08-29 | 0
2017-08-30 | 0
2017-08-31 | 0
2017-09-01 | 0
(32 rows)

How to average data on periods from a table in SQL

I'm trying to average data on specific period of time and then, averaging a date between from these result.
Having data like:
value | datetime
-------+------------------------
15 | 2015-08-16 01:00:40+02
22 | 2015-08-16 01:01:40+02
16 | 2015-08-16 01:02:40+02
19 | 2015-08-16 01:03:40+02
21 | 2015-08-16 01:04:40+02
18 | 2015-08-16 01:05:40+02
29 | 2015-08-16 01:06:40+02
16 | 2015-08-16 01:07:40+02
16 | 2015-08-16 01:08:40+02
15 | 2015-08-16 01:09:40+02
I would like to obtain something like in one query:
value | datetime
-------+------------------------
18.6 | 2015-08-16 01:03:00+02
18.8 | 2015-08-16 01:08:00+02
where value corresponding with the first 5 initial values averaged and the datetime with the middle (or average) of the 5 intial datetimes. 5 representing the interval n.
I saw some posts that put me on the track with avg, group by and averaging date format in SQL but I'm still not able to find out what to do exactly.
I'm working under PostgreSQL 9.4
You would need to share more information but here is a way to do it. Here is more information on it : HERE
mysql> SELECT AVG(value), AVG(datetime)
FROM database.table
WHERE datetime > date1
AND datetime < date2;
Something like
SELECT
to_timestamp(round(AVG(EXTRACT(epoch from datetime)))) as middleDate,
avg(value) AS avgValue
FROM
myTable
GROUP BY
(id) / ((SELECT Count(*) FROM myTable) / 100);
filled roughtly my requirements, with 100 acting on averaged intervals length (globally equals to the outputed lines).