Grouping by overall score based on a range of values and a score table in SQLite - sql

Given a big table of data about when people begin and complete tasks, e.g:
Person | Task | Date started | Date ended
---------------------------------------------
A Cleaning 20-FEB-2012 22-FEB-2012
N Dishes 20-FEB-2012 24-FEB-2012
Z Cleaning 21-FEB-2012 23-FEB-2012
and a score table which assigns scores of 2,3,4 for each task based on how long it takes them to do it, e.g.:
| Task | Days taken | Score
---------------------------
Cleaning 2 2
Cleaning 1.5 3
Cleaning 1 4
Dishes 3 2
Dishes 2.5 3
Dishes 2 4
how might I produce a query which gives the overall score for each person for each task, e.g.:
Person | Task | Overall Score
---------------------------------------------
A Cleaning 3.1
A Dishes 2.7
N Cleaning 3.4
The solution's been subtly eluding me, some assistance would be appreciated! I'm using SQLite at present.

Your definitions are a bit vague. However, the following should help you:
select t.person, t.task, sum(s.score)
from tasks t left outer join
score s
on t.task = s.task and
s.daysTaken = t.julianday(dateended) - t.julianday(datestarted)
group by t.person, t.task
Handling ranges a bit more difficult. You need to get the two ends of the interval, and then do the join:
select t.person, t.task, sum(s.score)
from tasks t left outer join
(select s.*,
(select min(days_taken) from score s2 where s2.person = s.person and s2.task = s.task and s2.days_taken > s.days_taken
) as nextDays_Taken
from score s
) s
on t.task = s.task and
t.julianday(dateended) - t.julianday(datestarted) >= s.days_taken and
t.julianday(dateended) - t.julianday(datestarted) < nextDays_Taken
group by t.person, t.task

Related

Postgres, groupBy and count for table and relations at the same time

I have a table called 'users' that has the following structure:
id (PK)
campaign_id
createdAt
1
123
2022-07-14T10:30:01.967Z
2
1234
2022-07-14T10:30:01.967Z
3
123
2022-07-14T10:30:01.967Z
4
123
2022-07-14T10:30:01.967Z
At the same time I have a table that tracks clicks per user:
id (PK)
user_id(FK)
createdAt
1
1
2022-07-14T10:30:01.967Z
2
2
2022-07-14T10:30:01.967Z
3
2
2022-07-14T10:30:01.967Z
4
2
2022-07-14T10:30:01.967Z
Both of these table are up to millions of records... I need the most efficient query to group the data per campaign_id.
The result I am looking for would look like this:
campaign_id
total_users
total_clicks
123
3
1
1234
1
3
I unfortunately have no idea how to achieve this while minding performance and most important of it all I need to use WHERE or HAVING to limit the query in a certain time range by createdAt
Note, PostgreSQL is not my forte, nor is SQL. But, I'm learning spending some time on your question. Have a go with INNER JOIN after two seperate SELECT() statements:
SELECT * FROM
(
SELECT campaign_id, COUNT (t1."id(PK)") total_users FROM t1 GROUP BY campaign_id
) tbl1
INNER JOIN
(
SELECT campaign_id, COUNT (t2."user_id(FK)") total_clicks FROM t2 INNER JOIN t1 ON t1."id(PK)" = t2."user_id(FK)" GROUP BY campaign_id
) tbl2
USING(campaign_id)
See an online fiddle. I believe this is now also ready for a WHERE clause in both SELECT statements to filter by "createdAt". I'm pretty sure someone else will come up with something better.
Good luck.
Hope this will help you.
select u.campaign_id,
count(distinct u.id) users_count,
count(c.user_id) clicks_count
from
users u left join clicks c on u.id=c.user_id
group by 1;
See here query output

Loop through rows and match values in SQL

appreciate any help with my problem! I have an org chart of all employees and then columns for their supervisors. I am trying to find the first in the org structure supervisor for each employee that has 3+ years' experience. So if supervisor 1 has only 1 year, I will need to move to the next column with super visor 2 and see if they have more experience. At the end, I would like to return a column of supervisors' ids [experienced_supervisor column]
Table: org_chart
id | experience | supervisor_id_1| supervisor_id_2 | experienced_supervisor
A | 2 | X | C | X
C | 5 | V | D | D
V | 1 | M | X | M
X | 3
D | 8
M | 11
I am new to SQL and not even sure if this is the best approach. But here is my thinking: I will use CASE to look though every row (employee) and compare their supervisor's experience.
SELECT CASE
WHEN experience >=3 THEN supervisor_id_1
ELSE
CASE WHEN experience >=3 THEN supervisor_id_2
ELSE 'not found'
END AS experienced_supervisor
FROM org_chart
Questions:
Is this the best way to tackle the problem?
Can I look up the value [experience years] of supervisors by matching supervisor_id_1, supervisor_id_2 to id? Or do I need to create a new column supervisor_id_1_experience and fill the years of experience by doing the join?
I am using Redshift.
You only need one case expression, but a lot of joins or subqueries. Perhaps
SELECT (CASE WHEN (SELECT oc2.experience >=3 FROM org_chart oc2 WHERE oc2.id = supervisor_id_1) >= 3
THEN supervisor_id_1
WHEN (SELECT oc2.experience >=3 FROM org_chart oc2 WHERE oc2.id = supervisor_id_2) >= 3
THEN supervisor_id_2
. . .
END) AS experienced_supervisor
FROM org_chart oc
After lots of trial and errors here is the result that worked for my problem. I am using Redshift in this case.
-- Use common table expression to find job level for each supervisor from reporting levels 8 to 2
WITH cte1 AS
(
SELECT B.id as employee
,B.experience as employee_experience
,B.supervisor_id_1 as manager_1
,A.experience as supervisor_1_experience
FROM org_chart
INNER JOIN org_chart B ON B.supervisor_id_1 = A.id
),
cte2 AS
(
SELECT B.id as employee2
,B.experience as employee_experience
,B.supervisor_id_2 as manager_2
,A.experience as supervisor_2_experience
FROM org_chart
INNER JOIN org_chart B ON B.supervisor_id_2 = A.id
),
........-- Write as many statements as I have columns with reporting levels
-- Join all tables above
cte3 AS
(
SELECT employee
,employee_experience
,manager_1
,supervisor_1_experience
,manager_2
,supervisor_2_experience
FROM cte1
JOIN cte2 ON cte2.employee2 = cte1.employee
....... -- Write as many statements as I have columns with reporting levels
)
-- Run through every row and evaluate if each supervisor has more than 3 years of experience
SELECT *
,CASE
WHEN cte3.supervisor_1_experience >= 3 THEN cte3.manager_1
WHEN cte3.supervisor_1_experience < 3
AND cte3.supervisor_2_experience >=3
THEN cte3.manager_2
........ -- Write as many statements as I have columns with reporting levels
END experienced_supervisor
FROM cte3

SQL To delete number of items is less than required item number

I have two tables - StepModels (support plan) and FeedbackStepModels (feedback), StepModels keeps how many steps each support plan requires.
SELECT [SupportPlanID],COUNT(*)AS Steps
FROM [StepModels]
GROUP BY SupportPlanID
SupportPlanID (Steps)
-------------------------------
1 4
2 9
3 3
4 10
FeedbackStepModels keeps how many steps employee entered the system
SELECT [FeedbackID],SupportPlanID,Count(*)AS StepsNumber
FROM [FeedbackStepModels]
GROUP BY FeedbackID,SupportPlanID
FeedbackID SupportPlanID
---------------------------------------------
1 1 3 --> this suppose to be 4
2 2 9 --> Correct
3 3 0 --> this suppose to be 3
4 4 10 --> Correct
If submitted Feedback steps total is less then required total amount I want to delete this wrong entry from the database. Basically i need to delete FeedbackID 1 and 3.
I can load the data into List and compare and delete it, but want to know if we can we do this in SQL rather than C# code.
You can use the query below to remove your unwanted data by SQL Script
DELETE f
FROM FeedbackStepModels f
INNER JOIN (
SELECT [FeedbackID],SupportPlanID, Count(*) AS StepsNumber
FROM [FeedbackStepModels]
GROUP BY FeedbackID,SupportPlanID
) f_derived on f_derived_FeedbackID=f.FeedBackID and f_derived.SupportPlanID = f.SupportPlanID
INNER JOIN (
SELECT [SupportPlanID],COUNT(*)AS Steps
FROM [StepModels]
GROUP BY SupportPlanID
) s_derived on s_derived.SupportPlanID = f.SupportPlanID
WHERE f_derived.StepsNumber < s_derived.Steps
I think you want something like this.
DELETE FROM [FeedbackStepModels]
WHERE FeedbackID IN
(
SELECT a.FeedbackID
FROM
(
SELECT [FeedbackID],
SupportPlanID,
COUNT(*) AS StepsNumber
FROM [FeedbackStepModels]
GROUP BY FeedbackID,
SupportPlanID
) AS a
INNER JOIN
(
SELECT [SupportPlanID],
COUNT(*) AS Steps
FROM [StepModels]
GROUP BY SupportPlanID
) AS b ON a.SupportPlanID = b.[SupportPlanID]
WHERE a.StepsNumber < b.Steps
);

Get 10 distinct projects with the latest updates in related tasks

I have two tables in a PostgreSQL 9.5 database:
project
- id
- name
task
- id
- project_id
- name
- updated_at
There are ~ 1000 projects (updated very rarely) and ~ 10 million tasks (updated very often).
I want to list those 10 distinct projects that have the latest task updates.
A basic query would be:
SELECT * FROM task ORDER BY updated_at DESC LIMIT 10;
However, there can be many updated tasks per project. So I won't get 10 unique projects.
If I try to add DISTINCT(project_id) somewhere in the query, I'm getting an error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
Problem is, I can't sort (primarily) by project_id, because I need to have tasks sorted by time. Sorting by updated_at DESC, project_id ASC doesn't work either, because several tasks of the same project can be among the latest.
I can't download all records because there are millions of them.
As a workaround I download 10x needed rows (without distinct) scope, and filter them in the backend. This works for most cases, but it's obviously not reliable: sometimes I don't get 10 unique projects.
Can this be solved efficiently in Postgres 9.5?
Example
id | name
----+-----------
1 | Project 1
2 | Project 2
3 | Project 3
id | project_id | name | updated_at
----+------------+--------+-----------------
1 | 1 | Task 1 | 13:12:43.361387
2 | 1 | Task 2 | 13:12:46.369279
3 | 2 | Task 3 | 13:12:54.680891
4 | 3 | Task 4 | 13:13:00.472579
5 | 3 | Task 5 | 13:13:04.384477
If I query:
SELECT project_id, updated_at FROM task ORDER BY updated_at DESC LIMIT 2
I get:
project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
3 | 13:13:00.472579
But I want to get 2 distinct projects with the respective latest task.update_at like this:
project_id | updated_at
------------+-----------------
3 | 13:13:04.384477
2 | 13:12:54.680891 -- from Task 3
The simple (logically correct) solution is to aggregate tasks to get the latest update per project, and then pick the latest 10, like #Nemeros provided.
However, this incurs a sequential scan on task, which is undesirable (expensive) for big tables.
If you have relatively few projects (many task entries per project), there are faster alternatives using (bitmap) index scans.
SELECT *
FROM project p
, LATERAL (
SELECT updated_at AS last_updated_at
FROM task
WHERE project_id = p.id
ORDER BY updated_at DESC
LIMIT 1
) t
ORDER BY t.last_updated_at
LIMIT 10;
Key to performance is a matching multicolumn index:
CREATE INDEX task_project_id_updated_at ON task (project_id, updated_at DESC);
A setup with 1000 projects and 10 million tasks (like you commented) is a perfect candidate for this.
Background:
Optimize GROUP BY query to retrieve latest record per user
Select first row in each GROUP BY group?
NULL and "no row"
Above solution assumes updated_at is defined NOT NULL. Else use ORDER BY updated_at DESCNULLS LAST and ideally make the index match.
Projects without any tasks are eliminated from the result by the implicit CROSS JOIN. NULL values cannot creep in this way. This is subtly different from correlated subqueries like #Nemeros added to his answer: those return NULL values for "no row" (project has no related tasks at all). The outer descending sort order then lists NULL on top unless instructed otherwise. Most probably not what you want.
Related:
PostgreSQL sort by datetime asc, null first?
What is the difference between LATERAL and a subquery in PostgreSQL?
Try a group by expression, that's what it's aimed for :
SELECT project_id, max(update_date) as max_upd_date
FROM task t
GROUP BY project_id
order by max_upd_date DESC
LIMIT 10
Do not forget to put an index that begin with : project_id, update_date if you want to avoid full table scans.
Well the only way to use the index seems to be with correlated sub query :
select p.id,
(select upd_dte from task t where p.id = t.prj_id order by upd_dte desc limit 1) as max_dte
from project p
order by max_dte desc
limit 10
try to use
SELECT project_id,
Max (updated_at)
FROM task
GROUP BY project_id
ORDER BY Max(updated_at) DESC
LIMIT 10
I believe row_number() over() can be used for this but you will still need the final order by and limit clauses:
select
mt.*
from (
SELECT
* , row_number() over(partition by project_id order by updated_at DESC) rn
FROM tasks
) mt
-- inner join Projects p on mt.project_id = p.id
where mt.rn = 1
order by mt.updated_at DESC
limit 2
Advantage of this approach gives you access to the full row corresponding to the maximum updated_at for each project. You can optionally join the projects table as well
result:
| id | project_id | name | updated_at | rn |
|----|------------|--------|-----------------|----|
| 5 | 3 | Task 5 | 13:13:04.384477 | 1 |
| 3 | 2 | Task 3 | 13:12:54.680891 | 1 |
see: http://sqlfiddle.com/#!15/ee039/1
How about sorting the records by the most recent update and then doing distinct on?
select distinct on (t.project_id) t.*
from tasks t
order by max(t.update_date) over (partition by t.project_id), t.project_id;
EDIT:
I didn't realize Postgres did that check. Here is the version with a subquery:
select distinct on (maxud, t.project_id) t.*
from (select t.*,
max(t.update_date) over (partition by t.project_id) as maxud
from tasks t
) t
order by maxud, t.project_id;
You could probably put the analytic call in the distinct on, but I think this is clearer anyway.

Select a row used for GROUP BY

I have this table:
id | owner | asset | rate
-------------------------
1 | 1 | 3 | 1
2 | 1 | 4 | 2
3 | 2 | 3 | 3
4 | 2 | 5 | 4
And i'm using
SELECT asset, max(rate)
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
ORDER BY max(rate) DESC
to get intersection of assets for specified owners with best rate.
I also need id of row used for max(rate), but i can't find a way to include it to SELECT. Any ideas?
Edit:
I need
Find all assets that belongs to both owners (1 and 2)
From the same asset i need only one with the best rate (3)
I also need other columns (owner) that belongs to the specific asset with best rate
I expect the following output:
id | asset | rate
-------------------------
3 | 3 | 3
Oops, all 3s, but basically i need id of 3rd row to query the same table again, so resulting output (after second query) will be:
id | owner | asset | rate
-------------------------
3 | 2 | 3 | 3
Let's say it's Postgres, but i'd prefer reasonably cross-DBMS solution.
Edit 2:
Guys, i know how to do this with JOINs. Sorry for misleading question, but i need to know how to get extra from existing query. I already have needed assets and rates selected, i just need one extra field among with max(rate) and given conditions if it's possible.
Another solution that might or might not be faster than a self join (depending on the DBMS' optimizer)
SELECT id,
asset,
rate,
asset_count
FROM (
SELECT id,
asset,
rate,
rank() over (partition by asset order by rate desc) as rank_rate,
count(asset) over (partition by null) as asset_count
FROM test
WHERE owner IN (1, 2)
) t
WHERE rank_rate = 1
ORDER BY rate DESC
You are dealing with two questions and trying to solve them as if they are one. With a subquery, you can better refine by filtering the list in the proper order first (max(rate)), but as soon as you group, you lose this. As such, i would set up two queries (same procedure, if you are using procedures, but two queries) and ask the questions separately. Unless ... you need some of the information in a single grid when output.
I guess the better direction to head is to have you show how you want the output to look. Once you bake the input and the output, the middle of the oreo is easier to fill.
SELECT b.id, b.asset, b.rate
from
(
SELECT asset, max(rate) maxrate
FROM test
WHERE owner IN (1, 2)
GROUP BY asset
HAVING count(asset) > 1
) a, test b
WHERE a.asset = b.asset
AND a.maxrate = b.rate
ORDER BY b.rate DESC
You don't specify what type of database you're running on, but if you have analytical functions available you can do this:
select id, asset, max_rate
from (
select ID, asset, max(rate) over (partition by asset) max_rate,
row_number() over (partition by asset order by rate desc) row_num
from test
where owner in (1,2)
) q
where row_num = 1
I'm not sure how to add in the "having count(asset) > 1" in this way though.
This first searches for rows with the maximum rate per asset. Then it takes the highest id per asset, and selects that:
select *
from test
inner join
(
select max(id) as MaxIdWithMaxRate
from test
inner join
(
select asset
, max(rate) as MaxRate
from test
group by
asset
) filter
on filter.asset = test.asset
and filter.MaxRate = test.rate
group by
asset
) filter2
on filter.MaxIdWithMaxRate = test.id
If multiple assets share the maximum rate, this will display the one with the highest id.