Postgres: Unable to determine percent of successful events ending in a completed trip - sql

SQL Gurus,
I'm trying to solve this challenging problem as I'm practicing my SQL skills, however I'm stuck and would appreciate if someone could help.
A signup is defined as an event labelled ‘sign_up_success’ within the events table. For each city (‘A’ and ‘B’) and each day of the week, determine the percentage of signups in the first week of 2016 that resulted in completed a trip within 10 hours of the sign up date.
Table Name: trips
Column Name Datatype
id integer
client_id integer (Foreign keyed to
events.rider_id)
driver_id integer
city_id Integer (Foreign keyed to
cities.city_id)
client_rating integer
driver_rating integer
request_at Timestamp with timezone
predicted_eta Integer
actual_eta Integer
status Enum(‘completed’,
‘cancelled_by_driver’, ‘cancelled_by_client’)
Table Name: cities
Column Name Datatype
city_id integer
city_name string
Table Name: events
Column Name Datatype
device_id integer
rider_id integer
city_id integer
event_name Enum(‘sign_up_success’, ‘attempted_sign_up’,
‘sign_up_failure’)
_ts Timestamp with timezone
Tried something on this lines, however its no where near the expected answer:
SELECT *
FROM trips AS trips
LEFT JOIN cities AS cities ON trips.city_id = cities.city_id
LEFT JOIN events AS events ON events.client_id = events.rider_id
WHERE events.event_name = "sign_up_success"
AND Convert(datetime, trips.request_at') <= Convert(datetime, '2016-01-
07' )
AND DATEDIFF(d, Convert(datetime, events._ts), Convert(datetime,
trips.request_at)) < 7 days
AND events.status = "completed
Desired Results look like below:
Monday A x%
Monday B y%
Tuesday A z%
Tuesday A p%
Can someone please help.

First of all, I assume that "trips"."city_id" is mandatory, so I use INNER JOIN instead of LEFT JOIN when joining with cities.
Then, to specify string constants, you need to use single quotes.
There are some other changes in the query -- hope you'll notice them yourself.
Also, the query might fail, since I didn't run it actually (you didn't provide boilerplate SQL unfortunately).
date_trunc() function with 'week' first parameter converts your timestamp to "first day of the corresponding week, time 00:00:00", based on your current timezone settings (see https://www.postgresql.org/docs/current/static/functions-datetime.html).
I used GROUP BY on that value and second "layer" of grouping was city ID.
Also, I used "filter (where ...)" next to count() -- it allows to count only desired rows.
Finally, I used CTE to improve the query's structure and readability.
Let me know if it fails, I'll fix it. In general, this approach must work.
with data as (
select
left(date_trunc('week', t.request_at)::text, 10) as period,
c.city_id,
count(distinct t.id) as trips_count,
count(*) filter (
where
e.event_name = 'sign_up_success'
and e._ts < t.request_at + interval '10 hour'
) as successes_count
from trips as t
join cities as c on t.city_id = c.city_id
left join events as e on t.client_id = e.rider_id and e._ts
where
t.request_at between '2016-01-01' and '2016-01-08'
group by 1, 2
)
select
*,
round(100 * success_count::numeric / trips_count, 2)::text || '%' as ratio_percent
from data
order by period, city_id
;

Related

How can I create a query in SQL Server, using as base table a date function and linking it to another table?

I am trying to create a query using a function of dates and a table of shifts, which can show me the shifts of workers each day, when I have a shift or rest depending on the day,
What do I have: I have a date function that gives me a range of dates that I add, I attach an example:
I have a table of shifts, with only the days that a person has a shift, if a day has a break, the date or the row does not appear, I attach an example:
It can be seen that in the shift table there are only records when a person has a shift.
Problem: when I perform the join between the function and the shift table through the date field, the result is that it only shows me the record when it has a shift and no, it does not put the date when it has a break, I attach an example:
Desired result:
The idea is that when the worker has a break, the row will be blank, only showing the date and his ID, or saying the word break.
I hope you can help me. Thank you so much.
Use LEFT JOIN for avoiding few date missing which has transaction in table.
Use two subquery here for getting appropriate result. In first subquery function CROSS JOIN with transaction table where retrieving distinct id_trabajador for specified date range. If it doesn't do then id will blank in result where no transaction exists for specific id in a certain date. In second subquery retrieve all rows for given date range.
-- SQL Server
SELECT tmp.fecha, tmp.id_trabajador
, tmd.inicio, tmd.termino
, COALESCE(CAST(tmd.jornada AS varchar(20)), 'DESCANSO') jornada
FROM (SELECT * FROM shift_cmr..fnRangoFechas('01-sep-2021', '31-dec-2021') t
CROSS JOIN (SELECT id_trabajador
FROM shift_cmr..trabajadores_turnos_planificados
WHERE fecha BETWEEN '2021-09-01' AND '2021-12-31'
GROUP BY id_trabajador) tt
) tmp
LEFT JOIN (SELECT *
FROM shift_cmr..trabajadores_turnos_planificados
WHERE fecha BETWEEN '2021-09-01' AND '2021-12-31') tmd
ON tmp.fecha = tmd.fecha
AND tmp.id_trabajador = tmd.id_trabajador
You need to start with the date table and LEFT JOIN everything else
SELECT
dates.fecha,
sh.id_trabajador,
sh.inicio,
sh.termino,
jornada = ISNULL(CAST(sh.jornada AS varchar(10)), 'DESCANSO')
FROM shift_cmr..fnRangoFechas('01-sep-2021', '31-dec-2021') dates
LEFT JOIN shift_cmr..trabajadores_turnos_planificados sh
ON sh.fecha = dates.fecha
This only gives you one blank row per date. If you need a blank row for every id_trabajador then you need to cross join that
SELECT
dates.fecha,
t.id_trabajador,
sh.inicio,
sh.termino,
jornada = ISNULL(CAST(sh.jornada AS varchar(10)), 'DESCANSO')
FROM shift_cmr..fnRangoFechas('01-sep-2021', '31-dec-2021') dates
CROSS JOIN shift_cmr..trabajadores t -- guessing the table name
LEFT JOIN shift_cmr..trabajadores_turnos_planificados sh
ON sh.fecha = dates.fecha AND t.id_trabajador = sh.id_trabajador

Quickly querying with effective dates on rows themselves with dates

Suppose I have a table called measurement. This table's purpose to measure a numeric "value" (which itself is calculated from other data) for a "series_id" at a particular "date".
Now let's add effective dating to this table with "effective_start" (inclusive) and "effective_end" (inclusive) fields.
DDL:
CREATE TABLE public.measurement
(
date date NOT NULL,
effective_end date NOT NULL,
effective_start date NOT NULL,
series_id character varying(255) COLLATE pg_catalog."default" NOT NULL,
value numeric,
CONSTRAINT measurement_pkey PRIMARY KEY (date, effective_end, effective_start, series_id)
)
My challenge is to now quickly, and with SQL only (I have Java code and a partial query that solves this), construct a query that results the following:
For all series, at a particular date in time (query parameter), return back the measurement that is the most recent (maximum "date") that was effective at the particular date in time being queried.
My current "all-SQL" solution is a view, combined with a query over the view:
DDL for the view:
CREATE OR REPLACE VIEW public.known_at AS
SELECT o.date,
o.effective_end,
o.effective_start,
o.series_id,
o.value
FROM measurement o
JOIN ( SELECT o_1.series_id,
min(o_1.effective_start) AS effective_start,
o_1.date
FROM measurement o_1
GROUP BY o_1.series_id, o_1.date) x ON o.series_id::text = x.series_id::text AND o.effective_start = x.effective_start AND o.date = x.date
JOIN ( SELECT o_1.series_id,
o_1.effective_start,
max(o_1.date) AS date
FROM measurement o_1
GROUP BY o_1.series_id, o_1.effective_start) y ON x.series_id::text = y.series_id::text AND x.effective_start = y.effective_start AND x.date = y.date
WHERE o.date <= o.effective_start
ORDER BY o.date DESC, o.series_id DESC;
Query:
select k.* from known_at k
inner join (
select
k.series_id,
max(k.date) as date
from known_at k
-- the passed in date here is a parameter as described above
where k.date <= '2020-03-26'
group by k.series_id) as mx
on k.series_id = mx.series_id and k.date = mx.date
order by k.series_id;
Unfortunately, the combination view and query is slow (~400ms) despite btree indices on series_id, date, effective_end, and effective_start. How can I do better?
I think this query should give you the results you want, though without having your dataset it's hard to say what its performance would be like. For this query, I'd recommend a multi-column index on (effective_start, effective_end, series_id, date DESC).
SELECT DISTINCT ON (series_id) *
FROM measurement
WHERE effective_start <= '2020-03-26' -- the passed-in date
AND effective_end >= '2020-03-26' -- the passed-in date
ORDER BY series_id, date DESC;
Explanation: The query filters for rows that include the passed-in date within the effective period, then for each series_id in the filtered rows, the row with the max date is taken.
Also, you may want to consider using a daterange type for the effective dates. Range types come with some useful range operators.

How to join to inner query and calculate column based on different groupings?

I have a table that contains data about a series of visits to shops.
The raw data for these visits can be found here.
My main table will have 1 row per Country, and will use something along the lines of:
Select Distinct o.Country from OtherTable as o
I need to add a new column to my main table, that uses the following calculation:
"Avg Visits by User" = (Sum of (No. Call IDs / No. unique User IDs)
for each day) / No. unique of days (based on Actual Start) for the
row.
I have formed this additional select statement to get the number of calls and users by day - but I am struggling to join this to my main table:
Select DATEPART(DAY, c.ActualStart) As 'Day',
CAST(CAST(COUNT(c.CallID) AS DECIMAL (5,1))/CAST(COUNT(Distinct c.UserID) AS DECIMAL (5,1)) AS DECIMAL (5,1)) as 'Value' from CallInfo as c
where (c.Status = 3))
Group by DATEPART(DAY, c.ActualStart)
For the country GB, I would expect to come to the see the following output:
Day Calls Users Calls / Users
13-Jun 29 8 3.625
14-Jun 31 7 4.428571429
So, in my main table, the calculation for my new column would be:
8.053571 / 2
Therefore, if I somehow add this to my table I would expect the following output:
Country Unique Days Sum of Calls/Users for each day) Final Calc
GB 2 8.053571429 4.026785714
I have tried adding this as a join, but I don't know how to join this to my main table. I could for example join on Call Id - but this would require the addition of a callID column in my inner query, and this would mean that the values are incorrect.
You can use a subquery to make calculations by day and after that make calculations by country. The result SQL query can be like this:
-- Make calculation by country, from the subquery
SELECT Country, UniqueDays = count(TheDay), CallsUserPerDay = sum(CallsPerUser),
FinalCalc = sum(CallsPerUser) / cast(count(TheDay) as DECIMAL)
FROM (
-- SUBQUERY: Make calculations by day
SELECT c.Country, c.ActualStart as TheDay,
Calls = COUNT(c.CallID),
Users = COUNT(Distinct c.UserID),
COUNT(c.CallID)
/CAST(COUNT(Distinct c.UserID) AS DECIMAL) as CallsPerUser
FROM CallInfo as c
WHERE (c.Status = 3)
GROUP BY c.Country, c.ActualStart
) data
GROUP BY Country
Note: I avoid use precission on DECIMAL casting to avoid rounding on final result.

SQL Queries in Oracle

I have created a database of a hospital and "management would like to know how many people got diagnosed with cancer in the last year".
CREATE TABLE patients (
ID_patients INTEGER NOT NULL,
Name VARCHAR NOT NULL
);
CREATE TABLE visit(
ID_visit INTEGER NOT NULL,
DATE_visit DATE NOT NULL,
FK_patients INTEGER NOT NULL,
);
CREATE TABLE Diagnosis(
ID_Diagnosis INTEGER NOT NULL,
FK_disease INTEGER NOT NULL
FK_visit INTEGER
);
CREATE TABLE Disease(
ID_disease INTEGER NOT NULL,
Name_disease VARCHAR NOT NULL
);
Now we need to find out: which patients got diagnosed with cancer last year.
I used query below to get patients that have visited last year, but I do not know how to target those with cancer ? I think I should use "VIEW AS" but I'm not sure.
SELECT *
FROM Visit
WHERE Date_Visit BETWEEN
(CURRENT_DATE - interval '2' year) AND CURRENT_DATE - INTERVAL '1' YEAR;
Assuming you only need a patient count and you already know how to define cancer, you'll want to use a JOIN to connect these tables together:
SELECT COUNT(v.FK_patients)
FROM visit v
JOIN Diagnosis d on d.ID_Diagnosis = v.FK_diagnosis --Here is where you connect the tables
WHERE v.Date_Visit BETWEEN
(CURRENT_DATE - interval '2' year) AND CURRENT_DATE - INTERVAL '1' YEAR
AND FK_disease IN(--Your list of cancer ids);
As illustrated very nicely by Dank. Here we can use some clean DATE function instead of using INTERVAL. The code looks cleaner this way and also we want data for past year so i am assuming you need the data for 01/01/2015 to 12/31/2015. Hope below snippet helps.
SELECT COUNT(v.FK_patients)
FROM visit v
JOIN Diagnosis d on d.ID_Diagnosis = v.FK_diagnosis --Here is where you connect the tables
WHERE v.Date_Visit BETWEEN
TRUNC(ADD_MONTHS(SYSDATE,-12),'YEAR') AND TRUNC(SYSDATE,'YEAR')-1 ;
This should be straight forward I guess...:
select pa.ID_patients, pa.Name
from patients pa, visit vi, Diagnosis dia, Disease dis
where vi.FK_patients = pa.ID_patients
and dia.ID_Diagnosis = vi.FK_diagnosis
and dis.ID_disease = dia.FK_disease
and upper(dis.Name_disease) like '%CANCER%'
Just add your date filtering to it and it should show the desired result...

sql select number divided aggregate sum function

I have this schema
and I want to have a query to calculate the cost per consultant per hour per month. In other words, a consultant has a salary per month, I want to divide the amount of the salary between the hours that he/she worked that month.
SELECT
concat_ws(' ', consultants.first_name::text, consultants.last_name::text) as name,
EXTRACT(MONTH FROM tasks.init_time) as task_month,
SUM(tasks.finish_time::timestamp::time - tasks.init_time::timestamp::time) as duration,
EXTRACT(MONTH FROM salaries.payment_date) as salary_month,
salaries.payment
FROM consultants
INNER JOIN tasks ON consultants.id = tasks.consultant_id
INNER JOIN salaries ON consultants.id = salaries.consultant_id
WHERE EXTRACT(MONTH FROM tasks.init_time) = EXTRACT(MONTH FROM salaries.payment_date)
GROUP BY (consultants.id, EXTRACT(MONTH FROM tasks.init_time), EXTRACT(MONTH FROM salaries.payment_date), salaries.payment);
It is not possible to do this in the select
salaries.payment / SUM(tasks.finish_time::timestamp::time - tasks.init_time::timestamp::time)
Is there another way to do it? Is it possible to solve it in one query?
Assumptions made for this answer:
The model is not entirely clear to me, so I am assuming the following:
you are using PostgreSQL
salaries.date is defined as a date column that stores the day when a consultant was paid
tasks.init_time and task.finish_time are defined as timestamp storing the data & time when a consultant started and finished work on a specific task.
Your join on only the month is wrong as far as I can tell. For one, because it would also include months from different years, but more importantly because this would lead to a result where the same row from salaries appeared several times. I think you need to join on the complete date:
FROM consultants c
JOIN tasks t ON c.id = t.consultant_id
JOIN salaries s ON c.id = s.consultant_id
AND t.init_time::date = s.payment_date --<< here
If my assumptions about the data types are correct, the cast to a timestamp and then back to a time is useless and wrong. Useless because you can simply subtract to timestamps and wrong because you are ignoring the actual date in the timestamp so (although unlikely) if init_time and finish_time are not on the same day, the result is wrong.
So the calculation of the duration can be simplified to:
t.finish_time - t.init_time
To get the cost per hour per month, you need to convert the interval (which is the result when subtracting one timestamp from another) to a decimal indicating the hours, you can do this by extracting the seconds from the interval and then dividing that by 3600, e.g.
extract(epoch from sum(t.finish_time - t.init_time)) / 3600)
If you divide the sum of the payments by that number you get your cost per hour per month:
SELECT concat_ws(' ', c.first_name, c.last_name) as name,
to_char(s.payment_date, 'yyyy-mm') as salary_month,
extract(epoch from sum(t.finish_time - t.init_time)) / 3600 as worked_hours,
sum(s.payment) / (extract(epoch from sum(t.finish_time - t.init_time)) / 3600) as cost_per_hour
FROM consultants c
JOIN tasks t ON c.id = t.consultant_id
JOIN salaries s ON c.id = s.consultant_id AND t.init_time::date = s.payment_date
GROUP BY c.id, to_char(s.payment_date, 'yyyy-mm') --<< no parentheses!
order by name, salary_month;
As you want the report broken down by month you should convert the month into something that contains the year as well. I used to_char() to get a string with only year and month. You also need to remove salaries.payment from the group by clause.
You also don't need the "payment month" and "salary month" because both will always be the same as that is the join condition.
And finally you don't need the cast to ::text for the name columns because they are most certainly defined as varchar or text anyway.
The sample data I made up for this: http://sqlfiddle.com/#!15/ae0c9
Somewhat unrelated, but:
You should also not put the column list of the group by in parentheses. Putting a column list in parentheses in Postgres creates an anonymous record which is something completely different then having multiple columns. This is also true for the columns in the select list.
If at all the target is putting it in one query, then just confirming, have you tried to achieve it using CTEs?
Like
;WITH cte_pymt
AS
(
//Your existing query 1
)
SELECT <your required data> FROM cte_pymt