Redshift SQL - Skipped sequence - sql

I'm working on applicant pipeline data and need to get a count of applicants who made it to each phase of the pipeline/funnel. If an applicant skips a phase, I need to need to count them in the phase anyway. Here's an example of how that data might look for one applicant:
Stage name | Entered on
Application Review | 9/7/2018
Recruiter Screen | 9/10/2018
Phone Interview | blank
Interview | 9/17/2018
Interview 2 | 9/20/2018
Offer | blank
this is what the table looks like:
CREATE TABLE application_stages (
application_id bigint,
stage_id bigint,
entered_on timestamp without time zone,
exited_on timestamp without time zone,
stage_name character varying
);
In this example, I want to count Application Review through Interview 2 (including the skipped/blank Phone Interview phase), but not the Offer. How would I write the above in SQL? (Data is stored in Amazon Redshift. Using SQL workbench to query.)
Also, please let me know if there is anything else I can add to my question to make the issue/solution clearer.

You can hardcode the stages of the pipeline in event_list table like this:
id | stage_name
1 | first stage
2 | second stage
3 | third stage
4 | fourth stage
UPD: The deeper is the stage of the funnel, the higher is its ID. This way, you can compare them, i.e. third stage is deeper than second stage because 3>2. Thus, if you need to find people that reached the 2nd stage it includes people that have events with id=2 OR events with id>2, i.e. events deeper in the funnel.
If the second stage is missed and the third stage is recorded for some person you can still count that person as "reached second stage" by joining your event data to this table by stage_name and counting the number of records with id>=2, like
select count(distinct user_id)
from event_data t1
join event_list t2
using (stage_name)
where t2.id>=2
Alternatively, you can left join your event table to event_list and fill the gaps using lag function that returns the value of the previous row (i.e. assigning the timestamp of first stage to the second stage in the case above)

Here is the SQL I ended up with. Thanks for the ideas, #AlexYes!
select stage_name,
application_stages.application_id, entered_on,
case when entered_on is NULL then lead(entered_on,1)
ignore nulls
over
(PARTITION BY application_stages.application_id order by case stage_name
when 'Application Review' then 1
when 'Recruiter Screen' then 2
when 'Phone Interview' then 3
when 'Interview' then 4
when 'Interview 2' then 5
when 'Offer' then 6
when 'Hired' then 7 end) else entered_on end as for_count, exited_on
from application_stages
I realize that the above SQL doesn't give me the counts but I am doing the counts in Tableau. Happy to have the format above in case I need to do other calculations on the new "for_count" field.

Related

SQL - Tracking student exam records as they move between schools

I'd like to pick some of your glorious minds for an optimal solution to my dilemma.
Scenario:
Schools have children and children take tests.
The tests point to the child, not the school.
If the child moves school, the test records are taken to the new school and the previous school has no record of the test being done as they are linked to the child.
Obviously, this isn't ideal and is the result of the database not being designed with this in mind. What would the correct course of action be; I’ve currently identified the 3 possibilities listed below which would solve the current problem. However, i cannot be sure which is best for the issue at hand - and if any better solutions exist.
Have each test store the school & student within the test records (requiring current records to be updated & increasing the size of the database)
Create a new child record, duplicating the existing data for the new school with a new ID so the test remains linked to the previous school (complicating the ability to identify previous test scores)
Separately keep track of moves to other schools, then use this additional table to identify current and previous using the timestamps (increased complexity and computational requirements)
EDIT:
So i tried to use a basic example, but requests for the task at hand have been requested.
Here's the DB Schema for the tables (simplified for problem, note: Postnatal is not important):
Patients: ID, MidwifeID, TeamID
Midwives: ID
Groups: ID
GroupsMidwives: MidwifeID, GroupsID
PatientObservations: ID, MidwifeID, PatientID
Using a query as follows:
SELECT Some Information
from Postnatals
JOIN Midwives on Postnatals.MidwifeID = Midwives.ID
JOIN Patients on Patients.PatientID = Postnatals.PatientID
JOIN GroupsMidwives on GroupsMidwives.MidwifeID = Midwives.ID
JOIN Groups on Groups.ID = GroupsMidwives.GroupID
JOIN PatientObservations on PatientObservations.PatientID =
Postnatals.PatientID
WHERE groups.Name = ?
*some extra checks*
GROUP BY Midwives.Firstname, Midwives.Surname, Midwives.ID
However, in the event that a midwife is moved to a different team, the data associated with the previous team is now owned by the newly assigned team. As described in the example detailed previously.
Thus a modification (which modification is yet to be realised) is required to make the data submitted - prior to a team change - assigned to the previous team, as of current, because of the way the records are owned by the midwife, this is not possible.
You should below suggestion as per you concern.
Step 1 ) You need to create School Master Table
ID | School | IsActive
1 | ABC | 1
2 | XYZ | 1
Step 2 ) You need to create Children Master having school id as foreign key
ID | School | Children Name| IsActive
1 | 2 | Mak | 1
2 | 2 | Jak | 1
Step 3 ) You need to create test table having children id as foreign key
ID | Children_id | Test Name | IsActive
1 | 2 | Math | 1
2 | 2 | Eng | 1
Now whenever child moves school then make child record inactive and create another active record with new school. This will help you to bifurcate the old test and new test.
do let me know in case morehelp required

How can I group flight legs together into routes for counting?

I have a person's flight history and want to find their most frequent route. All flights are stored as a single row in a table, even return trips where a->b will be in one row and b->a will be in another.
I need to identify where two legs equate to a route; for example:
This person has flown 16 times in total
New York to Paris 2 times (Flight key: JFKCDG)
Paris to New York 2 times (Flight Key: CDGJFK)
New York to London 3 times (Flight Key: JFKLHR)
Currently I don't know a way to group the first two above as a 'Route' and therefore any query I write considers JFKLHR to be the most frequent route (6 times between NY and London) even though I can see from the data that this person has flown between NY and Paris a total of 10 times
Sample Table:
User ID¦Flight Key
-------------------
1 ¦JFKCDG
1 ¦JFKCDG
1 ¦CDGJFK
1 ¦CDGJFK
1 ¦JFKLHR
1 ¦JFKLHR
1 ¦JFKLHR
Expected Output
User ID¦Flight Key¦Count
------------------------
1 ¦JFKCDGJFK ¦4
Building on the clever idea in the answer by #fancyPants. You can use string functions to compare each leg of a route and patch together a full return trip.
I believe this query should work. The first part of the common table expression turns those flights that are round trips into three parts (src-dst-src) and the second part returns those that are one way (as src-dst).
with flights_cte as (
select
USERID,
case when left(flightkey,3) > right(flightkey,3)
then concat(flightkey, left(flightkey,3))
else concat(right(flightkey,3), flightkey)
end as flightkey,
count(*) count
from flights f
where exists (
select 1 from flights where right(f.flightkey,3) = left(flightKey,3)
)
group by
userid,
case
when left(flightkey,3) > right(flightkey,3)
then concat(flightkey, left(flightkey,3))
else concat(right(flightkey,3), flightkey)
end
union all
select userid, FlightKey, count(*)
from flights f
where not exists (
select 1 from flights where right(f.flightkey,3) = left(flightKey,3)
)
group by UserID, FlightKey
)
select flights_cte.userid, flights_cte.flightkey, flights_cte.count
from flights_cte
join (select userid, max(count) _max_count from flights_cte group by userid) _max
on flights_cte.UserID=_max.UserID and flights_cte.count = _max_count
A sample SQL Fiddle gives this output:
| USERID | FLIGHTKEY | COUNT |
|--------|-----------|-------|
| 1 | JFKCDGJFK | 4 |
Assuming routes are not a single row, otherwise you wouldn't be asking.. (although I would guess that the whole route is in some other table, maybe reservation-related)
Guessing the first step is to group this data by person and flights that compose a 'route'. I have an article called T-SQL: Identify bad dates in a time series where the time series can be modified to detect gaps between legs of over a day (guess) to differentiate routes. Second step would be to convert legs into route, i.e. JFK-CDG and CDG-JFK to single value JFK-CDG-JFK.
Then it would be a single query, counting the above single value route, and ORDER BY that count.
Good luck.

Multi-Row Per Record SQL Statement

I'm not sure this is possible but my manager wants me to do it...
Using the below picture as a reference, is it possible to retrieve a group of records, where each record has 2 rows of columns?
So columns: Number, Incident Number, Vendor Number, Customer Name, Customer Location, Status, Opened and Updated would be part of the first row and column: Work Notes would be a new row that spans the width of the report. Each record would have two rows. Is this possible with a GROUP BY statement?
Record 1
Row 1 = Number, Incident Number, Vendor Number, Customer Name, Customer Location, Status, Opened and Updated
Row 2 = Work Notes
Record 2
Row 1 = Number, Incident Number, Vendor Number, Customer Name, Customer Location, Status, Opened and Updated
Row 2 = Work Notes
Record n
...
I don't think that possible with the built in report engine. You'll need to export the data and format it using something else.
You could have something similar to what you want on short description (list report, group by short description), but you can't group by work notes so that's out.
One thing to note is that the work_notes field is not actually a field on the table, the work_notes field is of type journal_input, which means it's really just a gateway to the actual underlying data model. "Modifying" work_notes actually just inserts into sys_journal_field.
sys_journal_field is the table which stores the work notes you're looking for. Given a sys_id of an incident record, this URL will give you all journal field entries for that particular record:
/sys_journal_field_list.do?sysparm_query=name=task^element_id=<YOUR_SYS_ID>
You will notice this includes ALL journal fields (comments + work_notes + anything else), so if you just wanted work notes, you could simply add a query against element thusly:
/sys_journal_field_list.do?sysparm_query=name=task^element=work_notes^element_id=<YOUR_SYS_ID>
What this means for you!
While you can't separate a physical row into multiple logical rows in the UI, in the case of journal fields you can join your target table against the sys_journal_field table using a Database View. This deviates from your goal in that you wouldn't get a single row for all work notes, but rather an additional row for each matched work note.
Given an incident INC123 with 3 work notes, your report against the Database View would look kind of like this:
Row 1: INT123 | markmilly | This is a test incident |
Row 2: INT123 | | | Work note #1
Row 3: INT123 | | | Work note #2
Row 4: INT123 | | | Work note #3

Cumulative average number of records created for specific day of week or date range

Yeah, so I'm filling out a requirements document for a new client project and they're asking for growth trends and performance expectations calculated from existing data within our database.
The best source of data for something like this would be our logs table as we pretty much log every single transaction that occurs within our application.
Now, here's the issue, I don't have a whole lot of experience with MySql when it comes to collating cumulative sum and running averages. I've thrown together the following query which kind of makes sense to me, but it just keeps locking up the command console. The thing takes forever to execute and there are only 80k records within the test sample.
So, given the following basic table structure:
id | action | date_created
1 | 'merp' | 2007-06-20 17:17:00
2 | 'foo' | 2007-06-21 09:54:48
3 | 'bar' | 2007-06-21 12:47:30
... thousands of records ...
3545 | 'stab' | 2007-07-05 11:28:36
How would I go about calculating the average number of records created for each given day of the week?
day_of_week | average_records_created
1 | 234
2 | 23
3 | 5
4 | 67
5 | 234
6 | 12
7 | 36
I have the following query which makes me want to murderdeathkill myself by casting my body down an elevator shaft... and onto some bullets:
SELECT
DISTINCT(DAYOFWEEK(DATE(t1.datetime_entry))) AS t1.day_of_week,
AVG((SELECT COUNT(*) FROM VMS_LOGS t2 WHERE DAYOFWEEK(DATE(t2.date_time_entry)) = t1.day_of_week)) AS average_records_created
FROM VMS_LOGS t1
GROUP BY t1.day_of_week;
Halps? Please, don't make me cut myself again. :'(
How far back do you need to go when sampling this information? This solution works as long as it's less than a year.
Because day of week and week number are constant for a record, create a companion table that has the ID, WeekNumber, and DayOfWeek. Whenever you want to run this statistic, just generate the "missing" records from your master table.
Then, your report can be something along the lines of:
select
DayOfWeek
, count(*)/count(distinct(WeekNumber)) as Average
from
MyCompanionTable
group by
DayOfWeek
Of course if the table is too large, then you can instead pre-summarize the data on a daily basis and just use that, and add in "today's" data from your master table when running the report.
I rewrote your query as:
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week
The reason why your query takes so long is because of your inner select, you are essentialy running 6,400,000,000 queries. With a query like this your best solution may be to develop a timed reporting system, where the user receives an email when the query is done and the report is constructed or the user logs in and checks the report after.
Even with the optimization written by OMG Ponies (bellow) you are still looking at around the same number of queries.
SELECT x.day_of_week,
AVG(x.count) 'average_records_created'
FROM (SELECT DAYOFWEEK(t.datetime_entry) 'day_of_week',
COUNT(*) 'count'
FROM VMS_LOGS t
GROUP BY DAYOFWEEK(t.datetime_entry)) x
GROUP BY x.day_of_week

Cross Tab - Storing different dates (Meeting1, Meeting2, Meeting 3 etc) in the same column

I need to keep track of different dates (dynamic). So for a specific Task you could have X number of dates to track (for example DDR1 meeting date, DDR2 meeting date, Due Date, etc).
My strategy was to create one table (DateTypeID, DateDescription) which would store the description of each date. Then I could create the main table (ID, TaskDescription, DateTypeID). So all the dates would be in one column and you could tell what that date represents by looking at the TypeID. The problem is displaying it in a grid. I know I should use a cross tab query, but i cannot get it to work. For example, I use a Case statement in SQL Server 2000 to pivot the table over so that each column name is the name of the date type. IF we have the following tables:
DateType Table
DateTypeID | DateDescription
1 | DDR1
2 | DDR2
3 | DueDate
Tasks Table
ID | TaskDescription
1 | Create Design
2 | Submit Paperwork
Tasks_DateType Table
TasksID | DateTypeID | Date
1 | 1 | 09/09/2009
1 | 2 | 10/10/2009
2 | 1 | 11/11/2009
2 | 3 | 12/12/2009
THE RESULT SHOULD BE:
TaskDescription | DDr1 | DDR2 | DueDate
Create Design |09/09/2009 | 10/10/2009 | null
Submit Paperwork |11/11/2009 | null | 12/12/2009
IF anyone has any idea how I can go about researching this, I appreciate it. The reason I do this instead of making a column for each date, has to do with the ability to let the user in the future add as many dates as they want without having to manually add columns to the table and editing html code. This also allows simple code for comparing dates or show upcoming tasks by their type (ex. 'Create design's DDR1 date is coming up' ) If anyone can point me in the right direction, I appreciate it.
Here is a proper answer, tested with your data. I only used the first two date types, but you'd build this up on the fly anyway.
Select
Tasks.TaskDescription,
Min(Case DateType.DateDescription When 'DDR1' Then Tasks_DateType.Date End) As DDR1,
Min(Case DateType.DateDescription When 'DDR2' Then Tasks_DateType.Date End) As DDR2
From
Tasks_DateType
INNER JOIN Tasks ON Tasks_DateType.TaskID = Tasks.TaskID
INNER JOIN DateType ON Tasks_DateType.DateTypeID = DateType.DateTypeID
Group By
Tasks.TaskDescription
EDIT
van mentioned that tasks with no dates won't show up. This is correct. Using left joins (again, mentioned by van) and restructuring the query a bit will return all tasks, even though this is not your need at the moment.
Select
Tasks.TaskDescription,
Min(Case DateType.DateDescription When 'DDR1' Then Tasks_DateType.Date End) As DDR1,
Min(Case DateType.DateDescription When 'DDR2' Then Tasks_DateType.Date End) As DDR2
From
Tasks
LEFT OUTER JOIN Tasks_DateType ON Tasks_DateType.TaskID = Tasks.TaskID
LEFT OUTER JOIN DateType ON Tasks_DateType.DateTypeID = DateType.DateTypeID
Group By
Tasks.TaskDescription
If the pivoted columns are unknown (dynamic), then you'll have to build up your query manually in either ms-sql 2000 or 2005, ie with out without PIVOT.
This involves either executing dynamic sql in a stored procedure (generally a no-no) or querying a view with dynamic sql. The latter is the approach I generally go with.
For pivoting, I prefer the Rozenshtein method over case statements, as explained here:
http://www.stephenforte.net/PermaLink.aspx?guid=2b0532fc-4318-4ac0-a405-15d6d813eeb8
EDIT
You can also do this in linq-to-sql, but it emits some pretty inefficient code (at least when I view it through linqpad), so I don't recommend it. If you're still curious I can post an example of how to do it.
I don't have personal experience with the pivot operator, it may provide a better solution.
But I've used a case statement in the past
SELECT
TaskDescription,
CASE(DateTypeID = 1, Tasks_DateType.Date) AS DDr1,
CASE(DateTypeID = 2, Tasks_DateType.Date) AS DDr2,
...
FROM Tasks
INNER JOIN Tasks_DateType ON Tasks.ID = Tasks_DateType.TasksID
INNER JOIN DateType ON Tasks_DateType.DateTypeID = DateType.DateTypeID
GROUP BY TaskDescription
This will work, but will require you to change the SQL whenever there are more Task descriptions added, so it's not ideal.
EDIT:
It appears as though the PIVOT keyword was added in SqlServer 2005, this example shows how to do a pivot query in both 2000 & 2005, but it is similar to my answer.
Version-1: +simple, -must be changed every time DateType is added. So is not great for a dynamic solution:
SELECT tt.ID,
tt.TaskDescription,
td1.Date AS DDR1,
td2.Date AS DDR2,
td3.Date AS DueDate
FROM Tasks tt
LEFT JOIN Tasks_DateType td1
ON td1.TasksID = tt.ID AND td1.DateTypeID = 1
LEFT JOIN Tasks_DateType td2
ON td2.TasksID = tt.ID AND td2.DateTypeID = 2
LEFT JOIN Tasks_DateType td3
ON td3.TasksID = tt.ID AND td3.DateTypeID = 3
Version-2: completely dynamic (with some limitations, but they can be handled - just google for it):
Dynamic pivot query creation. See Dynamic Cross-Tabs/Pivot Tables: you need to create one SP of UDF and then can use it for multiple purposes. This is the original post, to which you may find many links and improvements.
Version-3: just leave it for your client code to handle. I would not design my SQL to return a dynamic set of data, but rather handle it on the client (presentation layer). I just would not like to handle some dynamic columns that come as a result of my query, where I need to guess what is that exactly. The only reason I use Version-2 is when the result is presented directly as a table for a report. In all other cases for truly dynamic data I use client code. For example: having structure you have, how will you attach logic that field DueDate is mandatory - you cannot use DB constraints; how will you ensure that DDR1 is not higher then DDR2? If these are not separate (static) columns in the database (where you can use CONSTRAINTS), then the client code is the one that validates your data consistency.
Good luck!