SQL multiple join with count, having trouble - sql

First of all this is a homework assignment so I'm looking for assistance, not solutions. Let me try to explain my schema. I have three tables we'll call users (with columns id and name), parties (with columns id, partydate, and user_id) and questions (with columns id, createdate, and user_id). My requirement is to show for every user the number of parties within the last year and questions created within the last year. At first I had something like this:
SELECT users.id, users.name,
COUNT(parties.id) AS numparties, COUNT(qustions.id) AS numquestions
FROM users
FULL JOIN parties ON users.id=parties.user_id
FULL JOIN questions ON users.id=questions.user_id
WHERE (parties.partydate > NOW() - interval '1 year' OR parties.partydate IS NULL)
OR (questions.createdate > NOW() - interval '1 year' OR questions.createdate IS NULL)
GROUP BY users.id, users.name
Now this works, almost! The problem is, if a user has no parties nor questions within the past year, they don't show up at all in the result. I want such a user to show up, I just want it to show them with 0 for each numparties and numquestions.
What I think I need here is some sort of conditional counting, where I only want to COUNT(parties.id) WHERE that party's partydate is within the past year, and the same for questions. I'm just unsure how to do that. I have a hacky-workaround way to do what I want, where I basically UNION the above query with a near identical copy of itself, except I use SUM(0) for numparties and numquestions and my WHERE statement is just where the date is <= instead of >. I feel this is not the best way to go about it.
Any pointers in the right direction? Thanks for the help!

Take a look at this: http://www.w3schools.com/sql/sql_join_left.asp. I think it might point you in the right direction.

Think I've got it with this:
SUM(CASE WHEN (parties.partydate > NOW() - interval '1 year') THEN 1 ELSE 0 END) as numparties
and just removed the WHERE clauses.

I think I'd resort to a subquery for this. Homework questions are fun to answer, I can heavily psuedo code this. Take the entire query you have and call it 'x'.
First thing you'll want is a list of all users regardless of how many questions they've asked.
Select distinct users.id,users.name from users
that will give you a full list of your users. The query you have above gives you there calls...so left join the two together.
Select (fields you want)
from users
left join (enter you query above here) x on x.id = users.id
Hopefully the logic here makes sense for you. Use one query to get the list of users and join that to the subquery to get their counts.
edit to add: this will bring back nulls anytime there are no records. You can make your select statement show nulls as 0's

Related

BigQuery: SELECT in WHERE-clause with filter based on a value in the current row

I know the title is probably pretty stupid but I have a hard time phrasing it differently.
I have to use BigQuery at work atm for some report. BigQuery is connected to a Google Analytics view of ours. This gives us a dataset with 1 table for each day. The rows of the tables are user-sessions on our site, while columns have some information about the sessions.
The problem I have is the following:
I want to select sessions with transactions, but only if the user was referred to our site by a certain referrer in the last x days before the transaction happened. I'm only familiar with basic SQL and not with any advanced concepts. It's really frustrating to me because this would be a no-brainer with any proper programming language given a .csv of the data, but I'm lacking knowledge of the relevant concepts in SQL.
#standardSQL
SELECT
COUNT(*)
FROM
`dataset.ga_sessions_2017*`
WHERE
totals.transactions > 0 AND
fullVisitorId IN (SELECT
fullVisitorId
FROM
`dataset.ga_sessions_2017*`
WHERE
trafficSource.source = "xyz.com"
) AND
< date difference thing>
I could filter for the date difference like I did with the trafficSource (referrer). The problem for me is that while "xyz.com" is a static thing, I'd need to reference the date value of the current row I'm in. So the date by which I'd filter the 2nd SELECT would be dynamically changing from row to row. Can anyone guide me on how this is usually done? This seems like a thing that would come up often.
I'm not familiar with the GA tables specifically, but having written some wildcard queries in BigQuery before, I think what you're looking for can be done using the _TABLE_SUFFIX pseudo column:
CAST(_TABLE_SUFFIX AS INT64) >= 1217
Where 1217 is today's date in MMDD format minus 3 days, assuming the table names are _20171217, _20171218, etc. Otherwise you can just use REPLACE to remove underscores before casting to an int. There are also functions that will generate today's date for you if you needed this query to run automatically.
Also, I think the fullVisitorId business could be replaced with a simple WHERE trafficSource.source = "xyz.com" but it's hard to say for sure without being able to run the query myself.
So the full query would look something like this:
#standardSQL
SELECT
COUNT(*)
FROM
`dataset.ga_sessions_2017*`
WHERE
totals.transactions > 0 AND
trafficSource.source = "xyz.com" AND
CAST(_TABLE_SUFFIX AS INT64) >= 1217

comparing usage of inner join and where in

with two tables - all_data and selected_place_day_hours
all_data has place_id, day, hour, metric
selected_place_day_hours has fields place_id, day, hour
I need to subset all_data such that only records with place_id, day, hour in selected_place_day_hours are selected.
I can go two ways about it
1.Use inner join
select a.*
from all_data as a
inner join selected_place_day_hours as b
on (a.place_id = b.place_id)
and ( a.day = b.day)
and ( a.hour = b.hour)
;
2.Use where in
select *
from all_data
where
place_id in (select place_id from selected_place_day_hours)
and day in (select day from selected_place_day_hours)
and hour in (select day from selected_place_day_hours)
;
I want to get some idea on why, when, if you would choose one over the other from a functional and performance perspective ?
One thought is that in #2 above, probably sub-selects is not performance friendly and also longer code.
The two are semantically different.
The IN does a semi-join, meaning that it returns one from all_data regardless of how many rows are matched in selected_place_day_hours.
The JOIN can return multiple rows.
So, the first piece of advice is to use the version that is correct for what you want to accomplish.
Assuming the data in select_place_day_hours guarantees at most one match, then you have an issue with performance. The first piece of advice is to try both queries on your data and on your system. However, often JOIN is optimized at least as well as IN, so that would usually be a safe choice.
These days, SQL tends to ignore what you say and do its own thing.
This is why SQL is a declarative language, not a programming language: you tell it what you want, not how to do it. The SQL interpreter will work out what you want and devise its own plan for how to get the results.
In this case, the 2 versions will probably produce an identical plan, regardless of how you write it. In any case, the plan chosen will be the most efficient one.
The reasons to prefer the join syntax over the older where syntax are:
to look cool: you don’t want anybody catching you with code that is old-fashioned
the join syntax is easy to adapt to outer joins
the join syntax allows you to separate the join part from additional filter by distinguishing between join and where
The reasons do not include whether one is better, because the interpreter will handle that.
These are some more notes that are too long for a comment.
First it should be showed that your two queries is different. (Maybe the 2nd query you wrote is a wrong query)
For example:
all_data
place_id day hour other_cols...
1 4 3 ....
selected_place_day_hours
place_id day hour
1 4 9
4444 4444 6
Then your 1st query will get no row in return, and your 2nd will return (1, 4, 6)
One more note is that if (place_id, day, hour) is unique, your first query is in same purpose of following query
SELECT *
FROM all_data
WHERE
(place_id, day, hour) IN (
SELECT place_id, day, hour
FROM selected_place_day_hours
);

Access SQL Query: Comparing Date In Select Statement

I have a problem that I simply cannot seem to figure out. I have a list of employees with different travel dates and I want to display all of them in a cascading list format. The problem is that I only want to see employees once, and only the date closest to today.
For example I could have 'Smith' in there multiple times with dates before and after today, as we also keep historical records. This means I can't just do min, as it will try and display a date before today, and max is too far forward.
The code example below ALMOST works. The problem is in the select statement. I want to show the minimum date after today, but instead it gives me 0's and -1's where the dates should be. There might just be another way to do this all together, but this is the only configuration that seems to allow the other information such as Site, Position, and Comments to be displayed correctly alongside it.
SELECT A.`Last Name` AS [Last Name], Min(A.`Date In`) > Now() AS [Date In], Max(B.Site) AS Site, Max(B.Position), Max(B.Comments) AS Comments
FROM Deployments AS A
INNER JOIN Deployments AS B ON A.ID = B.ID
GROUP BY A.`FSR Name`
HAVING (((Max(A.`Actual TEP IN`))>Now()));
I did a group by Name because I only want to see each individual once. If I don't add the table to itself with a join it gives a self reference error. This is my first time posting so I hope this makes sense! All help will be greatly appreciated!
Not sure what DB you're on, but in general, you need to return MIN(date) instead of the result of the comparison "Min(Date) > Now()" - I'm guessing this is where you're seeing 0's and -1's, since that would be the result of the comparison, when you want the minimum date value itself.
Also, if you are just wanting people who have a trip date in the future, just restrict your query with a WHERE clause, do a GROUP BY, and you get rid of the self-join. Also note that the example below aligns some discrepancies in your OP like where you're selecting based on "Last Name" but grouping on "FSR Name" - these things must be consistent, whichever field you're concerned about.
Example:
SELECT A.[FSR Name] AS [FSR Name],
Min(A.[Date In]) AS [Date In],
Max(A.Site) AS Site,
Max(A.Position) AS Position,
Max(A.Comments) AS Comments
FROM Deployments AS A
WHERE A.[Date In] > Now()
GROUP BY A.[FSR Name];
EDIT: If you need to make sure that Site,Position,Comments all came from the same row, you have to do something like one of these options:
If you have a Primary Key:
select * from Deployments A3 where A3.pk_value =
(select max(A2.pk_value) from Deployments A2
where A2.[Date In] =
(select Max([Date In]) from Deployments A where A.[FSR Name] = A2.[FSR Name])
and A2.[FSR Name] = A3.[FSR Name]
)
This guarantees you to get 1 row per FSR Name, even if there are multiple rows for that FSR with the same "latest" date.
Otherwise, you can leave out the secondary query dealing with the pk_value, but you run a risk of getting multiple rows for an FSR that has multiple records with the same "latest" date.
Note: when you get to queries this complex, running on a full-featured database (SQL Server, Oracle, anything but Access) allows you to use much more sophistication. For this example, "Windowing Functions" would give you the answer without as much wrangling. Not sure if you're stuck with Access for now, but consider this for the future, anyway.
Try something like this
Select A.LastName, A.DateIn, A.Site, A.Position, A.Comments
From deployments a
Where not exists (Select *
From deployments b
Where b.id <> a.id
and (abs(datediff(d, getdate(), a.datein))) > abs(datediff(d, getdate(), b.datein))
or abs(datediff(d, getdate(), a.datein)) = abs(datediff(d, getdate(), b.datein) and a.id > b.id))
Instead of the funny mins and maxes that you are using to try to get the row with the datein that is closest to today, try using datediff. With this function, you can specify what type of date or time value you are looking to compare (day, month, year, minute) and then find the difference between two different datetimes. In this case, I used getdate() to find the current date and time. Then, we want the datein with the least value for datediff, the datein that is closest to today. Datediff will return positive or negative values, so I used abs to get the absolute value of the result. I did this because it doesn't matter if the date is before today or after today.
Then we are looking in the deployment table. The subquery says that we should look at all the values which are not the current value. Then, find all the rows that have a smaller datediff than the current record. Also, find all the records that have the same datediff as the current record and a smaller id. We will only include the current record if there isn't anything that fits this criteria. It is a little weird to think about, but this type of query should help you find what you are looking for a lot easier. The only thing is that you will need to add criteria in the where clause of the subquery to determine which entries to compare. As it stands, this query will look at all of the entries in your deployments table and pull back the one row that has a datein closest to today. Since you want one row for each person, this will need few more specifications.

How to adapt this query to use window functions

When I started tackling this problem, I thought, "This will be a great query to learn about Window Functions." I wasn't able to end up getting it to work with window functions, but I was able to get what I wanted using a join.
How would you adapt this query to use window functions:
SELECT
day,
COUNT(i.project) as num_open
FROM generate_series(0, 364) as t(day)
LEFT JOIN issues i on (day BETWEEN i.closed_days_ago AND i.created_days_ago)
GROUP BY day
ORDER BY day;
The query above takes a list of issues that have a range represented by created_days_ago and closed_days ago and for the last 365 days, it'll count the number of issues that were created but not yet closed for that specific day.
http://sqlfiddle.com/#!15/663f6/2
The issues table looks like:
CREATE TABLE issues (
id SERIAL,
project VARCHAR(255),
created_days_ago INTEGER,
closed_days_ago INTEGER);
What I was thinking was that the partition for a given day should include all the rows in issues where day is between the created and closed days ago. Something like SELECT day, COUNT(i.project) OVER (PARTITION day BETWEEN created_days_ago AND closed_days_ago) ...
I've never used window functions before, so I might be missing something basic, but it seemed like this was just the type of query that makes window functions so awesome.
The fact that you use generate_series() to create a full range of days, including those days with no changes, and thus no rows in table issues, does not rule out the use of window functions.
In fact, this query runs 50 times faster than the query in the Q in my local test:
SELECT t.day
, COALESCE(sum(a.created) OVER (ORDER BY t.day DESC), 0)
- COALESCE(sum(b.closed) OVER (ORDER BY t.day DESC), 0) AS open_tickets
FROM generate_series(0, 364) t(day)
LEFT JOIN (SELECT created_days_ago AS day, count(*) AS created
FROM issues GROUP BY 1) a USING (day)
LEFT JOIN (SELECT closed_days_ago AS day, count(*) AS closed
FROM issues GROUP BY 1) b USING (day)
ORDER BY 1;
It is also correct, as opposed to the query in the question, which results in 17 open tickets on day 0, although all of them have been closed.
The error is due to BETWEEN in your join condition, which includes upper and lower border. This way tickets are still counted as "open" on the day they are closed.
Each row in the result reflects the number of open tickets at the end of the day.
Explain
The query combines window functions with aggregate functions.
Subquery a counts the number of created tickets per day. This results in a single row per day, making the rest easier.
Subquery b does the same for closed tickets.
Use LEFT JOINs to join to the generated list of days in subquery t.
Be wary of joining to multiple unaggregated tables! That could trigger a CROSS JOIN among the joined tables for multiple matches per row, generating incorrect results. Compare:
Two SQL LEFT JOINS produce incorrect result
Finally use two window functions to compute the running total of created versus closed tickets.
An alternative would be to use this in the outer SELECT
sum(COALESCE(a.created, 0)
- COALESCE(b.closed, 0)) OVER (ORDER BY t.day DESC) AS open_tickets
Performs the same in my tests.
-> SQLfiddle demo.
Aside: I would never store "days_ago" in a table, but the absolute date / timestamp. Looks like a simplification for the purpose of this question.

Get aggregation values for all possible distinct entries in group by

Ok, I know the title is confusing, but the idea is pretty simple. I just need to figure out how many flights were flown at five different sites during a given time period. Sometimes a site won't have any flights during the period and this is where I'm having the problem. If I use:
select count(*)
from Flight
where date between '9/9/2013' and '9/15/2013'
group by Site
order by Site
I will only get the sites that have actually flown, but I would like to have those sites where there were no flights during during that period (but have flown at other times and have records in the table) still return a value of 0.
Use condition summation. That is, move the where clause to a case statement:
select sum(case when date between '9/9/2013' and '9/15/2013' then 1 else 0 end)
from Flight
group by Site
order by Site;