Non-repeated values in Big Query - sql

I am fairly new to SQL, so this might be an easy solution for most, but I am having an issue with joins in Big Query. I have two tables:
TABLE A
id name purchases
1 alex 2
2 jane 7
3 peter 8
4 mario 1
5 luigi 6
TABLE B
id name visited
1 alex jan
2 jane jan
2 jane feb
3 peter jan
3 peter feb
3 peter mar
4 mario feb
5 luigi mar
I want my end result to have unique number of purchases per name/id, so the following:
TABLE C
id name visited purchases
1 alex jan 2
2 jane jan 7
2 jane feb 0
3 peter jan 8
3 peter feb 0
3 peter mar 0
4 mario feb 1
5 luigi mar 6
However, no matter what joins I perform, I end up with number of purchases per user matched every time, like the following:
id name visited purchases
1 alex jan 2
2 jane jan 7
2 jane feb 7
3 peter jan 8
3 peter feb 8
3 peter mar 8
4 mario feb 1
5 luigi mar 6
What would be the query to have Table C from Tables A and B?
Thank you.

One method is using row_number()
select b.*, coalesce(a.purchases, 0) purchases
from (
select *, row_number() over(partition by id order by visited) rn
from b ) b
left join a on a.id = b.id and b.rn=1
You may wish to decode visited to an ordinal depending on ordering requirements, for example
.. order by case visited when 'jan' then 1 when .. end ..

Related

R - get a vector that tells me if a value of another vector is the first appearence or not

I have a data frame of sales with three columns: the code of the customer, the month the customer bought that item, and the year.
A customer can buy something in september and then in december make another purchase, so appear two times. But I'm interested in knowing the absolutely new customoers by month and year.
So I have thought in make an iteration and some checks and use the %in% function and build a boolean vector that tells me if a customer is new or not and then count by month and year with SQL using this new vector.
But I'm wondering if there's a specific function or a better way to do that.
This is an example of the data I would like to have:
date cust month new_customer
1 14975 25 1 TRUE
2 14976 30 1 TRUE
3 14977 22 1 TRUE
4 14978 4 1 TRUE
5 14979 25 1 FALSE
6 14980 11 1 TRUE
7 14981 17 1 TRUE
8 14982 17 1 FALSE
9 14983 18 1 TRUE
10 14984 7 1 TRUE
11 14985 24 1 TRUE
12 14986 22 1 FALSE
So put it more simple: the data frame is sorted by date, and I'm interested in a vector (new_customer) that tells me if the customer purchased something for the first time or not. For example customer 25 bought something the first day, and then four days later bought something again, so is not a new customer. The same can be seen with customer 17 and 22.
I create dummy data my self with id, month of numeric format, and year
dat <-data.frame(
id = c(1,2,3,4,5,6,7,8,1,3,4,5,1,2,2),
month = c(1,6,7,8,2,3,4,8,11,1,10,9,1,12,2),
year = c(2019,2019,2019,2019,2019,2020,2020,2020,2020,2020,2021,2021,2021,2021,2021)
)
id month year
1 1 1 2019
2 2 6 2019
3 3 7 2019
4 4 8 2019
5 5 2 2019
6 6 3 2020
7 7 4 2020
8 8 8 2020
9 1 11 2020
10 3 1 2020
11 4 10 2021
12 5 9 2021
13 1 1 2021
14 2 12 2021
15 2 2 2021
Then, group by id and arrange by year and month (order is meaningful). Then use filter and row_number().
dat %>%
group_by(id) %>%
arrange(year, month) %>%
filter(row_number() == 1)
id month year
<dbl> <dbl> <dbl>
1 1 1 2019
2 5 2 2019
3 2 6 2019
4 3 7 2019
5 4 8 2019
6 6 3 2020
7 7 4 2020
8 8 8 2020
Sample Code
You can change in your code according to this logic:-
Create Table:-
CREATE TABLE PURCHASE(Posting_Date DATE,Customer_Id INT,Customer_Name VARCHAR(15));
Insert Data Into Table
Posting_Date Customer_Id Customer_Name
2018-01-01 C_01 Jack
2018-02-01 C_01 Jack
2018-03-01 C_01 Jack
2018-04-01 C_02 James
2019-04-01 C_01 Jack
2019-05-01 C_01 Jack
2019-05-01 C_03 Gill
2020-01-01 C_02 James
2020-01-01 C_04 Jones
Code
WITH Date_CTE (PostingDate,CustomerID,FirstYear)
AS
(
SELECT MIN(Posting_Date) as [Date],
Customer_Id,
YEAR(MIN(Posting_Date)) as [F_Purchase_Year]
FROM PURCHASE
GROUP BY Customer_Id
)
SELECT T.[ActualYear],(CASE WHEN T.[Customer Status] = 'new' THEN COUNT(T.[Customer Status]) END) AS [New Customer]
FROM (
SELECT DISTINCT YEAR(T2.Posting_Date) AS [ActualYear],
T2.Customer_Id,
(CASE WHEN T1.FirstYear = YEAR(T2.Posting_Date) THEN 'new' ELSE 'old' END) AS [Customer Status]
FROM Date_CTE AS T1
left outer join PURCHASE AS T2 ON T1.CustomerID = T2.Customer_Id
) AS T
GROUP BY T.[ActualYear],T.[Customer Status]
Final Result
ActualYear New Customer
2018 2
2019 1
2020 1
2019 NULL
2020 NULL

Compare data from for specific column grouping and Update based on criteria

I have a table with the following structure:
Employee Project Task Accomplishment Score Year
John A 1 5 60 2016
John A 1 6 40 2018
John A 2 3 30 2016
Simon B 2 0 30 2017
Simon B 2 4 30 2019
David C 1 3 20 2015
David C 1 2 40 2016
David C 3 0 25 2017
David C 3 5 35 2017
I want to create a view with Oracle SQLout of the above table which looks like as follows:
Employee Project Task Accomplishment Score Year UpdateScore Comment
John A 1 5 60 2016 60
John A 1 6 40 2018 100 (=60+40)
John A 2 3 30 2016 30
Simon B 2 0 30 2017 30
Simon B 2 4 40 2019 40 (no update because Accomplishement was 0)
David C 1 3 20 2015 20
David C 1 2 40 2016 60 (=20+40)
David C 3 0 25 2017 25
David C 3 5 35 2017 35 (no update because Accomplishement was 0)
The Grouping is: Employee-Project-Task.
The Rule of the UpdateScore column:
If for a specific Employee-Project-Task group Accomplishment column value is greater than 0 for the previous year, add the previous year's score to the latest year for the same Employee-Project-Task group.
For example: John-A-1 is a group which is different from John-A-2. So as we can see for John-A-1 the Accomplishment is 5 (which is greater than 0) in 2016, so we add the Score from 2016 with the score of 2018 for the John-A-1 and the updated score becomes 100.
For Simon-B-2, the accomplishment was 0, so there will be no update for 2019 for Simon-B-2.
Note: I don't need the Comment field, it is there just for more clarification.
Use analytic functions to determine if there was a score for the previous year, and if so, add it to the UpdatedScore.
select Employee, Project, Task, Accomplishment, Score, Year,
case when lag(Year) over (partition by Employee, Project order by Year) = Year - 1
then lag(Score) over (partition by Employee, Project order by Year)
else 0
end + Score as UpdatedScore
from EmployeeScore;
This is a bit strange -- you are counting the accomplishment of 0 in one year but not the next. Okay.
Use analytic functions:
select t.*,
(case when lag(accomplishment) over (partition by Employee, Project, Task order by year) > 0
then lag(score) over (partition by Employee, Project, Task order by year)
else 0
end) + score as update_score
from t;
from t

Joining From Differently Formatted Tables

Table 1:
ID Year Month
-----------------
1 2018 1
2 2018 1
3 2018 1
1 2018 2
2 2018 2
3 2018 2
Table 2:
ID Year Jan Feb Mar
------------------------
1 2018 100 200 300
2 2018 200 400 300
3 2018 200 500 700
How can I join these two tables even though they are laid out differently?
I was exploring a case join but that doesn't seem to be exactly what I need.
I'd like my output to be:
ID Year Month Data
1 2018 1 100
2 2018 1 200
3 2018 1 200
1 2018 2 200
2 2018 2 400
3 2018 2 500
1 2018 3 300
2 2018 3 300
3 2018 3 700
So, firstly we get TableB in the right format:
SELECT B.ID, B.Year, B.MonthValue
INTO TableB_New
FROM TableB T
UNPIVOT
(
MonthValue FOR Month IN (Jan, Feb, Mar)
) AS B
And then you do the join. Good Luck!

return the last row that meets a condition in sql

I have two tables:
Meter
ID SerialNumber
=======================
1 ABC1
2 ABC2
3 ABC3
4 ABC4
5 ABC5
6 ABC6
RegisterLevelInformation
ID MeterID ReadValue Consumption PreviousReadDate ReadType
============================================================================
1 1 250 250 1 jan 2015 EST
2 1 550 300 1 feb 2015 ACT
3 1 1000 450 1 apr 2015 EST
4 2 350 350 1 jan 2015 EST
5 2 850 500 1 feb 2015 ACT
6 2 1000 150 1 apr 2015 ACT
7 3 1500 1500 1 jan 2015 EST
8 3 2500 1000 1 mar 2015 EST
9 3 5000 2500 4 apr 2015 EST
10 4 250 250 1 jan 2015 EST
11 4 550 300 1 feb 2015 ACT
12 4 1000 450 1 apr 2015 EST
13 5 350 350 1 jan 2015 ACT
14 5 850 500 1 feb 2015 ACT
15 5 1000 150 1 apr 2015 ACT
16 6 1500 1500 1 jan 2015 EST
17 6 2500 1000 1 mar 2015 EST
18 6 5000 2500 4 apr 2015 EST
I am trying to group by meter serial and return the last actual read date for each of the meters but I am unsure as to how to accomplish this. Here is the sql I have thus far:
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode, PreviousReadDate
order by a.SerialNumber
I can't seem to get the MAX function to take effect in returning only the latest actual reading row and it returns all dates and the same meter serial is displayed several times.
If I use the following sql:
select a.SerialNumber, count(*) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
group by a.SerialNumber
order by a.SerialNumber
then each serial is shown only once. Any help would be greatly appreciated.
Like #PaulGriffin said in his comment you need to remove PreviousReadDate column from your GROUP BY clause.
Why are you experiencing this behaviour?
Basically the partition you have chosen - (SerialNumber,ReadTypeCode,PreviousReadDate) for each distinct pair of those values prints you SerialNumber, ReadTypeCode, MAX(PreviousReadDate). Since you are applying a MAX() function to each row of the partition that includes this column you are simply using an aggregate function on one value - so the output of MAX() will be equal to the one without it.
What you wanted to achieve
Get MAX value of PreviousReadDate for every pair of (SerialNumber,ReadTypeCode). So this is what your GROUP BY clause should include.
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode
order by a.SerialNumber
Is the correct SQL query for what you want.
Difference example
ID MeterID ReadValue Consumption PreviousReadDate ReadType
============================================================================
1 1 250 250 1 jan 2015 EST
2 1 550 300 1 feb 2015 ACT
3 1 1000 450 1 apr 2015 EST
Here if you apply the query with grouping by 3 columns you would get result:
SerialNumber | ReadTypeCode | PreviousReadDate
ABC1 | EST | 1 jan 2015 -- which is MAX of 1 value (1 jan 2015)
ABC1 | ACT | 1 feb 2015
ABC1 | EST | 1 apr 2015
But instead when you only group by SerialNumber,ReadTypeCode it would yield result (considering the sample data that I posted):
SerialNumber | ReadTypeCode | PreviousReadDate
ABC1 | EST | 1 apr 2015 -- which is MAX of 2 values (1 jan 2015, 1 apr 2015)
ABC1 | ACT | 1 feb 2015 -- which is MAX of 1 value (because ReadTypeCode is different from the row above
Explanation of your second query
In this query - you are right indeed - each serial is shown only once.
select a.SerialNumber, count(*) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
group by a.SerialNumber
order by a.SerialNumber
But this query would produce you odd results you don't expect if you add grouping by more columns (which you have done in your first query - try it yourself).
You need to remove PreviousReadDate from your Group By clause.
This is what your query should look like:
select a.SerialNumber, ReadTypeCode, MAX(PreviousReadDate) from Meter as a
left join RegisterLevelInformation as b on a.MeterID = b.MeterID
where ReadType = 'ACT'
group by a.SerialNumber,b.ReadTypeCode
order by a.SerialNumber
To understand how the group by clause works when you mention multiple columns, follow this link: Using group by on multiple columns
You will understand what was wrong with your query and why it returns all dates and the same meter serial is displayed several times.
Good luck!
Kudos! :)

SQL : How to find number of occurrences without using HAVING or COUNT?

This is a trivial example, but I am trying to understand how to think creatively using SQL.
For example, I have the following tables below, and I want to query the names of folks who have three or more questions. How can I do this without using HAVING or COUNT? I wonder if this is possible using JOINS or something similar?
FOLKS
folkID name
---------- --------------
01 Bill
02 Joe
03 Amy
04 Mike
05 Chris
06 Elizabeth
07 James
08 Ashley
QUESTION
folkID questionRating questionDate
---------- ---------- ----------
01 2 2011-01-22
01 4 2011-01-27
02 4
03 2 2011-01-20
03 4 2011-01-12
03 2 2011-01-30
04 3 2011-01-09
05 3 2011-01-27
05 2 2011-01-22
05 4
06 3 2011-01-15
06 5 2011-01-19
07 5 2011-01-20
08 3 2011-01-02
Using SUM or CASE seems to be cheating to me!
I'm not sure if it's possible in your current formulation, but if you add a primary key to the question table (questionid) then the following seems to work:
SELECT DISTINCT Folks.folkid, Folks.name
FROM ((Folks
INNER JOIN Question AS Question_1 ON Folks.folkid = Question_1.folkid)
INNER JOIN Question AS Question_2 ON Folks.folkid = Question_2.folkid)
INNER JOIN Question AS Question_3 ON Folks.folkid = Question_3.folkid
WHERE (((Question_1.questionid) <> [Question_2].[questionid] And
(Question_1.questionid) <> [Question_3].[questionid]) AND
(Question_2.questionid) <> [Question_3].[questionid]);
Sorry, this is in MS Access SQL, but it should translate to any flavour of SQL.
Returns:
folkid name
3 Amy
5 Chris
Update: Just to explain why this works. Each join will return all the question ids asked by that person. The where clauses then leaves only unique rows of question ids. If there are less than three questions asked then there will be no unique rows.
For example, Bill:
folkid name Question_3.questionid Question_1.questionid Question_2.questionid
1 Bill 1 1 1
1 Bill 1 1 2
1 Bill 1 2 1
1 Bill 1 2 2
1 Bill 2 1 1
1 Bill 2 1 2
1 Bill 2 2 1
1 Bill 2 2 2
There are no rows where all the ids are different.
however for Amy:
folkid name Question_3.questionid Question_1.questionid Question_2.questionid
3 Amy 4 4 5
3 Amy 4 4 4
3 Amy 4 4 6
3 Amy 4 5 4
3 Amy 4 5 5
3 Amy 4 5 6
3 Amy 4 6 4
3 Amy 4 6 5
3 Amy 4 6 6
3 Amy 5 4 4
3 Amy 5 4 5
3 Amy 5 4 6
3 Amy 5 5 4
3 Amy 5 5 5
3 Amy 5 5 6
3 Amy 5 6 4
3 Amy 5 6 5
3 Amy 5 6 6
3 Amy 6 4 4
3 Amy 6 4 5
3 Amy 6 4 6
3 Amy 6 5 4
3 Amy 6 5 5
3 Amy 6 5 6
3 Amy 6 6 4
3 Amy 6 6 5
3 Amy 6 6 6
There are several rows which have different ids and hence these get returned by the above query.
you can try sum , to replace count.
SELECT SUM(CASE WHEN Field_name >=3 THEN field_name ELSE 0 END)
FROM tabel_name
SELECT f.*
FROM (
SELECT DISTINCT
COUNT(*) OVER (PARTITION BY folkID) AS [Count] --count questions for folks
,a.folkID
FROM QUESTION AS q
) AS p
INNER JOIN FOLKS as f ON f.folkID = q.folkID
WHERE p.[Count] > 3