Related
I have been stuck with this problem for a while now and can't resolve it, would greatly appreciate some guidance
I am comparing records in a persons table to see if they're possibly the same. To do this I am using a with statement to take the values I need and looking for matches
CREATE TABLE persons (
serialno VARCHAR(20) NOT NULL,
given VARCHAR(30) NOT NULL,
family VARCHAR(30) NOT NULL,
dob DATE NOT NULL,
gender VARCHAR2(20 BYTE),
address VARCHAR2(64 BYTE)
);
INSERT ALL
INTO persons ( serialno, given, family,dob,gender,address ) VALUES ( '001', 'Mick', 'Dundon','01/01/1970','Male','Main St' )
INTO persons ( serialno, given, family, dob,gender,address) VALUES ( '002', 'Mick', 'Dundon','01/01/1970', 'Male','Montague St' )
INTO persons ( serialno, given, family,dob,gender,address ) VALUES ( '003', 'Dave', 'Doyle', '13/10/1981','Male', 'Rathmines')
INTO persons ( serialno, given, family,dob,gender,address ) VALUES ( '004', 'Jim', 'Morrison', '15/08/1956','Male','Newtown')
INTO persons ( serialno, given, family, dob,gender,address) VALUES ( '005', 'Sam', 'Wise', '12/12/1992','Male','High St')
SELECT 1 FROM dual;
with rec as
(select serialno,given,family,dob,gender,address
from persons)
select *
from rec r1
join rec r2
on r1.given = r2.given
and r1.family = r2.family
and r1.gender = r2.gender
and r1.serialno <> r2.serialno
the code works fine except I end up with duplicates as the R1 record will appear further down in the output as R2, and vice versa.
Is there a simple way I can avoid this kind of duplication?
You can get all the duplicates without a self-join by using the analytic COUNT function:
SELECT serialno, given, family, dob, gender, address
FROM (
SELECT serialno, given, family, dob, gender, address,
COUNT(*) OVER (PARTITION BY given, family, gender) AS num_matches
FROM persons
)
WHERE num_matches > 1;
If you also want to compare the values to the row with the same given/family/gender combination and the minimum serial number then, again you can avoid a self-join by using analytic functions:
SELECT serialno, given, family, dob, gender, address,
min_serialno, min_dob, min_address
FROM (
SELECT serialno,
given,
family,
dob,
gender,
address,
MIN(serialno) OVER (PARTITION BY given, family, gender) AS min_serialno,
MIN(dob) KEEP (DENSE_RANK FIRST ORDER BY serialno)
OVER (PARTITION BY given, family, gender) AS min_dob,
MIN(address) KEEP (DENSE_RANK FIRST ORDER BY serialno)
OVER (PARTITION BY given, family, gender) AS min_address
FROM persons
)
WHERE serialno > min_serialno;
If, in Oracle, you want to get all possible combinations then you can avoid a self-join by using a hierarchical query:
SELECT serialno, given, family, dob, gender, address,
PRIOR serialno AS p_serialno,
PRIOR dob AS p_dob,
PRIOR address AS p_address
FROM persons
WHERE LEVEL = 2
CONNECT BY
PRIOR gender = gender
AND PRIOR given = given
AND PRIOR family = family
AND PRIOR serialno < serialno
db<>fiddle here
Here's the initial table's structure :
yearquarter,user_id,gender,generation,country,group_id
2019-03,zfuzhfuzh,M,Y,FR,Group_1
2019-04,zfuzhfuzh,M,Y,FR,Group_1
2020-04,zfuzhfuzh,M,Y,FR,Group_1
2019-03,ggezegz,F,Y,FR,Group_2
2019-04,ggezegz,F,Y,FR,Group_2
2020-04,ggezegz,F,X,FR,Group_2
....
I want to be able to know the cumulative amount of user_id quarter after quarter grouped by gender, generation and country. Expected result: for a given combination of gender,generation,country I need the cumulated number of users quarter after quarter.
I started with this :
SELECT yearquarter,gender,generation,country,array_agg(distinct user_id IGNORE NULLS) as users FROM my table
WHERE group_id= "mygroup"
GROUP BY 1,2,3,4
But I don't know how to go from this to the result I'm looking for...
You can use aggregation to count the number of users per gender, generation country and period, and then make a window sum over the periods;
select
gender,
generation,
country,
yearquarter,
sum(count(distinct user_id)) over(partition by gender, generation, country order by yearquarter) cnt
from mytable
where group_id = 'mygroup'
group by gender, generation, country, yearquarter
order by gender, generation, country, yearquarter
I am unsure that bigquery supports distinct in window functions. If it doesn't, then we can use a subquery:
select
gender,
generation,
country,
yearquarter,
sum(count(*)) over(partition by gender, generation, country order by yearquarter) cnt
from (
select distinct gender, generation, country, yearquarter, user_id
from mytable
where group_id = 'mygroup'
) t
group by gender, generation, country, yearquarter
order by gender, generation, country, yearquarter
If you want each user to be counted only once, for their first appearance period:
select select
gender,
generation,
country,
yearquarter,
sum(count(*)) over(partition by gender, generation, country order by yearquarter) cnt
from (
select gender, generation, country, user_id, min(yearquarter) yearquarter
from mytable
where group_id = 'mygroup'
group by gender, generation, country, user_id
) t
group by gender, generation, country
Below is for BigQuery Standard SQL - built purely on top of your initial query with ARRAY_AGG replaced with STRING_AGG
#standardSQL
SELECT yearquarter, gender, generation, country,
(SELECT COUNT(DISTINCT id) FROM UNNEST(SPLIT(cumulative_users)) AS id) AS cumulative_number_of_users
FROM (
SELECT *,
STRING_AGG(users) OVER(PARTITION BY gender, generation, country ORDER BY yearquarter) AS cumulative_users
FROM (
SELECT
yearquarter, gender, generation, country,
STRING_AGG(DISTINCT user_id) AS users
FROM `project.dataset.table`
WHERE group_id= "mygroup"
GROUP BY yearquarter, gender, generation, country
)
)
-- ORDER BY yearquarter, gender, generation, country
Helping a customer out. I'm trying to copy one nested BigQuery table into another nested table and am running into the following error: "Syntax error: Expected ")" or "," but got ".""
Query:
INSERT INTO `<GCP_PROJECT_NAME>.Test_Tables.Nested_Person_Table2` (id,
first_name,
last_name,
dob,
address.status,
address.address,
address.city,
address.state,
address.zip,
address.numberOfYears)
SELECT
id,
first_name,
last_name,
dob,
address.status,
address.address,
address.city,
address.state,
address.zip,
address.numberOfYears
FROM
`<GCP_PROJECT_NAME>.Test_Tables.Nested_Person_Table`
Answer below. Hope this helps someone else out too!
INSERT INTO
`<GCP_PROJECT_NAME>.Test_Tables.Nested_Person_Table2`
(id,
first_name,
last_name,
dob,
addresses)
SELECT
id,
first_name,
last_name,
dob,
ARRAY_AGG(STRUCT(a1.status,
a1.address,
a1.city,
a1.state,
a1.zip,
a1.numberOfYears)) AS addresses
FROM
`<GCP_PROJECT_NAME>.Test_Tables.Nested_Person_Table`,
UNNEST(addresses) AS a1
GROUP BY
id,
first_name,
last_name,
dob
I have a timequery that has the following.
ID, StaffName, PName, Description, startd1, endd1, startt1, endt1, startd2, endd2, startt2, endt2, startd3, endd3, startt3, endt3, startd4, endd4, startt4, endt4, startd5, endd5, startt5, endt5
I need to split the row so it will show
ID, StaffName, PName, Description, startd1, endd1, startt1, endt1
ID, StaffName, PName, Description, startd2, endd2, startt2, endt2
ID, StaffName, PName, Description, startd3, endd3, startt3, endt3
ID, StaffName, PName, Description, startd4, endd4, startt4, endt4
ID, StaffName, PName, Description, startd5, endd5, startt5, endt5
Any help would be appreciated.
The generic way in SQL is to use union all:
select ID, StaffName, PName, Description, startd1 as startd, endd1 as endd, startt1 as startt, endt1 as endt
from t
union all
select ID, StaffName, PName, Description, startd2, endd2, startt2, endt2
from t
union all
select ID, StaffName, PName, Description, startd3, endd3, startt3, endt3
from t
union all
select ID, StaffName, PName, Description, startd4, endd4, startt4, endt4
from t
union all
select ID, StaffName, PName, Description, startd5, endd5, startt5, endt5
from t;
If you have a large table, there are more efficient methods. This requires scanning the table once for each subquery.
The column names come from the first subquery, which renames them so there are no numbers.
Was told to put that into UNF/1NF/2NF/3NF, is this correct?
Show the above data as a relation in UNF (unnormalised data).
Customer (CustomerID, FirstName, LastName, Address, City, Phone, State, Postcode,Qty, ProductNo, Description, Unit price, Total, Subtotal, Shipping, Tax Rate, Date, OrderNo.))
Show the data as a relation/s in 1NF. (Indicate any keys.)
Customer (CustomerID, FirstName, LastName, Address, City, state, Phone, State, Postcode)
Product (ProductNo, Qty, Description, Unitprice, total, subtotal, shipping, Tax rate(s), CustomerID(FK).)
Order (OrderNo, Date, ProductNo(FK).)
Show the data as a relation/s in 2NF. (Indicate any keys.)
Customer( CustomerID, FirstName, LastName, Address, City, Phone, State, Postcode)
Product ( ProductNo, Qty, Description, UnitPrice, CustomerID(FK), Total(FK).)
Order( OrderNo, Date, CustomerID(FK), ProductNo(FK).)
Total(Total, subtotal, shipping, Tax Rates, ProductNo(FK),CustomerID(FK) )
Show the data as a relation/s in 3NF. (Indicate any keys.)
Customer (CustomerID, FirstName, LastName, Address, City, Phone, State, Postcode)
Product (ProductNo, , Description, Unit Price. CustomerID(FK), Total(FK) )
Order (OrderNo, Date, CustomerID(FK).ProductNo(FK) )
Total(Total, subtotal, ProductNo(FK), CustomerID(FK) )
Shipping(Shipping, Tax Rates, Total(FK), OrderNo(FK) )
Qty( QtyID, Qty, ProductNo(FK), OrderNo(FK).)
It looks good to me, but you are missing one crucial piece of the design. You haven't defined any Primary Keys on your tables, although you have identified the foreign keys (use the foreign keys you have to work out the primary keys on each of the tables :)).
Show the above data as a relation in UNF (unnormalised data).
Customer (CustomerID, FirstName, LastName, Address, City, Phone,
State, Postcode,Qty, ProductNo, Description, Unit price, Total,
Subtotal, Shipping, Tax Rate, Date, OrderNo.))
No, that's not right. There doesn't seem to be any customer ID number on the invoice. Normalization doesn't involve introducing new attributes. As an unnormalized collection of attributes, labeling that list as "Customer" is premature.
Show the data as a relation/s in 1NF. (Indicate any keys.)
Customer (CustomerID, FirstName, LastName, Address, City, state,
Phone, State, Postcode)
Product (ProductNo, Qty, Description,
Unitprice, total, subtotal, shipping, Tax rate(s), CustomerID(FK).)
Order (OrderNo, Date, ProductNo(FK).)
Drop CustomerID. (See above.) I'm guessing that one of the candidate keys for the "Product" table is "ProductNo". If that's the case, why does that table include "CustomerID"?
Show the data as a relation/s in 2NF. (Indicate any keys.)
Customer( CustomerID, FirstName, LastName, Address, City, Phone, State, Postcode)
Product ( ProductNo, Qty, Description, UnitPrice, CustomerID(FK), Total(FK).)
Order( OrderNo, Date, CustomerID(FK), ProductNo(FK).)
Total(Total, subtotal, shipping, Tax Rates, ProductNo(FK),CustomerID(FK) )
2NF has to do with removing partial key dependencies. What partial key dependency did you identify that justified creating the table "Total"? (Hint: there isn't any justification for that.) Do this thought experiment (or build it in SQL): If "Total" is the primary key for the table "Total", what will you do if two orders result in the same total?
I'll stop there for now, because you've really gotten off on the wrong foot. You need to start with a list of all attributes, then identify the candidate keys and functional dependencies. Without starting there, you're unlikely to find 3NF.
An interesting thing about invoices....J Frompton orders a rake today, but some time in the future the price will change. However, that does not change the price Frompton payed today.
Once invoices are fulfilled, they really should be moved to a table that is 1NF.