How do you merge duplicate rows in a table in BigQuery - replacing missing values with most recent records - sql

For example, I have a table of leads from a marketing database. There are multiple records with duplicate email values. I'd like to merge all of the duplicate records to roll up into the latest updated record and if the latest updated record is missing values for certain fields then update those fields from other records most recently updated.
Table:
First
last
Email
Phone
Job Title
State
Last Updated
John
Doe
john.doe#example.com
MD
1/1/2019
John
low
john.doe#example.com
1234567891
Coach
VA
1/1/2018
John
Doe
john.doe#example.com
3214569875
Teacher
CA
1/1/2017
Andy
Yes
john.doe#example.com
DC
1/1/2021
Roby
Doe
john.doe#example.com
8628423578
Scientist
VA
1/1/2025
Output - One record:
First
last
Email
Phone
Job Title
State
Last Updated
Andy
Yes
john.doe#example.com
1234567891
Coach
DC
1/1/2021
In this example, since the 2021 record is missing a phone number and job title, those values are pulled from the most recent updated records (2018).
I've thought about using Distinct or Unique functions but not sure how to execute on the merge using the last updated record and then filling in the blank values with the other most recent records. Any help would be greatly appreciated!!
Thank you in advance.
Best,
Dawit

Consider below approach - I think it is most generic - you need just make sure you have correct list of fields in unpivot and pivot lines. Though there is an assumption that following fields (First, Last, Phone, Job_Title, State) are all of string data type
select First, Last, Email, Phone, Job_Title, State, max_Last_Updated as Last_Updated
from (
select * except(Last_Updated),
max(Last_Updated) over(partition by Email) as max_Last_Updated
from data
unpivot (value for col in (First, Last, Phone, Job_Title, State))
where true
qualify row_number() over(partition by Email, col order by Last_Updated desc) = 1
)
pivot (max(value) for col in ('First', 'Last', 'Phone', 'Job_Title', 'State', 'Last_Updated'))
If applied to sample data in your question (excluding 2025 row) - output is

You need a method to know that these are all the same record. You can use last_value(ignore nulls) for this purpose:
select t.*,
last_value(first ignore nulls) over (partition by email order by last_updated) as imputed_first,
last_value(last ignore nulls) over (partition by email order by last_updated) as imputed_first,
. . . -- and so on for the other columns
from t;

Related

How to select a foreign key after narrowing down via Group By and Having in a subquery

I've got a unique problem. I'm querying a replicated database table cost_plan_breakdown, and the replication is known to have some duplicates due to issues with deleting records. I'm not the Admin so I'm trying to sidestep these duplicates as efficiently as possible. The table looks like this:
sys_id
sys_created_on
cost_plan
breakdown_start_date
axr123
2020-10-01 09:31:15
Outlook KTLO - Lisa Lymon
10-01-2020
pqo100
2020-12-23 05:50:20
Outlook KTLO - Lisa Lymon
10-01-2020
cji985
2020-10-01 09:31:15
Outlook KTLO - Lisa Lymon
11-01-2020
twg795
2020-10-05 13:23:08
DataPyramid CTB - Dave Dods
10-01-2020
jqr820
2020-09-28 16:11:54
Revoluccion CTB - Marcus Vance
11-01-2020
vjo150
2021-01-13 11:10:09
Server KTLO - Tom Smith
10-01-2020
Cost Plans typically have between 1 and 12 breakdowns during their lifespan, but there should only be one breakdown per cost plan per month. Notice that the Outlook Cost Plan has two breakdowns within the same month (October) with differing sys_id and sys_created_on.
So by using a smaller subquery in the where clause, I'm trying to determine the following:
"Group the rows with identical month and year of breakdown_start_date, and identical cost_plan. Of the remaining rows, select the one with the MAX sys_created_on. Take the sys_id of that row and feed it to the parent query to only include these rows."
...rest of query above
WHERE cpb.breakdown_type = 'requirement'
AND cpb.sys_id IN
(SELECT cpb2.sys_id
FROM cost_plan_breakdown cpb2
GROUP BY cpb2.name,
YEAR(cpb2.start_date_time),
MONTH(cpb2.start_date_time)
HAVING MAX(cpb2.sys_created_on))
At this point, I'm running into the error
cpb2.sys_id is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
I've previously semi-solved this by putting the MAX sys_created_on in the SELECT statement, and matching off that, but I realized that could pull in unwanted dupe records just because they match the sys_created_on of another.
I feel like the solution may be staring me in the face, but I'm stuck. Appreciate your help!
Use row_number to number the duplicate rows and then exclude them. Ordering the row number by sys_created_on desc ensures you get the latest of each per month.
declare #Test table (sys_id varchar(6), sys_created_on datetime2(0), cost_plan varchar(32), breakdown_start_date date);
insert into #Test (sys_id, sys_created_on, cost_plan, breakdown_start_date)
values
('axr123', '2020-10-01 09:31:15', 'Outlook KTLO - Lisa Lymon', '10-01-2020'),
('pqo100', '2020-12-23 05:50:20', 'Outlook KTLO - Lisa Lymon', '10-01-2020'),
('cji985', '2020-10-01 09:31:15', 'Outlook KTLO - Lisa Lymon', '11-01-2020'),
('twg795', '2020-10-05 13:23:08', 'DataPyramid CTB - Dave Dods', '10-01-2020'),
('jqr820', '2020-09-28 16:11:54', 'Revoluccion CTB - Marcus Vance', '11-01-2020'),
('vjo150', '2021-01-13 11:10:09', 'Server KTLO - Tom Smith', '10-01-2020');
with cte as (
select *
, row_number() over (partition by cost_plan, datepart(year,breakdown_start_date), datepart(month,breakdown_start_date) order by sys_created_on desc) rn
from #Test
)
select *
from cte
where rn = 1;
As per your comments this (the CTE) is just a neat way to write a sub-query/derived table and can still be written as follows:
select *
from (
select *
, row_number() over (partition by cost_plan, datepart(year,breakdown_start_date), datepart(month,breakdown_start_date) order by sys_created_on desc) rn
from #Test
) cte
where rn = 1;
Note: If you provide DDL+DML as shown above you make it much easier for people to assist.

How to get the differences between two rows **and** the name of the field where the difference is, in BigQuery?

I have a table in BigQuery like this:
Name
Phone Number
Address
John
123456778564
1 Penny Lane
John
873452987424
1 Penny Lane
Mary
845704562848
87 5th Avenue
Mary
845704562848
54 Lincoln Rd.
Amy
342847327234
4 Ocean Drive Avenue
Amy
347907387469
98 Truman Rd.
I want to get a table with the differences between two consecutive rows and the name of the field where occurs the difference:
I mean this:
Name
Field
Before
After
John
Phone Number
123456778564
873452987424
Mary
Address
87 5th Avenue
54 Lincoln Rd.
Amy
Phone Number
342847327234
347907387469
Amy
Address
4 Ocean Drive Avenue
98 Truman Rd.
How can I do this ? I've looked on other posts but couldn't find something that corresponds to my need.
Thank you
Consider below BigQuery'ish solution
select Name, ['Phone Number', 'Address'][offset(offset)] Field,
prev_field as Before, field as After
from (
select timestamp, Name, offset, field,
lag(field) over (partition by Name, offset order by timestamp) as prev_field
from yourtable,
unnest([`Phone Number`, Address]) field with offset
)
where prev_field != field
if applied to sample data in your question - output is
As you can see here - no matter how many columns in your table that you need to compare - it is still just one query - no unions and such.
You just need to enumerate your columns in two places
['Phone Number', 'Address'][offset(offset)] Field
and
unnest([`Phone Number`, Address]) field with offset
Note: you can further refactor above using scripting's execute immediate to compose such lists within the query on the fly (check my other answers - I frequently use such technique in them)
One method is just use to use lag() and union all
select name, 'phone', prev_phone as before, phone as after
from (select name, phone,
lag(phone) over (partition by name order by timestamp) as prev_phone
from t
) t
where prev_phone <> phone
union all
select name, 'address', prev_address as before, address as afte4r
from (select name, address,
lag(address) over (partition by name order by timestamp) as prev_address
from t
) t
where prev_address <> address

displaying sql results by a group based on column

I have in my table, say thousands of records. I want to display records together by city. It's a lot more complicated then that, since I need it displayed in alphabetical order as well based on customer name. How do I achieve this? Group BY seems to want to give me a total instead of displaying each of my records. so..
mark zuck some city
john smith cherryville
bill gates some city
jane doe cherryville
should return
bill gates some city
mark zuck some city
jane doe cherryville
john smith cherryville
This is an over-simplification but the idea stands. I appreciate all the help. thank you!
Group by is for aggregations. There is no aggregation in your query. You just want your output to be sorted. In this case, Order By well fits for the purpose.
select * from table1
order by city, customer
In english, get all table1 data sorted by first city, then customer

Populating column for Oracle Text search from 2 tables

I am investigating the benefits of Oracle Text search, and currently am looking at collecting search text data from multiple (related) tables and storing the data in the smaller table in a 1-to-many relationship.
Consider these 2 simple tables, house and inhabitants, and there are NEVER any uninhabited houses:
HOUSE
ID Address Search_Text
1 44 Some Road
2 31 Letsby Avenue
3 18 Moon Crescent
INHABITANT
ID House Name Nickname
1 1 Jane Doe Janey
2 1 John Doe JD
3 2 Jo Smythe Smithy
4 2 Percy Plum PC
5 3 Apollo Lander Moony
I want to to write SQL that updates the HOUSE.Search_Text column with text from INHABITANT. Now because this is a 1-to-many, the SQL needs to collate the data in INHABITANT for each matching row in house, and then combine the data (comma separated) and update the Search_Text field.
Once done, the Oracle Text search index on HOUSE.Search_Text will return me HOUSEs that match the search criteria, and I can look up INHABITANTs accordingly.
Of course, this is a very simplified example, I want to pick up data from many columns and Full Text Search across fields in both tables.
With the help of a colleague we've got:
select id, ADDRESS||'; '||Names||'; '||Nicknames as Search_Text
from house left join(
SELECT distinct house_id,
LISTAGG(NAME, ', ') WITHIN GROUP (ORDER BY NAME) OVER (PARTITION BY house_id) as Names,
LISTAGG(NICKNAME, ', ') WITHIN GROUP (ORDER BY NICKNAME) OVER (PARTITION BY house_id) as Nicknames
FROM INHABITANT)
i on house.id = i.house_id;
which returns:
1 44 Some Road; Jane Doe, John Doe; JD, Janey
2 31 Letsby Avenue; Jo Smythe, Percy Plum; PC, Smithy
3 18 Moon Crescent; Apollo Lander; Moony
Some questions:
Is this an efficient query to return this data? I'm slightly
concerned about the distinct.
Is this the right way to use Oracle Text search across multiple text fields?
How to update House.Search_Text with the results above? I think I need a correlated subquery, but can't quite work it out.
Would it be more efficient to create a new table containing House_ID and Search_Text only, rather than update House?

Ordering 2 columns on the same order

I have the table:
Example:
Name | Last Name
Albert Rigs
Carl Dimonds
Robert Big
Julian Berg
I need to order like this:
Name | Last Name
Albert Rigs (name)
Julian Berg (last name)
Robert Big (last name)
Carl Dimonds
I need something like, order by name and last name on the same ordering.
See on example, i have Name Albert, the next ordered name row its the Carl, but i have Big and Berg on last name, B > C so i get the last name order on second row.
It's like the two columns are the same but isn't.
It's hard to explaim, i'm sorry.
Its possible?
Thaks in advance.
To order by the minimum of (Name, Lastname), you could:
select *
from YourTable
order by
case
when Name > LastName then LastName
else name
end
A syntactic improvement on the Case, and allowing a ti-break on the other column.
select *
from my_table
order by least(name,last_name),
greatest(name,last_name)