How to get the differences between two rows **and** the name of the field where the difference is, in BigQuery? - sql

I have a table in BigQuery like this:
Name
Phone Number
Address
John
123456778564
1 Penny Lane
John
873452987424
1 Penny Lane
Mary
845704562848
87 5th Avenue
Mary
845704562848
54 Lincoln Rd.
Amy
342847327234
4 Ocean Drive Avenue
Amy
347907387469
98 Truman Rd.
I want to get a table with the differences between two consecutive rows and the name of the field where occurs the difference:
I mean this:
Name
Field
Before
After
John
Phone Number
123456778564
873452987424
Mary
Address
87 5th Avenue
54 Lincoln Rd.
Amy
Phone Number
342847327234
347907387469
Amy
Address
4 Ocean Drive Avenue
98 Truman Rd.
How can I do this ? I've looked on other posts but couldn't find something that corresponds to my need.
Thank you

Consider below BigQuery'ish solution
select Name, ['Phone Number', 'Address'][offset(offset)] Field,
prev_field as Before, field as After
from (
select timestamp, Name, offset, field,
lag(field) over (partition by Name, offset order by timestamp) as prev_field
from yourtable,
unnest([`Phone Number`, Address]) field with offset
)
where prev_field != field
if applied to sample data in your question - output is
As you can see here - no matter how many columns in your table that you need to compare - it is still just one query - no unions and such.
You just need to enumerate your columns in two places
['Phone Number', 'Address'][offset(offset)] Field
and
unnest([`Phone Number`, Address]) field with offset
Note: you can further refactor above using scripting's execute immediate to compose such lists within the query on the fly (check my other answers - I frequently use such technique in them)

One method is just use to use lag() and union all
select name, 'phone', prev_phone as before, phone as after
from (select name, phone,
lag(phone) over (partition by name order by timestamp) as prev_phone
from t
) t
where prev_phone <> phone
union all
select name, 'address', prev_address as before, address as afte4r
from (select name, address,
lag(address) over (partition by name order by timestamp) as prev_address
from t
) t
where prev_address <> address

Related

How do you merge duplicate rows in a table in BigQuery - replacing missing values with most recent records

For example, I have a table of leads from a marketing database. There are multiple records with duplicate email values. I'd like to merge all of the duplicate records to roll up into the latest updated record and if the latest updated record is missing values for certain fields then update those fields from other records most recently updated.
Table:
First
last
Email
Phone
Job Title
State
Last Updated
John
Doe
john.doe#example.com
MD
1/1/2019
John
low
john.doe#example.com
1234567891
Coach
VA
1/1/2018
John
Doe
john.doe#example.com
3214569875
Teacher
CA
1/1/2017
Andy
Yes
john.doe#example.com
DC
1/1/2021
Roby
Doe
john.doe#example.com
8628423578
Scientist
VA
1/1/2025
Output - One record:
First
last
Email
Phone
Job Title
State
Last Updated
Andy
Yes
john.doe#example.com
1234567891
Coach
DC
1/1/2021
In this example, since the 2021 record is missing a phone number and job title, those values are pulled from the most recent updated records (2018).
I've thought about using Distinct or Unique functions but not sure how to execute on the merge using the last updated record and then filling in the blank values with the other most recent records. Any help would be greatly appreciated!!
Thank you in advance.
Best,
Dawit
Consider below approach - I think it is most generic - you need just make sure you have correct list of fields in unpivot and pivot lines. Though there is an assumption that following fields (First, Last, Phone, Job_Title, State) are all of string data type
select First, Last, Email, Phone, Job_Title, State, max_Last_Updated as Last_Updated
from (
select * except(Last_Updated),
max(Last_Updated) over(partition by Email) as max_Last_Updated
from data
unpivot (value for col in (First, Last, Phone, Job_Title, State))
where true
qualify row_number() over(partition by Email, col order by Last_Updated desc) = 1
)
pivot (max(value) for col in ('First', 'Last', 'Phone', 'Job_Title', 'State', 'Last_Updated'))
If applied to sample data in your question (excluding 2025 row) - output is
You need a method to know that these are all the same record. You can use last_value(ignore nulls) for this purpose:
select t.*,
last_value(first ignore nulls) over (partition by email order by last_updated) as imputed_first,
last_value(last ignore nulls) over (partition by email order by last_updated) as imputed_first,
. . . -- and so on for the other columns
from t;

Select top three records grouping by two factors

I am trying to identify the three records with the highest values grouped by two factors. I realize this question is similar to this one PostgreSQL: select top three in each group, but I can't figure out how to generalize from this example which includes a single factor, to two factors. I have tried searching stack overflow for an answer to this question beyond the one listed above and I can't find one, but perhaps I'm not searching for the correct terms.
Briefly, I'm connecting to a table with the following schema
city, country, value
I only have a single row per city, country combination, but I have a variable, but the number of city entries I have per country is variable. For example, I have a few dozen cities for Canada, a hundred for the United States, but only two for Uzbekistan.
What I want, as output is a table with the same schema, but only countaining the rows containing the highest three values for city, nested within country. For example, if Canada has the cities and values of
{Canada, toronto, 100}, {Canada, vancouver, 80},
{Canada, montreal,112}, {Canada, calgary, 109},
{Canada, edmonton, 76}, {Canada, winnipeg, 73},
and the United States has the entries of
{{us, nyc, 104}, {us, chicago, 87},
{us, boston, 98}, {us, seattle, 105},
{us, sanfran, 88}, {us, minneapolis, 84},
{us, miami, 103}, {us, houston, 112},
{us, dallas, 78}, {us, tucson, 83}}
and Uzbekistan has the entries of
{uzbekistan, qarshi, 95}, {uzbeckistan, gluiston, 101}
What I would like as output would be
Canada, Montreal, 112
Canada, Toronto, 100
Canada, Calgary, 109
us, houston, 112
us, seattle, 105
us, nyc, 103,
uzbeckistan, qarshi, 95
uzbeckistan, gluiston 101
I've tried the following query
SELECT logincity, logincountry, VAL
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY logincountry, logincity ORDER BY
val DESC) AS Row_ID
FROM a_table)
WHERE Row_ID < 4
ORDER BY logincity
But I end up with more than three cities per country.
Can someone help me out?
Thanks Stack Overflow!
I think you only need partition by logincountry
SELECT logincity, logincountry, VAL
FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY logincountry
ORDER BY val DESC) AS Row_ID
FROM a_table ) T
WHERE Row_ID < 4
ORDER BY logincity
TIP: You probably will realize the problem if you include the Row_id on the SELECT
SELECT logincity, logincountry, VAL, Row_ID
On your query all Row_ID = 1
TIP 2: Your query want top 3 cities for each country, so you only have one partition country. So the linked question is the right answer, top 3 of each group in this case country.

Populating column for Oracle Text search from 2 tables

I am investigating the benefits of Oracle Text search, and currently am looking at collecting search text data from multiple (related) tables and storing the data in the smaller table in a 1-to-many relationship.
Consider these 2 simple tables, house and inhabitants, and there are NEVER any uninhabited houses:
HOUSE
ID Address Search_Text
1 44 Some Road
2 31 Letsby Avenue
3 18 Moon Crescent
INHABITANT
ID House Name Nickname
1 1 Jane Doe Janey
2 1 John Doe JD
3 2 Jo Smythe Smithy
4 2 Percy Plum PC
5 3 Apollo Lander Moony
I want to to write SQL that updates the HOUSE.Search_Text column with text from INHABITANT. Now because this is a 1-to-many, the SQL needs to collate the data in INHABITANT for each matching row in house, and then combine the data (comma separated) and update the Search_Text field.
Once done, the Oracle Text search index on HOUSE.Search_Text will return me HOUSEs that match the search criteria, and I can look up INHABITANTs accordingly.
Of course, this is a very simplified example, I want to pick up data from many columns and Full Text Search across fields in both tables.
With the help of a colleague we've got:
select id, ADDRESS||'; '||Names||'; '||Nicknames as Search_Text
from house left join(
SELECT distinct house_id,
LISTAGG(NAME, ', ') WITHIN GROUP (ORDER BY NAME) OVER (PARTITION BY house_id) as Names,
LISTAGG(NICKNAME, ', ') WITHIN GROUP (ORDER BY NICKNAME) OVER (PARTITION BY house_id) as Nicknames
FROM INHABITANT)
i on house.id = i.house_id;
which returns:
1 44 Some Road; Jane Doe, John Doe; JD, Janey
2 31 Letsby Avenue; Jo Smythe, Percy Plum; PC, Smithy
3 18 Moon Crescent; Apollo Lander; Moony
Some questions:
Is this an efficient query to return this data? I'm slightly
concerned about the distinct.
Is this the right way to use Oracle Text search across multiple text fields?
How to update House.Search_Text with the results above? I think I need a correlated subquery, but can't quite work it out.
Would it be more efficient to create a new table containing House_ID and Search_Text only, rather than update House?

Count number of rows that have a specific word in a varchar (in postgresql)

I have a table similar to the below:
id | name | direction |
--------------------------------------
1 Jhon Washington, DC
2 Diego Miami, Florida
3 Michael Orlando, Florida
4 Jenny Olympia, washington
5 Joe Austin, Texas
6 Barack Denver, Colorado
and I want to count how many people live in a specific state:
Washington 2
Florida 2
Texas 1
Colorado 1
How can I do this? (By the way this is just an question with an academic point of view )
Thanks in advance!
Postgres offers the function split_part(), which will break up a string by a delimiter. You want the second part (the part after the comma):
select split_part(direction, ', ', 2) as state, count(*)
from t
group by split_part(direction, ', ', 2);
Initially I would obtain the state from the direction field. Once you have that, it's quite simple:
SELECT state, count(*) as total FROM initial_table group by state.
To obtain the state, some functions depending on the dbms are useful. It depends on the language.
A possible pseudocode (given a function like substring_index of MySQL) for the query would be:
SELECT substring_index(direction,',',-1) as state, count(*) as total
FROM initial_table group by substring_index(direction,',',-1)
Edit: As it is suggested above, the query should return 1 for the Washington state.
My way do making such a queries is two-step - first, prepare fields you need, second, do you grouping or other calculation. That way you're following DRY principle and don't repeating yourself. I think CTE is the best tool for this:
with cte as (
-- we don't need other fields, only state
select
split_part(direction, ', ', 2) as state
from table1
)
select state, count(*)
from cte
group by state
sql fiddle demo
If you writing queries that way, it's easy to change grouping field in the future.
Hope that helps, and remember - readability counts! :)

Ordering 2 columns on the same order

I have the table:
Example:
Name | Last Name
Albert Rigs
Carl Dimonds
Robert Big
Julian Berg
I need to order like this:
Name | Last Name
Albert Rigs (name)
Julian Berg (last name)
Robert Big (last name)
Carl Dimonds
I need something like, order by name and last name on the same ordering.
See on example, i have Name Albert, the next ordered name row its the Carl, but i have Big and Berg on last name, B > C so i get the last name order on second row.
It's like the two columns are the same but isn't.
It's hard to explaim, i'm sorry.
Its possible?
Thaks in advance.
To order by the minimum of (Name, Lastname), you could:
select *
from YourTable
order by
case
when Name > LastName then LastName
else name
end
A syntactic improvement on the Case, and allowing a ti-break on the other column.
select *
from my_table
order by least(name,last_name),
greatest(name,last_name)