Easiest way to merge rows in Google Refine (OpenRefine) if all columns are identical - openrefine

I'm cleaning data with OpenRefine (was Google Refine) from multiple sources. I have files from different sources which contain companies, column definitions are identical i.e.
UNID | Name | Street | City | Country | Phone | ...
sg52d | Company a | A street | a city | c country | 12345
sg52d | Company a | A street | a city | c country | 0099835
dfnsd | Company B | B Street | City B | c country | 33445
dfnsd | Company B | Different | Another | c country | 33445
xxbb3 | Company C | C Street | City B | Country A | 1111
xxbb3 | Company C | C Street | City B | Country A | 1111
What I want is this result (only the last Company is merged, all columns were identical)
UNID | Name | Street | City | Country | Phone | ...
sg52d | Company a | A street | a city | c country | 12345
sg52d | Company a | A street | a city | c country | 0099835
dfnsd | Company B | B Street | City B | c country | 33445
dfnsd | Company B | Different | Another | c country | 33445
xxbb3 | Company C | C Street | City B | Country A | 1111
Is there a simple way to do this?
I understand that I can concatenate all columns into a new column, but this is a little PITA, because of the number of columns.
Perhaps there is a way for the new column definition to loop through all other columns and merge it?

It is a strange approach but this should work: http://googlerefine.blogspot.com/2011/08/remove-duplicate.html
Make sure you make the sort change permanent.

You could create new column with an expression like:
forEach(["UNID", "Name", "Street", "City", "..." ],x,cells[x].value).join("")

Related

FIRST & LAST values in Oracle SQL

I am having trouble querying some data. The table I am trying to pull the data from is a LOG table, where I would like to see changes in the values next to each other (example below)
Table:
+-----------+----+-------------+----------+------------+
| UNIQUE_ID | ID | NAME | CITY | DATE |
+-----------+----+-------------+----------+------------+
| xa220 | 1 | John Smith | Berlin | 2020.05.01 |
| xa195 | 1 | John Smith | Berlin | 2020.03.01 |
| xa111 | 1 | John Smith | München | 2020.01.01 |
| xa106 | 2 | James Brown | Atlanta | 2018.04.04 |
| xa100 | 2 | James Brown | Boston | 2017.12.10 |
| xa76 | 3 | Emily Wolf | Shanghai | 2016.11.03 |
| xa20 | 3 | Emily Wolf | Shanghai | 2016.07.03 |
| xa15 | 3 | Emily Wolf | Tokyo | 2014.02.22 |
| xa12 | 3 | Emily Wolf | null | 2014.02.22 |
+-----------+----+-------------+----------+------------+
Desired outcome:
+----+-------------+----------+---------------+
| ID | NAME | CITY | PREVIOUS_CITY |
+----+-------------+----------+---------------+
| 1 | John Smith | Berlin | München |
| 2 | James Brown | Atlanta | Boston |
| 3 | Emily Wolf | Shanghai | Tokyo |
| 3 | Emily Wolf | Tokyo | null |
+----+-------------+----------+---------------+
I have been trying to use FIRST and LAST values, however, cannot get the desired outcome.
select distinct id,
name,
city,
first_value(city) over (partition by id order by city) as previous_city
from test
Any help is appreciated!
Thank you!
Use the LAG function to get the city for previous date and display only the rows where current city and the result of lag are different:
WITH cte AS (
SELECT t.*, LAG(CITY, 1, CITY) OVER (PARTITION BY ID ORDER BY "DATE") LAG_CITY
FROM yourTable t
)
SELECT ID, NAME, CITY, LAG_CITY AS PREVIOUS_CITY
FROM cte
WHERE
CITY <> LAG_CITY OR
CITY IS NULL AND LAG_CITY IS NOT NULL OR
CITY IS NOT NULL AND LAG_CITY IS NULL
ORDER BY
ID, "DATE" DESC;
Demo
Some comments on how LAG is being used and its values checked are warranted. We use the three parameter version of LAG here. The second parameter means the number of records to look back, which in this case is 1 (the default). The third parameter means the default value to use should a given record per ID partition be the first. In this case, we use the default as the same CITY value. This means that the first record would never appear in the result set.
For the WHERE clause above, a matching record is one for which the city and lag city are different, or for where one of the two be NULL and the other not NULL. This is the logic needed to treat a NULL city and some not NULL city value as being different.

Combining similar names and counting distinct from a single column

I am attempting to combine rows with similar names within a single column. The data I am using has some clients listed with the same name:
| Client_Column | Address |
| XYZ ISD | 1 Drive |
| XYZ ISD | 2 Drive |
| XYZ ISD | 3 Drive |
However, there are some records where the same client has a distinct, but similar name with an alternate address:
| Client_Column | Address |
| ABC Inc. | 101 Lane |
| ABC Inc. - P1 | 102 Lane |
| ABC Inc. - P2 | 103 Lane |
I would like to count the total number of distinct addresses for each client in the table. So in this example, XYZ ISD and ABC Inc. would each have 3 unique addresses. How can I adjust my script so that I can achieve this result?
I think I need to use SUM to aggregate the total, but I am not sure how. I am able to count the addresses for clients with the exact same name with accuracy, but the ones with only similar names display a single record for AddressCount.
Here is the script:
SELECT ClientName, COUNT(Address) AS AddressCount
FROM tableA
GROUP BY ClientName
ORDER BY AddressCount
Result:
| Client_Column | AddressCount |
| Blue LLC | 15 |
| PSC Parts | 14 |
| ABC Inc. | 1 |
| ABC Inc. - P1 | 1 |
| ABC Inc. - P2 | 1 |

SQL Filter based on results from SQL query

Input table - t1
make | model | engine | kms_covered | start | end
-------------------------------------------------------
suzuki | sx4 | petrol | 11 | City A | City D
suzuki | sx4 | diesel | 150 | City B | City C
suzuki | swift | petrol | 140 | City C | City B
suzuki | swift | diesel | 18 | City D | City A
toyota | prius | petrol | 16 | City E | City A
toyota | prius | hybrid | 250 | City B | City E
Need to get a subset of the records such that start and end is only cities where both diesel and hybrid cars were either in start or end.
In above case, expect that only city B qualifies for the condition and expect output table as below
output table
make | model | engine | kms_covered | start | end
-------------------------------------------------------
suzuki | sx4 | diesel | 150 | City B | City C
suzuki | swift | petrol | 140 | City C | City B
toyota | prius | hybrid | 250 | City B | City E
Two step process
Get list of cities where both diesel and hybrid cars have either in start or end
Subset the table with only records having cities in #1
Need help with starting point as below.
select * from t1
where start in () or end in ()
Hmmmm . . . If I understand the question, you can get the list of cities using a CTE and then use this in to solve your question:
with c as (
select city
from (select start as city, engine
from t1
union all
select end, engine
from t1
)
where engine in ('petrol', 'deisel')
group by city
having count(distinct engine) = 2
)
select t1.*
from t1
where t1.start in (select city from c) and
t1.end in (select city from c);

SQL one-to-many tables group by

Let consider example: I have following tables - TableA with people and TableB containing language skills of these people. Each row describing person can have none, one or more rows in TableB. Example below:
People
+-----+--------+
| pId | Name |
+-----+--------+
| 0 | Thomas |
| 1 | Henry |
| 2 | John |
+-----+--------+
Skills
+-----+-----+----------+---------------+
| lID | pId | Language | LanguageSkill |
+-----+-----+----------+---------------+
| 0 | 0 | Dutch | 0 |
| 1 | 0 | French | 4 |
| 2 | 0 | Italian | 2 |
| 3 | 2 | Italian | 2 |
+-----+-----+----------+---------------+
Thomas knows dutch, french and italian, Henry doesn't know any foreign language and John knows italian.
What I want to get is the best known language for each person from TableA:
+--------+----------+
| Name | Language |
+--------+----------+
| Thomas | French |
| Henry | NULL |
| John | Italian |
+--------+----------+
I have feeling that is quite easy thing, but don't have idea how to achieve it in a simple way.
Thanks for your responses.
You need to get the best language for each person using the following query:
SELECT pid, language
from TableB
group by pid
having languageskill = max(languageskill)
Then you join it onto the People table:
SELECT a.name, b.language
from TableA a
LEFT JOIN
(
SELECT pid, language, languageskill
from TableB
group by pid
having languageskill = max(languageskill)
) b
ON a.pid = b.pid
Of course, this method would not give more than one row if the person had a 'tied' best language, and you would lose that data about the 'tied' best language.

Split column into two columns based on type code in third column

My SQL is very rusty. I'm trying to transform this table:
+----+-----+--------------+-------+
| ID | SIN | CONTACT | TYPE |
+----+-----+--------------+-------+
| 1 | 737 | b#bacon.com | email |
| 2 | 760 | 250-555-0100 | phone |
| 3 | 737 | 250-555-0101 | phone |
| 4 | 800 | 250-555-0102 | phone |
| 5 | 850 | l#lemon.com | email |
+----+-----+--------------+-------+
Into this table:
+----+-----+--------------+-------------+
| ID | SIN | PHONE | EMAIL |
+----+-----+--------------+-------------+
| 1 | 737 | 250-555-0101 | b#bacon.com |
| 2 | 760 | 250-555-0100 | |
| 4 | 800 | 250-555-0102 | |
| 5 | 850 | | l#lemon.com |
+----+-----+--------------+-------------+
I wrote this query:
SELECT *
FROM (SELECT *
FROM people
WHERE TYPE = 'phone') phoneNumbers
FULL JOIN (SELECT *
FROM people
WHERE TYPE = 'email') emailAddresses
ON phoneNumbers.SIN = emailAddresses.SIN;
Which produces:
+----+-----+--------------+-------+------+-------+-------------+--------+
| ID | SIN | CONTACT | TYPE | ID_1 | SIN_1 | CONTACT_1 | TYPE_1 |
+----+-----+--------------+-------+------+-------+-------------+--------+
| 2 | 760 | 250-555-0100 | phone | | | | |
| 3 | 737 | 250-555-0101 | phone | 1 | 737 | b#bacon.com | email |
| 4 | 800 | 250-555-0102 | phone | | | | |
| | | | | 5 | 850 | l#lemon.com | email |
+----+-----+--------------+-------+------+-------+-------------+--------+
I know that I can select the columns I want, but the SIN column is incomplete. I seem to recall that I should join in the table a third time to get a complete SIN column, but I cannot remember how.
How can I produce my target table (ID, SIN, PHONE, EMAIL)?
Edit and clarification: I am grateful for the answers I have received so far, but as a SQL greenhorn I am unfamiliar with the techniques you are using (case statements, conditional aggregation, and pivoting). Can this not be done using JOIN and SELECT? Please excuse my ignorance in this matter. (It's not that I am not interested in superior techniques, but I do not want to move too fast too soon.)
One way to approach this is conditional aggregation:
select min(ID), SIN,
max(case when type = 'phone' then contact end) as phone,
max(case when type = 'email' then contact end) as email
from people t
group by sin;
Seems a pivot (oracle.com) would work easily here.
SELECT ID, SIN, PHONE, EMAIL
FROM PEOPLE
PIVOT (
MAX(CONTACT)
FOR TYPE IN ('EMAIL', 'PHONE')
)
I realize this is less elegant than all the solutions posted, but here it is anyhow, a solution using only JOIN and SELECT:
SELECT sins.SIN, phone, email
FROM ((SELECT SIN email_sin, contact email
FROM people
WHERE TYPE = 'email') email
FULL JOIN (SELECT SIN phone_sin, contact phone
FROM people
WHERE TYPE = 'phone') phone
ON email.email_sin = phone.phone_sin)
RIGHT JOIN (SELECT DISTINCT SIN FROM people) sins
ON sins.SIN = phone_sin OR sins.SIN = email_sin;
This lacks the ID column.