I want to clean a dataset because there are repeated keys that should not be there. Although the key is repeated, other fields do change. On repetition, I want to keep those entries whose country field is not null. Let's see it with a simplified example:
| email | country |
| 1#x.com | null |
| 1#x.com | PT |
| 2#x.com | SP |
| 2#x.com | PT |
| 3#x.com | null |
| 3#x.com | null |
| 4#x.com | UK |
| 5#x.com | null |
Email acts as key, and country is the field which I want to filter by. On email repetition:
Retrieve the entry whose country is not null (case 1)
If there are several entries whose country is not null, retrieve one of them, the first occurrence for simplicity (case 2)
If all the entries' country is null, again, retrieve only one of them (case 3)
If the entry key is not repeated, just retrieve it no matter what its country is (case 4 and 5)
The expected output should be:
| email | country |
| 1#x.com | PT |
| 2#x.com | SP |
| 3#x.com | null |
| 4#x.com | UK |
| 5#x.com | null |
I have thought of doing a UNION or some type of JOIN to achieve this. One possibility could be querying:
SELECT
...
FROM (
SELECT *
FROM `myproject.mydataset.mytable`
WHERE country IS NOT NULL
) AS a
...
and then match it with the full table to add those values which are missing, but I am not able to imagine the way since my experience with SQL is limited.
Also, I have read about the COALESCE function and I think it could be helpful for the task.
Consider below approach
select *
from `myproject.mydataset.mytable`
where true
qualify row_number() over(partition by email order by country nulls last) = 1
Related
I have a DB structure that looks something like this:
Table "public.person"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | not null |
Table "public.person_name"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
person | integer | | not null |
name | character varying | | |
Foreign-key constraints:
"person_name_person_fkey" FOREIGN KEY (person) REFERENCES person(id)
Table "public.event"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
id | integer | | not null |
name | character varying | | |
Table "public.attendee"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
event | integer | | |
person | integer | | |
Foreign-key constraints:
"attendee_event_fkey" FOREIGN KEY (event) REFERENCES public.event(id)
"attendee_person_fkey" FOREIGN KEY (person) REFERENCES person(id)
With some sample data:
person:
id
----
0
1
2
3
person_name:
person | name
--------+-----------
0 | Alex
0 | Alexander
1 | Barbara
1 | Barb
2 | Cecilia
3 | Dave
3 | David
event:
id | name
----+------------
0 | Wedding
1 | Party
2 | Funeral
attendee:
event | person
-------+--------
0 | 0
0 | 1
0 | 2
1 | 1
1 | 2
2 | 2
2 | 3
I'd like to make a select statement that returns all events, with a row for every combination of nicknames that all attendees have, like this:
event_id | event_name | attendee_list
----------+------------+---------------
0 | Wedding | Alex, Barbara, Cecilia
0 | Wedding | Alexander, Barbara, Cecilia
0 | Wedding | Alex, Barb, Cecilia
0 | Wedding | Alexander, Barb, Cecilia
1 | Party | Barbara, Cecilia
1 | Party | Barb, Cecilia
2 | Funeral | Cecilia, Dave
2 | Funeral | Cecilia, David
My initial intuition was to join all of the tables together, group by event, and then use string_agg, but that puts all of everybody's nicknames in the list (of course, since it's aggregating over the whole join). My second attempt was to select the attendee names from a subquery, but subqueries can't return multiple rows. I also tried aggregating using arrays instead, as described here, but you can't aggregate arrays of differing dimensionality. Finally, I tried using some recursive magic as described here, but found it difficult to adapt to my problem, and ultimately couldn't get it to work.
Here's a recursive query that does it. I made a array of the person IDs, and in each stage of the recursion I joined the next ID with the person_name table.
WITH RECURSIVE recur AS (
SELECT
event as event_id,
event.name as event_name,
array_agg(person) as person_id_list,
ARRAY[]::text[] as person_name_list,
1 as index
FROM attendee, event
WHERE attendee.event = event.id
GROUP BY event, event.name
UNION ALL
SELECT
event_id,
event_name,
person_id_list,
person_name_list || person_name.name,
index + 1
FROM recur
JOIN person_name on (person_name.person = recur.person_id_list[recur.index])
WHERE cardinality(recur.person_id_list) >= recur.index
)
SELECT event_id, event_name, array_to_string(person_name_list, ', ') as attendee_list
FROM recur
WHERE cardinality(recur.person_id_list) < recur.index
ORDER BY event_id;
I think I got it figured out with the "recursive magic" linked before. The issue was that my real data was a little more complicated and each attendee had a "position" in the list, which didn't always work with the r.id < t.id constraint. Here's a query that works with the sample data in the question:
with recursive recur as (
select
array[person_name.person] as persons,
array[name] as names,
attendee.event
from person_name
join attendee
on person_name.person=attendee.person
union all
select
persons || t.person,
names || t.name,
attendee.event
from person_name t
join recur r
on t.person != all(r.persons)
join attendee
on t.person=attendee.person
and attendee.event=r.event
)
select event, names
from recur
where cardinality(names)=(
select count(*)
from attendee
where attendee.event=recur.event
);
This does return an additional row for every possible order of attendees as well, but I'm fine with that (and like I said, my real data has a "position" field that would constrain that). If you needed only one ordering, the ordering has to be specified in the data somewhere, so for example adding back that r.id < t.id bit would work.
I want to get data in a single row from two tables which have one to many relation.
Primary table
Secondary table
I know that for each record of primary table secondary table can have maximum 10 rows. Here is structure of the table
Primary Table
-------------------------------------------------
| ImportRecordId | Summary |
--------------------------------------------------
| 1 | Imported Successfully |
| 2 | Failed |
| 3 | Imported Successfully |
-------------------------------------------------
Secondary table
------------------------------------------------------
| ImportRecordId | CodeName | CodeValue |
-------------------------------------------------------
| 1 | ABC | 123456A |
| 1 | DEF | 8766339 |
| 1 | GHI | 887790H |
------------------------------------------------------
I want to write a query with inner join to get data from both table in a way that from secondary table each row should be treated as column instead showing as multiple row.
I can hard code 20 columns names(as maximum 10 records can exist in secondary table and i want to display values of two columns in a single row) so if there are less than 10 records in the secondary table all other columns will be show as null.
Here is expected Output. You can see that for first record in primary table there was only three rows that's why two required columns from these three rows are converted into columns and for all others columns values are null.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| ImportRecordId | Summary | CodeName1 | CodeValue1 | CodeName2 | CodeValue2 | CodeName3 | CodeValue3 | CodeName4 | CodeValue4| CodeName5 | CodeValue5| CodeName6 | CodeValue6| CodeName7 | CodeValue7 | CodeName8 | CodeValue8 | CodeName9 | CodeValue9 | CodeName10 | CodeValue10|
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 1 | Imported Successfully | ABC | 123456A | DEF | 8766339 | GHI | 887790H | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Here is my simple SQL query which return all data from both tables but instead multiple rows from secondary table i want to get them in a single row like above result set.
Select p.ImportRecordId,p.Summary,s.*
from [dbo].[primary_table] p
inner join [dbo].[secondary_table] s on p.ImportRecordId = s.ImportRecordId
The following uses Row_Number(), a JOIN and a CROSS APPLY to create the source of the PIVOT
You'll have to add the CodeName/Value 4...10
Example
Select *
From (
Select A.[ImportRecordId]
,B.Summary
,C.*
From (
Select *
,RN = Row_Number() over (Partition by [ImportRecordId] Order by [CodeName])
From Secondary A
) A
Join Primary B on A.[ImportRecordId]=B.[ImportRecordId]
Cross Apply (values (concat('CodeName' ,RN),CodeName)
,(concat('CodeValue',RN),CodeValue)
) C(Item,Value)
) src
Pivot (max(value) for Item in (CodeName1,CodeValue1,CodeName2,CodeValue2,CodeName3,CodeValue3) ) pvt
Returns
ImportRecordId Summary CodeName1 CodeValue1 CodeName2 CodeValue2 CodeName3 CodeValue3
1 Imported Successfully ABC 123456A DEF 8766339 GHI 887790H
I have tables below as follows:
tbl_tasks
+---------+-------------+
| Task_ID | Assigned_ID |
+---------+-------------+
| 1 | 8 |
| 2 | 12 |
| 3 | 31 |
+---------+-------------+
tbl_resources
+---------+-----------+
| Task_ID | Source_ID |
+---------+-----------+
| 1 | 4 |
| 1 | 10 |
| 2 | 42 |
| 4 | 8 |
+---------+-----------+
A task is assigned to at least one person (denoted by the "assigned_ID") and then any number of people can be assigned as a source (denoted by "source_ID"). The ID numbers are all linked to names in another table. Though the ID numbers are named differently, they all return to the same table.
Would there be any way for me to combine the two tables based on ID such that I could search based on someone's ID number? For example- if I decide to search on or do a WHERE User_ID = 8, in order to see what Tasks that 8 is involved in, I would get back Task 1 and Task 4.
Right now, by joining all the tables together, I can easily filter on "Assigned" but not "Source" due to all the multiple entries in the table.
Use union all:
select distinct task_id
from ((select task_id, assigned_id as id
from tbl_tasks
) union all
(select task_id, source_id
from tbl_resources
)
) ti
where id = ?;
Note that this uses select distinct in case someone is assigned to the same task in both tables. If not, remove the distinct.
I have a database like this
| Contact | Incident | OpenTime | Country | Product |
| C1 | | 1/1/2014 | MX | Office |
| C2 | I1 | 2/2/2014 | BR | SAP |
| C3 | | 3/2/2014 | US | SAP |
| C4 | I2 | 3/3/2014 | US | SAP |
| C5 | I3 | 3/4/2014 | US | Office |
| C6 | | 3/5/2014 | TW | SAP |
I want to run a query with criteria on country and and open time, and I want to receive back something like this:
| Product | Contacts with | Incidents |
| | no Incidents | |
| Office | 1 | 1 |
| SAP | 2 | 2 |
I can easily get one part to work with a query like
SELECT Service, count(
FROM database
WHERE criterias AND Incident is Null //(or Not Null) depending on the row
GROUP BY Product
What I am struggling to do is counting Incident is Null, and Incident is not Null on the same table as a result of the same query as in the example above.
I have tried the following
SELECT Service AS Service,
(SELECT count Contacts FROM Database Where Incident Is Null) as Contact,
(SELECT count Contacts FROM Database Where Incident Is not Null) as Incident
FROM database
WHERE criterias AND Incident is Null //(or Not Null) depending on the row
GROUP BY Product
The issue I have with the above sentence is that whatever criteria I use on the "main" select are ignored by the nested Selects.
I have tried using UNION ALL as well, but did not managed to make it work.
Ultimately I resolved it with this approach: I counted the total contacts per product, counted the numbers of incidents and added a calculated field with the result
SELECT Service, COUNT (Contact) AS Total, COUNT (Incident) as Incidents,
(Total - Incident) as Only Contact
From Database
Where <criterias>
GROUP BY Service
Although I make it work, I am still sure that there is a more elegant approach for it.
How can I retrieve the different counting on the same column with different count criteria in one query?
Just use conditional aggregation:
SELECT Product,
SUM(IIF(incident is not null, 1, 1)) as incidents,
SUM(IIF(incident is null, 1, 1)) as noincidents
FROM database
WHERE criterias
GROUP BY Product;
Possibly a very MS Access solution would suit:
TRANSFORM Count(tmp.Contact) AS CountOfContact
SELECT tmp.Product
FROM tmp
GROUP BY tmp.Product
PIVOT IIf(Trim([Incident] & "")="","No Incident","Incident");
This IIf(Trim([Incident] & "")="" covers all possibilities of Null string, Null and space filled.
tmp is the name of the table.
I have a table in a postgresql-9.1.x database which is defined as follows:
# \d cms
Table "public.cms"
Column | Type | Modifiers
-------------+-----------------------------+--------------------------------------------------
id | integer | not null default nextval('cms_id_seq'::regclass)
last_update | timestamp without time zone | not null default now()
system | text | not null
owner | text | not null
action | text | not null
notes | text
Here's a sample of the data in the table:
id | last_update | system | owner | action |
notes
----+----------------------------+----------------------+-----------+------------------------------------- +-
----------------
584 | 2012-05-04 14:20:53.487282 | linux32-test5 | rfell | replaced MoBo/CPU |
34 | 2011-03-21 17:37:44.301984 | linux-gputest13 | apeyrovan | System deployed with production GPU |
636 | 2012-05-23 12:51:39.313209 | mac64-cvs11 | kbhatt | replaced HD |
211 | 2011-09-12 16:58:16.166901 | linux64-test12 | rfell | HD swap |
drive too small
What I'd like to do is craft a SQL query that returns only the unique/distinct values from the system and owner columns (and filling in NULLs if the number of values in one column's results is less than the other column's results), while ignoring the association between them. So something like this:
system | owner
-----------------+------------------
linux32-test5 | apeyrovan
linux-gputest13 | kbhatt
linux64-test12 | rfell
mac64-cvs11 |
The only way that I can figure out to get this data is with two separate SQL queries:
SELECT system FROM cms GROUP BY system;
SELECT owner FROM cms GROUP BY owner;
Far be it from me to inquire why you would want to do such a thing. The following query does this by doing a join, on a calculated column using the row_number() function:
select ts.system, town.owner
from (select system, row_number() over (order by system) as seqnum
from (select distinct system
from t
) ts
) ts full outer join
(select owner, row_number() over (order by owner) as seqnum
from (select distinct owner
from t
) town
) town
on ts.seqnum = town.seqnum
The full outer join makes sure that the longer of the two lists is returned in full.