obtaining unique/distinct values from multiple unassociated columns - sql

I have a table in a postgresql-9.1.x database which is defined as follows:
# \d cms
Table "public.cms"
Column | Type | Modifiers
-------------+-----------------------------+--------------------------------------------------
id | integer | not null default nextval('cms_id_seq'::regclass)
last_update | timestamp without time zone | not null default now()
system | text | not null
owner | text | not null
action | text | not null
notes | text
Here's a sample of the data in the table:
id | last_update | system | owner | action |
notes
----+----------------------------+----------------------+-----------+------------------------------------- +-
----------------
584 | 2012-05-04 14:20:53.487282 | linux32-test5 | rfell | replaced MoBo/CPU |
34 | 2011-03-21 17:37:44.301984 | linux-gputest13 | apeyrovan | System deployed with production GPU |
636 | 2012-05-23 12:51:39.313209 | mac64-cvs11 | kbhatt | replaced HD |
211 | 2011-09-12 16:58:16.166901 | linux64-test12 | rfell | HD swap |
drive too small
What I'd like to do is craft a SQL query that returns only the unique/distinct values from the system and owner columns (and filling in NULLs if the number of values in one column's results is less than the other column's results), while ignoring the association between them. So something like this:
system | owner
-----------------+------------------
linux32-test5 | apeyrovan
linux-gputest13 | kbhatt
linux64-test12 | rfell
mac64-cvs11 |
The only way that I can figure out to get this data is with two separate SQL queries:
SELECT system FROM cms GROUP BY system;
SELECT owner FROM cms GROUP BY owner;

Far be it from me to inquire why you would want to do such a thing. The following query does this by doing a join, on a calculated column using the row_number() function:
select ts.system, town.owner
from (select system, row_number() over (order by system) as seqnum
from (select distinct system
from t
) ts
) ts full outer join
(select owner, row_number() over (order by owner) as seqnum
from (select distinct owner
from t
) town
) town
on ts.seqnum = town.seqnum
The full outer join makes sure that the longer of the two lists is returned in full.

Related

SQL - Get unique values by key selected by condition

I want to clean a dataset because there are repeated keys that should not be there. Although the key is repeated, other fields do change. On repetition, I want to keep those entries whose country field is not null. Let's see it with a simplified example:
| email | country |
| 1#x.com | null |
| 1#x.com | PT |
| 2#x.com | SP |
| 2#x.com | PT |
| 3#x.com | null |
| 3#x.com | null |
| 4#x.com | UK |
| 5#x.com | null |
Email acts as key, and country is the field which I want to filter by. On email repetition:
Retrieve the entry whose country is not null (case 1)
If there are several entries whose country is not null, retrieve one of them, the first occurrence for simplicity (case 2)
If all the entries' country is null, again, retrieve only one of them (case 3)
If the entry key is not repeated, just retrieve it no matter what its country is (case 4 and 5)
The expected output should be:
| email | country |
| 1#x.com | PT |
| 2#x.com | SP |
| 3#x.com | null |
| 4#x.com | UK |
| 5#x.com | null |
I have thought of doing a UNION or some type of JOIN to achieve this. One possibility could be querying:
SELECT
...
FROM (
SELECT *
FROM `myproject.mydataset.mytable`
WHERE country IS NOT NULL
) AS a
...
and then match it with the full table to add those values which are missing, but I am not able to imagine the way since my experience with SQL is limited.
Also, I have read about the COALESCE function and I think it could be helpful for the task.
Consider below approach
select *
from `myproject.mydataset.mytable`
where true
qualify row_number() over(partition by email order by country nulls last) = 1

SQL - Given sequence of data, how do I query the origin?

Let's assume we have the following data.
| UUID | SEENTIME | LAST_SEENTIME |
------------------------------------------------------
| UUID1 | 2020-11-10T05:00:00 | |
| UUID2 | 2020-11-10T05:01:00 | 2020-11-10T05:00:00 |
| UUID3 | 2020-11-10T05:03:00 | 2020-11-10T05:01:00 |
| UUID4 | 2020-11-10T05:04:00 | 2020-11-10T05:03:00 |
| UUID5 | 2020-11-10T05:07:00 | 2020-11-10T05:04:00 |
| UUID6 | 2020-11-10T05:08:00 | 2020-11-10T05:07:00 |
Each data is connected to each other via LAST_SEENTIME.
In such case, is there a way to use SQL to identify these connected events as one? I want to be able to calculate start and end to calculate the duration of this event.
You can use a recursive CTE. The exact syntax varies by database, but something like this:
with recursive cte as
select uuid as orig_uuid, uuid, seentime
from t
where last_seentime is null
union all
select cte.orig_uuid, t.uuid, t.seentime
from cte join
t
on cte.seentime = t.last_seentime
)
select orig_uuid,
max(seentime) - min(seentime) -- or whatever your database uses
from cte
group by orig_uuid;

Postgresql - How to remove last one from array_agg() in one select query?

I have a special need with below table
Table "public.skill_name"
Column | Type | Collation | Nullable | Default
----------+---------+-----------+----------+---------
position | integer | | not null |
value | text | | not null |
id | text | | not null |
skill | text | | |
Indexes:
"skill_name.id" UNIQUE, btree (id)
Foreign-key constraints:
"skill_name_skill_fkey" FOREIGN KEY (skill) REFERENCES skill(id) ON DELETE SET NULL
and some sample data like below:
position | value | id | skill
----------+---------------------------------------------------------------------------------------+-----------------------------+-----------------------------
1000 | Python | ck5bxmk67101790acuf05cikujw | ck5bxmk62101789acuf7pj1qmj6
2000 | Python Language | ck5bxmk69101791acufih7mc6u6 | ck5bxmk62101789acuf7pj1qmj6
3000 | Stdlib | ck5bxmk6c101792acuflzcu2avg | ck5bxmk62101789acuf7pj1qmj6
4000 | functools | ck5bxmk6e101793acuf42ih0evn | ck5bxmk62101789acuf7pj1qmj6
5000 | lru_cache | ck5bxmk6g101794acuf690rjgzp | ck5bxmk62101789acuf7pj1qmj6
1000 | Python | ck5bxysvp102005acuf6unt4cb7 | ck2wk5gba044342xbyaulv17i
2000 | Python Language | ck5bxysvs102012acuf5862l0gx | ck2wk5gba044342xbyaulv17i
3000 | Python Syntax | ck5bxysvu102021acufjcmxi1ij | ck2wk5gba044342xbyaulv17i
4000 | Classes | ck5bxysvx102030acufbaz3kml3 | ck2wk5gba044342xbyaulv17i
5000 | metaclasses | ck5bxysvz102037acufa5lmbuhj | ck2wk5gba044342xbyaulv17i
The requirement is to generate a result like below(NOTE: The last one group by skill been excluded in column path)
skill | path
-----------------------------+---------------------------------------------------------------------------------------
ck5bxmk62101789acuf7pj1qmj6 | Python,Python Language,Stdlib,functools
ck2wk5gba044342xbyaulv17i | Python,Python Language,Python Syntax,Classes
I have below sql but it does not work, it complains more than one row returned by a subquery used as an expression
SELECT
skill,
ARRAY_REMOVE(
ARRAY_AGG(value),
(
SELECT
skill_name.value
FROM (
SELECT
*,
skill AS skill_id,
LAST_VALUE(position) OVER (
PARTITION BY skill
ORDER BY position
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_pos
FROM skill_name
) skill_name
WHERE last_pos=position
GROUP BY skill_name.value
)
) as path
FROM skill_name
GROUP BY skill;
I do not know how to fix that, could any one help?
You could use row_number() to locate and eliminate the last value from the resultset before aggregating:
select skill, array_agg(value order by position) path
from (
select t.*, row_number() over(partition by skill order by position desc) rn
from mytable t
) t
where rn > 1
group by skill
Postgres has pretty sophisticated array functions. You don't need a subquery to do this:
select skill,
(array_agg(value order by position))[1:count(*) - 1] as path
from t
group by skill

How to Count the same field with different criteria on the same Query

I have a database like this
| Contact | Incident | OpenTime | Country | Product |
| C1 | | 1/1/2014 | MX | Office |
| C2 | I1 | 2/2/2014 | BR | SAP |
| C3 | | 3/2/2014 | US | SAP |
| C4 | I2 | 3/3/2014 | US | SAP |
| C5 | I3 | 3/4/2014 | US | Office |
| C6 | | 3/5/2014 | TW | SAP |
I want to run a query with criteria on country and and open time, and I want to receive back something like this:
| Product | Contacts with | Incidents |
| | no Incidents | |
| Office | 1 | 1 |
| SAP | 2 | 2 |
I can easily get one part to work with a query like
SELECT Service, count(
FROM database
WHERE criterias AND Incident is Null //(or Not Null) depending on the row
GROUP BY Product
What I am struggling to do is counting Incident is Null, and Incident is not Null on the same table as a result of the same query as in the example above.
I have tried the following
SELECT Service AS Service,
(SELECT count Contacts FROM Database Where Incident Is Null) as Contact,
(SELECT count Contacts FROM Database Where Incident Is not Null) as Incident
FROM database
WHERE criterias AND Incident is Null //(or Not Null) depending on the row
GROUP BY Product
The issue I have with the above sentence is that whatever criteria I use on the "main" select are ignored by the nested Selects.
I have tried using UNION ALL as well, but did not managed to make it work.
Ultimately I resolved it with this approach: I counted the total contacts per product, counted the numbers of incidents and added a calculated field with the result
SELECT Service, COUNT (Contact) AS Total, COUNT (Incident) as Incidents,
(Total - Incident) as Only Contact
From Database
Where <criterias>
GROUP BY Service
Although I make it work, I am still sure that there is a more elegant approach for it.
How can I retrieve the different counting on the same column with different count criteria in one query?
Just use conditional aggregation:
SELECT Product,
SUM(IIF(incident is not null, 1, 1)) as incidents,
SUM(IIF(incident is null, 1, 1)) as noincidents
FROM database
WHERE criterias
GROUP BY Product;
Possibly a very MS Access solution would suit:
TRANSFORM Count(tmp.Contact) AS CountOfContact
SELECT tmp.Product
FROM tmp
GROUP BY tmp.Product
PIVOT IIf(Trim([Incident] & "")="","No Incident","Incident");
This IIf(Trim([Incident] & "")="" covers all possibilities of Null string, Null and space filled.
tmp is the name of the table.

how to make postgresql result unique

this is somehow hard to describe, however I have a postgresql 9.1 table (planet_osm_roads).
My query is
SELECT
osm_id, name, highway, way, md5(astext(way)) AS md5
FROM planet_osm_roads
WHERE highway IS NOT NULL
AND md5(astext(way)) IN (
SELECT DISTINCT md5(astext(way))
FROM planet_osm_roads
WHERE highway IS NOT NULL
GROUP BY md5
HAVING count(osm_id) > 1
)
ORDER BY osm_id
The result is
osm_id | name | highway | ...way ... | md5
----------+------+---------------+-------...----...--+----------------------------------
-1641383 | | motorway | 010200...CA96...0 | 04b4336b997e7ea9d99208bd487bbe7d
-1641383 | | motorway | 010200...EC3E...0 | ae945148417ada285130c59277c48a25
-1641383 | | motorway | 010200...7BF6...0 | 5c5a1b8ae40c1b7f24e293a012ad2add
23133731 | | motorway_link | 010200...EC3E...0 | ae945148417ada285130c59277c48a25
31309105 | | motorway | 010200...7BF6...0 | 5c5a1b8ae40c1b7f24e293a012ad2add
49339926 | | motorway | 010200...CA96...0 | 04b4336b997e7ea9d99208bd487bbe7d
(6 rows)
I want a result that holds 3 rows (one for every md5 hash) and any of the other corresponding rows.
So a valid row for "ae945148417ada285130c59277c48a25" may contain osm_id-highway pair of "-1641383" & "motorway" or "23133731" & "motorway_link"- I don't mind and will consider both as correct.
How can I solve this and how is the required operation/technique called? So I know for next time how to call it an what to search for.
select
md5(astext(way)) as md5,
min(osm_id) osm_id,
min(name) name,
min(highway) highway,
min(way) way
from planet_osm_roads
where highway is not null
group by 1
having count(osm_id) > 1