Return first entry in table for every row given a specifc column sort order [duplicate]

Return first entry in table for every row given a specifc column sort order [duplicate] - sql

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 5 years ago.
I have a table with three columns, hostname, address, and virtual. The address column is unique, but a host can have up to two address entries, one virtual and one non-virtual. In other words, the hostname and virtual column pair are also unique. I want to produce a result set that contains one address entry for a host giving priority to the virtual address. For example, I have:
hostname | address | virtual
---------+---------+--------
first | 1.1.1.1 | TRUE
first | 1.1.1.2 | FALSE
second | 1.1.2.1 | FALSE
third | 1.1.3.1 | TRUE
fourth | 1.1.4.2 | FALSE
fourth | 1.1.4.1 | TRUE
The query should return the results:
hostname | address
---------+--------
first | 1.1.1.1
second | 1.1.2.1
third | 1.1.3.1
fourth | 1.1.4.1
Which is the virtual address for every host, and the non-virtual address for hosts lacking a virtual address. The closest I've come is asking for one specific host:
SELECT hostname, address
FROM system
WHERE hostname = 'first'
ORDER BY virtual DESC NULLS LAST
LIMIT 1;
Which gives this:
hostname | address
---------+--------
first | 1.1.1.1
I would like to get this for every host in the table with a single query if possible.

What you're looking for is a RANK function. It would look something like this:
SELECT * FROM (
SELECT hostname, address
, RANK() OVER (PARTITION BY hostname ORDER BY virtual DESC NULLS LAST) AS rk
FROM system
)
WHERE rk = 1
This is a portable solution that also works in Oracle and SQL Server.

In Postgres, the simplest way is distinct on:
SELECT DISTINCT ON (hostname) hostname, address
FROM system
ORDER BY hostname, virtual DESC NULLS LAST

Related

Incremental integer ID in Impala

I am using Impala for querying parquet-tables and cannot find a solution to increment an integer-column ranging from 1..n. The column is supposed to be used as ID-reference. Currently I am aware of the uuid() function, which
Returns a universal unique identifier, a 128-bit value encoded as a string with groups of hexadecimal digits separated by dashes.
Anyhow, this is not suitable for me since I have to pass the ID to another system which requests an ID in style of 1..n. I also already know that Impala has no auto-increment-implementation.
The desired result should look like:
-- UUID() provided as example - I want to achieve the `my_id`-column.
| my_id | example_uuid | some_content |
|-------|--------------|--------------|
| 1 | 50d53ca4-b...| "a" |
| 2 | 6ba8dd54-1...| "b" |
| 3 | 515362df-f...| "c" |
| 4 | a52db5e9-e...| "d" |
|-------|--------------|--------------|
How can I achieve the desired result (integer-ID ranging from 1..n)?
Note: This question differs from this one which specifically handles Kudu-tables. However, answers should be applicable for this question as well.

Since other Q&A's like this one only came up with uuid()-alike answers, I put some thought in it and finally came up with this solution:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as my_id
, some_content
FROM some_table
row_number() generates a continuous integer-number over a provided partition. Unlike rank(), row_number() always provides an incremented number on its partition (even if duplicates occur)
PARTITION BY "dummy" partitions the entire table into one partition. This works since "dummy" is interpreted in the execution graph as temporary column yielding only the String-value "dummy". Thus, also something analog to "dummy" works.
ORDER BY is required in order to generate the increment. Since we don't care about the order in this example (otherwise just set your respective column), also use the "dummy"-workaround.
The command creates the desired incremental ID without any nested SQL-statements or other tricks.
| my_id | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|

I used Markus's answer for a large partitioned table and found that I was getting duplicate ids. I think the ids were only unique within their partition; possibly PARTITION BY "dummy" leads Impala to think that each partition can execute row_number() on its own. I was able to get it working by specifying an actual column to order by and no partition by:
SELECT
row_number() OVER (ORDER BY actual_column) as my_id
, some_content
FROM some_table
It doesn't seem to matter whether the values in the column are unique (mine weren't), but using the actual partition key might result in the same issue as the "dummy" column.
Understandably, it took a lot longer to run than the dummy version.

Postgress | SQL | Get a row only if the subnet is part of a given ip list

I have a table with text column that holds ip with subnet
| ip
-------------
| 1.1.1.1/30
when you convert 1.1.1.1/30 to list of ip you get:
1.1.1.0
1.1.1.1
1.1.1.2
1.1.1.3
I want to run a sql on this table and give a list of ips somehow as part of "where" or anything else, and get this row only if the list of the ips that I give contain the ips of the range in the row.
meaning,
where ('1.1.1.0','1.1.1.1)
--> I will not get the row
but:
where ('1.1.1.0','1.1.1.1,1.1.1.2,1.1.1.3)
--> I will get the row
but:
where ('1.1.1.0','1.1.1.1,1.1.1.2,1.1.1.3,1.1.1.4,1.1.1.5)
--> I will get the row
Is there anyway to do that ?

You have to expand out the inet into all its host values and then use containment to accomplish this:
with blowout as (
select t.ip, array_agg(host(network(t.ip::inet) + gs.n)) as all_ips
from t
cross join lateral
generate_series(0, broadcast(t.ip::inet) - network(t.ip::inet)) as gs(n)
group by t.ip;
)
select *
from blowout
where all_ips <# array['1.1.1.0', '1.1.1.1', '1.1.1.2',
'1.1.1.3', '1.1.1.4', '1.1.1.5']::text[]
;
Since you are not using any special inet functions in the comparison, it is best to do the comparisons using text.

Postgres | How the get rows according to the exact or less IP list and a simple Text list

I have a table that holds 2 text columns: type and IP.
Both of them can be a single text or a multiple text separated with '####'.
in addition, the ip can also be a range such as 1.1.1.1/24
the table below
row |type | ip
-------------------------------------------
1 |red | 1.1.1.1
2. |red####blue | 1.1.1.1####2.2.2.2
3. |blue | 1.1.1.1/32####2.2.2.2/32
4. |yellow | 1.1.1.1
5. |red | 3.3.3.3
6. |yellow####red | 1.1.1.1
7. |blue | 1.1.1.1####3.3.3.3
I want to get all the rows that have
type red or blue or both (exactly red and blue or less, meaning a single red or a single blue)
AND
IP 1.1.1.1 or 2.2.2.2 or both including ranges (exactly 1.1.1.1 and 2.2.2.2 or less meaning a single 1.1.1.1 or a single 2.2.2.2 or if we have multiple ips, they need to match the range ecactly or less)
meaning I want to get rows 1,2,3
I started to write the next query but I can't get it right:
SELECT * FROM t where
regexp_split_to_array(t.type, '####')::text[] in ('red','blue')
and
regexp_split_to_array(t.ip, '####')::inet[] in ('1.1.1.1','2.2.2.2')
Thanks in advance!

You want the overlaps operator:
SELECT *
FROM t
WHERE regexp_split_to_array(t.type, '####')::text[] && array['red', 'blue'] and
regexp_split_to_array(t.ip, '####')::inet[] && array['1.1.1.1', '2.2.2.2']

Matching type so that one, the other, or both match can be accomplished with the containment operator since the underlying comparison is equality.
Matching inet types to subnets is a different story. That has to use the inet && operator (contains or is contained by), so the ip array has to be turned to rows by unnest. The requirement to again match one, the other, or both means we need a count of ip values and return rows only where the count of matches equals the count of ip values.
This query appears to do the job. Fiddle here.
with asarrays as (
SELECT row, regexp_split_to_array(t.type, '####')::text[] as types,
unnest(regexp_split_to_array(t.ip, '####')::inet[]) as ip
FROM t
), typematch as (
select *, count(*) over (partition by row) as totcount
from asarrays
where types <# array['red', 'blue']
), ipmatch as (
select *, count(*) over (partition by row) as matchcount
from typematch
where ip && any(array['1.1.1.1'::inet, '2.2.2.2'::inet])
)
select row, types, array_agg(ip) as ip
from ipmatch
where matchcount = totcount
group by row, types;

DB | Postgres | How to check if IP is in a list of IPs or a range

I have a table that has a TEXT column that holds IP, IPs or range (for example 1.1.1.1/24).
In case of multiple IPs, the IPs will be separated by a ####
for example 1.1.1.1####2.2.2.2
The table with 4 rows:
ip
------------------
1.1.1.1
1.1.1.1####2.2.2.2
1.1.1.1/24
3.3.3.3
2.2.2.2
I want to get all the rows that contain the ip 1.1.1.1 or 3.3.3.3, meaning I want to get the first 4 rows.
(1.1.1.1,1.1.1.1####2.2.2.2,1.1.1.1/24,3.3.3.3)
I found this solution in another stack-overflow question:
select inet '192.168.1.5' << any (array['192.168.1/24', '10/8']::inet[]);
but I cannot understand how can I make it work for my specific table and to get me all the first 4 rows.
Please help
Thanks in advance

I think this does what you want:
select t.*
from t
where '1.1.1.1'::inet <<= any(regexp_split_to_array(t.ips, '####')::inet[])
Here is a db<>fiddle.

SQL groupby having count distinct

I've got a postgres database that contains a table with IP, User, and time fields. I need a query to give me the complete set of all IPs that have only a single user active on them over a defined time period (i.e. I need to filter out IPs with multiple or no users, and should only have one row per IP). The user field contains some null values, that I can filter out. I'm using Pandas' read_sql() method to get a dataframe directly.
I can get the full dataframe of data from the defined time period easily with:
SELECT ip, user FROM table WHERE user IS NOT NULL AND time >= start AND time <= end
I can then take this data and wrangle the information I need out of it easily using pandas with groupby and filter operations. However, I would like to be able to get what I need using a single SQL query. Unfortunately, my SQL chops ain't too hot. My first attempt below isn't great; the dataframe I end up with isn't the same as when I create the dataframe manually using the original query above and some pandas wrangling.
SELECT DISTINCT ip, user FROM table WHERE user IS NOT NULL AND ip IN (SELECT ip FROM table WHERE user IS NOT NULL AND time >= start AND time <= end GROUP BY ip HAVING COUNT(DISTINCT user) = 1)
Can anyone point me in the right direction here? Thanks.
edit: I neglected to mention that there are multiple entries for each user/ip combination. The source is network authentication traffic, and users authenticate on IPs very frequently.
Sample table head:
---------------------------------
ip | user | time
---------------------------------
172.18.0.0 | jbloggs | 1531987000
172.18.0.0 | jbloggs | 1531987100
172.18.0.1 | jsmith | 1531987200
172.18.0.1 | jbloggs | 1531987300
172.18.0.2 | odin | 1531987400
If I were to query this example table for the time range 1531987000 to 1531987400 I would like the following output:
---------------------
ip | user
--------------------
172.18.0.0 | jbloggs
172.18.0.2 | odin

This should work
SELECT ip
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip
HAVING COUNT(ip) = 1
Explanation:
SELECT ip FROM table WHERE user IS NOT NULL AND time >= start AND time <= end - filtering out the nulls and time periods
...GROUP BY ip HAVING COUNT(ip) = 1 - If an ip has multiple users, the count(no. of rows with that ip) would be greater > 1.

If by "single user" you mean that there could be multiple rows with only one user, then:
SELECT ip
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip
HAVING MIN(user) = MAX(user) AND COUNT(user) = COUNT(*);

I have figured out a query that gets me what I want:
SELECT DISTINCT ip, user
FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end AND ip IN
(SELECT ip FROM table
WHERE user IS NOT NULL AND time >= start AND time <= end
GROUP BY ip HAVING COUNT(DISTINCT user) = 1)
Explanation:
The inner select gets me all IPs that have only one user across the specified time range. I then need to select the distinct ip/user pairs from the main table where the IPs are in the nested select.
It seems messy that I have to do the same filtering (of time range and non-null user fields) twice though, is there a better way to do this?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Return first entry in table for every row given a specifc column sort order [duplicate] - sql

What you're looking for is a RANK function. It would look something like this: SELECT * FROM ( SELECT hostname, address , RANK() OVER (PARTITION BY hostname ORDER BY virtual DESC NULLS LAST) AS rk FROM system ) WHERE rk = 1 This is a portable solution that also works in Oracle and SQL Server.

In Postgres, the simplest way is distinct on: SELECT DISTINCT ON (hostname) hostname, address FROM system ORDER BY hostname, virtual DESC NULLS LAST

Related

Incremental integer ID in Impala

Postgress | SQL | Get a row only if the subnet is part of a given ip list

Postgres | How the get rows according to the exact or less IP list and a simple Text list

DB | Postgres | How to check if IP is in a list of IPs or a range

SQL groupby having count distinct

Categories

Resources