SQL aggregate function for any non-specific value from a group - sql

Is there an agregate function that returns any value from a group. I could use MIN or MAX, but would rather avoid the overhead if possible given it's a text field.
My situation is an error log summary. The errors are grouped by the type of error and an example of the error text is displayed for each group. It doesn't matter which error message is used as the example.
SELECT
ref_code,
log_type,
error_number,
COUNT(*) AS count,
MIN(data) AS example
FROM data
GROUP BY
ref_code,
log_type,
error_number
What can I replace MIN(data) with to not have to compare 100,000s of varchar(2000) values?

you can use MIN coupled with KEEP, like this:
MIN(data) keep (dense_rank first order by rowid) AS EXAMPLE
The idea behind this is that the database engine will be sorting data over ROWID instead of the VARCHAR(2000) values, which theoretically should be faster. You can replace ROWID with the primary key value, and check if it's faster

Going by the proposed answers, it appears that MIN(data) (or MAX(data)) is the fastest way to achieve what I want. I'm trying to over-optimise unnecessarily.
I'll try out any other answers that come up while I have access to this database, but in the mean time, this comes out on top.
Thank you for everyone's effort!

Well, since you asked about OVER PARTITION AND ORDER BY, below is a version that does your GROUP BY, but then also uses ROW_NUMBER() with OVER and PARTITION AND ORDER BY, to number the first ref_code, log_type, error_num combination it comes across as row number 1 (with whatever data column is there at 1). Then it renumbers, starting at 1, at the next distinct ref_code, log_type, error_num combination it finds (with whatever data column that happens to be there). So you can then simply pull the data field at row number 1 as a representative data field for a given ref_code, log_type, error_num.
It's still lacking something. It would be more elegant if I didn't have the double pass (once for aggregation and once for row_number()); however, it might perform very well none-the-less. I'll have to think about it some more to see if I can eliminate the double pass.
But it avoids any comparison of the large data field. And it is represents a way to do what you asked: to pull 1 representative sample from the data field in correlation with the aggregated fields.
SELECT
t.ref_code,
t.log_type,
t.error_number,
t.count,
d.data
FROM
(
SELECT
ref_code,
log_type,
error_number,
COUNT(*) as count
FROM data
GROUP BY
ref_code,
log_type,
error_number
) t
INNER JOIN
(
SELECT
ref_code,
log_type,
error_number,
data,
ROW_NUMBER() OVER
(
PARTITION BY
ref_code,
log_type,
error_number
ORDER BY
ref_code,
log_type,
error_number
) as row_number
FROM data
) d on
d.ref_code = t.ref_code and
d.log_type = t.log_type and
d.error_number = t.error_number and
row_number = 1
Final caveat: I don't have Oracle to try this on. But I did put it together from reading Oracle documentation.
I added the below after I thought further how to elminate the GROUP BY, which I only had in there for COUNT(*). Don't know if it's any faster though.
SELECT *
FROM
(
SELECT
ref_code,
log_type,
error_number,
data,
ROW_NUMBER() OVER
(
PARTITION BY
ref_code,
log_type,
error_number
ORDER BY
ref_code,
log_type,
error_number
) as row_number,
COUNT(*) OVER
(
PARTITION BY
ref_code,
log_type,
error_number
ORDER BY
ref_code,
log_type,
error_number
) as count
FROM data
) t
WHERE row_number = 1

Related

What else do I need to add to my SQL query to bring related information in other columns if using MIN() GROUP BY

There is a table with the following column headers: indi_cod, ries_cod, date, time and level. Each ries_cod contains more than one indi_cod, and these indi_cod are random consecutive numbers.
Which SQL query would be appropriate to build if the aim is to find the smallest ID of each ries_cod, and at the same time bring its related information corresponding to date, time and level?
I tried the following query:
SELECT MIN (indi_cod) AS min_indi_cod
FROM my-project-01-354113.indi_cod.second_step
GROUP BY ries_cod
ORDER BY ries_cod
And, indeed, it presented me with the minimum value of indi_cod for each group of ries_cod, but I couldn't write the appropriate query to bring me the information from the date, time and level columns corresponding to each indi_cod.
I usually use some kind of ranking for this type of thing. you can use row_number, rank, or dense_rank depending on your rdbms. here is an example.
with t as(select a.*,
row_number() over (partition by ries_cod, order by indi_cod) as rn
from mytable)
select * from t where rn = 1
in addition if you are using oracle you can do this without two queries by using keep.
https://renenyffenegger.ch/notes/development/databases/SQL/select/group-by/keep-dense_rank/index
I think you just need to group by with the other columns
SELECT MIN (indi_cod), ries_cod, date, time, level AS min_indi_cod
FROM mytavke p
GROUP BY ries_cod, date, time, level
ORDER BY ries_cod

Taking a Random Sample From Each Group in Big Query

I'm trying to figure out what is the best way to take a random sample of 100 records for each group in a table in Big Query.
For example, I have a table where column A is a unique recordID, and column B is the groupID to which the record belongs. For every distinct groupID, I would like to take a random sample of 100 recordIDs. Is there a simple way to complete this?
Something like below should work
SELECT recordID, groupID
FROM (
SELECT
recordID, groupID,
RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID
Also check RAND() here if you want to improve randomness
Had a similar need, namely cluster sampling, over 400M and more columns but hit Exceeded resources... error when using ROW_NUMBER().
If you don't need RAND() because your data is unordered anyway, this performs quite well (<30s in my case):
SELECT ARRAY_AGG(x LIMIT 100)
FROM yourtable x
GROUP BY groupId
You can:
decorate with UNNEST() if front-end cannot render nested records
add ORDER BY groupId to find/confirm patterns more quickly

SQL Row_Count function with Partition

I have a query that returns a set of results as a table called DATA, from several UNION ALL joined queries.
I am then doing ROW_NUMBER() on this, to get the row number for a specific grouping (WorksOrderNo)
ROW_NUMBER() Over(partition by Data.WorksOrderNo order by Data.WorksOrderNo) as RowNo,
Is there an equivalent ROW_Count function where I can specify a partition, and return the count of rows for that partition?
ROW_Count() Over(partition by Data.WorksOrderNo order by Data.WorksOrderNo) as RowNo ???
Reason being, this is query being used to drive a report layout.
As part of this, I need to format based on whether the total row count for each WorksOrderNo is >1 or not.
So for instance if there were three rows for a works order, the row_number function currently returns 1, 2 and 3, where the row count would return 3 on each row.
The function is simply COUNT(). In SQL Server, all the aggregation functions can be used as window functions, as long as they do not use DISTINCT.
Note that for the total count, you do not want the ORDER BY:
COUNT(*) Over (partition by Data.WorksOrderNo) as cnt
If you include the ORDER BY, then the COUNT() is cumulative, rather than constant for all rows in the partition.
It looks like you just need group by and count:
select WorksOrderNo, count(*) as Row_Count
from Data
group by WorksOrderNo

Return only the newest rows from a BigQuery table with a duplicate items

I have a table with many duplicate items – Many rows with the same id, perhaps with the only difference being a requested_at column.
I'd like to do a select * from the table, but only return one row with the same id – the most recently requested.
I've looked into group by id but then I need to do an aggregate for each column. This is easy with requested_at – max(requested_at) as requested_at – but the others are tough.
How do I make sure I get the value for title, etc that corresponds to that most recently updated row?
I suggest a similar form that avoids a sort in the window function:
SELECT *
FROM (
SELECT
*,
MAX(<timestamp_column>)
OVER (PARTITION BY <id_column>)
AS max_timestamp,
FROM <table>
)
WHERE <timestamp_column> = max_timestamp
Try something like this:
SELECT *
FROM (
SELECT
*,
ROW_NUMBER()
OVER (
PARTITION BY <id_column>
ORDER BY <timestamp column> DESC)
row_number,
FROM <table>
)
WHERE row_number = 1
Note it will add a row_number column, which you might not want. To fix this, you can select individual columns by name in the outer select statement.
In your case, it sounds like the requested_at column is the one you want to use in the ORDER BY.
And, you will also want to use allow_large_results, set a destination table, and specify no flattening of results (if you have a schema with repeated fields).

SQL Server rand() aggregate

Problem: a table of coordinate lat/lngs. Two rows can potentially have the same coordinate. We want a query that returns a set of rows with unique coordinates (within the returned set). Note that distinct is not usable because I need to return the id column which is, by definition, distinct. This sort of works (#maxcount is the number of rows we need, intid is a unique int id column):
select top (#maxcount) max(intid)
from Documents d
group by d.geoLng, d.geoLat
It will always return the same row for a given coordinate unfortunately, which is bit of a shame for my use. If only we had a rand() aggregate we could use instead of max()... Note that you can't use max() with guids created by newid().
Any ideas?
(there's some more background here, if you're interested: http://www.itu.dk/~friism/blog/?p=121)
UPDATE: Full solution here
You might be able to use a CTE for this with the ROW_NUMBER function across lat and long and then use rand() against that. Something like:
WITH cte AS
(
SELECT
intID,
ROW_NUMBER() OVER
(
PARTITION BY geoLat, geoLng
ORDER BY NEWID()
) AS row_num,
COUNT(intID) OVER (PARTITION BY geoLat, geoLng) AS TotalCount
FROM
dbo.Documents
)
SELECT TOP (#maxcount)
intID, RAND(intID)
FROM
cte
WHERE
row_num = 1 + FLOOR(RAND() * TotalCount)
This will always return the first sets of lat and lngs and I haven't been able to make the order random. Maybe someone can continue on with this approach. It will give you a random row within the matching lat and lng combinations though.
If I have more time later I'll try to get around that last obstacle.
this doesn't work for you?
select top (#maxcount) *
from
(
select max(intid) as id from Documents d group by d.geoLng, d.geoLat
) t
order by newid()
Where did you get the idea that DISTINCT only works on one column? Anyway, you could also use a GROUP BY clause.