SQL aggregation for identical values - sql

Suppose I have data something like this
id
project-id
thing-count
country
1
1
4
GBR
2
1
2
GBR
3
1
8
GBR
4
2
1
USA
5
2
4
USA
6
2
9
USA
I want to group the data using the project-id and keep the country. I know that the country does not vary within a project. There seem to be two ways I can do this:
SELECT
project-id,
MIN(country) AS country
FROM data
GROUP BY project-id
or
SELECT
project-id,
country
FROM data
GROUP BY
project-id,
country
These both work, but neither seem right. The first puts an extra burden on the GROUP BY since there's an unnecessary MIN calculation, while the second suggests to anyone reading the query that I want to GROUP BY the country data.
I'm always surprised that there is no FIRST, LAST, or ANY aggregation function, but as far as I can tell neither SQL Server, MySQL, nor Postgres have that.
How can I write his query so that the GROUP BY does not need to do extra work aggregating a column whose entries are identical, in a way that makes it obvious in the SQL that I do not care which of the values being aggregated is chosen to represent the set of values?

Two errors in my original question: FIRST and LAST. I need to constantly remind myself that SQL rows have no implicit ordering (think sets not lists) and so these aggregate functions would not make sense. But ANY does make sense, and I have expected it to be available time and time again.
In a comment on the question #amir-saleem points out that this does exist in MySQL. Using the ANY_VALUE aggregation function I could write
SELECT project-id, ANY_VALUE(country) as country, FROM data GROUP BY project-id
(N.B. ANY_VALUE looks like a useful aggregation function, but it is included in the Miscellaneous Functions documentation rather than the Aggregate Functions documentation; I do not know why.)
There is no similar aggregate function in Postgres as far as I can tell, though I could use a custom aggregate function to achieve this. Here are some community provided examples that would work in my case:
First/last (aggregate)
Aggregate_Random
(I have not investigated SQL Server based solutions.)

If I'm understanding correctly, you basically want a list of all project IDs and their associated countries?
If so, you can do some by simply adding a distinct clause after select as shown below
Select distinct project-id,country from data

Related

HANA concat rows

I use SAP-HANA database. I have a simple 2 column table whose columns are number, name, noodles, fish . The rows are these:
number name noodles fish
1 tom x
1 tom x
1 jack
2 jack x
I would like to group the rows by the id, and concatenate the names into a field, and thus obtain this:
number name noodles fish
1 tom x x
2 jack x
Can you please tell me how we can perform this operation in sap-hana? Thanks in advance.
Well, you did not really concatenate the names, but instead kept the same ones (if you would have concatenated the names as well, you would get something like jackjack in your result). I guess your x's indicate some sort of ABAP-style flags.
In any case, you would do this with grouping. This is a completely non-HANA thing (you can use the same basic SQL for any DB). You can group against several columns. All other columns that you want to select must be used in an aggregated expression (e.g. a SUM, MAX, COUNT, etc.).
To get the output from your question, I wrote the following code:
SELECT "ID", "NAME", MAX("FISH"), MAX("NOODLES")
FROM #TEST GROUP BY "ID", "NAME";
And got the same output as you. I used the MAX function based on the following assumption: you would want to get X if there is any X in the "concatenated" (aggregated) rows in that column. You get nothing / space if all the "concatenated" rows have space in them.

How does order by clause works if two values are equal?

This is my NEWSPAPER table.
National News A 1
Sports D 1
Editorials A 12
Business E 1
Weather C 2
Television B 7
Births F 7
Classified F 8
Modern Life B 1
Comics C 4
Movies B 4
Bridge B 2
Obituaries F 6
Doctor Is In F 6
When i run this query
select feature,section,page from NEWSPAPER
where section = 'F'
order by page;
It gives this output
Doctor Is In F 6
Obituaries F 6
Births F 7
Classified F 8
But in Kevin Loney's Oracle 10g Complete Reference the output is like this
Obituaries F 6
Doctor Is In F 6
Births F 7
Classified F 8
Please help me understand how is it happening?
If you need reliable, reproducible ordering to occur when two values in your ORDER BY clause's first column are the same, you should always provide another, secondary column to also order on. While you might be able to assume that they will sort themselves based on order entered (almost always the case to my knowledge, but be aware that the SQL standard does not specify any form of default ordering) or index, you never should (unless it is specifically documented as such for the engine you are using--and even then I'd personally never rely on that).
Your query, if you wanted alphabetical sorting by feature within each page, should be:
SELECT feature,section,page FROM NEWSPAPER
WHERE section = 'F'
ORDER BY page, feature;
In relational databases, tables are sets and are unordered. The order by clause is used primarily for output purposes (and a few other cases such as a subquery containing rownum).
This is a good place to start. The SQL standard does not specify what has to happen when the keys on an order by are the same. And this is for good reason. Different techniques can be used for sorting. Some might be stable (preserving original order). Some methods might not be.
Focus on whether the same rows are in the sets, not their ordering. By the way, I would consider this an unfortunate example. The book should not have ambiguous sorts in its examples.
When you use the SELECT statement to query data from a table, the order which rows appear in the result set may not be what you expected.
In some cases, the rows that appear in the result set are in the order that they are stored in the table physically. However, in case the query optimizer uses an index to process the query, the rows will appear as they are stored in the index key order. For this reason, the order of rows in the result set is undetermined or unpredictable.
The query optimizer is a built-in software component in the database
system that determines the most efficient way for an SQL statement to
query the requested data.

Dynamic use of MDX AVG function

Anyone have advice on how to build an average measure that is dynamic -- it doesn't specify a particular slice but instead uses your current view? I'm working within a front-end OLAP viewer (Strategy Companion) and I need a "dynamic" implementation based on the dimensions that are currently filtered in the data view.
My fact table looks something like this:
Key AmountA IndicatorA AmountB Other Data
1 5 1 null 25
2 6 1 null 52
3 7 1 2 106
4 null 0 4 108
Now I can specify a simple average for "[Measures].[AmountA]" with "[Measures].[AmountA] / [Measures].[IndicatorA]" which works great - "[IndicatorA]" sums up to the number of non-null values of "[AmountA]". And this also works great no matter what dimensions are selected in the view - it always divides by the count of rows that have been filtered in.
But what about [AmountB]? I don't have a null indicator column. I want to get an average value of [AmountB] for whatever rows have been filtered in for my current view. If I try to use the count of rows as a simple formula (psuedo-code "[Measures].[AmountB] / Count([Measures].[Key])") I get the wrong result, because it is counting all the null rows in the average.
So, I need a way to use the AVG function to specify the average of [AmountB] over the set of "whatever rows I'm currently filtering in, based on whatever dimensions I'm currently using". How do I specify this dynamic set?
I've tried several different uses of the AVG function and they have either returned null or summed up to huge numbers, clearly not the average I'm looking for.
Thanks-
Matt
Sorry, my first suggestion was wrong. If you don't have access to OLAP cube you can't write any mdx-query for this purpose (IMHO). Because, you don't have any detailed data (from your fact table) in this access level and you can use only aggregated data and dimensions from your cube.
Otherwise (if you have access to olap db), you can create this metric (count of not NULL rows) in your measure group and after that use it for AVG calculation (as calculated member in your cube or in section "WITH" in your mdx-query).

Minimum/Maximum function in T-SQL?

I am not asking about the aggregate Min/Max functions here. I would like to know if there are functions to get the mix or max of two values as in:
SELECT Maximum(a,b)
FROM Foo
If table Foo contains
a b
1 2
4 3
Then the results should be 2, then 4.
I can do this with an IF or CASE statement, but you'd think there would be some simple math functions for this.
Thank you,
Daniel
There is not. You can write your own UDFs but UDFs can slow queries down. Another option is to UNPIVOT the data so you can use the aggregate function. But for small applications CASE is best.

Retrieve names by ratio of their occurrence

I'm somewhat new to SQL queries, and I'm struggling with this particular problem.
Let's say I have query that returns the following 3 records (kept to one column for simplicity):
Tom
Jack
Tom
And I want to have those results grouped by the name and also include the fraction (ratio) of the occurrence of that name out of the total records returned.
So, the desired result would be (as two columns):
Tom | 2/3
Jack | 1/3
How would I go about it? Determining the numerator is pretty easy (I can just use COUNT() and GROUP BY name), but I'm having trouble translating that into a ratio out of the total rows returned.
SELECT name, COUNT(name)/(SELECT COUNT(1) FROM names) FROM names GROUP BY name;
Since the denominator is fixed, the "ratio" is directly proportional to the numerator. Unless you really need to show the denominator, it'll be a lot easier to just use something like:
select name, count(*) from your_table_name
group by name
order by count(*) desc
and you'll get the right data in the right order, but the number that's shown will be the count instead of the ratio.
If you really want that denominator, you'd do a count(*) on a non-grouped version of the same select -- but depending on how long the select takes, that could be pretty slow.