Join members from different dimensions on rows in MDX - ssas

I have 4 separate dimensions I'm interested in: ( A, B, C date ).
Each dimension has multiple attribute hierarchies.
Each dimension theoretically maps to each other. C -> B -> A.
In other words, multiple members of B map to a single member in A and multiple members of C map to a single member of B.
Originally I had the following query which worked
SELECT
(
[Measures].[Count]
)
ON COLUMNS,
(
[A].[Id].[Id].MEMBERS,
FILTER
(
[A].[Name].[Name].MEMBERS,
LEFT([A].[Name].CURRENTMEMBER.NAME, 4) <> "test"
),
[A].[Start].[Start].MEMBERS,
[A].[Owner].[Owner].MEMBERS
)
ON ROWS
FROM
(
SELECT
(
{[A].[Start].&[2020-05-10] : [A].[Start].&[2020-05-25]}
)
ON COLUMNS
FROM [Model]
)
WHERE
(
{[date].[date].&[2020-05-10] : [date].[date].&[2020-05-25]},
{[B].[End].&[2020-05-25]:NULL},
[A].[Product].&[ASDF]
)
The output I was getting looked as follows:
A.id | A.Name | A.Owner | Count
----------------------------------------
1 | A | asdf | (null)
2 | B | asdf | 23
3 | C | asdd | (null)
4 | D | asdf | (null)
5 | E | qwer | 5067
6 | F | adfd | (null)
7 | G | wert | (null)
... | ... | .... | ...
25 | Y | werd | (null)
As you can see there are a lot of nulls in the data.
I now have additional requirement to filter only to "Enabled" members of the B.id hierarchy.
So in the WHERE clause I added the following line: [B].[Status].&[Enabled].
This did not change my output but I know it should because I have to table that I need mocked up in PowerBI and this condition eliminates a few members from the A.id hierarchy.
Tha new output should look something like this:
A.id | A.Name | A.Owner | Count
----------------------------------------
2 | B | asdf | 23
4 | D | asdf | (null)
5 | E | qwer | 5067
7 | G | wert | (null)
... | ... | .... | ...
25 | Y | werd | (null)
As you can see, some nulls should still be there because they have "Enabled" B.Id members mapped to them.
I then tried to add [B].[id].[id].MEMBERS and [B].[Status].[Status].MEMBERS on the rows to see what the relationship is and why certain members of A.id are not being dropped. I did that as follows:
(
[A].[Id].[Id].MEMBERS,
FILTER
(
[A].[Name].[Name].MEMBERS,
LEFT([A].[Name].CURRENTMEMBER.NAME, 4) <> "test"
),
[A].[Start].[Start].MEMBERS,
[A].[Owner].[Owner].MEMBERS,
[B].[id].[id].MEMBERS,
[B].[Status].[Status].MEMBERS
)
ON ROWS
But this showed every single member of A mapped with every single member of B. Basically a crossjoin. This is not what I need. Like I mentioned, there are unique members in B that map to one member in A. I did a lot of googling and came across a LINKMEMBER() function but this does not seem to work for the implementation I need. Any help or advice is appreciated.
The current query I am running is below. I have tried adding the commented [B].[Status].&[Enabled] to both the WHERE clause and ON ROWS but it was giving me the same results as always. I also tried to use the FILTER function and filter to only [B].[Status].CURRENTMEMBER.NAME = "Enabled" but that produced an empty table with no output.
SELECT
(
[Measures].[Count]
)
ON COLUMNS,
ORDER(
(
--[B].[Status].&[Enabled],
[A].[Id].Children,
FILTER
(
[A].[Name].Children,
LEFT([A].[Name].CURRENTMEMBER.NAME, 4) <> "test"
),
[A].[Start].Children,
[A].[Owner].Children
),
[A].[Start].CurrentMember.Member_Key,
BASC
)
ON ROWS
FROM
(
SELECT
(
{[A].[Start].&[2020-05-21] : [A].[Start].&[2020-05-27]}
)
ON COLUMNS
FROM [Model]
)
WHERE
(
{[date].[date].&[2020-05-21] : [date].[date].&[2020-05-27]},
--[B].[Status].&[Enabled],
[A].[Product].&[ASDF]
)
Im fairly new to MDX so I apologies for the extensive explanation.

Welcome to Stackoverflow and MDX. The problem you are facing is address by using non empty.
In MDX if you write (DimA.Attribute1.members,DimB.Attribute1.members), it means you are asking for a cross join. To ensure that only those combination are returned that are valid you have to use non empty. Try your modified query below
SELECT
(
[Measures].[Count]
)
ON COLUMNS,
non empty(
[A].[Id].[Id].MEMBERS,
FILTER
(
[A].[Name].[Name].MEMBERS,
LEFT([A].[Name].CURRENTMEMBER.NAME, 4) <> "test"
),
[A].[Start].[Start].MEMBERS,
[A].[Owner].[Owner].MEMBERS
)
ON ROWS
FROM
(
SELECT
(
{[A].[Start].&[2020-05-10] : [A].[Start].&[2020-05-25]}
)
ON COLUMNS
FROM [Model]
)
WHERE
(
{[date].[date].&[2020-05-10] : [date].[date].&[2020-05-25]},
{[B].[End].&[2020-05-25]:NULL},
[A].[Product].&[ASDF],[B].[Status].&[Enabled]
)
One thing that needs to be remembered is that this happens we you are using attribute of diffrent dimensions, if you have the same dimensions its handled automatically. Example (Dim1.attribute1.members, Dim1.attribute2.members), This will only returns datapoints that exist.
Try the Query below.
SELECT
(
[Measures].[Count]
)
ON COLUMNS,
([B].[Status].&[Enabled],
[A].[Id].[Id].MEMBERS,
FILTER
(
[A].[Name].[Name].MEMBERS,
LEFT([A].[Name].CURRENTMEMBER.NAME, 4) <> "test"
),
[A].[Start].[Start].MEMBERS,
[A].[Owner].[Owner].MEMBERS
)
ON ROWS
FROM
(
SELECT
(
{[A].[Start].&[2020-05-10] : [A].[Start].&[2020-05-25]}
)
ON COLUMNS
FROM [Model]
)
WHERE
(
{[date].[date].&[2020-05-10] : [date].[date].&[2020-05-25]},
{[B].[End].&[2020-05-25]:NULL},
[A].[Product].&[ASDF]
)

Related

SQL/PostgreSQL: How to select limited amount of rows of different types based on limits stored in a different table?

I have a table (table 1) where the first column is the key and the second column contains elements of different types. In table 1, there's three types (type A, B, C) but the actual database have many more types.
Table.1. A minimal example.
_________________
| | |
|_KEY| attribute |
|____|___________|
|k1 | A |
|k2 | A |
|k3 | B |
|k4 | C |
|k5 | C |
|____|___________|
From table 1; I am interested in retrieving only a limited amount of elements from each type. The limited amount of elements of a given type is provided by table 2, in which the elements type is the key of the table (_element).
To clarify; The limited amount of elements of type A to obtain from table 1. in this minimal example is 1. Likewise, for type B it is 2 and for type C it is 1.
Table 2. Limits of item to obtain for each type in table 1.
____________________
| _Element | Limit |
|----------|-------|
| A | 1 |
| B | 2 |
| C | 1 |
|__________|_______|
Finally, the elements should be retrieved from table 1 from top to bottom.
Thanks for any help and/or pointers / gus.
P.S.
For the above minimal example, the expected output would be
___________________
| Key| Attribute |
|____|____________|
| k1 | A |
| k3 | B |
| K4 | C |
|____|____________|
Since there only exists 1 C attribute for this particular minimal example. Note that if there would have existed, say 5 elements of type C then the follow table would have been obtained instead (since the limited amount of C elements is 2)
___________________
| Key| Attribute |
|____|____________|
| k1 | A |
| k3 | B |
| K4 | C |
|_k5 | C |
|____|____________|
You can always do it with a union.
select top (SELECT Limit FROM Table2 WHERE _Element='A') * from Table1
WHERE attribute = A
UNION ALL
select top (SELECT Limit FROM Table2 WHERE _Element='B') * from Table1
WHERE attribute = B
UNION ALL
select top (SELECT Limit FROM Table2 WHERE _Element='C') * from Table1
WHERE attribute = C
Or using row_number:
with cte as (SELECT _Key,
attribute,
ROW_NUMBER() OVER (Partition by attribute Order by _Key ASC) as rowno
From Table1)
SELECT * FROM cte
LEFT JOIN Table2 on Table2.Element = Table1.attribute
WHERE rowno >= Limit
I truly like the power of PostgreSQL arrays. So
select
table2._element,
unnest((array_agg(table1._key order by table1._key desc)[1:table2.limit])) as _key
from
table1 join table2 on (table1.attribute = table2._element)
group by
table2._element, table2.limit
where in the second field of the query:
array_agg(table1._key order by table1._key desc) - collects values into array in the specified order (note that order by table1._key desc is just for example and you might to skip it or to specify another one),
(...)[1:table2.limit] - returns array elements from 1 to table2.limit,
unnest(...) - unwraps previous result to rows.

Insert into multiple tables

A brief explanation on the relevant domain part:
A Category is composed of four data:
Gender (Male/Female)
Age Division (Mighty Mite to Master)
Belt Color (White to Black)
Weight Division (Rooster to Heavy)
So, Male Adult Black Rooster forms one category. Some combinations may not exist, such as mighty mite black belt.
An Athlete fights Athletes of the same Category, and if he classifies, he fights Athletes of different Weight Divisions (but of the same Gender, Age and Belt).
To the modeling. I have a Category table, already populated with all combinations that exists in the domain.
CREATE TABLE Category (
[Id] [int] IDENTITY(1,1) NOT NULL,
[AgeDivision_Id] [int] NULL,
[Gender] [int] NULL,
[BeltColor] [int] NULL,
[WeightDivision] [int] NULL
)
A CategorySet and a CategorySet_Category, which forms a many to many relationship with Category.
CREATE TABLE CategorySet (
[Id] [int] IDENTITY(1,1) NOT NULL,
[Championship_Id] [int] NOT NULL,
)
CREATE TABLE CategorySet_Category (
[CategorySet_Id] [int] NOT NULL,
[Category_Id] [int] NOT NULL
)
Given the following result set:
| Options_Id | Championship_Id | AgeDivision_Id | BeltColor | Gender | WeightDivision |
|------------|-----------------|----------------|-----------|--------|----------------|
1. | 2963 | 422 | 15 | 7 | 0 | 0 |
2. | 2963 | 422 | 15 | 7 | 0 | 1 |
3. | 2963 | 422 | 15 | 7 | 0 | 2 |
4. | 2963 | 422 | 15 | 7 | 0 | 3 |
5. | 2964 | 422 | 15 | 8 | 0 | 0 |
6. | 2964 | 422 | 15 | 8 | 0 | 1 |
7. | 2964 | 422 | 15 | 8 | 0 | 2 |
8. | 2964 | 422 | 15 | 8 | 0 | 3 |
Because athletes may fight two CategorySets, I need CategorySet and CategorySet_Category to be populated in two different ways (it can be two queries):
One Category_Set for each row, with one CategorySet_Category pointing to the corresponding Category.
One Category_Set that groups all WeightDivisions in one CategorySet in the same AgeDivision_Id, BeltColor, Gender. In this example, only BeltColor varies.
So the final result would have a total of 10 CategorySet rows:
| Id | Championship_Id |
|----|-----------------|
| 1 | 422 |
| 2 | 422 |
| 3 | 422 |
| 4 | 422 |
| 5 | 422 |
| 6 | 422 |
| 7 | 422 |
| 8 | 422 |
| 9 | 422 | /* groups different Weight Division for BeltColor 7 */
| 10 | 422 | /* groups different Weight Division for BeltColor 8 */
And CategorySet_Category would have 16 rows:
| CategorySet_Id | Category_Id |
|----------------|-------------|
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
| 7 | 7 |
| 8 | 8 |
| 9 | 1 | /* groups different Weight Division for BeltColor 7 */
| 9 | 2 | /* groups different Weight Division for BeltColor 7 */
| 9 | 3 | /* groups different Weight Division for BeltColor 7 */
| 9 | 4 | /* groups different Weight Division for BeltColor 7 */
| 10 | 5 | /* groups different Weight Division for BeltColor 8 */
| 10 | 6 | /* groups different Weight Division for BeltColor 8 */
| 10 | 7 | /* groups different Weight Division for BeltColor 8 */
| 10 | 8 | /* groups different Weight Division for BeltColor 8 */
I have no idea how to insert into CategorySet, grab it's generated Id, then use it to insert into CategorySet_Category
I hope I've made my intentions clear.
I've also created a SQLFiddle.
Edit 1: I commented in Jacek's answer that this would run only once, but this is false. It will run a couple of times a week. I have the option to run as SQL Command from C# or a stored procedure. Performance is not crucial.
Edit 2: Jacek suggested using SCOPE_IDENTITY to return the Id. Problem is, SCOPE_IDENTITY returns only the last inserted Id, and I insert more than one row in CategorySet.
Edit 3: Answer to #FutbolFan who asked how the FakeResultSet is retrieved.
It is a table CategoriesOption (Id, Price_Id, MaxAthletesByTeam)
And tables CategoriesOptionBeltColor, CategoriesOptionAgeDivision, CategoriesOptionWeightDivison, CategoriesOptionGender. Those four tables are basically the same (Id, CategoriesOption_Id, Value).
The query look like this:
SELECT * FROM CategoriesOption co
LEFT JOIN CategoriesOptionAgeDivision ON
CategoriesOptionAgeDivision.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionBeltColor ON
CategoriesOptionBeltColor.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionGender ON
CategoriesOptionGender.CategoriesOption_Id = co.Id
LEFT JOIN CategoriesOptionWeightDivision ON
CategoriesOptionWeightDivision.CategoriesOption_Id = co.Id
The solution described here will work correctly in multi-user environment and when destination tables CategorySet and CategorySet_Category are not empty.
I used schema and sample data from your SQL Fiddle.
First part is straight-forward
(ab)use MERGE with OUTPUT clause.
MERGE can INSERT, UPDATE and DELETE rows. In our case we need only to INSERT. 1=0 is always false, so the NOT MATCHED BY TARGET part is always executed. In general, there could be other branches, see docs. WHEN MATCHED is usually used to UPDATE; WHEN NOT MATCHED BY SOURCE is usually used to DELETE, but we don't need them here.
This convoluted form of MERGE is equivalent to simple INSERT, but unlike simple INSERT its OUTPUT clause allows to refer to the columns that we need.
MERGE INTO CategorySet
USING
(
SELECT
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,Category.Id AS Category_Id
FROM
FakeResultSet
INNER JOIN Category ON
Category.AgeDivision_Id = FakeResultSet.AgeDivision_Id AND
Category.Gender = FakeResultSet.Gender AND
Category.BeltColor = FakeResultSet.BeltColor AND
Category.WeightDivision = FakeResultSet.WeightDivision
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(Championship_Id
,Price_Id
,MaxAthletesByTeam)
VALUES
(Src.Championship_Id
,Src.Price_Id
,Src.MaxAthletesByTeam)
OUTPUT inserted.id AS CategorySet_Id, Src.Category_Id
INTO CategorySet_Category (CategorySet_Id, Category_Id)
;
FakeResultSet is joined with Category to get Category.id for each row of FakeResultSet. It is assumed that Category has unique combinations of AgeDivision_Id, Gender, BeltColor, WeightDivision.
In OUTPUT clause we need columns from both source and destination tables. The OUTPUT clause in simple INSERT statement doesn't provide them, so we use MERGE here that does.
The MERGE query above would insert 8 rows into CategorySet and insert 8 rows into CategorySet_Category using generated IDs.
Second part
needs temporary table. I'll use a table variable to store generated IDs.
DECLARE #T TABLE (
CategorySet_Id int
,AgeDivision_Id int
,Gender int
,BeltColor int);
We need to remember the generated CategorySet_Id together with the combination of AgeDivision_Id, Gender, BeltColor that caused it.
MERGE INTO CategorySet
USING
(
SELECT
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,FakeResultSet.AgeDivision_Id
,FakeResultSet.Gender
,FakeResultSet.BeltColor
FROM
FakeResultSet
GROUP BY
FakeResultSet.Championship_Id
,FakeResultSet.Price_Id
,FakeResultSet.MaxAthletesByTeam
,FakeResultSet.AgeDivision_Id
,FakeResultSet.Gender
,FakeResultSet.BeltColor
) AS Src
ON 1 = 0
WHEN NOT MATCHED BY TARGET THEN
INSERT
(Championship_Id
,Price_Id
,MaxAthletesByTeam)
VALUES
(Src.Championship_Id
,Src.Price_Id
,Src.MaxAthletesByTeam)
OUTPUT
inserted.id AS CategorySet_Id
,Src.AgeDivision_Id
,Src.Gender
,Src.BeltColor
INTO #T(CategorySet_Id, AgeDivision_Id, Gender, BeltColor)
;
The MERGE above would group FakeResultSet as needed and insert 2 rows into CategorySet and 2 rows into #T.
Then join #T with Category to get Category.IDs:
INSERT INTO CategorySet_Category (CategorySet_Id, Category_Id)
SELECT
TT.CategorySet_Id
,Category.Id AS Category_Id
FROM
#T AS TT
INNER JOIN Category ON
Category.AgeDivision_Id = TT.AgeDivision_Id AND
Category.Gender = TT.Gender AND
Category.BeltColor = TT.BeltColor
;
This will insert 8 rows into CategorySet_Category.
Here is not the full answer, but direction which you can use to solve this:
1st query:
select row_number() over(order by t, Id) as n, Championship_Id
from (
select distinct 0 as t, b.Id, a.Championship_Id
from FakeResultSet as a
inner join
Category as b
on
a.AgeDivision_Id=b.AgeDivision_Id and
a.Gender=b.Gender and
a.BeltColor=b.BeltColor and
a.WeightDivision=b.WeightDivision
union all
select distinct 1, BeltColor, Championship_Id
from FakeResultSet
) as q
2nd query:
select q2.CategorySet_Id, c.Id as Category_Id from (
select row_number() over(order by t, Id) as CategorySet_Id, Id, BeltColor
from (
select distinct 0 as t, b.Id, null as BeltColor
from FakeResultSet as a
inner join
Category as b
on
a.AgeDivision_Id=b.AgeDivision_Id and
a.Gender=b.Gender and
a.BeltColor=b.BeltColor and
a.WeightDivision=b.WeightDivision
union all
select distinct 1, BeltColor, BeltColor
from FakeResultSet
) as q
) as q2
inner join
Category as c
on
(q2.BeltColor is null and q2.Id=c.Id)
OR
(q2.BeltColor = c.BeltColor)
of course this will work only for empty CategorySet and CategorySet_Category tables, but you can use select coalese(max(Id), 0) from CategorySet to get current number and add it to row_number, thus you will get real ID which will be inserted into CategorySet row for second query
What I do when I run into these situations is to create one or many temporary tables with row_number() over clauses giving me identities on the temporary tables. Then I check for the existence of each record in the actual tables, and if they exist update the temporary table with the actual record ids. Finally I run a while exists loop on the temporary table records missing the actual id and insert them one at a time, after the insert I update the temporary table record with the actual ids. This lets you work through all the data in a controlled manner.
##IDENTITY is your friend to the 2nd part of question
https://msdn.microsoft.com/en-us/library/ms187342.aspx
and
Best way to get identity of inserted row?
Some API (drivers) returns int from update() function, i.e. ID if it is "insert". What API/environment do You use?
I don't understand 1st problem. You should not insert identity column.
Below query will give final result For CategorySet rows:
SELECT
ROW_NUMBER () OVER (PARTITION BY Championship_Id ORDER BY Championship_Id) RNK,
Championship_Id
FROM
(
SELECT
Championship_Id
,BeltColor
FROM #FakeResultSet
UNION ALL
SELECT
Championship_Id,BeltColor
FROM #FakeResultSet
GROUP BY Championship_Id,BeltColor
)BASE

CTE to represent a logical table for the rows in a table which have the max value in one column

I have an "insert only" database, wherein records aren't physically updated, but rather logically updated by adding a new record, with a CRUD value, carrying a larger sequence. In this case, the "seq" (sequence) column is more in line with what you may consider a primary key, but the "id" is the logical identifier for the record. In the example below,
This is the physical representation of the table:
seq id name | CRUD |
----|-----|--------|------|
1 | 10 | john | C |
2 | 10 | joe | U |
3 | 11 | kent | C |
4 | 12 | katie | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
This is the logical representation of the table, considering the "most recent" records:
seq id name | CRUD |
----|-----|--------|------|
2 | 10 | joe | U |
3 | 11 | kent | C |
5 | 12 | sue | U |
6 | 13 | jill | C |
7 | 14 | bill | C |
In order to, for instance, retrieve the most recent record for the person with id=12, I would currently do something like this:
SELECT
*
FROM
PEOPLE P
WHERE
P.ID = 12
AND
P.SEQ = (
SELECT
MAX(P1.SEQ)
FROM
PEOPLE P1
WHERE P.ID = 12
)
...and I would receive this row:
seq id name | CRUD |
----|-----|--------|------|
5 | 12 | sue | U |
What I'd rather do is something like this:
WITH
NEW_P
AS
(
--CTE representing all of the most recent records
--i.e. for any given id, the most recent sequence
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
The first SQL example using the the subquery already works for us.
Question: How can I leverage a CTE to simplify our predicates when needing to leverage the "most recent" logical view of the table. In essence, I don't want to inline a subquery every single time I want to get at the most recent record. I'd rather define a CTE and leverage that in any subsequent predicate.
P.S. While I'm currently using DB2, I'm looking for a solution that is database agnostic.
This is a clear case for window (or OLAP) functions, which are supported by all modern SQL databases. For example:
WITH
ORD_P
AS
(
SELECT p.*, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY seq DESC) rn
FROM people p
)
,
NEW_P
AS
(
SELECT * from ORD_P
WHERE rn = 1
)
SELECT
*
FROM
NEW_P P2
WHERE
P2.ID = 12
PS. Not tested. You may need to explicitly list all columns in the CTE clauses.
I guess you already put it together. First find the max seq associated with each id, then use that to join back to the main table:
WITH newp AS (
SELECT id, MAX(seq) AS latestseq
FROM people
GROUP BY id
)
SELECT p.*
FROM people p
JOIN newp n ON (n.latestseq = p.seq)
ORDER BY p.id
What you originally had would work, or moving the CTE into the "from" clause. Maybe you want to use a timestamp field rather than a sequence number for the ordering?
Following up from #Glenn's answer, here is an updated query which meets my original goal and is on par with #mustaccio's answer, but I'm still not sure what the performance (and other) implications of this approach vs the other are.
WITH
LATEST_PERSON_SEQS AS
(
SELECT
ID,
MAX(SEQ) AS LATEST_SEQ
FROM
PERSON
GROUP BY
ID
)
,
LATEST_PERSON AS
(
SELECT
P.*
FROM
PERSON P
JOIN
LATEST_PERSON_SEQS L
ON
(
L.LATEST_SEQ = P.SEQ)
)
SELECT
*
FROM
LATEST_PERSON L2
WHERE
L2.ID = 12

Select multiple (non-aggregate function) columns with GROUP BY

I am trying to select the max value from one column, while grouping by another non-unique id column which has multiple duplicate values. The original database looks something like:
mukey | comppct_r | name | type
65789 | 20 | a | 7n
65789 | 15 | b | 8m
65789 | 1 | c | 1o
65790 | 10 | a | 7n
65790 | 26 | b | 8m
65790 | 5 | c | 1o
...
This works just fine using:
SELECT c.mukey, Max(c.comppct_r) AS ComponentPercent
FROM c
GROUP BY c.mukey;
Which returns a table like:
mukey | ComponentPercent
65789 | 20
65790 | 26
65791 | 50
65792 | 90
I want to be able to add other columns in without affecting the GROUP BY function, to include columns like name and type into the output table like:
mukey | comppct_r | name | type
65789 | 20 | a | 7n
65790 | 26 | b | 8m
65791 | 50 | c | 7n
65792 | 90 | d | 7n
but it always outputs an error saying I need to use an aggregate function with select statement. How should I go about doing this?
You have yourself a greatest-n-per-group problem. This is one of the possible solutions:
select c.mukey, c.comppct_r, c.name, c.type
from c yt
inner join(
select c.mukey, max(c.comppct_r) comppct_r
from c
group by c.mukey
) ss on c.mukey = ss.mukey and c.comppct_r= ss.comppct_r
Another possible approach, same output:
select c1.*
from c c1
left outer join c c2
on (c1.mukey = c2.mukey and c1.comppct_r < c2.comppct_r)
where c2.mukey is null;
There's a comprehensive and explanatory answer on the topic here: SQL Select only rows with Max Value on a Column
Any non-aggregate column should be there in Group By clause .. why??
t1
x1 y1 z1
1 2 5
2 2 7
Now you are trying to write a query like:
select x1,y1,max(z1) from t1 group by y1;
Now this query will result only one row, but what should be the value of x1?? This is basically an undefined behaviour. To overcome this, SQL will error out this query.
Now, coming to the point, you can either chose aggregate function for x1 or you can add x1 to group by. Note that this all depends on your requirement.
If you want all rows with aggregation on z1 grouping by y1, you may use SubQ approach.
Select x1,y1,(select max(z1) from t1 where tt.y1=y1 group by y1)
from t1 tt;
This will produce a result like:
t1
x1 y1 max(z1)
1 2 7
2 2 7
Try using a virtual table as follows:
SELECT vt.*,c.name FROM(
SELECT c.mukey, Max(c.comppct_r) AS ComponentPercent
FROM c
GROUP BY c.muke;
) as VT, c
WHERE VT.mukey = c.mukey
You can't just add additional columns without adding them to the GROUP BY or applying an aggregate function. The reason for that is, that the values of a column can be different inside one group. For example, you could have two rows:
mukey | comppct_r | name | type
65789 | 20 | a | 7n
65789 | 20 | b | 9f
How should the aggregated group look like for the columns name and type?
If name and type is always the same inside a group, just add it to the GROUP BY clause:
SELECT c.mukey, Max(c.comppct_r) AS ComponentPercent
FROM c
GROUP BY c.muke, c.name, c.type;
Use a 'Having' clause
SELECT *
FROM c
GROUP BY c.mukey
HAVING c.comppct_r = Max(c.comppct_r);

Deleting similar columns in SQL

In PostgreSQL 8.3, let's say I have a table called widgets with the following:
id | type | count
--------------------
1 | A | 21
2 | A | 29
3 | C | 4
4 | B | 1
5 | C | 4
6 | C | 3
7 | B | 14
I want to remove duplicates based upon the type column, leaving only those with the highest count column value in the table. The final data would look like this:
id | type | count
--------------------
2 | A | 29
3 | C | 4 /* `id` for this record might be '5' depending on your query */
7 | B | 14
I feel like I'm close, but I can't seem to wrap my head around a query that works to get rid of the duplicate columns.
count is a sql reserve word so it'll have to be escaped somehow. I can't remember the syntax for doing that in Postgres off the top of my head so I just surrounded it with square braces (change it if that isn't correct). In any case, the following should theoretically work (but I didn't actually test it):
delete from widgets where id not in (
select max(w2.id) from widgets as w2 inner join
(select max(w1.[count]) as [count], type from widgets as w1 group by w1.type) as sq
on sq.[count]=w2.[count] and sq.type=w2.type group by w2.[count]
);
There is a slightly simpler answer than Asaph's, with EXISTS SQL operator :
DELETE FROM widgets AS a
WHERE EXISTS
(SELECT * FROM widgets AS b
WHERE (a.type = b.type AND b.count > a.count)
OR (b.id > a.id AND a.type = b.type AND b.count = a.count))
EXISTS operator returns TRUE if the following SQL statement returns at least one record.
According to your requirements, seems to me that this should work:
DELETE
FROM widgets
WHERE type NOT IN
(
SELECT type, MAX(count)
FROM widgets
GROUP BY type
)