In Oracle SQL how can i find all values in one column for which in another column exist more than one distinct value - sql

I have an Oracle table like this
| id | code | info | More cols |
|----|------|------------------|-----------|
| 1 | 13 | The Thirteen | dggf |
| 1 | 18 | The Eighteen | ghdgffg |
| 1 | 18 | The Eighteen | |
| 1 | 9 | The Nine | ghdfgjgf |
| 1 | 9 | Die Neun | ghdfgjgf |
| 1 | 75 | The Seventy-five | ghfgh |
| 1 | 75 | The Seventy-five | ghfgh |
| 1 | 2 | The Two | ghfgh |
| 1 | 27 | The Twenty-Seven | |
| 1 | 27 | The Twenty-Seven | |
| 1 | 27 | el veintisiete | fghfg |
| . | . | . | . |
| . | . | . | . |
| . | . | . | . |
In this table I want to find all rows with values in column code which have more than one distinct value in the info column. So from the listed rows this would be the values 9 and 27 and the associated rows.
I tried to construct a first query like
SELECT code FROM mytable
WHERE COUNT(DISTINCT info) >1
but I get a "ORA-00934: group function is not allowed here" error. Also I don't know how to express the condition COUNT(DISTINCT info) "with a fixed postcode".

You need having with group by - aggregate functions don't work with where clause
SELECT code
FROM mytable
group by code
having COUNT(DISTINCT info) >1

I would write your query as:
SELECT code
FROM yourTable
GROUP BY code
HAVING MIN(info) <> MAX(info);
Writing the HAVING logic this ways leaves the query sargable, meaning that an index on (code, info) should be usable.
You could also do this using exists logic:
SELECT DISTINCT code
FROM yourTable t1
WHERE EXISTS (SELECT 1 FROM yourTable WHERE t2.code = t1.code AND t2.info <> t1.info);

Related

Replace null values with most recent non-null values SQL

I have a table where each row consists of an ID, date, variable values (eg. var1).
When there is a null value for var1 in a row, I want like to replace the null value with the most recent non-null value before that date for that ID. How can I do this quickly for a very large table?
So presume I start with this table:
+----+------------|-------+
| id |date | var1 |
+----+------------+-------+
| 1 |'01-01-2022'|55 |
| 2 |'01-01-2022'|12 |
| 3 |'01-01-2022'|45 |
| 1 |'01-02-2022'|Null |
| 2 |'01-02-2022'|Null |
| 3 |'01-02-2022'|20 |
| 1 |'01-03-2022'|15 |
| 2 |'01-03-2022'|Null |
| 3 |'01-03-2022'|Null |
| 1 |'01-04-2022'|Null |
| 2 |'01-04-2022'|77 |
+----+------------+-------+
Then I want this
+----+------------|-------+
| id |date | var1 |
+----+------------+-------+
| 1 |'01-01-2022'|55 |
| 2 |'01-01-2022'|12 |
| 3 |'01-01-2022'|45 |
| 1 |'01-02-2022'|55 |
| 2 |'01-02-2022'|12 |
| 3 |'01-02-2022'|20 |
| 1 |'01-03-2022'|15 |
| 2 |'01-03-2022'|12 |
| 3 |'01-03-2022'|20 |
| 1 |'01-04-2022'|15 |
| 2 |'01-04-2022'|77 |
+----+------------+-------+
cte suits perfect here
this snippets returns the rows with values, just an update query and thats all (will update my response).
WITH selectcte AS
(
SELECT * FROM testnulls where var1 is NOT NULL
)
SELECT t1A.id, t1A.date, ISNULL(t1A.var1,t1B.var1) varvalue
FROM selectcte t1A
OUTER APPLY (SELECT TOP 1 *
FROM selectcte
WHERE id = t1A.id AND date < t1A.date
AND var1 IS NOT NULL
ORDER BY id, date DESC) t1B
Here you can dig further about CTEs :
https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql?view=sql-server-ver16

SQL to Get Latest Field Value

I'm trying to write an SQL query (SQL Server) that returns the latest value of a field from a history table.
The table structure is basically as below:
ISSUE TABLE:
issueid
10
20
30
CHANGEGROUP TABLE:
changegroupid | issueid | updated |
1 | 10 | 01/01/2020 |
2 | 10 | 02/01/2020 |
3 | 10 | 03/01/2020 |
4 | 20 | 05/01/2020 |
5 | 20 | 06/01/2020 |
6 | 20 | 07/01/2020 |
7 | 30 | 04/01/2020 |
8 | 30 | 05/01/2020 |
9 | 30 | 06/01/2020 |
CHANGEITEM TABLE:
changegroupid | field | newvalue |
1 | ONE | 1 |
1 | TWO | A |
1 | THREE | Z |
2 | ONE | J |
2 | ONE | K |
2 | ONE | L |
3 | THREE | K |
3 | ONE | 2 |
3 | ONE | 1 | <--
4 | ONE | 1A |
5 | ONE | 1B |
6 | ONE | 1C | <--
7 | ONE | 1D |
8 | ONE | 1E |
9 | ONE | 1F | <--
EXPECTED RESULT:
issueid | updated | newvalue
10 | 03/01/2020 | 1
20 | 07/01/2020 | 1C
30 | 06/01/2020 | 1F
So each change to an issue item creates 1 change group record with the date the change was made, which can then contain 1 or more change item records.
Each change item shows the field name that was changed and the new value.
I then need to link those tables together to get each issue, the latest value of the field name called 'ONE', and ideally the date of the latest change.
These tables are from Jira, for those familiar with that table structure.
I've been trying to get this to work for a while now, so far I've got this query:
SELECT issuenum, MIN(created) AS updated FROM
(
SELECT ISSUE.IssueId, UpdGrp.Created as Created, UpdItm.NEWVALUE
FROM ISSUE
JOIN ChangeGroup UpdGrp ON (UpdGrp.IssueID = CR.ID)
JOIN CHANGEITEM UpdItm ON (UpdGrp.ID = UpdItm.groupid)
WHERE UPPER(UpdItm.FIELD) = UPPER('ONE')
) AS dummy
GROUP BY issuenum
ORDER BY issuenum
This returns the first 2 columns I'm looking for but I'm struggling to work out how to return the final column as when I include that in the first line I get an error saying "Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause."
I've done a search on here and can't find anything that exactly matches my requirements.
Use window functions:
SELECT i.*
FROM (SELECT i.IssueId, cg.Created as Created, ui.NEWVALUE,
ROW_NUMBER() OVER (PARTITION BY i.IssueId ORDER BY cg.Created DESC) as seqnum
FROM ISSUE i JOIN
ChangeGroup cg
ON cg.IssueID = CR.ID JOIN
CHANGEITEM ci
ON cg.ID = ci.groupid
WHERE UPPER(UpdItm.FIELD) = UPPER('ONE')
) i
WHERE seqnum = 1
ORDER BY issueid;

Filtering using aggregation functions

I would like to filter my table by MIN() function but still keep columns which cant be grouped.
I have table:
+----+----------+----------------------+
| ID | distance | geom |
+----+----------+----------------------+
| 1 | 2 | DSDGSAsd23423DSFF |
| 2 | 11.2 | SXSADVERG678BNDVS4 |
| 2 | 2 | XCZFETEFD567687SDF |
| 3 | 24 | SADASDSVG3423FD |
| 3 | 10 | SDFSDFSDF343DFDGF |
| 4 | 34 | SFDHGHJ546GHJHJHJ |
| 5 | 22 | SDFSGTHHGHGFHUKJYU45 |
| 6 | 78 | SDFDGDHKIKUI45 |
| 6 | 15 | DSGDHHJGHJKHGKHJKJ65 |
+----+----------+----------------------+
This is what I would like to achieve:
+----+----------+----------------------+
| ID | distance | geom |
+----+----------+----------------------+
| 1 | 2 | DSDGSAsd23423DSFF |
| 2 | 2 | XCZFETEFD567687SDF |
| 3 | 10 | SDFSDFSDF343DFDGF |
| 4 | 34 | SFDHGHJ546GHJHJHJ |
| 5 | 22 | SDFSGTHHGHGFHUKJYU45 |
| 6 | 15 | DSGDHHJGHJKHGKHJKJ65 |
+----+----------+----------------------+
it is possible when I use MIN() on distance column and grouping by ID but then I loose my geom which is essential.
The query looks like this:
SELECT "ID", MIN(distance) AS distance FROM somefile GROUP BY "ID"
the result is:
+----+----------+
| ID | distance |
+----+----------+
| 1 | 2 |
| 2 | 2 |
| 3 | 10 |
| 4 | 34 |
| 5 | 22 |
| 6 | 15 |
+----+----------+
but this is not what I want.
Any suggestions?
One common approach to this is to find the minimum values in a derived table that you join with:
SELECT somefile."ID", somefile.distance, somefile.geom
FROM somefile
JOIN (
SELECT "ID", MIN(distance) AS distance FROM somefile GROUP BY "ID"
) t ON t.distance = somefile.distance AND t.ID = somefile.ID;
Sample SQL Fiddle
You need a window function to do this:
SELECT "ID", distance, geom
FROM (
SELECT "ID", distance, geom, rank() OVER (PARTITION BY "ID" ORDER BY distance) AS rnk
FROM somefile) sub
WHERE rnk = 1;
This effectively orders the entire set of rows first by the "ID" value, then by the distance and returns the record for each "ID" where the distance is minimal - no need to do a GROUP BY.
select a.*,b.geom from
(SELECT ID, MIN(distance) AS distance FROM somefile GROUP BY ID) as a
inner join somefile as b on a.id=b.id and a.distance=b.distance
You can use "distinct on" clause of the PostgreSQL.
select distinct on(id) id, distance, geom
from table_name
order by distance;
I think this is what you are exactly looking for.
For more details on how "distinct on" works, refer the documentation and the example.
But, remember, using "distinct on" does not comply to SQL standards.

Error in executing two groupbys in sparkSQL

I am new to sparksql and i was trying to experiment certain queries with that.
This is the query i am trying to execute
sqlContext.sql(SELECT id , category ,AVG(mark) FROM data GROUP BY id, category)
I am not getting proper output when i run the query.
instead of actual value of category i am getting some value as 1,2,3.
I am stuck at this weird error for long time
but when i do simple select statement and one group by its working perfectly
sqlContext.sql(SELECT id , category FROM data)
sqlContext.sql(SELECT id ,AVG(mark) FROM data GROUP BY id)
What is wrong? Does SPARKSQL has something to do with multiple group by.
right now i am running this complex query
sqlContext.sql(SELECT data.id , data.category, AVG(id_avg.met_avg) FROM (SELECT id, AVG(mark) AS met_avg FROM data GROUP BY id) AS id_avg, data GROUP BY data.category, data.id)
This works, but taking a longer time to execute.
Please Help
Sample data:
|id | category | marks
| 1 | a | 40
| 2 | b | 44
| 3 | a | 50
| 4 | b | 40
| 1 | a | 30
The output should be:
|id | category | avg
| 1 | a | 35
| 2 | b | 44
| 3 | a | 50
| 4 | b | 40
Please try this query:
SELECT
data.id
, data.category
, AVG(mark)
FROM data
GROUP BY
data.id
, data.category
Based on this sample data:
|id | category | marks
| 1 | a | 40
| 2 | b | 44
| 3 | a | 50
| 4 | b | 40
| 1 | a | 30
The output WILL be this:
|id | category | avg
| 1 | a | 35
| 2 | b | 44
| 3 | a | 50
| 4 | b | 40
and, the following expected row cannot be produced using group by:
| 5 | a | 30
That is a bug in sparksql.
Try using the next version. Its fixed.
i got the proper output by using spark-1.0.2
it worked with pure scala code also. Try either of them :)

Aggregate function across two tables

I need for further working routine a query which calculates several functions across two (maybe more) tables. But once I import more than one table I got odd results caused by JOIN conditions. First I used that query:
SELECT
sum(s.bedarf2050_kwh_a) AS bedarf_kWh_a,
sum(s.bedarf2050_kwh_a)*0.2 AS netzverlust,
sum(s.bedarf2050_kwh_a) + sum(s.bedarf2050_kwh_a)*0.2 AS gesamtbedarf,
sum(pv.modulflaeche_qm) AS instbar_modulflaeche_qm
FROM
siedlungsareale_wbm s, pv_st_potenziale_gis pv
WHERE
s.vg_solar LIKE '%NWS 2%'
AND
ST_Covers(s.geom, pv.geom);
Using sum with DISTINCT returns some accurate values but only if all input values are unique. That's not a solution I can use:
SELECT
SUM(DISTINCT s.bedarf2050_kwh_a) AS bedarf_kWh_a,
SUM(DISTINCT s.bedarf2050_kwh_a)*0.2 AS netzverlust,
SUM(DISTINCT s.bedarf2050_kwh_a) + SUM(DISTINCT s.bedarf2050_kwh_a)*0.2 AS gesamtbedarf,
SUM(pv.modulflaeche_qm) AS instbar_modulflaeche_qm,
(SUM(DISTINCT s.bedarf2050_kwh_a) + SUM(DISTINCT s.bedarf2050_kwh_a)*0.2)*0.01499 AS startwert_speichergroesse
FROM
siedlungsareale_wbm s, pv_st_potenziale_gis pv
WHERE
pv.vg_solar LIKE '%NWS 2%'
AND
ST_Covers(s.geom, pv.geom);
DISTINCT would be a proper solution if the DISTINCT refers to another column, not the column to use in the function. Or some subquery or other JOIN condition. But all I tried run in errors or false result values.
I found some solutions using UNION dealing with aggregate function on multiple tables. But as I tried to fit the code on my query I got errors.
For example like there:
Can SQL calculate aggregate functions across multiple tables?
Hope someone can help me to build a working query for my task.
[EDIT] simple example
siedlungsareale
id | bedarf2050_kWh_a | a | b | c | vg_solar | geom
---|------------------|---|---|---|----------|-----
1 | 20 | | | | NWS 2 | xxxxx
2 | 10 | | | | NWS 2 | xxxxx
3 | 30 | | | | NWS 2 | xxxxx
4 | 5 | | | | NWS 2 | xxxxx
5 | 15 | | | | NWS 2 | xxxxx
sum = 80
pv_st_potenziale_gis
id | modulflaeche_qm | x | y | z | geom
---|------------------|---|---|---|---------
1 | 10 | | | | xxxxx
2 | 10 | | | | xxxxx
3 | 20 | | | | xxxxx
4 | 10 | | | | xxxxx
5 | 30 | | | | xxxxx
6 | 30 | | | | xxxxx
7 | 10 | | | | xxxxx
8 | 10 | | | | xxxxx
9 | 10 | | | | xxxxx
10 | 10 | | | | xxxxx
sum = 140
SELECT sum(s.bedarfxxxx) AS bedarf, sum(pv.mflaeche) As mflaeche
FROM siedlungsareale s, pv_st_potenziale_gis pv
WHERE s.vg_solar LIKE '%NWS 2%' AND ST_Covers(s.geom,pv.geom);
Expected correct result:
bedarf | mflaeche
---------|----------
80 | 140
There I would get the sum of all values for column 'bedarf' from 'siedlungsareale' and all for 'mflaeche' from 'pv_st_potenziale_gis'
But the real calculated values of column 'bedarf' using this query are much higher caused of the CROSS JOIN condition.
And the other query:
SELECT sum(DISTINCT s.bedarfxxxx) AS bedarf, sum(DISTINCT pv.mflaeche) As mflaeche
FROM siedlungsareale s, pv_st_potenziale_gis pv
WHERE s.vg_solar LIKE '%NWS 2%' AND ST_Covers(s.geom,pv.geom);
returns:
bedarf | mflaeche
---------|-----------
80 | 60
Accurate value for 'bedarf' caused the values are unique. But for mflaeche where some values occurre several times the result is wrong.