Something like COALESCE for complicated query? - sql

Suppose I have a table1:
column1|column2|state|
-------|-------|-----|
test1 | 2| 0|
test1 | 3| 0|
test1 | 1| 1|
test2 | 2| 1|
test2 | 1| 2|
I want to select (actually delete, but I use select for testing) all columns that don't have unique column1 and don't select (actually retain) only the rows that have:
state = 0 and smallest value in column2,
if no row with state = 0 exists, then the row with just smallest value in column2.
So the result if the select should be:
column1|column2|state|
-------|-------|-----|
test1 | 3| 0|
test1 | 1| 1|
test2 | 2| 1|
and the retained rows (in case of delete) should be:
column1|column2|state|
-------|-------|-----|
test1 | 2| 0|
test2 | 1| 2|
I tried to achieve it with following (which does not work):
SELECT * FROM table1 AS result1
WHERE
result1.column1 IN
(SELECT
result2.column1
FROM
table1 AS result2
WHERE /*part that works*/)
AND
result1.column2 >
(SELECT
min(result3.column2)
FROM
table1 AS result3
WHERE (COALESCE(
result3.column1 = result1.column1
AND
result3.state = 0,
WHERE
result3.column1 = result1.column1
)))
The part that I can't figure out is behind result1.column2 >.
I want to compare the result1.column2 with the result of
smallest value from result-set where it3.state = 0,
if 1. does not exist, then with smallest value from similar result-set without it3.state = 0 condition.
That is my problem, I hope it makes sense. Maybe it can be rewritten in a more efficient/neater way completely.
Can you help me to fix that query?

Is this what you want?
SELECT
*
FROM
table1 AS result1
WHERE
result1.column1 IN (SELECT result2.column1
FROM table1 AS result2
WHERE /*part that works*/)
AND result1.column2 > COALESCE( ( SELECT min(result3.column2)
FROM table1 AS result3
WHERE result3.column1 = result1.column1
AND result3.state = 0 )
,( SELECT min(result3.column2)
FROM table1 AS result3
WHERE result3.column1 = result1.column1 )
)
;

Related

Vertica: repeat category from previous period if it not listed in current

I'm trying to make some sort of running total in table with gaps. I have a period, a category and a value, and I want to list all categories used in current and previous periods for given storage_id even if there is no value of that category in current period.
My data:
period|storage_id|category|value|
------|----------|--------|-----|
1| 1|a |foo |
2| 1|b |bar |
3| 1|a |bar |
3| 1|b |foo |
1| 2|a |foo |
2| 2|b |bar |
4| 2|c |foo |
My goal:
period|storage_id|category|value|
------|----------|--------|-----|
1| 1|a |foo |
2| 1|a |NULL |
2| 1|b |bar |
3| 1|a |bar |
3| 1|b |foo |
1| 2|a |foo |
2| 2|a |NULL |
2| 2|b |bar |
4| 2|a |NULL |
4| 2|b |NULL |
4| 2|c |foo |
I managed to make it using temporary table and 2 self-joins. Is there more efficient way to do that, e.g., using window functions?
Reproducible example:
CREATE LOCAL TEMPORARY TABLE tt (
storage_id int
, category varchar(255)
, value varchar(255)
, period int
) ON COMMIT PRESERVE ROWS;
INSERT INTO tt
SELECT 1, 'a', 'foo', 1 UNION ALL
SELECT 1, 'b', 'bar', 2 UNION ALL
SELECT 1, 'a', 'bar', 3 UNION ALL
SELECT 1, 'b', 'foo', 3 UNION ALL
SELECT 2, 'a', 'foo', 1 UNION ALL
SELECT 2, 'b', 'bar', 2 UNION ALL
SELECT 2, 'c', 'foo', 4
;
My imperfect solution:
WITH
cat as (
SELECT
t1.category
, t1.storage_id
, t2.period
FROM
tt as t1 join tt as t2
on t1.storage_id = t2.storage_id
and t1.period <= t2.period
GROUP BY
t1.category
, t1.storage_id
, t2.period
)
SELECT
cat.period
, cat.storage_id
, cat.category
, tt.value
FROM cat
LEFT JOIN tt
ON tt.category = cat.category
and tt.storage_id = cat.storage_id
and tt.period = cat.period
ORDER BY
storage_id, period;
11 rows, 178 ms
I want to list all categories used in current and previous periods even if there is no value of that category in current period.
I don't see how your result set illustrates this, because you have not carried all results to the end.
For the problem you describe, the following should do what you want:
select p.period, sc.storage_id, sc.category, tt.value
from (select distinct period from tt) p join
(select storage_id, category, min(period) as first_period
from tt
group by 1, 2
) sc
on p.period >= sc.first_period left join
tt
on tt.period = p.period and
tt.storage_id = sc.storage_id and
tt.category = sc.category
order by p.period, sc.storage_id, sc.category;
Here is a db<>fiddle.
I can't figure out the actual logic that produces the result set you want.

Sql server select query of with ids, Count of ids grouped by casted date from datetime

i am struggling to find a right way to write as select query that produces a count of ids with unique date, i have Log table as
id| DateTime
1|23-03-2019 18:27:45|
1|23-03-2019 18:27:45|
2|23-03-2019 18:27:50|
2|23-03-2019 18:27:51|
2|23-03-2019 18:28:01|
3|23-03-2019 18:33:15|
1|24-03-2019 18:13:18|
2|23-03-2019 18:27:12|
2|23-03-2019 15:27:46|
3|23-03-2019 18:21:58|
3|23-03-2019 18:21:58|
4|24-03-2019 10:11:14|
What i have am tried
select id, count(cast(DateTime as DATE)) as Counts from Logs group by id
its producing proper count of ids with id like
id|count
1 | 2|
2 | 3|
3 | 1|
1 | 1|
2 | 2|
3 | 2|
4 | 1|
What i want is to add datetime column casted as date
id|count|Date
1 | 2| 23-03-2019
2 | 3| 23-03-2019
3 | 1| 23-03-2019
1 | 1| 24-03-2019
2 | 2| 24-03-2019
3 | 2| 24-03-2019
4 | 1| 24-03-2019
However i get an error saying
Column 'Logs.DateTime' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.
when i try
select id, count(cast(DateTime as DATE)) as Counts from Logs group by id
You need to add cast(DateTime as DATE) also in group by
select id,cast(DateTime as DATE) as dateval, count(cast(DateTime as DATE)) as Counts
from Logs
group by id,cast(DateTime as DATE)

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.

SQL - Pivot or Unpivot?

Another time, another problem. I have the following table:
|assemb.|Repl_1|Repl_2|Repl_3|Repl_4|Repl_5|Amount_1|Amount_2|Amount_3|Amount_4|Amount_5|
|---------------------------------------------------------------------------------------|
|4711001|111000|222000|333000|444000|555000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
|4711002|222000|333000|444000|555000|666000| 1| 1| 1| 1| 1|
|---------------------------------------------------------------------------------------|
And here what I need:
|Article|Amount|
|--------------|
| 111000| 1|
|--------------|
| 222000| 2|
|--------------|
| 333000| 2|
|--------------|
| 444000| 2|
|--------------|
| 555000| 2|
|--------------|
| 666000| 1|
|---------------
Repl_1 to Repl_10 are replacement-articles of the assembly. I can have n assemblies with to 10 rep-articles. At the end I need to overview all articles with there amounts of all assemblies.
THX.
Best greetz
Vegeta
This is probably the quickest way of achieving it using UNION ALL. However, I'd recommend normalising your table
SELECT Article, SUM(Amount) FROM (
SELECT Repl_1 AS Article, SUM(Amount_1) AS Amount FROM #Test GROUP BY Repl_1
UNION ALL
SELECT Repl_2 AS Article, SUM(Amount_2) AS Amount FROM #Test GROUP BY Repl_2
UNION ALL
SELECT Repl_3 AS Article, SUM(Amount_3) AS Amount FROM #Test GROUP BY Repl_3
UNION ALL
SELECT Repl_4 AS Article, SUM(Amount_4) AS Amount FROM #Test GROUP BY Repl_4
UNION ALL
SELECT Repl_5 AS Article, SUM(Amount_5) AS Amount FROM #Test GROUP BY Repl_5
) tbl GROUP BY Article

Return only first distinct row from each join

The scenario is simple. I have 4 tables, A -table, B -table, C1 -table and C2 -table. A is a root level table, B references A, and C1 and C2 reference B. But each B.ID can only be referenced by either C1 or C2, never both. The results are exported to a .CSV -file which is then used for a variety of purposes, and the question here has to do with readability as well as making it easier to manage the information in external software.
I wrote a query that returns all data in all 4 tables keeping the relations intact, ordering them by A, B, C1 and C2.
SELECT A.*, B.*, C1.*, C2.*
FROM A
JOIN B
LEFT JOIN C1
LEFT JOIN C2
ORDER BY A.ID, B.ID, etc.
And got this:
A.ID | B.ID | C1.ID | C2.ID
1| 1| 1| NULL
1| 1| 2| NULL
1| 2| 1| NULL
1| 2| 2| NULL
1| 2| 3| NULL
2| 1| NULL| 1
2| 1| NULL| 2
....
Now, the question here is this: How do I return only the first distinct row for each join, so that the resultset doesn't get clogged with redundant data. Basically, the result above should produce this:
A.ID | B.ID | C1.ID | C2.ID
1| 1| 1| NULL
| | 2| NULL
| 2| 1| NULL
| | 2| NULL
| | 3| NULL
2| 1| NULL| 1
| | NULL| 2
....
I can probably do this by making each join a subquery and partitioning the results by rank, or alternatively creating a temporary table and slam the results there with the required logic, but since this will be used in a console app, I'd like to keep the solution as clean, simple and optimized as possible.
Any ideas?
This is reporting / formatting, not data, so it should be handled by the application, not by SQL.
That said, this will produce something close to your requirements
select
case arn when 1 then convert(varchar(10),aid) else '' end as aid,
case brn when 1 then convert(varchar(10),bid) else '' end as bid,
case crn when 1 then convert(varchar(10),c1id) else '' end as c1id,
c2id
from
(
select a.id aid, b.id bid, c1.id c1id, c2.id c2id,
ROW_NUMBER() over(partition by a.id order by a.id,b.id,c1.id,c2.id) arn,
ROW_NUMBER() over(partition by a.id,b.id order by a.id,b.id,c1.id,c2.id) brn,
ROW_NUMBER() over(partition by a.id,b.id,c1.id order by a.id,b.id,c1.id,c2.id) crn
FROM A
JOIN B
LEFT JOIN C1
LEFT JOIN C2
) v