Display Rows if Column Value is repeated - sql

I have a SQL table that looks like this:
DATA | TEST_ID | PARAM_ID
-------------------------------------
c:\desktop\image1| 11 | 1
c:\desktop\image2| 12 | 1
c:\desktop\image3| 13 | 1
c:\desktop\image4| 14 | 1
Fail | 14 | 2
0.45 | 14 | 3
c:\desktop\image5| 15 | 1
Fail | 15 | 2
0.68 | 15 | 3
c:\desktop\image6| 16 | 1
Fail | 16 | 2
0.25 | 16 | 3
I would like to create a query where the result only shows DATA if TEST_ID has the same value repeated 3 times.
Ideal Result:
DATA | TEST_ID | PARAM_ID
-------------------------------------
c:\desktop\image4| 14 | 1
Fail | 14 | 2
0.45 | 14 | 3
c:\desktop\image5| 15 | 1
Fail | 15 | 2
0.68 | 15 | 3
c:\desktop\image6| 16 | 1
Fail | 16 | 2
0.25 | 16 | 3
Would the best approach be to use COUNT(*)>2 for the TEST_ID column?

Use window functions:
select t.*
from (select t.*, count(*) over (partition by test_id) as cnt
from t
) t
where cnt >= 3;

Related

Bigquery - Count how many time a value show up in a column

I have a table like this
Col1 | Col2 | Col3
A | 1 | 23
B | 3 | 23
B | 2 | 64
A | 4 | 75
C | 5 | 23
A | 6 | 12
A | 2 | 33
B | 3 | 52
A | 1 | 83
C | 5 | 24
A | 6 | 74
and I need a query that will show how many times the value in Col1 appeared:
Col1 | Col2 | Col3 | Col4
A | 1 | 23 | 6
B | 3 | 23 | 3
B | 2 | 64 | 6
A | 4 | 75 | 6
C | 5 | 23 | 2
A | 6 | 12 | 6
A | 2 | 33 | 6
B | 3 | 52 | 3
A | 1 | 83 | 6
C | 5 | 24 | 2
A | 6 | 74 | 6
How can I do it in BigQuery?
Easier with a window function
select *, count(*) over (partition by col1) as col4
from t;
You just need to use window function
select *, count(col1) over (partition by col

How to sum 2 columns and add it with the previous summed columns in sql?

I have a table with these rows:
+------+--------+---------+---------+
| ID | Date | Amount1 | Amount2 |
+------+--------+---------+---------+
| 1 | 13 Nov | 8 | 3 |
| 2 | 11 Nov | 5 | 1 |
| 3 | 15 Nov | 0 | 3 |
| 4 | 18 Nov | 5 | 7 |
| 5 | 20 Nov | 10 | 0 |
+------+--------+---------+---------+
Would like to query with these result with the formula
Total = (Amount1 - Amount2) + Previous Row's Total
+------+--------+---------+---------+---------+
| ID | Date | Plus | Minus | Total |
+------+--------+---------+---------+---------+
| 2 | 11 Nov | 5 | 1 | 4 |
| 1 | 13 Nov | 8 | 3 | 9 |
| 3 | 15 Nov | 0 | 3 | 6 |
| 4 | 18 Nov | 5 | 7 | 4 |
| 5 | 20 Nov | 10 | 0 | 14 |
+------+--------+---------+---------+---------+
Is there any way to query this without binding the Total to a column on temporary table?
To get a running total, you can use SUM(columnname) OVER (ORDER BY sortedcolumnname).
To me it's actually a little counterintuitive compared to most windowed functions, as it doesn't have a partition but produces different results over the set of rows. However, it does work.
Here is some somewhat-obfuscated documentation from Microsoft about it.
I think you can therefore use
SELECT mt.[ID],
mt.[Date],
mt.[Amount1] AS [Plus],
mt.[Amount2] AS [Minus],
SUM(mt.[Amount1] - mt.[Amount2]) OVER (ORDER BY mt.[Date], mt.[ID]) AS Total
FROM mytable mt
ORDER BY mt.[Date],
mt.[ID];
And here are the results - they match yours.
ID Date Plus Minus Total
2 2020-11-11 5 1 4
1 2020-11-13 8 3 9
3 2020-11-15 0 3 6
4 2020-11-18 5 7 4
5 2020-11-20 10 0 14
Demo
You can acheive this using CTE first followed by self join. For amount1 - amount2, for id=3, you will be getting 0 -3 = -3. So, for id 3, the result below will be different for id=3
DECLARE #t table(id int, dateval date, amount1 int, amount2 int)
INSERT INTO #t
values
(1 ,'2020-11-13', 8, 3),
(2 ,'2020-11-11', 5, 1),
(3 ,'2020-11-15', 0, 3),
(4 ,'2020-11-18', 5, 7),
(5 ,'2020-11-20',10, 0);
;WITH CTE_First AS
(
SELECT id, dateval, amount1 as plus, amount2 as minus, (amount1-amount2) as total ,
ROW_NUMBER() OVER (ORDER BY dateval) as rnk
FROM #t
)
SELECT c.ID, c.DATEVAL, c.plus,c.minus,c.total + isnull(c1.total,0) as new_total
FROM CTE_First AS c
left outer join CTE_First AS C1
on C1.rnk = c.rnk- 1
+----+------------+------+-------+-----------+
| ID | DATEVAL | plus | minus | new_total |
+----+------------+------+-------+-----------+
| 2 | 2020-11-11 | 5 | 1 | 4 |
| 1 | 2020-11-13 | 8 | 3 | 9 |
| 3 | 2020-11-15 | 0 | 3 | 2 |
| 4 | 2020-11-18 | 5 | 7 | -5 |
| 5 | 2020-11-20 | 10 | 0 | 8 |
+----+------------+------+-------+-----------+

Is there a way to shuffle rows in a table into distinctive fixed size chunks using SQL only?

I have a very big table (~300 million rows) with the following structure:
my_table(id, group, chunk, new_id), where chunk and new_id are set to NULL.
I want to set the rows of each group to a random chunk with distinct new_id in the chunk. Each chunk should be of fixed size of 100.
For example if group A has 1278 rows, they should go into 13 chunks (0-12), 12 chunks with 100 rows s.t. new_id are in range (0-99) and another single chunk with 78 rows s.t. new_id are in range (0-77).
The organization into chunks and within the chunks should be a random permutation where each row in A is assigned with a unique (chunk, new_id) tuple.
I'm successfully doing it using pandas but it takes hours, mostly due to memory and bandwidth limitations.
Is it possible to execute using only a SQL query?
I'm using postgres 9.6.
You could do this with row_number():
select id, group, rn / 100 chunk, rn % 100 new_id
from (select t.*, row_number() over(order by random()) - 1 rn from mytable t) t
The inner query assigns a random integer number to each record (starting at 0). The outer query does arithmetic to compute the chunk and new id.
If you want an update query:
update mytable t set chunk = x.rn / 3, new_id = x.rn % 3
from (select id, row_number() over(order by random()) - 1 rn from mytable t) x
where x.id = t.id
Demo on DB Fiddle for a dataset of 20 records with chunks of 3 records .
Before:
id | grp | chunk | new_id
-: | --: | ----: | -----:
1 | 1 | nullnull
2 | 2 | nullnull
3 | 3 | nullnull
4 | 4 | nullnull
5 | 5 | nullnull
6 | 6 | nullnull
7 | 7 | nullnull
8 | 8 | nullnull
9 | 9 | nullnull
10 | 10 | nullnull
11 | 11 | nullnull
12 | 12 | nullnull
13 | 13 | nullnull
14 | 14 | nullnull
15 | 15 | nullnull
16 | 16 | nullnull
17 | 17 | nullnull
18 | 18 | nullnull
19 | 19 | nullnull
20 | 20 | nullnull
After:
id | grp | chunk | new_id
-: | --: | ----: | -----:
19 | 19 | 0 | 0
11 | 11 | 0 | 1
20 | 20 | 0 | 2
12 | 12 | 1 | 0
14 | 14 | 1 | 1
17 | 17 | 1 | 2
3 | 3 | 2 | 0
8 | 8 | 2 | 1
5 | 5 | 2 | 2
13 | 13 | 3 | 0
10 | 10 | 3 | 1
2 | 2 | 3 | 2
16 | 16 | 4 | 0
18 | 18 | 4 | 1
6 | 6 | 4 | 2
1 | 1 | 5 | 0
15 | 15 | 5 | 1
7 | 7 | 5 | 2
4 | 4 | 6 | 0
9 | 9 | 6 | 1

Deleting recursively in a function (ERROR: query has no destination for result data)

I have this table of relationships (only id_padre and id_hijo are interesting):
id | id_padre | id_hijo | cantidad | posicion
----+----------+---------+----------+----------
0 | | 1 | 1 | 0
1 | 1 | 2 | 1 | 0
2 | 1 | 3 | 1 | 1
3 | 3 | 4 | 1 | 0
4 | 4 | 5 | 0.5 | 0
5 | 4 | 6 | 0.5 | 1
6 | 4 | 7 | 24 | 2
7 | 4 | 8 | 0.11 | 3
8 | 8 | 6 | 0.12 | 0
9 | 8 | 9 | 0.05 | 1
10 | 8 | 10 | 0.3 | 2
11 | 8 | 11 | 0.02 | 3
12 | 3 | 12 | 250 | 1
13 | 12 | 5 | 0.8 | 0
14 | 12 | 6 | 0.8 | 1
15 | 12 | 13 | 26 | 2
16 | 12 | 8 | 0.15 | 3
This table store the links between nodes (id_padre = parent node and id_hijo = child node).
I'm trying to do a function for a recursive delete of rows where I begin with a particular row. After deleted, I check if there are more rows with id_hijo column with the same value I used to delete the first row.
If there aren't rows with this condition, I'll must to delete all the rows where id_padre are equal id_hijo of the deleted row.
i.e.: If I begin to delete the row where id_padre=3 and id_hijo=4 then I delete this row:
id | id_padre | id_hijo | cantidad | posicion
----+----------+---------+----------+----------
3 | 3 | 4 | 1 | 0
and the table remains like that:
id | id_padre | id_hijo | cantidad | posicion
----+----------+---------+----------+----------
0 | | 1 | 1 | 0
1 | 1 | 2 | 1 | 0
2 | 1 | 3 | 1 | 1
4 | 4 | 5 | 0.5 | 0
5 | 4 | 6 | 0.5 | 1
6 | 4 | 7 | 24 | 2
7 | 4 | 8 | 0.11 | 3
8 | 8 | 6 | 0.12 | 0
9 | 8 | 9 | 0.05 | 1
10 | 8 | 10 | 0.3 | 2
11 | 8 | 11 | 0.02 | 3
12 | 3 | 12 | 250 | 1
13 | 12 | 5 | 0.8 | 0
14 | 12 | 6 | 0.8 | 1
15 | 12 | 13 | 26 | 2
16 | 12 | 8 | 0.15 | 3
Because of there aren't any row with id_hijo = 4 I will delete the rows where id_padre = 4....and so on..recursively. (in this example the process end here)
I have try to do this function (this function calls itself):
CREATE OR REPLACE FUNCTION borrar(integer,integer) RETURNS VOID AS
$BODY$
DECLARE
padre ALIAS FOR $1;
hijo ALIAS FOR $2;
r copia_rel%rowtype;
BEGIN
DELETE FROM copia_rel WHERE id_padre = padre AND id_hijo = hijo;
IF NOT EXISTS (SELECT id_hijo FROM copia_rel WHERE id_hijo = hijo) THEN
FOR r IN SELECT * FROM copia_rel WHERE id_padre = hijo LOOP
RAISE NOTICE 'Selecciono: %,%',r.id_padre,r.id_hijo;--for debugging
SELECT borrar(r.id_padre,r.id_hijo);
END LOOP;
END IF;
END;
$BODY$
LANGUAGE plpgsql;
But I get this error:
ERROR: query has no destination for result data
I know that there are specific recursive ways in postgresql wit CTE. I have used it for traverse my graph, but I don't know how could use it in this case.
The error is due to the SELECT used to call the function recursively. PostgreSQL wants to put the results somewhere but is not told where.
If you want to run a function and discard results use PERFORM instead of SELECT in PL/PgSQL functions.

Merge queries into one query

I have the two following tables (with some sample datas)
LOGS:
ID | SETID | DATE
========================
1 | 1 | 2010-02-25
2 | 2 | 2010-02-25
3 | 1 | 2010-02-26
4 | 2 | 2010-02-26
5 | 1 | 2010-02-27
6 | 2 | 2010-02-27
7 | 1 | 2010-02-28
8 | 2 | 2010-02-28
9 | 1 | 2010-03-01
STATS:
ID | OBJECTID | FREQUENCY | STARTID | ENDID
=============================================
1 | 1 | 0.5 | 1 | 5
2 | 2 | 0.6 | 1 | 5
3 | 3 | 0.02 | 1 | 5
4 | 4 | 0.6 | 2 | 6
5 | 5 | 0.6 | 2 | 6
6 | 6 | 0.4 | 2 | 6
7 | 1 | 0.35 | 3 | 7
8 | 2 | 0.6 | 3 | 7
9 | 3 | 0.03 | 3 | 7
10 | 4 | 0.6 | 4 | 8
11 | 5 | 0.6 | 4 | 8
7 | 1 | 0.45 | 5 | 9
8 | 2 | 0.6 | 5 | 9
9 | 3 | 0.02 | 5 | 9
Every day new logs are analyzed on different sets of objects and stored in table LOGS.
Among other processes, some statistics are computed on the objects contained into these sets and the result are stored in table STATS. These statistic are computed through several logs (identified by the STARTID and ENDID columns).
So, what could be the SQL query that would give me the latest computed stats for all the objects with the corresponding log dates.
In the given example, the result rows would be:
OBJECTID | SETID | FREQUENCY | STARTDATE | ENDDATE
======================================================
1 | 1 | 0.45 | 2010-02-27 | 2010-03-01
2 | 1 | 0.6 | 2010-02-27 | 2010-03-01
3 | 1 | 0.02 | 2010-02-27 | 2010-03-01
4 | 2 | 0.6 | 2010-02-26 | 2010-02-28
5 | 2 | 0.6 | 2010-02-26 | 2010-02-28
So, the most recent stats for set 1 are computed with logs from feb 27 to march 1 whereas stats for set 2 are computed from feb 26 to feb 28.
object 6 is not in the results rows as there is no stat on it within the last period of time.
Last thing, I use MySQL.
Any Idea ?
Does this query fit to your question ?
SELECT objectid, l1.setid, frequency, l1.date as startdate, l2.date as enddate
FROM `logs` l1
INNER JOIN `stats` s ON (s.startid=l1.id)
INNER JOIN `logs` l2 ON (l2.id=s.endid)
INNER JOIN
(
SELECT setid, MAX(date) as date
FROM `logs` l
INNER JOIN `stats` s ON (s.startid=l.id)
GROUP BY setid
) d ON (d.setid=l1.setid and d.date=l1.date)
ORDER BY objectid
If there are no ties, you can use a filtering join. For example:
select stats.objectid
, stats.frequency
, startlog.setid
, startlog.date
, endlog.date
from stats
join logs startlog
on startlog.id = stats.startid
join logs endlog
on endlog.id = stats.endid
join (
select objectid, max(endlog.date) as maxenddate
from stats
join logs endlog
on endlog.id = stats.endid
group by objectid
) filter
on stats.objectid = filter.objectid
and filter.maxenddate = endlog.date
order by stats.objectid
Your example results appear to be slightly off, for example there is no row for objectid 5 where the frequency equals 0.35.