Complex aggregation with select - sql

I have table in DB like this (ID column is not a unique UUID, just some object ID, primary key still exists, but removed for example)
ID
Option
Value
Number of searches
Search date
1
abc
a
1
2021-01-01
1
abc
b
2
2021-01-01
1
abc
a
3
2021-01-02
1
abc
b
4
2021-01-02
1
def
a
5
2021-01-01
1
def
b
6
2021-01-01
1
def
a
7
2021-01-02
1
def
b
8
2021-01-02
2
...
...
...
...
...
...
...
...
...
N
xyz
xyz
M
any date
I want to get a some kind of statistic report like
ID
Total searches
Option
Total number of option searches
Value
Total value searches
1
36
abc
10
a
4
b
6
def
26
a
12
b
14
Is it possible in some way? UNION isn't working were, clause GROUP BY also have no idea how can solve that
I can do it easily in kotlin, just request everything and aggregate to classes like that
data class SearchAggregate (
val id: String,
val options: List<Option>,
val values: List<Value>
)
data class Option (
val name: String,
val totalSearches: Long
)
data class Value(
val name: String,
val totalSearches: Long
)
and export to file but I have to request data by SQL

You can use the COUNT() window function in a subquery to preprocess the data. For example:
select
id,
max(total_searches) as total_searches,
option,
max(total_options) as total_options,
value,
max(total_values) as total_values
from (
select
id,
count(*) over(partition by id) as total_searches,
option,
count(*) over(partition by id, option) as total_options,
value,
count(*) over(partition by id, option, value) as total_values
from t
) x
group by id, option, value
See running example at DB Fiddle #1.
Or you can use a shorter query, as in:
select
id,
sum(cnt) over(partition by id) as total_searches,
option,
sum(cnt) over(partition by id, option) as total_searches,
value,
cnt
from (
select id, option, value, count(*) as cnt from t group by id, option, value
) x
See running example at DB Fiddle #2.

The first option is to use ROLLUP, as that is the intended SQL pattern. It doesn't give you the results in the format you asked for. That's a reflection on the format you asked for not being normalised.
SELECT
id,
option,
value,
SUM(`Number of searches`) AS total_searches
FROM
your_table
GROUP BY
ROLLUP(
id,
option,
value
)
It's concise, standard practice, SQL Friendly, etc, etc.
Thinking in terms of these normalised patterns will make your use of SQL much more effective.
That said, you CAN use SQL to aggregate and restructure the results. You get the structure you want, but with more code and increased maintenance, lower flexibility, etc.
SELECT
id,
SUM(SUM(`Number of searches`)) OVER (PARTITION BY id) as total_by_id,
option,
SUM(SUM(`Number of searches`)) OVER (PARTITION BY id, option) as total_by_id_option,
value,
SUM(`Number of searches`) AS total_by_id_option_value
FROM
your_table
GROUP BY
id,
option,
value
That doesn't leave blanks where you have them, but that's because to do so is a SQL Anti-Pattern, and should be handled in your presentation layer, not in the database.
Oh, and please don't use column names with spaces; stick to alphanumeric characters with underscores.
Demo : https://www.db-fiddle.com/f/fX3tNL82gqgVCRoa3v6snP/5

Related

How to use a group by but still access every number values cleverly

I need to "group by" my data to distinguish between tests (each test has a specific id, name and temperature) and to calculate their count, standard deviation, etc. But I also need access every raw data value from each group, for further indexes calculations that I do in a python script.
I have found two solution to this problem, but both seems non-optimal/flawed:
1) Using listagg to store every raw value that were grouped into a single string row. It does the work but it is not optimized : I concatenate multiples float values into a giant string that I will immediately de-concatenate and convert back to float. That seem necessary and costly.
2) Removing the group by entirely and do the count and standard deviation though partitioning. But that seems even worse to me. I don't know if PLSQL/oracle optimizes this, it could be calculating the same count and standard deviation for every line (I don't know how to check this). The query result also becomes messy: since there is no 'group by' anymore, I have to do add multiple checks in my python file in order to differentiate every test data (specific id, name and temperature).
I think that my first solution can be improved but I don't know how. How can I use a group by but still access every number values cleverly ?
A function similar to list_agg but with a collection/array output type instead of a string output type could maybe do the trick (a sort of 'array_agg' compatible with oracle), but I don't know any.
EDIT:
The sample data is complex and probably restricted to the company viewing, but I can show you my simplified query for my 1) :
SELECT
rav.rav_testid as test_id,
tte.tte_testname as test_name,
tsc.tsc_temperature as temperature,
listagg(rav.rav_value, ' ')WITHIN GROUP (ORDER BY rav.rav_value) as all_specific_test_values,
COUNT(rav.rav_value) as n,
STDDEV(rav.rav_value) as sigma,
FROM
...
(8 inner joins)
GROUP BY
rav.rqv_testid, tte.tte_testname,tsc.tsc_temperature
ORDER BY
rav.RAV_testid, tte.tte_testname, spd.SPD_SPLITNAMEINTERNAL,tsc.tsc_temperature
The result looks like :
test_id | test_name | temperature | all_specific_test_values | n | sigma
-------------------------------------------------------------------------
6001 |VADC_A(...) | -40 | ,8094034194946289 ,8(...)| 58 | 0,54
6001 |VADC_A(...) | 25 | ,5054857852946545 ,6(...)| 56 | 0,24
6001 |VADC_A(...) | 150 | ,8625754277452524 ,4(...)| 56 | 0,26
6002 |VADC_B(...) | -40 | ,9874514651548454 ,5(...)| 57 | 0,44
I think you want analytic functions:
select t.*,
count(*) over (partition by test) as cnt,
avg(value) over (partition by test) as avg_value,
stddev(value) over (partition by test) as stddev_value
from t;
This adds additional columns on each row.
I would suggest going with #Gordon_Linoff's solution. That is likely the most standard solution.
If you want to go with a less standard solution, you can have a group by that returns a collection as one of the columns. Presumably, your script could iterate through that collection though it might take a bit of work in the script to do that.
create type num_tbl as table of number;
/
create table foo (
grp integer,
val number
);
insert into foo values( 1, 1.1 );
insert into foo values( 2, 1.2 );
insert into foo values( 1, 1.3 );
insert into foo values( 2, 1.4 );
select grp, avg(val), cast( collect( val ) as num_tbl )
from foo
group by grp

BigQuery full table to partition

I have a 340 GB of data in one table (270 days worth of data). Now planning move this data to partition table.
That means I will have 270 partitions. What is the best way to move this data to partition table.
I dont want to run 270 queries which is very costly operation. So looking for optimized solution.
I have multiple tables like this. I need to migrate all these tables to partition tables.
Thanks,
I see three options
Direct Extraction out of original table:
Actions (how many queries to run) = Days [to extract] = 270
Full Scans (how much data scanned measured in full scans of original table) = Days = 270
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 270 = $459.00
Hierarchical(recursive) Extraction: (described in Mosha’s answer)
Actions = 2^log2(Days) – 2 = 510
Full Scans = 2*log2(Days) = 18
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 18 = $30.60
Clustered Extraction: (I will describe it in a sec)
Actions = Days + 1 = 271
Full Scans = [always]2 = 2
Cost, $ = $5 x Table Size, TB xFull Scans = $5 x 0.34 x 2 = $3.40
Summary
Method Actions Total Full Scans Total Cost
Direct Extraction 270 270 $459.00
Hierarchical(recursive) Extraction 510 18 $30.60
Clustered Extraction 271 2 $3.40
Definitely, for most practical purposes Mosha’s solution is way to go (I use it in most such cases)
It is relatively simple and straightforward
Even though you need to run query 510 times – the query is "relatively" simple and orchestration logic is simple to implement with whatever client you usually use
And cost save is quite visible!
From $460 down to $31!
Almost 15 times down!
In case if you -
a) want to lower cost even further for yet another 9 times (so it will be total x135 times lower)
b) and like having fun and more challenges
- take a look at third option
“Clustered Extraction” Explanation
Idea / Goal:
Step 1
We want to transform original table into another [single] table with 270 columns – one column for one day
Each column will hold one serialized row for respective day from original table
Total number of rows in this new table will be equal to number of rows for most "heavy" day
This will require just one query (see example below) with one full scan
Step 2
After this new table is ready – we will be extracting day-by-day querying ONLY respective column and write into final daily table (schema of daily table are the very same as original table’s schema and all those tables could be pre-created)
This will require 270 queries to be run with scans approximately equivalent (this really depends on how complex your schema, so can vary) to one full size of original table
While querying column – we will need to de-serialize row’s value and parse it back to original scheme
Very simplified example: (using BigQuery Standard SQL here)
The purpose of this example is just to give direction if you will find idea interesting for you
Serialization / de-serialization is extremely simplified to keep focus on idea and less on particular implementation which can be different from case to case (mostly depends on schema)
So, assume original table (theTable) looks somehow like below
SELECT 1 AS id, "101" AS x, 1 AS ts UNION ALL
SELECT 2 AS id, "102" AS x, 1 AS ts UNION ALL
SELECT 3 AS id, "103" AS x, 1 AS ts UNION ALL
SELECT 4 AS id, "104" AS x, 1 AS ts UNION ALL
SELECT 5 AS id, "105" AS x, 1 AS ts UNION ALL
SELECT 6 AS id, "106" AS x, 2 AS ts UNION ALL
SELECT 7 AS id, "107" AS x, 2 AS ts UNION ALL
SELECT 8 AS id, "108" AS x, 2 AS ts UNION ALL
SELECT 9 AS id, "109" AS x, 2 AS ts UNION ALL
SELECT 10 AS id, "110" AS x, 3 AS ts UNION ALL
SELECT 11 AS id, "111" AS x, 3 AS ts UNION ALL
SELECT 12 AS id, "112" AS x, 3 AS ts UNION ALL
SELECT 13 AS id, "113" AS x, 3 AS ts UNION ALL
SELECT 14 AS id, "114" AS x, 3 AS ts UNION ALL
SELECT 15 AS id, "115" AS x, 3 AS ts UNION ALL
SELECT 16 AS id, "116" AS x, 3 AS ts UNION ALL
SELECT 17 AS id, "117" AS x, 3 AS ts UNION ALL
SELECT 18 AS id, "118" AS x, 3 AS ts UNION ALL
SELECT 19 AS id, "119" AS x, 4 AS ts UNION ALL
SELECT 20 AS id, "120" AS x, 4 AS ts
Step 1 – transform table and write result into tempTable
SELECT
num,
MAX(IF(ts=1, ser, NULL)) AS ts_1,
MAX(IF(ts=2, ser, NULL)) AS ts_2,
MAX(IF(ts=3, ser, NULL)) AS ts_3,
MAX(IF(ts=4, ser, NULL)) AS ts_4
FROM (
SELECT
ts,
CONCAT(CAST(id AS STRING), "|", x, "|", CAST(ts AS STRING)) AS ser,
ROW_NUMBER() OVER(PARTITION BY ts ORDER BY id) num
FROM theTable
)
GROUP BY num
tempTable will look like below:
num ts_1 ts_2 ts_3 ts_4
1 1|101|1 6|106|2 10|110|3 19|119|4
2 2|102|1 7|107|2 11|111|3 20|120|4
3 3|103|1 8|108|2 12|112|3 null
4 4|104|1 9|109|2 13|113|3 null
5 5|105|1 null 14|114|3 null
6 null null 15|115|3 null
7 null null 16|116|3 null
8 null null 17|117|3 null
9 null null 18|118|3 null
Here, I am using simple concatenation for serialization
Step 2 – extracting rows for specific day and write output to respective daily table
Please note: In below example - we extracting rows for ts = 2 : this corresponds to column ts_2
SELECT
r[OFFSET(0)] AS id,
r[OFFSET(1)] AS x,
r[OFFSET(2)] AS ts
FROM (
SELECT SPLIT(ts_2, "|") AS r
FROM tempTable
WHERE NOT ts_2 IS NULL
)
The result will look like below (which is expected):
id x ts
6 106 2
7 107 2
8 108 2
9 109 2
I wish I had more time for this to write down, so don’t judge to heavy if something missing – this is more directional answer - but at the same time example is pretty reasonable and if you have plain simple schema – almost no extra thinking is required. Of course with records, nested stuff in schema - most challenging part is serialization / de-serialization – but that’s where fun is – along with extra $saving
I will add another fourth option to #Mikhail's answer
DML QUERY
Action = 1 query to run
Full scans = 1
Cost = $5 x 0.34 = 1.7$ (x270 times cheaper than solution #1 \o/)
With the new DML feature of BiQuery you can convert a none partitioned table to a partitioned one while doing only one full scan of the source table
To illustrate my solution I will use one of BQ's public tables, namely bigquery-public-data:hacker_news.comments. below is the tables schema
name | type | description
_________________________________
id | INTGER | ...
_________________________________
by | STRING | ...
_________________________________
author | STRING | ...
_________________________________
... | |
_________________________________
time_ts | TIMESTAMP | human readable timestamp in UTC YYYY-MM-DD hh:mm:ss /!\ /!\ /!\
_________________________________
... | |
_________________________________
We are going to partition the comments table based on time_ts
#standardSQL
CREATE TABLE my_dataset.comments_partitioned
PARTITION BY DATE(time_ts)
AS
SELECT *
FROM `bigquery-public-data:hacker_news.comments`
I hope it helps :)
If your data was in sharded tables (i.e. with YYYYmmdd suffix), you could've used "bq partition" command. But with data in a single table - you will have to scan it multiple times applying different WHERE clauses on your partition key column.
The only optimization I can think of is to do it hierarchically, i.e. instead of 270 queries which will do 270 full table scans - first split table in half, then each half in half etc. This way you will need to pay for 2*log_2(270) = 2*9 = 18 full scans.
Once the conversion is done - all the temporary tables can be deleted to eliminate extra storage costs.

How to write an SQL query to have alternating pattern between rows (like -1 and 1)?

I cannot get an alternating pattern of 1 an -1 with my database.
This explains what I am trying to do.
ID Purpose Date Val
1 Derp 4/1/1969 1
1 Derp 4/1/1969 -1
2 Derp 4/2/2011 1
2 Derp 4/2/2011 -1
From a database that is something like
ID Purpose Date
1 Derp 4/1/1969
1 Herp 4/1/1911
2 Woot 4/2/1311
2 Wall 4/2/211
Here is my attempt:
SELECT
ID
,Purpose
,Date
,Val as 1
FROM (
SELECT FIRST(Purpose)
FROM DerpTable WHERE Purpose LIKE '%DERP%'
GROUP BY ID, DATE) as HerpTable, DerpTable
WHERE HerpTable.ID = DerpTable.ID AND DerpTable.ID = HerpTable.ID
This query does not work for me because my mssm does not recognize 'FIRST' or 'FIRST_VALUE' as built in functions. Thus, I have no way of numbering the first incident of derp and giving it a value.
Problems:
I am using sql2012 and thus cannot use First.
I tried using last_value and first_value as seen here but get errors indicating that function is not found
A bunch of sql queries. I've been staring at the MSDN T-SQL help pages
This is me right now.
What I need is a fresh perspective and assistance. Am I making this too hard?
Use a subquery along with ROW_NUMBER and the modulo operator:
select
ID,
Purpose,
Date,
case when rownum % 2 = 0 then 1 else -1 end as Val
from (
SELECT
ID
,Purpose
,Date
ROW_NUMBER() over (order by ID) as rownum
FROM (
SELECT
ID,
Purpose,
Date
FROM DerpTable WHERE Purpose LIKE '%DERP%'
GROUP BY ID, DATE) as HerpTable, DerpTable
WHERE HerpTable.ID = DerpTable.ID AND DerpTable.ID = HerpTable.ID
) [t1]
ROW_NUMBER will assign a value to each row, in this case it's an incrementing value. Using the modulus with 2 allows us to check if it's even or odd and assign 1 or -1.
Note: I don't know if this query will run since I don't know the architecture of your database, but the idea should get you there.
You can use first_value() in SQL Server 2012. I'm not sure what the WHERE condition is in your query, but the following should return your desired results:
SELECT ID,
FIRST_VALUE(Purpose) OVER (PARTITION BY ID ORDER BY DATE) as Purpose,
DATE,
2 * ROW_NUMBER() OVER (PARTITION BY ID ORDER BY DATE) - 1
FROM DERPTABLE
Why not add an incremental column, update the table using modulo to determine if it's even or odd, then drop the column?

Removing duplicate rows by selecting only those with minimum length value

I have a table with two string columns: Name and Code. Code is unique, but Name is not. Sample data:
Name Code
-------- ----
Jacket 15
Jeans 003
Jeans 26
I want to select unique rows with the smallest Code value, but not in terms of numeric value; rather, the length of the string. Of course this does not work:
SELECT Name, Min(Code) as Code
FROM Clothes
GROUP BY Name, Code
The above code will return one row for Jeans like such:
Jeans | 003
That is correct, because as a number, 003 is less than 26. But not in my application, which cares about the length of the value, not the actual value. A value with a length of three characters is greater than a value with two characters. I actually need it to return this:
Jeans | 26
Because the length of 26 is shorter than the length of 003.
So how do I write SQL code that will select row that has the code with the minimum length, not the actual minimum value? I tried doing this:
SELECT Name, Min(Len(Code)) as Code
FROM Clothes
GROUP BY Name, Code
The above returns me only a single character so I end up with this:
Jeans | 2
;WITH cte AS
(
SELECT Name, Code, rn = ROW_NUMBER()
OVER (PARTITION BY Name ORDER BY LEN(Code))
FROM dbo.Clothes
)
SELECT Name, Code
FROM cte
WHERE rn = 1;
SQLfiddle demo
If you have multiple values of code that share the same length, the choice will be arbitrary, so you can break the tie by adding an additional order by clause, e.g.
OVER (PARTITION BY Name ORDER BY LEN(Code), CONVERT(INT, Code) DESC)
SQLfiddle demo
Try this
select clothes.name, MIN(code)
from clothes
inner join
(
SELECT
Name, Min(Len(Code)) as CodeLen
FROM
clothes
GROUP BY
Name
) results
on clothes.name = results.name
and LEN(clothes.code) = results.CodeLen
group by clothes.name
It sounds like you are trying to sort on the numeric value of the Code field. If so, the correct approach would be to cast it to INT first, and use that for sorting/min functions (in a subquery), then select the original code in your main query clause.

how to select one tuple in rows based on variable field value

I'm quite new into SQL and I'd like to make a SELECT statement to retrieve only the first row of a set base on a column value. I'll try to make it clearer with a table example.
Here is my table data :
chip_id | sample_id
-------------------
1 | 45
1 | 55
1 | 5986
2 | 453
2 | 12
3 | 4567
3 | 9
I'd like to have a SELECT statement that fetch the first line with chip_id=1,2,3
Like this :
chip_id | sample_id
-------------------
1 | 45 or 55 or whatever
2 | 12 or 453 ...
3 | 9 or ...
How can I do this?
Thanks
i'd probably:
set a variable =0
order your table by chip_id
read the table in row by row
if table[row]>variable, store the table[row] in a result array,increment variable
loop till done
return your result array
though depending on your DB,query and versions you'll probably get unpredictable/unreliable returns.
You can get one value using row_number():
select chip_id, sample_id
from (select chip_id, sample_id,
row_number() over (partition by chip_id order by rand()) as seqnum
) t
where seqnum = 1
This returns a random value. In SQL, tables are inherently unordered, so there is no concept of "first". You need an auto incrementing id or creation date or some way of defining "first" to get the "first".
If you have such a column, then replace rand() with the column.
Provided I understood your output, if you are using PostGreSQL 9, you can use this:
SELECT chip_id ,
string_agg(sample_id, ' or ')
FROM your_table
GROUP BY chip_id
You need to group your data with a GROUP BY query.
When you group, generally you want the max, the min, or some other values to represent your group. You can do sums, count, all kind of group operations.
For your example, you don't seem to want a specific group operation, so the query could be as simple as this one :
SELECT chip_id, MAX(sample_id)
FROM table
GROUP BY chip_id
This way you are retrieving the maximum sample_id for each of the chip_id.