Normalize column in ClickHouse

Normalize column in ClickHouse - sql

Is there a possibility a to normalize a column in Clickhouse?
I was trying to do it getting the column into array via groupArray and then using arrayMap with lambda function
arrayMap(x -> (x-minArray(c)) / (maxArray(c)-minArray(c), c) to normalize the data in the array.
But it seem a little bit clunky, cause it should be a subquery that repeats the actual query and then JOIN this subquery to it.
So, is there a better solution to it?

hmm... just try use standart aggregation function like this:
SELECT c, (c-min(c)) / (max(c)-min(c)) AS normalized_c FROM table GROUP BY c

Related

Aggregate single array of distinct elements from array column, excluding NULL

I'm trying to roll up the distinct non-null values of timestamps stored in a PostgreSQL 9.6 database column.
So given a table containing the following:
date_array
------------------------
{2019-10-21 00:00:00.0}
{2019-08-06 00:00:00.0,2019-08-05 00:00:00.0}
{2019-08-05 00:00:00.0}
(null)
{2019-08-01 00:00:00.0,2019-08-06 00:00:00.0,null}
The desired result would be:
{2019-10-21 00:00:00.0, 2019-08-06 00:00:00.0, 2019-08-05 00:00:00.0, 2019-08-01 00:00:00.0}
The arrays can be different sizes so most solutions I've tried end up running into a Code 0:
SQL State: 2202E
ERROR: cannot accumulate arrays of different dimensionality.
Some other caveats:
The arrays can be null, the arrays can contain a null. They happen to be timestamps of just dates (eg without time or timezone). But in trying to simplify the problem, I've had no luck in changing the sample data to strings (e.g {foo, bar, (null)}, {foo,baz}) - just to focus on the problem and eliminate any issues I miss/don't understand about timestamps w/o timezone.
This following SQL is the closest I've come (it resolves all but the different dimensionality issues):
SELECT
ARRAY_REMOVE ( ARRAY ( SELECT DISTINCT UNNEST ( ARRAY_AGG ( CASE WHEN ARRAY_NDIMS(example.date_array) > 0 AND example.date_array IS NOT NULL THEN example.date_array ELSE '{null}' END ) ) ), NULL) as actualDates
FROM example;
I created the following DB fiddle with sample data that illustrates the problem if the above is lacking: https://www.db-fiddle.com/f/8m469XTDmnt4iRkc5Si1eS/0
Additionally, I've perused stackoverflow on the issue (as well as PostgreSQL documentation) and there are similar questions with answers, but I've found none that are articulating the same problem I'm having.

Use unnest() in FROM clause (in a lateral join):
select array_agg(distinct elem order by elem desc) as result
from example
cross join unnest(date_array) as elem
where elem is not null
Test it in DB Fiddle.
A general note. An alternative solution using an array constructor is more efficient, especially in cases as simple as described. Personally, I prefer to use aggregate functions because this query structure is more general and flexible, easy to extend to handle more complex problems (e.g. having to aggregate more than one column, grouping by another column, etc). In these non-trivial cases, the performance differences tend to decrease, but the code using aggregates remains cleaner and more readable. It's an extremely important factor when you have to maintain really large and complex projects.
See also In Postgres select, return a column subquery as an array?

Plain array_agg() does this with arrays:
Concatenates all the input arrays into an array of one higher
dimension. (The inputs must all have the same dimensionality, and
cannot be empty or null.)
Not what you need. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
You need something like this: unnest(), process and sort elements an feed the resulting set to an ARRAY constructor:
SELECT ARRAY(
SELECT DISTINCT elem::date
FROM (SELECT unnest(date_array) FROM example) AS e(elem)
WHERE elem IS NOT NULL
ORDER BY elem DESC
);
db<>fiddle here
To be clear: we could use array_agg() (taking non-array input, different from your incorrect use) instead of the final ARRAY constructor. But the latter is faster (and simpler, too, IMO).
They happen to be timestamps of just dates (eg without time or timezone)
So cast to date and trim the noise.
Should be the fastest way:
A correlated subquery is a bit faster than a LATERAL one (and does the simple job).
An ARRAY constructor is a bit faster than the aggregate function array_agg() (and does the simple job).
Most importantly, sorting and applying DISTINCT in a subquery is typically faster than inline ORDER BY and DISTINCT in an aggregate function (and does the simple job).
See:
Unnest arrays of different dimensions
How to select 1d array from 2d array?
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Performance comparison:
db<>fiddle here

Access SQL GROUP BY problem (eg. tbl_Produktion.ID not part of the aggregation-function)

I want to group by two columns, however MS Access won't let me do it.
Here is the code I wrote:
SELECT
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter,
tbl_Produktion.ProduktionsID, tbl_Produktion.Linie,
tbl_Produktion.Schicht, tbl_Produktion.Anzahl_Schichten_P,
tbl_Produktion.Schichtteam, tbl_Produktion.Von, tbl_Produktion.Bis,
tbl_Produktion.Pause, tbl_Produktion.Kunde, tbl_Produktion.TeileNr,
tbl_Produktion.FormNr, tbl_Produktion.LabyNr,
SUM(tbl_Produktion.Stueckzahl_Prod),
tbl_Produktion.Stueckzahl_Ausschuss, tbl_Produktion.Ausschussgrund,
tbl_Produktion.Kommentar, tbl_Produktion.StvSchichtleiter,
tbl_Produktion.Von2, tbl_Produktion.Bis2, tbl_Produktion.Pause2,
tbl_Produktion.Arbeiter3, tbl_Produktion.Von3, tbl_Produktion.Bis3,
tbl_Produktion.Pause3, tbl_Produktion.Arbeiter4,
tbl_Produktion.Von4, tbl_Produktion.Bis4, tbl_Produktion.Pause4,
tbl_Produktion.Leiharbeiter5, tbl_Produktion.Von5,
tbl_Produktion.Bis5, tbl_Produktion.Pause5,
tbl_Produktion.Leiharbeiter6, tbl_Produktion.Von6,
tbl_Produktion.Bis6, tbl_Produktion.Pause6, tbl_Produktion.Muster
FROM
tbl_Personal
INNER JOIN
tbl_Produktion ON tbl_Personal.PersID = tbl_Produktion.Schichtleiter
GROUP BY
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter;
It works when I group it by all the columns, but not like this.
The error message say that the rest of the columns aren't part of the aggregation-function (translated from german to english as best as I could).
PS.: I also need the sum of "tbl_Produktion.Stueckzahl_Prod" therefore I tried using the SUM function (couldn't try it yet).

Have you tried something along these lines?
SELECT
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter,
MAX(tbl_Produktion.ProduktionsID), MAX(tbl_Produktion.Linie),
MAX(tbl_Produktion.Schicht), MAX(tbl_Produktion.Anzahl_Schichten_P),
MAX(tbl_Produktion.Schichtteam), MAX(tbl_Produktion.Von), MAX(tbl_Produktion.Bis),
SUM(tbl_Produktion.Stueckzahl_Prod)
FROM
tbl_Personal
INNER JOIN
tbl_Produktion ON tbl_Personal.PersID = tbl_Produktion.Schichtleiter
GROUP BY
tbl_Produktion.Datum, tbl_Produktion.Schichtleiter;
I have used the MAX function for all the data except the two items you specify in the GROUP BY and the one where you desire the SUM. I took the liberty of leaving out mush of your data just to get started.
Using the MAX function turns out to be a convenient workaround when the data item is known to be unique within each group. We cannot know your data or your itent, so we cannot tell you whether MAX will yield the results you need.

If you use an aggregation function in the select clause, you must group by every column that you're selecting that's not an aggregation. If you don't want to do that for some reason (perhaps it changes the output of the aggregation in way that you don't intend) you either must think of an aggregate to use (pick a value. Average? Max? Min?) or just do two selects, one for the aggregate, and one for the non-aggregates. But, then, you have to decide how to get the non-aggregated fields that make sense for the aggregate (or show them all in a table, I suppose?)

Is there a Python fn that mimics nvl() or can I use SQL to do it?

I need to join two tables using python on an nvl type fn because we have one table where, based on the part type, we only put in the first 7 characters.
So far I have not found a simple way to do this in python.
Is there a function that will do this, or another easy way to achieve it?
Thank you in advance
I joined on part_number, removed where the other table's fields were NaN, then joined as a new table on the substring, then appended the tables together. And ended up with the wrong number of rows.
left join on nvl(nvl(thistable.part_number, substr(thistable.part_number, 1, 7)),'not in defn table') = part_number.othertable
Output might be like this:
thistable.part_number othertable.description
abc123 real part
def456 another real part
1234567-02 part stored as 1234567 in othertable
koue49c not in defn table

I think you want coalesce():
on coalesce(thistable.part_number, substr(thistable.part_number, 1, 7), 'not in defn table') = part_number.othertable
coalesce() takes multiple arguments, so you don't need to nest the function calls.

BigQuery Standard SQL - store query or UDF in table

Is it possible to store data in a table that can then be converted into either a SQL query or a UDF - like a javascript eval()?
The use case is that I have a list of clients where earnings are calculated in quite significantly different ways for each, and this can change over time. So I would like to have a lookup table which can be updated with a formula for calculating this figure rather than having to write not only hundreds of queries (one for each client) but also maintain these.
I have tried to think if there is a way of having a standard formula that would be flexible enough, but I really don't think it's possible unfortunately.

Sure! BigQuery can define and use JS UDFs. The good news is that eval() works as expected:
CREATE TEMP FUNCTION calculate(x FLOAT64, y FLOAT64, formula STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return eval(formula);
""";
WITH table AS (
SELECT 1 AS x, 5 as y, 'x+y' formula
UNION ALL SELECT 2, 10, 'x-y'
UNION ALL SELECT 3, 15, 'x*y'
)
SELECT x, y, formula, calculate(x, y, formula) result
FROM table;

Hive UDF to generate all possible ordered combinations from the list

I am trying to figure out in Hive how to generate a UDF that would take as input a list and output a list with 2 way ordered combination all elements in the list
Input:
list_variable_b
[5142430,5146974,5141766]
Output:
list_variable_b
[(5142430,5146974),(5146974,5141766),(5142430,5141766)]

So you're asking how to write an UDF that can take an array<bigint> and
turn it into an array<struct<int,int> or array<array<int>.
It sounds you want what's called n take k, which will produce (n!)/(n-k)!k! elements.
Now, hive has two kinds of UDFs, one that's the simple one, that can only process primitive (non-collection) types. But here you are processing an array so you'll need a Generic UDF. Generic UDF can do much more than simple UDFs, but they are also more difficult to write. A good guide on how to do it is here: http://www.baynote.com/2012/11/a-word-from-the-engineers/
Another way would be to use a double LATERAL VIEW with the caveat that all the elements in the array have to be unique for this to work.
If the table is
create table xx ( col array<int>);
such that
select * from xx;
OK
[5142430,5146974,5141766]
Using a double lateral view to do the cartesian product of the array on itself, then only get the pairs where one element is bigger then the other:
select a1,b1 from xx
lateral view explode(col) a as a1
lateral view explode(col) b as b1 where a1 < b1;
5142430 5146974
5141766 5142430
5141766 5146974

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Normalize column in ClickHouse - sql

hmm... just try use standart aggregation function like this: SELECT c, (c-min(c)) / (max(c)-min(c)) AS normalized_c FROM table GROUP BY c

Related

Aggregate single array of distinct elements from array column, excluding NULL

Access SQL GROUP BY problem (eg. tbl_Produktion.ID not part of the aggregation-function)

Is there a Python fn that mimics nvl() or can I use SQL to do it?

BigQuery Standard SQL - store query or UDF in table

Hive UDF to generate all possible ordered combinations from the list

Categories

Resources