In postgres, when I call a function on some data, like so:
select f(col_nums) from tbl_name
where col_str = '12345'
then function f will be applied on each row where col_str = '12345'.
On the other hand, if I call an aggregation function on some data, like so:
select g_agg(col_nums) from tbl_name
where col_str = '12345'
then the function g_agg will be called on the the entire column but will result in a single value.
Q: How can I make a function that will be applied on the entire column and return a column of the same size while at the same time being aware of all the values in the the subset?
For example, can I create a function to calculate cumulative sum?
select *, sum_accum(col_nums) as cs from tbl_name
where col_str = '12345'
such that the result of the above query would look like this:
col_str | more_cols | col_numbers | cs
---------+-----------+-------------+----
12345 | 567 | 1 | 1
12345 | 568 | 2 | 3
12345 | 569 | 3 | 6
12345 | 570 | 4 | 10
Is there no choice but to pass a sub-query result to a function and then join with the original table?
Use window functions
A window function performs a calculation across a set of table rows
that are somehow related to the current row. This is comparable to the
type of calculation that can be done with an aggregate function. But
unlike regular aggregate functions, use of a window function does not
cause rows to become grouped into a single output row — the rows
retain their separate identities. Behind the scenes, the window
function is able to access more than just the current row of the query
result.
e.g.
select *, sum(col_nums) OVER(PARTITION BY T.COLX, T.COLY) as cs
from tbl_name T
where col_str = '12345'
Note that it is the addition on a over clause that changes an aggregate from its traditional use to a window function:
the OVER clause causes it to be treated as a window function and
computed across an appropriate set of rows
In the over clause has a partition by (analogous to group by) which controls the window that the calculations are performed in; and it also allows an order by which is valid for some functions but not all.
select *
-- running sum using an order by
, sum(col_nums) OVER(PARTITION BY T.COLX ORDER BY T.COLY) as cs
-- but count does not permit ordering
, count(*) OVER(PARTITION BY T.COLX) as cs_count
from tbl_name T
where col_str = '12345'
The function that you want is a cumulative sum. This is handled by window functions:
select t.*, sum(col_nums) over (order by more_cols) as cs
from tbl_name t
where col_str = '12345';
I am guessing that the order by sequence is defined by the second column. It can be any column including col_nums.
You can do this for all values of col_str at the same time, using the partition by clause:
select t.*, sum(col_nums) over (partition by col_str order by more_cols) as cs
from tbl_name t
Related
I have a table which requires filtering based on the dates.
| Group | Account || Values | Date_ingested |
| -------- | -------- || -------- | -------- |
| X | 3000 || 0 | 2023-01-07 |
| Y | 3000 || null | 2021-02-22 |
The goal is to select the latest date when there is multiple data points like in the example above.
The account 3000 in the dataframe occurs under two Groups but the up-to-date and correct result should only reflect the group X because it was ingested into Databricks very recently.
Now, if I try to use the code below with grouping the code gets executed but the max function is ignored and in the results I get two results for account 3000 with group X and then Y.
Select Group, Account, Values, max(Date_ingested) from datatableX
If I choose to use the code without grouping, I get the following error
Error in SQL statement: AnalysisException: grouping expressions sequence is empty, and 'datatableX.Account' is not an aggregate function. Wrap '(max(spark_catalog.datatableX.Date_ingested) AS`max(Date_ingested))' in windowing function(s) or wrap 'spark_catalog.datatableX.Account' in first() (or first_value) if you don't care which value you get.
I can't, however, figure out a way to do the above. Tried reading about the aggreate functions but I can't grasp the concept.
Select Group, Account, Values, max(Date_ingested) from datatableX
or
Select Group, Account, Values, max(Date_ingested) from datatableX
group by Group, Account, Values
You want the entire latest record per account, which suggests filtering rather than aggregation.
A typical approach uses rank() to enumerate records having the same account by descending date of ingestion, then filters on the top-record per group in the outer query:
select *
from (
select d.*,
row_number() over(partition by account order by date_ingested desc) rn
from datatableX
) d
where rn = 1
I have a PostgreSQL table where there is column which has array of strings. The row have some unique array strings or some have duplicate strings also. I want to remove duplicate strings from each row if they exists.
I have tried to some queries but couldn't make it happen.
Following is the table:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8","viper"}
7 | {"ferrariff","viper","viper","volt"}
I am expecting following output:
veh_id | vehicle_types
--------+----------------------------------------
1 | {"byd_tang","volt","viper","laferrari"}
2 | {"volt","viper"}
3 | {"byd_tang","sonata","jaguarxf"}
4 | {"swift","teslax","mirai"}
5 | {"volt","viper"}
6 | {"viper","ferrariff","bmwi8"}
7 | {"ferrariff","viper","volt"}
Since each row's array is independent, a plain correlated subquery with an ARRAY constructor would do the job:
SELECT *, ARRAY(SELECT DISTINCT unnest (vehicle_types)) AS vehicle_types_uni
FROM vehicle;
See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Note that NULL is converted to an empty array ('{}'). We'd need to special-case it, but it is excluded in the UPDATE below anyway.
Fast and simple. But don't use this. You didn't say so, but typically you'd want to preserve original order of array elements. Your rudimentary sample suggests as much. Use WITH ORDINALITY in the correlated subquery, which becomes a bit more sophisticated:
SELECT *, ARRAY (SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
) AS vehicle_types_uni
FROM vehicle;
See:
PostgreSQL unnest() with element number
UPDATE to actually remove dupes:
UPDATE vehicle
SET vehicle_types = ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
)
WHERE cardinality(vehicle_types) > 1 -- optional
AND vehicle_types <> ARRAY (
SELECT v
FROM unnest(vehicle_types) WITH ORDINALITY t(v,ord)
GROUP BY 1
ORDER BY min(ord)
); -- suppress empty updates (optional)
Both added WHERE conditions are optional to improve performance. The 1st one is completely redundant. Each condition also excludes the NULL case. The 2nd one suppresses all empty updates.
See:
How do I (or can I) SELECT DISTINCT on multiple columns?
If you tried to do that without preserving original order, you'd likely update most rows without need, just because the order or elements changed even without dupes.
Requires Postgres 9.4 or later.
db<>fiddle here
I don't claim it's efficient, but something like this might work:
with expanded as (
select veh_id, unnest (vehicle_types) as vehicle_type
from vehicles
)
select veh_id, array_agg (distinct vehicle_type)
from expanded
group by veh_id
If you really want to get fancy and do something that is worst case O(n), you can write a custom function:
create or replace function unique_array(input_array text[])
returns text[] as $$
DECLARE
output_array text[];
i integer;
BEGIN
output_array = array[]::text[];
for i in 1..cardinality(input_array) loop
if not (input_array[i] = any (output_array)) then
output_array := output_array || input_array[i];
end if;
end loop;
return output_array;
END;
$$
language plpgsql
Usage example:
select veh_id, unique_array(vehicle_types)
from vehicles
In Hive a statement like this:
SELECT MIN('FOO') AS id,
MIN('Foo') as name;
will return a result set like this:
+------------+---------+
| id | name |
+------------+---------+
| Foo | Foo |
+------------+---------+
Even though I would expect :
FOO, Foo
(the Max('FOO') is the max value over a group of 1 and the Max('Foo') is the max value over another group of 1).
Using more than one function or appending a ' ' to one of the values produces the expected result.
SELECT MIN('FOO') AS id,
Max('Foo') as name;
or
SELECT MIN('FOO') AS id,
MIN(concat('Foo', '')) as name;
Is this a bug or does grouping in Hive work at a row level over all columns in the row with the same function case insensitively.
Try setting hive.cache.expr.evaluation=false
This might be related to this bug in Hive.
I have a table that looks like this:
Column A | Column B | Counter
---------------------------------------------
A | B | 53
B | C | 23
A | D | 11
C | B | 22
I need to remove the last row because it's cyclic to the second row. Can't seem to figure out how to do it.
EDIT
There is an indexed date field. This is for Sankey diagram. The data in the sample table is actually the result of a query. The underlying table has:
date | source node | target node | path count
The query to build the table is:
SELECT source_node, target_node, COUNT(1)
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
GROUP BY source_node, target_node
In the sample, the last row C to B is going backwards and I need to ignore it or the Sankey won't display. I need to only show forward path.
Removing all edges from your graph where the tuple (source_node, target_node) is not ordered alphabetically and the symmetric row exists should give you what you want:
DELETE
FROM sankey_table t1
WHERE source_node > target_node
AND EXISTS (
SELECT NULL from sankey_table t2
WHERE t2.source_node = t1.target_node
AND t2.target_node = t1.source_node)
If you don't want to DELETE them, just use this WHERE clause in your query for generating the input for the diagram.
If you can adjust how your table is populated, you can change the query you're using to only retrieve the values for the first direction (for that date) in the first place, with a little bit an analytic manipulation:
SELECT source_node, target_node, counter FROM (
SELECT source_node,
target_node,
COUNT(*) OVER (PARTITION BY source_node, target_node) AS counter,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table
WHERE TO_CHAR(data_date, 'yyyy-mm-dd')='2013-08-19'
)
WHERE rnk = 1;
The inner query gets the same data you collect now but adds a ranking column, which will be 1 for the first row for any source/target pair in any order for a given day. The outer query then just ignores everything else.
This might be a candidate for a materialised view if you're truncating and repopulating it daily.
If you can't change your intermediate table but can still see the underlying table you could join back to it using the same kind of idea; assuming the table you're querying from is called sankey_agg_table:
SELECT sat.source_node, sat.target_node, sat.counter
FROM sankey_agg_table sat
JOIN (SELECT source_node, target_node,
RANK () OVER (PARTITION BY GREATEST(source_node, target_node),
LEAST(source_node, target_node), TRUNC(data_date)
ORDER BY data_date) AS rnk
FROM sankey_table) st
ON st.source_node = sat.source_node
AND st.target_node = sat.target_node
AND st.rnk = 1;
SQL Fiddle demos.
DELETE FROM yourTable
where [Column A]='C'
given that these are all your rows
EDIT
I would recommend that you clean up your source data if you can, i.e. delete the rows that you call backwards, if those rows are incorrect as you state in your comments.
I'm quite new into SQL and I'd like to make a SELECT statement to retrieve only the first row of a set base on a column value. I'll try to make it clearer with a table example.
Here is my table data :
chip_id | sample_id
-------------------
1 | 45
1 | 55
1 | 5986
2 | 453
2 | 12
3 | 4567
3 | 9
I'd like to have a SELECT statement that fetch the first line with chip_id=1,2,3
Like this :
chip_id | sample_id
-------------------
1 | 45 or 55 or whatever
2 | 12 or 453 ...
3 | 9 or ...
How can I do this?
Thanks
i'd probably:
set a variable =0
order your table by chip_id
read the table in row by row
if table[row]>variable, store the table[row] in a result array,increment variable
loop till done
return your result array
though depending on your DB,query and versions you'll probably get unpredictable/unreliable returns.
You can get one value using row_number():
select chip_id, sample_id
from (select chip_id, sample_id,
row_number() over (partition by chip_id order by rand()) as seqnum
) t
where seqnum = 1
This returns a random value. In SQL, tables are inherently unordered, so there is no concept of "first". You need an auto incrementing id or creation date or some way of defining "first" to get the "first".
If you have such a column, then replace rand() with the column.
Provided I understood your output, if you are using PostGreSQL 9, you can use this:
SELECT chip_id ,
string_agg(sample_id, ' or ')
FROM your_table
GROUP BY chip_id
You need to group your data with a GROUP BY query.
When you group, generally you want the max, the min, or some other values to represent your group. You can do sums, count, all kind of group operations.
For your example, you don't seem to want a specific group operation, so the query could be as simple as this one :
SELECT chip_id, MAX(sample_id)
FROM table
GROUP BY chip_id
This way you are retrieving the maximum sample_id for each of the chip_id.