Greater than any element of array - sql

I am facing this issue:
Is there a way in PostgreSQL to put aggregated timestamp data into an array (for example using array_agg function) and then perform any match on some condition?
I am doing something similar with LIKE on aggregated strings (using string_agg(column,';')). But how to perform something similar on timestamps?
So if result would be '{10.10.2021,20.12.2021,1.1.1996}' as timestamp_array and I would like to filter rows that have at least one array element that after some input?
For example, ... WHERE 31.12.2021 > timestamp_array ... would not match the row above cause there is no array element after 31.12.2021.
But If I query ... WHERE 31.12.1996 > timestamp_array ..., the row above would be matched (cause at least one element of the array is in given interval).

First, you would use standard date formats. Then you can use:
where '2021-12-31' > any (timestamp_array)
Here is a db<>fiddle to illustrate the idea.

I would like to filter rows that has at least one array element that is after some input?
You can use the ANY construct as has been advised.
WHERE '1996-12-31'::timestamp < ANY ('{2021-10-10, 2021-12-20, 1996-01-01}'::timestamp[])
Has to be <, not >, obviously.
Your "timestamps" look a lot like dates - timestamp input accepts that, too.
But always use the recommended ISO 8601 format (as demonstrated), else your input depends on setting of the current session.
See:
IN vs ANY operator in PostgreSQL
But chances are, there is much more efficient way. You speak of "aggregated timestamp data". Typically it's much more efficient to check before aggregating. Not least because that can use indexes, as opposed to your approach. Typically, EXISTS does the job. Something like:
SELECT ...
FROM tbl t
WHERE EXISTS (SELECT FROM tbl t1 WHERE t1.id = t.id AND t1.timestamp_column > '1996-12-31'
GROUP BY ...
Start a new question with details of your query to get a fitting solution.

Related

Aggregate single array of distinct elements from array column, excluding NULL

I'm trying to roll up the distinct non-null values of timestamps stored in a PostgreSQL 9.6 database column.
So given a table containing the following:
date_array
------------------------
{2019-10-21 00:00:00.0}
{2019-08-06 00:00:00.0,2019-08-05 00:00:00.0}
{2019-08-05 00:00:00.0}
(null)
{2019-08-01 00:00:00.0,2019-08-06 00:00:00.0,null}
The desired result would be:
{2019-10-21 00:00:00.0, 2019-08-06 00:00:00.0, 2019-08-05 00:00:00.0, 2019-08-01 00:00:00.0}
The arrays can be different sizes so most solutions I've tried end up running into a Code 0:
SQL State: 2202E
ERROR: cannot accumulate arrays of different dimensionality.
Some other caveats:
The arrays can be null, the arrays can contain a null. They happen to be timestamps of just dates (eg without time or timezone). But in trying to simplify the problem, I've had no luck in changing the sample data to strings (e.g {foo, bar, (null)}, {foo,baz}) - just to focus on the problem and eliminate any issues I miss/don't understand about timestamps w/o timezone.
This following SQL is the closest I've come (it resolves all but the different dimensionality issues):
SELECT
ARRAY_REMOVE ( ARRAY ( SELECT DISTINCT UNNEST ( ARRAY_AGG ( CASE WHEN ARRAY_NDIMS(example.date_array) > 0 AND example.date_array IS NOT NULL THEN example.date_array ELSE '{null}' END ) ) ), NULL) as actualDates
FROM example;
I created the following DB fiddle with sample data that illustrates the problem if the above is lacking: https://www.db-fiddle.com/f/8m469XTDmnt4iRkc5Si1eS/0
Additionally, I've perused stackoverflow on the issue (as well as PostgreSQL documentation) and there are similar questions with answers, but I've found none that are articulating the same problem I'm having.
Use unnest() in FROM clause (in a lateral join):
select array_agg(distinct elem order by elem desc) as result
from example
cross join unnest(date_array) as elem
where elem is not null
Test it in DB Fiddle.
A general note. An alternative solution using an array constructor is more efficient, especially in cases as simple as described. Personally, I prefer to use aggregate functions because this query structure is more general and flexible, easy to extend to handle more complex problems (e.g. having to aggregate more than one column, grouping by another column, etc). In these non-trivial cases, the performance differences tend to decrease, but the code using aggregates remains cleaner and more readable. It's an extremely important factor when you have to maintain really large and complex projects.
See also In Postgres select, return a column subquery as an array?
Plain array_agg() does this with arrays:
Concatenates all the input arrays into an array of one higher
dimension. (The inputs must all have the same dimensionality, and
cannot be empty or null.)
Not what you need. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
You need something like this: unnest(), process and sort elements an feed the resulting set to an ARRAY constructor:
SELECT ARRAY(
SELECT DISTINCT elem::date
FROM (SELECT unnest(date_array) FROM example) AS e(elem)
WHERE elem IS NOT NULL
ORDER BY elem DESC
);
db<>fiddle here
To be clear: we could use array_agg() (taking non-array input, different from your incorrect use) instead of the final ARRAY constructor. But the latter is faster (and simpler, too, IMO).
They happen to be timestamps of just dates (eg without time or timezone)
So cast to date and trim the noise.
Should be the fastest way:
A correlated subquery is a bit faster than a LATERAL one (and does the simple job).
An ARRAY constructor is a bit faster than the aggregate function array_agg() (and does the simple job).
Most importantly, sorting and applying DISTINCT in a subquery is typically faster than inline ORDER BY and DISTINCT in an aggregate function (and does the simple job).
See:
Unnest arrays of different dimensions
How to select 1d array from 2d array?
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
Performance comparison:
db<>fiddle here

SQL Query for Search Page

I am working on a small project for an online databases course and i was wondering if you could help me out with a problem I am having.
I have a web page that is searching a movie database and retrieving specific columns using a movie initial input field, a number input field, and a code field. These will all be converted to strings and used as user input for the query.
Below is what i tried before:
select A.CD, A.INIT, A.NBR, A.STN, A.ST, A.CRET_ID, A.CMNT, A.DT
from MOVIE_ONE A
where A.INIT = :init
AND A.CD = :cd
AND A.NBR = :num
The way the page must search is in three different cases:
(initial and number)
(code)
(initial and number and code)
The cases have to be independent so if certain field are empty, but fulfill a certain case, the search goes through. It also must be in one query. I am stuck on how to implement the cases.
The parameters in the query are taken from the Java parameters in the method found in an SQLJ file.
If you could possibly provide some aid on how i can go about this problem, I'd greatly appreciate it!
Consider wrapping the equality expressions in NVL (synonymous to COALESCE) so if parameter inputs are blank, corresponding column is checked against itself. Also, be sure to kick the a-b-c table aliasing habit.
SELECT m.CD, m.INIT, m.NBR, m.STN, m.ST, m.CRET_ID, m.CMNT, m.DT
FROM MOVIE_ONE m
WHERE m.INIT = NVL(:init, m.INIT)
AND m.CD = NVL(:cd, m.CD)
AND m.NBR = COALESCE(:num, m.NBR)
To demonstrate, consider below DB2 fiddles where each case can be checked by adjusting value CTE parameters all running on same exact data.
Case 1
WITH
i(init) AS (VALUES('db2')),
c(cd) AS (VALUES(NULL)),
n(num) AS (VALUES(53)),
cte AS
...
Case 2
WITH
i(init) AS (VALUES(NULL)),
c(cd) AS (VALUES(2018)),
n(num) AS (VALUES(NULL)),
cte AS
...
Case 3
WITH
i(init) AS (VALUES('db2')),
c(cd) AS (VALUES(2018)),
n(num) AS (VALUES(53)),
cte AS
...
However, do be aware the fiddle runs a different SQL due to nature of data (i.e., double and dates). But query does reflect same concept with NVL matching expressions on both sides.
SELECT *
FROM cte, i, c, n
WHERE cte.mytype = NVL(i.init, cte.mytype)
AND YEAR(CAST(cte.mydate AS date)) = NVL(c.cd, YEAR(CAST(cte.mydate AS date)))
AND ROUND(cte.mynum, 0) = NVL(n.num, ROUND(cte.mynum, 0));

Efficient way to select one from each category - Rails

I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?

How to handle string ordering in order by clause?

Suppose I want to order the records order by a field (string data type) called STORY_LENGTH. This field is a multi-valued field and I represent the multiple values using commas. For example, for record1, its value is "1" and record2 its value is "1,3" and for record3 its value is "1,2". Now when, I want to order the records according to STORY_LENGTH then records are ordered like this record1 > record3 > record2. Its clear that STORY_LENGTH data type is string and order by ASC is ordering that value considering it as string. But, here comes the problem. For example, when record4="10" and record5="2" and I try to order it looks like record4 > record5 which obviously I don't want. Because 2 > 10 and I am using a string formatted just because of multiple values of the field.
So, anybody, can you help me out of this? I need some good idea to fix.
thanks
Multi-values fields as you describe mean your data model is broken and should be normalized.
Once this is done, querying becomes much more simple.
From what I've understood you want to sort items by second or first number in comma separated values stored in a VARCHAR field. Implementation would depend on database used, for example in MySQL it would look like:
SELECT * FROM stories
ORDER BY CAST(COALESCE(SUBSTRING_INDEX(story_length, ',', -1), '0') AS INTEGER)
Yet it is not generally not good to use such sorting for performance reasons as sorting would require scanning of whole table instead of using index on field.
Edit: After edits it looks like you want to sort on first value and ignore value(s) after comma. As according to some comment above changes in database design are not an option just use following code for sorting:
SELECT * FROM stories
ORDER BY CAST(COALESCE(NULLIF(SUBSTRING_INDEX(story_length, ',', 1), ''), '0') AS INTEGER)

SQL - Calculating if an alphanumeric value exists in an alphanumeric range

I have a varchar column in a database and a requirement has come in so a user can enter a range to/from eg/ABC001 to ABC100
I have the following query but feel it might not be strict enough to work out if any values within that range exist.
SELECT count(*) FROM MyTable where MyColumn between 'ABC001' and 'ABC005'
I have a feeling an order by should be used or is there a better way to calculate the existence of values within a alphanumeric range
No orderby is required. That should be perfrect.
If you want to boost that operation you can create a index on it.
Order by operation is done at the end of query execution, so the data will be retrived in the same way.
OP said:
or is there a better way to calculate
the existence of values within a
alphanumeric range
The best way would be:
SELECT count(*) FROM MyTable where MyColumn>='ABC001' and MyColumn<='ABC005'
I find most people can't remember if BETWEEN includes or excludes the "end points". By just always using >= and/or > and/or <= and/or < you have more clarity and flexibility.
Any ORDER BY would be applied to the resulting set of rows that meet the WHERE condition, and has nothing to do with the WHERE filtering. You can use it if you want the final result set in a particular order, but it will have no effect on which rows are included in the results.