SQL - Calculating if an alphanumeric value exists in an alphanumeric range - sql

I have a varchar column in a database and a requirement has come in so a user can enter a range to/from eg/ABC001 to ABC100
I have the following query but feel it might not be strict enough to work out if any values within that range exist.
SELECT count(*) FROM MyTable where MyColumn between 'ABC001' and 'ABC005'
I have a feeling an order by should be used or is there a better way to calculate the existence of values within a alphanumeric range

No orderby is required. That should be perfrect.

If you want to boost that operation you can create a index on it.
Order by operation is done at the end of query execution, so the data will be retrived in the same way.

OP said:
or is there a better way to calculate
the existence of values within a
alphanumeric range
The best way would be:
SELECT count(*) FROM MyTable where MyColumn>='ABC001' and MyColumn<='ABC005'
I find most people can't remember if BETWEEN includes or excludes the "end points". By just always using >= and/or > and/or <= and/or < you have more clarity and flexibility.
Any ORDER BY would be applied to the resulting set of rows that meet the WHERE condition, and has nothing to do with the WHERE filtering. You can use it if you want the final result set in a particular order, but it will have no effect on which rows are included in the results.

Related

Finding the last 4, 3, 2, 1 months consecutive order drops among clients based on drop variance

Here I have this query that finds out the drop percentage of a bunch of clients based on the orders they have received(i.e. It finds the percentage difference in orders by comparing the current month with the previous month). What I want to achieve here is to have a field where I can see the clients who had 4 months continuous drop, 3 months drop, 2 months drop, and 1 month drop.
I know, it can only be achieved by comparing the last 4 months using the lag function or sub queries. can you guys pls help me out on this one, would appreciate it very much
select
fd.customers2, fd.Month1, fd.year1, fd.variance, case when
(fd.variance < -0.00001 and fd.year1 = '2022.0' and fd.Month1 = '1')
then '1month drop' else fd.customers2 end as 1_most_host_drop
from 
(SELECT
c.*,
sa.customers as customers2,
sum(sa.order) as orders,
date_part(mon, sa.date) as Month1,
date_part(year, sa.date) as year1,
(cast(orders - LAG(orders) OVER(Partition by customers2 ORDER BY
 year1, Month1) as NUMERIC(10,2))/NULLIF(LAG(orders) 
OVER(partition by customers2 ORDER BY year1, Month1) * 1, 0)) AS variance
FROM stats sa join (select distinct
    d.id, d.customers 
     from configer d 
    ) c on sa.customers=c.customers
WHERE sa.date >= '2021-04-1' 
GROUP BY Month1, sa.customers, c.id,  year1, 
     c.customers)fd
In a spirit of friendliness: I think you are a little premature in posting this here as there are several issues with the syntax before even reaching the point where you can solve the problem:
You have at least two places with a comma immediately preceding the word FROM:
...AS variance, FROM stats_archive sa ...
...d.customers, FROM config d...
Recommend you don't use VARIANCE as an alias (it is a system function in PostgreSQL and so is likely also a system function name in Redshift)
Not super important, but there's no need for c.* - just select the columns you will use
DATE_PART requires a string as the first parameter DATE_PART('mon',current_date)
I might be wrong about this, but I suspect you cannot use column aliases in the partition by or order by of a window function. Put the originating expressions there instead:
... OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
LAG has three parameters. (1) The column you want to retrieve the value from, (2) the row offset, where a positive integer indicates how many rows prior to the current row you should retrieve a value from according to the partition and order context and (3) the value the function should return as a default (in case of the first row in the partition). As such, you don't need NULLIF. So, to get the row immediately prior to the current row, or return 0 in case the current row is the first row in the partition:
LAG(orders,1,0) OVER (PARTITION BY customers2 ORDER BY DATE_PART('year',sa.date),DATE_PART('mon',sa.date))
If you use 0 as a default in the calculation of what is currently aliased variance, you will almost certainly run into a div/0 error either now, or worse, when you least expect it in the future. You should protect against that with some CASE logic or better, provide a more appropriate default value or even better, calculate the LAG with the default 0, then filter out the 0 rows before doing the calculation.
You can't use column aliases in the GROUP BY. You must reference each field that is not participating in an aggregate in the group by, whether through direct mention (sa.date) or indirectly in an expression (DATE_PART('mon',sa.date))
Your date should be '2021-04-01'
All in all, without sample data, expected results using the posted sample data and without first removing syntax errors, it is a tall order to attempt to offer advice on the problem which is any more specific than:
Build the source of the calculation as a completely separate query first. Calculate the LAG in that source query. Only when you've run that source query and verified that the LAG is producing the correct result should you then wrap it as a sub-query or CTE (not sure if Redshift supports these, but presumably) at which point you can filter out the rows with a zero as the denominator (the first month of orders for each customer).
Good luck!

Greater than any element of array

I am facing this issue:
Is there a way in PostgreSQL to put aggregated timestamp data into an array (for example using array_agg function) and then perform any match on some condition?
I am doing something similar with LIKE on aggregated strings (using string_agg(column,';')). But how to perform something similar on timestamps?
So if result would be '{10.10.2021,20.12.2021,1.1.1996}' as timestamp_array and I would like to filter rows that have at least one array element that after some input?
For example, ... WHERE 31.12.2021 > timestamp_array ... would not match the row above cause there is no array element after 31.12.2021.
But If I query ... WHERE 31.12.1996 > timestamp_array ..., the row above would be matched (cause at least one element of the array is in given interval).
First, you would use standard date formats. Then you can use:
where '2021-12-31' > any (timestamp_array)
Here is a db<>fiddle to illustrate the idea.
I would like to filter rows that has at least one array element that is after some input?
You can use the ANY construct as has been advised.
WHERE '1996-12-31'::timestamp < ANY ('{2021-10-10, 2021-12-20, 1996-01-01}'::timestamp[])
Has to be <, not >, obviously.
Your "timestamps" look a lot like dates - timestamp input accepts that, too.
But always use the recommended ISO 8601 format (as demonstrated), else your input depends on setting of the current session.
See:
IN vs ANY operator in PostgreSQL
But chances are, there is much more efficient way. You speak of "aggregated timestamp data". Typically it's much more efficient to check before aggregating. Not least because that can use indexes, as opposed to your approach. Typically, EXISTS does the job. Something like:
SELECT ...
FROM tbl t
WHERE EXISTS (SELECT FROM tbl t1 WHERE t1.id = t.id AND t1.timestamp_column > '1996-12-31'
GROUP BY ...
Start a new question with details of your query to get a fitting solution.

String ending in range of numbers

I have a column with data of the following structure:
aaa5644988
aaa4898494
aaa5642185
aaa5482312
aaa4648848
I have a range that can be anything, like 100-30000 or example. I want to have all values that end in numbers between that range.
I tried
like '%[100-30000]'
but this doesn't work apparently.
I have seen a lot of similar questions but none of the solved my problem
edit I'm using SQL server 2008
Example:
Value
aaa45645695
aaa28568720
aaa65818450
8789212
6566700
For the range 600-1200, I want to retrieve row 1,2,5 because they end with the range.
In SQL, like normally only support % and _ these two operators. That's why like '%[100-30000]' doesn't work.
Depend on your use case, there could be two solutions for this problem:
If you only need to do this query two or three times(didn't care how long it takes), or the dataset is not very big. You can select all the data from this column, and then do the filtering in another programming language.
Take ruby for example, you can do:
column_data = #connection.execute("select * from your_column_name")
result = column_data.map{|x| x.gsub(/^.*[^\d]/, '').to_i }.select{|x| x > 100 && x < 30000}
If you need to do this query regularly, I'd suggest you add a new column to this data table with only the numbers in the current column, so as to get a much better performance in querying speed.
SELECT *
FROM your_table
WHERE number_column BETWEEN 100 AND 30000

How to handle string ordering in order by clause?

Suppose I want to order the records order by a field (string data type) called STORY_LENGTH. This field is a multi-valued field and I represent the multiple values using commas. For example, for record1, its value is "1" and record2 its value is "1,3" and for record3 its value is "1,2". Now when, I want to order the records according to STORY_LENGTH then records are ordered like this record1 > record3 > record2. Its clear that STORY_LENGTH data type is string and order by ASC is ordering that value considering it as string. But, here comes the problem. For example, when record4="10" and record5="2" and I try to order it looks like record4 > record5 which obviously I don't want. Because 2 > 10 and I am using a string formatted just because of multiple values of the field.
So, anybody, can you help me out of this? I need some good idea to fix.
thanks
Multi-values fields as you describe mean your data model is broken and should be normalized.
Once this is done, querying becomes much more simple.
From what I've understood you want to sort items by second or first number in comma separated values stored in a VARCHAR field. Implementation would depend on database used, for example in MySQL it would look like:
SELECT * FROM stories
ORDER BY CAST(COALESCE(SUBSTRING_INDEX(story_length, ',', -1), '0') AS INTEGER)
Yet it is not generally not good to use such sorting for performance reasons as sorting would require scanning of whole table instead of using index on field.
Edit: After edits it looks like you want to sort on first value and ignore value(s) after comma. As according to some comment above changes in database design are not an option just use following code for sorting:
SELECT * FROM stories
ORDER BY CAST(COALESCE(NULLIF(SUBSTRING_INDEX(story_length, ',', 1), ''), '0') AS INTEGER)

SQL statement HAVING MAX(some+thing)=some+thing

I'm having trouble with Microsoft Access 2003, it's complaining about this statement:
select cardnr
from change
where year(date)<2009
group by cardnr
having max(time+date) = (time+date) and cardto='VIP'
What I want to do is, for every distinct cardnr in the table change, to find the row with the latest (time+date) that is before year 2009, and then just select the rows with cardto='VIP'.
This validator says it's OK, Access says it's not OK.
This is the message I get: "you tried to execute a query that does not include the specified expression 'max(time+date)=time+date and cardto='VIP' and cardnr=' as part of an aggregate function."
Could someone please explain what I'm doing wrong and the right way to do it? Thanks
Note: The field and table names are translated and do not collide with any reserved words, I have no trouble with the names.
Try to think of it like this - HAVING is applied after the aggregation is done.
Therefore it can not compare to unaggregated expressions (neither for time+date, nor for cardto).
However, to get the last (principle is the same for getting rows related to other aggregated functions as weel) time and date you can do something like:
SELECT cardnr
FROM change main
WHERE time+date IN (SELECT MAX(time+date)
FROM change sub
WHERE sub.cardnr = main.cardnr AND
year(date)<2009
AND cardto='VIP')
(assuming that date part on your time field is the same for all the records; having two fields for date/time is not in your best interest and also using reserved words for field names can backfire in certain cases)
It works because the subquery is filtered only on the records that you are interested in from the outer query.
Applying the same year(date)<200 and cardto='VIP' to the outer query can improve performance further.