Find entries with array of dates if any is currently available within a certain range in postgres - sql

I have a postgres table with columns:
id: text
availabilities: integer[]
A certain ID can has multiply availabilities (different days (not continuous) in a range for up to a few years). Each availability is a Unix timestamp (in seconds) for a certain day.
Hours, minutes, seconds, ms are set to 0, i.e. a timestamp represents the start of a day.
Question:
How can I find all IDs very fast, which contain at least one availability inbetween a certain from-to range (also timestamp)?
I can also store them differently in an array, e.g "days since epoch", if needed (to get 1 (day) steps instead of 86400 (second) steps).
However, if possible (and speed is roughly same), I want to use an array and on row per each entry.
Example:
Data (0 = day-1, 86400 = day-2, ...)
| id | availabilities |
| 1 | [0 , 86400, 172800, 259200 ]
| 2 | [ 86400, 259200 ]
| 3 | [ , 345600 ]
| 4 | [ , 172800, ]
| 5 | [0, ]
Now I want to get a list of IDs which contains at least 1 availability which:
is between 86400 AND 259200 --> ID 1, 2, 4
is between 172800 AND 172800 --> ID 1, 4
is between 259200 AND (max-int) --> ID 1,2,3

In PostgreSQL unnest function is the best function for converting array elements to rows and gets the best performance. You can use this function. Sample Query:
with mytable as (
select 1 as id, '{12,2500,6000,200}'::int[] as pint
union all
select 2 as id, '{0,200,3500,150}'::int[]
union all
select 4 as id, '{20,10,8500,1100,9000,25000}'::int[]
)
select id, unnest(pint) as pt from mytable;
-- Return
1 12
1 2500
1 6000
1 200
2 0
2 200
2 3500
2 150
4 20
4 10
4 8500
4 1100
4 9000
4 25000

Related

Multiplication of returns by company increasing in time (BHARs)

I have the following Dataframe, organized in panel data. It contains daily returns of many companies on different days following the IPO date. The day_diff represents the days that have passed since the IPO, and return_1 represents the daily individual returns for that specific day for that specific company, from which I have already added +1. Each company has its own company_tic and I have about 300 companies. My goal is to calculate the first component of the right-hand side of the equation below (so having results for each day_diff and company_tic, always starting at day 0, until the last day of data; e.g. = from day 0 to day 1, then from day 0 to day 2, from 0 to day 3, and so on until my last day, which is day 730). I have tried df.groupby(['company_tic', 'day_diff'])['return_1'].expanding().prod() but it doesn't work. Any alternatives?
Index day_diff company_tic return_1
0 0 xyz 1.8914
1 1 xyz 1.0542
2 2 xyz 1.0016
3 0 abc 1.4398
4 1 abc 1.1023
5 2 abc 1.0233
... ... ... ...
159236 x 3
Not sure to fully get what you want, but you might want to use cumprod instead of expanding().prod().
Here's what I tried :
df['return_1_prod'] = df.groupby('company_tic')['return_1'].cumprod()
Output :
day_diff company_tic return_1 return_1_prod
0 0 xyz 1.8914 1.891400
1 1 xyz 1.0542 1.993914
2 2 xyz 1.0016 1.997104
3 0 abc 1.4398 1.439800
4 1 abc 1.1023 1.587092
5 2 abc 1.0233 1.624071

Find correct bucket for the number

I have a bucket that has different 3-3-4, I need to find the right bucket for the number
1-3 4-6 7-10 11-13 14-16 17-20 21 - 23 24 - 26 27 - 30
What could be the efficient formula to find the right bucket, like number 5 lies in buckets 4-6, 18 lies in the bucket 17-20.
Need to write sql query for that
I would maintain a separate bona fide table containing the range values. Then, join to it to get the output you want.
Table: ranges
start | end
1 | 3
4 | 6
7 | 10
11 | 13
14 | 16
17 | 20
21 | 23
24 | 26
27 | 30
WITH buckets AS (
SELECT 5 AS val UNION ALL
SELECT 18
)
SELECT b.val, CAST(r.start AS VARCHAR(10)) + '-' + CAST(r.end AS VARCHAR(10))
FROM buckets b
INNER JOIN ranges r
ON b.val BETWEEN r.start AND r.end
ORDER BY b.val;
For non-overlapping ranges, you only need the starting value of each range (or, equivalently, the ending). Then INDEX that column. Then this is very efficient:
SELECT ...
FROM ...
WHERE n >= ?
ORDER BY n
LIMIT 1
Because of the INDEX, this runs in O(1). (Or O(logN) if include the cost of the BTree lookup.)
More details (couched in IP-addresses): http://mysql.rjweb.org/doc.php/ipranges

width_bucket not returning buckets of equal width

I'm using Postgres version 9.6.9 and attempting to use width_bucket() to generate a histogram with buckets consisting of equal widths. However, the query I'm using is not returning buckets of equal widths.
As you can see in the example below, the values in the bucket have varying widths. e.g. bucket 1 has a min of 7 and a max of 18 - a width of 11. bucket 3 has a min of 52 and a max of 55 - a width of 3.
How can I adjust my query to ensure that each bucket has the same width?
Here's what the data looks like:
value
-------
7
7
15
17
18
22
23
25
29
42
52
52
55
60
74
85
90
90
92
95
(20 rows)
Here's the query and resulting histogram:
WITH min_max AS (
SELECT
min(value) AS min_val,
max(value) AS max_val
FROM table
)
SELECT
min(value),
max(value),
count(*),
width_bucket(value, min_val, max_val, 5) AS bucket
FROM table, min_max
GROUP BY bucket
ORDER BY bucket;
min | max | count | bucket
-----+-----+-------+--------
7 | 23 | 7 | 1
25 | 42 | 3 | 2
52 | 55 | 3 | 3
60 | 74 | 2 | 4
85 | 92 | 4 | 5
95 | 95 | 1 | 6
( 6 rows )
From https://prestodb.io/docs/current/functions/window.html
Have a look at ntile():
ntile(n) → bigint
Divides the rows for each window partition into n buckets ranging from 1 to at most n. Bucket values will differ by at most 1. If the number of rows in the partition does not divide evenly into the number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket.
For example, with 6 rows and 4 buckets, the bucket values would be as follows: 1 1 2 2 3 4
Or say to rank each runner's 100m race times to find their personal best out of their 10 races:
SELECT
NTILE(10) over (PARTITION BY runners ORDER BY racetimes)
FROM
table
Your buckets are the same size. You just don't have data that accurately represents the end-points.
For instance, would 24 be in the first or second bucket? This is more notable for the ranges without any data, such as 75-83.
From https://www.oreilly.com/library/view/sql-in-a/9780596155322/re91.html
WIDTH_BUCKET( expression, min, max, buckets)
The buckets argument specifies the number of buckets to create over the range defined by min through max. min is inclusive, whereas max is not.
Maximum is not included. so set
WIDTH_BUCKET( expression, min, max + 1, buckets)

Percentage of variable corresponding to percentage of other variable

I have two numerical variables, and would like to calculate the percentage of one variable corresponding to at least 50% of the other variable's sum.
For example:
A | B
__________
2 | 8
1 | 20
3 | 12
5 | 4
2 | 7
1 | 11
4 | 5
Here, the sum of column B is 68, so I'm looking for the rows (in B's descending order) where cumulative sum is at least 34.
In that case, they are rows 2, 3 & 6 (cumulative sum of 45). The sum of these row's column A is 5, which I want to compare to the total sum of column A (18).
Therefore, the result I'm looking for is 5 / 18 * 100 = 28.78%
I'm looking for a way to implement this in QlikSense, or in SQL.
Here's one way you can do it - there is probably some optimisation to be done, but this gives what you want.
Source:
LOAD
*,
RowNo() as RowNo_Source
Inline [
A , B
2 , 8
1 , 20
3 , 12
5 , 4
2 , 7
1 , 11
4 , 5
];
SourceSorted:
NoConcatenate LOAD *,
RowNo() as RowNo_SourceSorted
Resident Source
Order by B asc;
drop table Source;
BTotal:
LOAD sum(B) as BTotal
Resident SourceSorted;
let BTotal=peek('BTotal',0);
SourceWithCumu:
NoConcatenate LOAD
*,
rangesum(peek('BCumu'),B) as BCumu,
$(BTotal) as BTotal,
rangesum(peek('BCumu'),B)/$(BTotal) as BCumuPct,
if(rangesum(peek('BCumu'),B)/$(BTotal)>=0.5,A,0) as AFiltered
Resident SourceSorted;
Drop Table SourceSorted;
I worked with a debug fields that might be useful but you could of course remove these.
Then in the front end you do your calculation of sum(AFiltered)/sum(A) to get the stat you want and format it as a percentage.

Using temporary extended table to make a sum

From a given table I want to be able to sum values having the same number (should be easy, right?)
Problem: A given value can be assigned from 2 to n consecutive numbers.
For some reasons this information is stored in a single row describing the value, the starting number and the ending number as below.
TABLE A
id | starting_number | ending_number | value
----+-----------------+---------------+-------
1 2 5 8
2 0 3 5
3 4 6 6
4 7 8 10
For instance the first row means:
value '8' is assigned to numbers: 2, 3 and 4 (5 is excluded)
So, I would like the following intermediairy result table
TABLE B
id | number | value
----+--------+-------
1 2 8
1 3 8
1 4 8
2 0 5
2 1 5
2 2 5
3 4 6
3 5 6
4 7 10
So I can sum 'value' for elements having identical 'number'
SELECT number, sum(value)
FROM B
GROUP BY number
TABLE C
number | sum(value)
--------+------------
2 13
3 8
4 14
0 5
1 5
5 6
7 10
I don't know how to do this and didn't find any answer on the web (maybe not looking with appropriate key words...)
Any idea?
You can do what you want with generate_series(). So, TableB is basically:
select id, generate_series(starting_number, ending_number - 1, 1) as n, value
from tableA;
Your aggregation is then:
select n, sum(value)
from (select id, generate_series(starting_number, ending_number - 1, 1) as n, value
from tableA
) a
group by n;