Regex to get nth occurrence of pattern (Trino) - sql

WITH t(x,y) AS (
VALUES
(1,'[2]'),
(2,'[1, 2]'),
(3,'[2, 1]'),
(4,'[3, 2, 5]'),
(5,'[3, 2, 5, 2, 4]'),
(6,'[3, 2, 2, 0, 4]')
)
--- my wrong answer below
SELECT
REGEXP_EXTRACT(y, '(\d+,\s)?(2)(,\s\d+)?') AS _1st,
REGEXP_EXTRACT(y,'(.*?(2)){1}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _2nd,
REGEXP_EXTRACT(y,'(.*?(2)){2}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _3rd
FROM t
Expected ans:
| x | y | 1st | 2nd | nth |
| - | --------------- | ------- | ------- | ------- |
| 1 | [2] | 2 | | |
| 2 | [1, 2] | 1, 2 | | |
| 3 | [2, 1] | 2, 1 | | |
| 4 | [3, 2, 5] | 3, 2, 5 | | |
| 5 | [3, 2, 5, 2, 4] | 3, 2, 5 | 5, 2, 4 | |
| 6 | [3, 2, 2, 0, 4] | 3, 2, 2 | 2, 2, 0 | |
Need help on the Regex for REGEXP_EXTRACT function in Presto to get the nth occurrence of number '2' and include the figures before and after it (if any)
Additional info:
The figures in column y are not necessary single digit.
Orders of the numbers are important
1st, 2nd, 3rd refers to the nth occurrence of the number that I am seeking
Will be looking for a list of numbers, not just 2. Using 2 for illustration purpose.

Must it be a regular-expression?
If you see the text (VARCHAR) [1,2,3] as array-representation (JSON or internal data-type Array), you have more functions available to solve your task.
See related functions supported by Presto:
JSON functions
Array function
I would recommend to cast it as array of integers: CAST('[1,23,456]' AS ARRAY(INTEGER))
Finding the n-th occurrence
From Array functions, array_position(x, element, instance) → bigint to find the n-th occurrence:
If instance > 0, returns the position of the instance-th occurrence of the element in array x.
If instance < 0, returns the position of the instance-to-last occurrence of the element in array x.
If no matching element instance is found, 0 is returned.
Example:
SELECT CAST('[1,2,23,2,456]' AS ARRAY(INTEGER));
SELECT array_position(2, CAST('[1,2,23,2,456]' AS ARRAY(INTEGER)), 1); -- found in position 2
Now use the found position to build your slice (relatively from that).
Slicing and extracting sub-arrays
either parse it as JSON to a JSON-array. Then use a JSON-path to slice (extract a sub-array) as desired: Array slice operator in JSON-path: [start, stop, step]
or cast it as Array and then use slice(x, start, length) → array
Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.
Examples:
SELECT json_extract(json_parse('[1,2,3]'), '$[-2, -1]'); -- the last two elements
SELECT slice(CAST('[1,23,456]' AS ARRAY(INTEGER)), -2, 2); -- [23, 456]

Related

Filtering based on value and creating list in spark dataframe

I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])

Array to columns in PostgreSQL when array length is non static

I need something similar to unnest(), but for unnesting to columns rather than rows.
I have a table which has id column and array column. How can I unnest array to columns? Arrays with same ids always have same array length.
EDIT: I'm seeking for query which would work with any array lenght
SELECT ???? FROM table WHERE id=1;
id | array array1 | array2 | ... | arrayn
---+---------- -------+--------+-----+-------
1 | {1, 2, ..., 3} -> 1 | 2 | ... | 3
1 | {4, 5, ..., 6} 4 | 5 | ... | 6
2 | {7, 8, ..., 9}
Anyone got idea?
Wouldn't this be the logic?
select array[1] as array1, array[2] as array2
from t
where id = 1;
A SQL query returns a fixed set of columns. You cannot have a regular query that sometimes returns two columns and sometimes returns one or three. In fact, that is one reason to use arrays -- it gives you the flexibility to have variable numbers of values.

CodeChef C_HOLIC2 Solution Find the smallest N whose factorial produces P Trailing Zeroes

For CodeChef problem C_HOLIC2, I tried iterating over elements: 5, 10, 15, 20, 25,... and for each number checking the number of trailing zeros using the efficient technique as specified over here, but got TLE.
What is the fastest way to solve this using formula method?
Here is the Problem Link
As we know for counting the number of trailing zeros in factorial of a number, the trick used is:
The number of multiples of 5 that are less than or equal to 500 is 500÷5=100
Then, the number of multiples of 25 is 500÷25=20
Then, the number of multiples of 125 is 500÷125=4
The next power of 5 is 625, which is > than 500.
Therefore, the number of trailing zeros of is 100+20+4=124
For detailed explanation check this page
Thus, this count can be represented as:
Using this trick, given a number N you can determine the no. of trailing zeros count in its factorial. Codechef Problem Link
Now, suppose we are given the no. of trailing zeros, count and we are asked the smallest no. N whose factorial has count trailing zeros Codechef Problem Link
Here the question is how can we split count into this representation?
This is a problem because in the following examples, as we can see it becomes difficult.
The count jumps even though the no is increasing by the same amount.
As you can see from the following table, count jumps at values whose factorials have integral powers of 5 as factors e.g. 25, 50, ..., 125, ...
+-------+-----+
| count | N |
+-------+-----+
| 1 | 5 |
+-------+-----+
| 2 | 10 |
+-------+-----+
| 3 | 15 |
+-------+-----+
| 4 | 20 |
+-------+-----+
| 6 | 25 |
+-------+-----+
| 7 | 30 |
+-------+-----+
| 8 | 35 |
+-------+-----+
| 9 | 40 |
+-------+-----+
| 10 | 45 |
+-------+-----+
| 12 | 50 |
+-------+-----+
| ... | ... |
+-------+-----+
| 28 | 120 |
+-------+-----+
| 31 | 125 |
+-------+-----+
| 32 | 130 |
+-------+-----+
| ... | ... |
+-------+-----+
You can see this from any brute force program for this task, that these jumps occur frequently i.e. at 6, 12, 18, 24 in case of numbers whose factorials have 25.(Interval = 6=1×5+1)
After N=31 factorials will also have a factor of 125. Thus, these jumps corresponding to 25 will still occur with the same frequency i.e. at 31, 37, 43, ...
Now the next jump corresponding to 125 will be at 31+31 which is at 62. Thus jumps corresponding to 125 will occur at 31, 62, 93, 124.(Interval =31=6×5+1)
Now the jump corresponding to 625 will occur at 31×5+1=155+1=156
Thus you can see there exists a pattern. We need to find the formula for this pattern to proceed.
The series formed is 1, 6, 31, 156, ...
which is 1 , 1+5 , 1+5+52 , 1+5+52+53 , ...
Thus, nth term is sum of n terms of G.P. with a = 1, r = 5
Thus, the count can be something like 31+31+6+1+1, etc.
We need to find this tn which is less than count but closest to it. i.e.
Say the number is count=35, then using this we identify that tn=31 is closest. For count=63 we again see that using this formula, we get tn=31 to be the closest but note that here, 31 can be subtracted twice from count=63. Now we go on finding this n and keep on subtracting tn from count till count becomes 0.
The algorithm used is:
count=read_no()
N=0
while count!=0:
n=floor(log(4*count+1,5))
baseSum=((5**n)-1)/4
baseOffset=(5**n)*(count/baseSum) // This is integer division
count=count%baseSum
N+=baseOffset
print(N)
Here, 5**n is 5n
Let's try working this out for an example:
Say count = 70,
Iteration 1:
Iteration 2:
Iteration 3:
Take another example. Say count=124 which is the one discussed at the beginning of this page:
Iteration 1:
PS: All the images are completely owned by me. I had to use images because StackOverflow doesn't allow MathJax to be embedded. #StackOverflowShouldAllowMathJax

How to calculate power consumption from power records?

I have table which contains power values (kW) for devices. Values are read from each device once a minute and inserted into table with timestamp. What I need to do is calculate power consumption (kWh) for given time span and return 10 most power consuming devices. Right now I query results for given time span and do calculation in backend looping all records. This works fine with small amount of devices and with short time span, but in real use case I could have thousands of devices and long time span.
So my question is how could I do this all in PostgreSQL 9.4.4 so that my query would return only 10 most power consuming (device_id, power_consumption) pairs?
Example table:
CREATE TABLE measurements (
id serial primary key,
device_id integer,
power real,
created_at timestamp
);
Simple data example:
| id | device_id | power | created_at |
|----|-----------|-------|--------------------------|
| 1 | 1 | 10 | August, 26 2015 08:23:25 |
| 2 | 1 | 13 | August, 26 2015 08:24:25 |
| 3 | 1 | 12 | August, 26 2015 08:25:25 |
| 4 | 2 | 103 | August, 26 2015 08:23:25 |
| 5 | 2 | 134 | August, 26 2015 08:24:25 |
| 6 | 2 | 2 | August, 26 2015 08:25:25 |
| 7 | 3 | 10 | August, 26 2015 08:23:25 |
| 8 | 3 | 13 | August, 26 2015 08:24:25 |
| 9 | 3 | 20 | August, 26 2015 08:25:25 |
Wanted results for query:
| id | device_id | power_consumption |
|----|-----------|-------------------|
| 1 | 1 | 24.0 |
| 2 | 2 | 186.5 |
| 3 | 3 | 28.0 |
Simplified example (created_at in hours) how I calculate kWh value:
data = [
[
{ 'id': 1, 'device_id': 1, 'power': 10.0, 'created_at': 0 },
{ 'id': 2, 'device_id': 1, 'power': 13.0, 'created_at': 1 },
{ 'id': 3, 'device_id': 1, 'power': 12.0, 'created_at': 2 }
],
[
{ 'id': 4, 'device_id': 2, 'power': 103.0, 'created_at': 0 },
{ 'id': 5, 'device_id': 2, 'power': 134.0, 'created_at': 1 },
{ 'id': 6, 'device_id': 2, 'power': 2.0, 'created_at': 2 }
],
[
{ 'id': 7, 'device_id': 3, 'power': 10.0, 'created_at': 0 },
{ 'id': 8, 'device_id': 3, 'power': 13.0, 'created_at': 1 },
{ 'id': 9, 'device_id': 3, 'power': 20.0, 'created_at': 2 }
]
]
# device_id: power_consumption
results = { 1: 0, 2: 0, 3: 0 }
for d in data:
for i in range(0, len(d)):
if i < len(d)-1:
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
# Area between two records gives us kWh
# X-axis is time(h)
# Y-axis is power(kW)
x1 = d[i]['created_at']
x2 = d[i+1]['created_at']
y1 = d[i]['power']
y2 = d[i+1]['power']
results[d[i]['device_id']] += ((x2-x1)*(y2+y1))/2
print results
EDIT: Check this to see how I ended up solving this.
Some of the elements that you'll need in order to do this are:
Sum() aggregations, to calculate the total of a number of records
Lag()/Lead() functions, to calculate for a given record what the "previous" or "next" record's values were.
So where for a given row you can get the current created_at and power records, in SQL you'd probably use a Lead() windowing function to get the created_at and power records for the record for the same device id that has the next highest value for created_at.
Docs for Lead() are here: http://www.postgresql.org/docs/9.4/static/functions-window.html
When for each row you have calculated the power consumption by reference to the "next" record, you can use a Sum() to aggregate up all of the calculated powers for that one device.
When you have calculated the power per device, you can use ORDER BY and LIMIT to select the top n power-consuming devices.
Steps to follow, if you're not confident to plunge in and just write the final SQL -- after each step make sure you haveSQL you understand, and which returns just the data you need:
Start small, by selecting the data rows that you want.
Work out the Lead() function, defining the appropriate partition and order clauses to get the next row.
Add the calculation of power per row.
Define the Sum() function, and group by the device id.
Add the ORDER BY and LIMIT clauses.
If you have trouble with any one of these steps, they would each make a decent StackOverflow question.
If someone happens to wonder same thing here is how I solved this.
I followed instructions by David and made this:
SELECT
t.device_id,
sum(len_y*(extract(epoch from date_trunc('milliseconds', len_x)))/7200) AS total
FROM (
SELECT
m.id,
m.device_id,
m.power,
m.created_at,
m.power+lag(m.power) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_y,
m.created_at-lag(m.created_at) OVER (
PARTITION BY device_id
ORDER BY m.created_at
) AS len_x
FROM
mes AS m
WHERE m.created_at BETWEEN '2015-08-26 13:39:57.834674'::timestamp
AND '2015-08-26 13:43:57.834674'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total
DESC LIMIT 10;

Convert multi-dimensional array to records

Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)