Related
WITH t(x,y) AS (
VALUES
(1,'[2]'),
(2,'[1, 2]'),
(3,'[2, 1]'),
(4,'[3, 2, 5]'),
(5,'[3, 2, 5, 2, 4]'),
(6,'[3, 2, 2, 0, 4]')
)
--- my wrong answer below
SELECT
REGEXP_EXTRACT(y, '(\d+,\s)?(2)(,\s\d+)?') AS _1st,
REGEXP_EXTRACT(y,'(.*?(2)){1}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _2nd,
REGEXP_EXTRACT(y,'(.*?(2)){2}.*?(\d+,\s(2)(,\s\d+)?)',3) AS _3rd
FROM t
Expected ans:
| x | y | 1st | 2nd | nth |
| - | --------------- | ------- | ------- | ------- |
| 1 | [2] | 2 | | |
| 2 | [1, 2] | 1, 2 | | |
| 3 | [2, 1] | 2, 1 | | |
| 4 | [3, 2, 5] | 3, 2, 5 | | |
| 5 | [3, 2, 5, 2, 4] | 3, 2, 5 | 5, 2, 4 | |
| 6 | [3, 2, 2, 0, 4] | 3, 2, 2 | 2, 2, 0 | |
Need help on the Regex for REGEXP_EXTRACT function in Presto to get the nth occurrence of number '2' and include the figures before and after it (if any)
Additional info:
The figures in column y are not necessary single digit.
Orders of the numbers are important
1st, 2nd, 3rd refers to the nth occurrence of the number that I am seeking
Will be looking for a list of numbers, not just 2. Using 2 for illustration purpose.
Must it be a regular-expression?
If you see the text (VARCHAR) [1,2,3] as array-representation (JSON or internal data-type Array), you have more functions available to solve your task.
See related functions supported by Presto:
JSON functions
Array function
I would recommend to cast it as array of integers: CAST('[1,23,456]' AS ARRAY(INTEGER))
Finding the n-th occurrence
From Array functions, array_position(x, element, instance) → bigint to find the n-th occurrence:
If instance > 0, returns the position of the instance-th occurrence of the element in array x.
If instance < 0, returns the position of the instance-to-last occurrence of the element in array x.
If no matching element instance is found, 0 is returned.
Example:
SELECT CAST('[1,2,23,2,456]' AS ARRAY(INTEGER));
SELECT array_position(2, CAST('[1,2,23,2,456]' AS ARRAY(INTEGER)), 1); -- found in position 2
Now use the found position to build your slice (relatively from that).
Slicing and extracting sub-arrays
either parse it as JSON to a JSON-array. Then use a JSON-path to slice (extract a sub-array) as desired: Array slice operator in JSON-path: [start, stop, step]
or cast it as Array and then use slice(x, start, length) → array
Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.
Examples:
SELECT json_extract(json_parse('[1,2,3]'), '$[-2, -1]'); -- the last two elements
SELECT slice(CAST('[1,23,456]' AS ARRAY(INTEGER)), -2, 2); -- [23, 456]
I am new to spark and I am trying to do the following, using Pyspark:
I have a dataframe with 3 columns, "id", "number1", "number2".
For each value of "id" I have multiple rows and what I want to do is create a list of tuples with all the rows that correspond to each id.
Eg, for the following dataframe
id | number1 | number2 |
a | 1 | 1 |
a | 2 | 2 |
b | 3 | 3 |
b | 4 | 4 |
the desired outcome would be 2 lists as such:
[(1, 1), (2, 2)]
and
[(3, 3), (4, 4)]
I'm not sure how to approach this, since I'm a newbie. I have managed to get a list of the distinct ids doing the following
distinct_ids = [x for x in df.select('id').distinct().collect()]
In pandas that I'm more familiar with, now I would loop through the dataframe for each distinct id and gather all the rows for it, but I'm sure this is far from optimal.
Can you give me any ideas? Groupby comes to mind but I'm not sure how to approach
You can use groupby and aggregate using collect_list and array:
import pyspark.sql.functions as F
df2 = df.groupBy('id').agg(F.collect_list(F.array('number1', 'number2')).alias('number'))
df2.show()
+---+----------------+
| id| number|
+---+----------------+
| b|[[3, 3], [4, 4]]|
| a|[[1, 1], [2, 2]]|
+---+----------------+
And if you want to get back a list of tuples,
result = [[tuple(j) for j in i] for i in [r[0] for r in df2.select('number').orderBy('number').collect()]]
which gives result as [[(1, 1), (2, 2)], [(3, 3), (4, 4)]]
If you want a numpy array, you can do
result = np.array([r[0] for r in df2.select('number').collect()])
which gives
array([[[3, 3],
[4, 4]],
[[1, 1],
[2, 2]]])
I am trying to populate the column num_crimes. Since the zipcode repeats in the houses data frame, I just want to add the number of crimes related to that zipcode from the dictionary containing all the crimes per zipcode.
the houses dataframe contains 5000 entries, and the dictionary contains only 67, so I cannot just merge them.
This is the houses dataframe:
sold_price | zipcode | fireplaces | num_crimes
5300000 | 85637 | 6 | NaN
4200000 | 85646 | 5 | NaN
4200000 | 85646 | 5 | NaN
4500000 | 85646 | 6 | NaN
3411450 | 85750 | 4 | NaN
and this is the dictionary:
{85141: 1,85601: 2, 85607: 1, 85614: 4, 85622: 2, 85629: 4, 85634: 1....}
Problem: this is the code I used for that, but it is not changing the values in num_crimes:
def populate(df1):
for row, rows in df1.iterrows():
if rows[1] in my_dict:
rows[3]=my_dict[rows[1]]
else:
rows[3]=0
You can just do something like:
df["num_crimes"] = df["zipcode"].apply(lambda z: my_dict[z])
If you have zipcode in df that are not in my_dict, you need to handle for that as well:
df["num_crimes"] = df["zipcode"].apply(lambda z: my_dict[z] if z in my_dict else -1)
It's a lot easier to answer your questions if you post your data as text rather than images. Anyways, you could make the dict into a dataframe and then join it with the original dataframe. So something like this:
houses.set_index("Zipcode").join(pd.DataFrame.from_dict(my_dict, orient='index', columns = ["Crimes from dict"]))
Would that work?
For CodeChef problem C_HOLIC2, I tried iterating over elements: 5, 10, 15, 20, 25,... and for each number checking the number of trailing zeros using the efficient technique as specified over here, but got TLE.
What is the fastest way to solve this using formula method?
Here is the Problem Link
As we know for counting the number of trailing zeros in factorial of a number, the trick used is:
The number of multiples of 5 that are less than or equal to 500 is 500÷5=100
Then, the number of multiples of 25 is 500÷25=20
Then, the number of multiples of 125 is 500÷125=4
The next power of 5 is 625, which is > than 500.
Therefore, the number of trailing zeros of is 100+20+4=124
For detailed explanation check this page
Thus, this count can be represented as:
Using this trick, given a number N you can determine the no. of trailing zeros count in its factorial. Codechef Problem Link
Now, suppose we are given the no. of trailing zeros, count and we are asked the smallest no. N whose factorial has count trailing zeros Codechef Problem Link
Here the question is how can we split count into this representation?
This is a problem because in the following examples, as we can see it becomes difficult.
The count jumps even though the no is increasing by the same amount.
As you can see from the following table, count jumps at values whose factorials have integral powers of 5 as factors e.g. 25, 50, ..., 125, ...
+-------+-----+
| count | N |
+-------+-----+
| 1 | 5 |
+-------+-----+
| 2 | 10 |
+-------+-----+
| 3 | 15 |
+-------+-----+
| 4 | 20 |
+-------+-----+
| 6 | 25 |
+-------+-----+
| 7 | 30 |
+-------+-----+
| 8 | 35 |
+-------+-----+
| 9 | 40 |
+-------+-----+
| 10 | 45 |
+-------+-----+
| 12 | 50 |
+-------+-----+
| ... | ... |
+-------+-----+
| 28 | 120 |
+-------+-----+
| 31 | 125 |
+-------+-----+
| 32 | 130 |
+-------+-----+
| ... | ... |
+-------+-----+
You can see this from any brute force program for this task, that these jumps occur frequently i.e. at 6, 12, 18, 24 in case of numbers whose factorials have 25.(Interval = 6=1×5+1)
After N=31 factorials will also have a factor of 125. Thus, these jumps corresponding to 25 will still occur with the same frequency i.e. at 31, 37, 43, ...
Now the next jump corresponding to 125 will be at 31+31 which is at 62. Thus jumps corresponding to 125 will occur at 31, 62, 93, 124.(Interval =31=6×5+1)
Now the jump corresponding to 625 will occur at 31×5+1=155+1=156
Thus you can see there exists a pattern. We need to find the formula for this pattern to proceed.
The series formed is 1, 6, 31, 156, ...
which is 1 , 1+5 , 1+5+52 , 1+5+52+53 , ...
Thus, nth term is sum of n terms of G.P. with a = 1, r = 5
Thus, the count can be something like 31+31+6+1+1, etc.
We need to find this tn which is less than count but closest to it. i.e.
Say the number is count=35, then using this we identify that tn=31 is closest. For count=63 we again see that using this formula, we get tn=31 to be the closest but note that here, 31 can be subtracted twice from count=63. Now we go on finding this n and keep on subtracting tn from count till count becomes 0.
The algorithm used is:
count=read_no()
N=0
while count!=0:
n=floor(log(4*count+1,5))
baseSum=((5**n)-1)/4
baseOffset=(5**n)*(count/baseSum) // This is integer division
count=count%baseSum
N+=baseOffset
print(N)
Here, 5**n is 5n
Let's try working this out for an example:
Say count = 70,
Iteration 1:
Iteration 2:
Iteration 3:
Take another example. Say count=124 which is the one discussed at the beginning of this page:
Iteration 1:
PS: All the images are completely owned by me. I had to use images because StackOverflow doesn't allow MathJax to be embedded. #StackOverflowShouldAllowMathJax
Given: {{1,"a"},{2,"b"},{3,"c"}}
Desired:
foo | bar
-----+------
1 | a
2 | b
3 | c
You can get the intended result with the following query; however, it'd be better to have something that scales with the size of the array.
SELECT arr[subscript][1] as foo, arr[subscript][2] as bar
FROM ( select generate_subscripts(arr,1) as subscript, arr
from (select '{{1,"a"},{2,"b"},{3,"c"}}'::text[][] as arr) input
) sub;
This works:
select key as foo, value as bar
from json_each_text(
json_object('{{1,"a"},{2,"b"},{3,"c"}}')
);
Result:
foo | bar
-----+------
1 | a
2 | b
3 | c
Docs
Not sure what exactly you mean saying "it'd be better to have something that scales with the size of the array". Of course you can not have extra columns added to resultset as the inner array size grows, because postgresql must know exact colunms of a query before its execution (so before it begins to read the string).
But I would like to propose converting the string into normal relational representation of matrix:
select i, j, arr[i][j] a_i_j from (
select i, generate_subscripts(arr,2) as j, arr from (
select generate_subscripts(arr,1) as i, arr
from (select ('{{1,"a",11},{2,"b",22},{3,"c",33},{4,"d",44}}'::text[][]) arr) input
) sub_i
) sub_j
Which gives:
i | j | a_i_j
--+---+------
1 | 1 | 1
1 | 2 | a
1 | 3 | 11
2 | 1 | 2
2 | 2 | b
2 | 3 | 22
3 | 1 | 3
3 | 2 | c
3 | 3 | 33
4 | 1 | 4
4 | 2 | d
4 | 3 | 44
Such a result may be rather usable in further data processing, I think.
Of course, such a query can handle only array with predefined number of dimensions, but all array sizes for all of its dimensions can be changed without rewriting the query, so this is a bit more flexible approach.
ADDITION: Yes, using with recursive one can build resembling query, capable of handling array with arbitrary dimensions. None the less, there is no way to overcome the limitation coming from relational data model - exact set of columns must be defined at query parse time, and no way to delay this until execution time. So, we are forced to store all indices in one column, using another array.
Here is the query that extracts all elements from arbitrary multi-dimensional array along with their zero-based indices (stored in another one-dimensional array):
with recursive extract_index(k,idx,elem,arr,n) as (
select (row_number() over())-1 k, idx, elem, arr, n from (
select array[]::bigint[] idx, unnest(arr) elem, arr, array_ndims(arr) n
from ( select '{{{1,"a"},{11,111}},{{2,"b"},{22,222}},{{3,"c"},{33,333}},{{4,"d"},{44,444}}}'::text[] arr ) input
) plain_indexed
union all
select k/array_length(arr,n)::bigint k, array_prepend(k%array_length(arr,2),idx) idx, elem, arr, n-1 n
from extract_index
where n!=1
)
select array_prepend(k,idx) idx, elem from extract_index where n=1
Which gives:
idx | elem
--------+-----
{0,0,0} | 1
{0,0,1} | a
{0,1,0} | 11
{0,1,1} | 111
{1,0,0} | 2
{1,0,1} | b
{1,1,0} | 22
{1,1,1} | 222
{2,0,0} | 3
{2,0,1} | c
{2,1,0} | 33
{2,1,1} | 333
{3,0,0} | 4
{3,0,1} | d
{3,1,0} | 44
{3,1,1} | 444
Formally, this seems to prove the concept, but I wonder what a real practical use one could make out of it :)