Simplify Array Query with Range - sql

i have a big query datatable with 512 variables as arrays with quite the long names (x__x_arrVal_arrSlices_0__arrValues to arrSlices_511). In each array are 360 values. the bi-tool cannot compute an array in this form. this is the reason why i want to have each value as an output.
the query excerpt i use right now is:
SELECT
timestamp, x_stArrayTag_sArrayName, x_stArrayTag_sComission,
1 as row,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(1)] AS f001,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(10)] AS f010,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(20)] AS f020,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(30)] AS f030,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(40)] AS f040,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(50)] AS f050,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(60)] AS f060,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(70)] AS f070,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(80)] AS f080,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(90)] AS f090,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(100)] AS f100,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(110)] AS f110,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(120)] AS f120,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(130)] AS f130,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(140)] AS f140,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(150)] AS f150,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(160)] AS f160,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(170)] AS f170,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(180)] AS f180,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(190)] AS f190,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(200)] AS f200,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(210)] AS f210,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(220)] AS f220,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(230)] AS f230,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(240)] AS f240,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(250)] AS f250,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(260)] AS f260,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(270)] AS f270,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(280)] AS f280,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(290)] as f290,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(300)] AS f300,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(310)] AS f310,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(320)] AS f320,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(330)] AS f330,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(340)] AS f340,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(350)] AS f350,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(359)] AS f359
FROM
`project.table`
WHERE
_PARTITIONTIME >= "2017-01-01 00:00:00"
AND _PARTITIONTIME < "2018-02-16 00:00:00"
UNION ALL
The output i get is unfortunately only a fracture of all values. getting all 512*360 values with this query is not possible because if i used this query for all slices i reach the limit of bigquery.
is there a possibility to rename the the long name and to select a range?
best regards
scotti

You can get 360 rows and 512 columns by using UNNEST. Here is a small example:
WITH data AS (
SELECT
[1, 2, 3, 4] as a,
[2, 3, 4, 5] as b,
[3, 4, 5, 6] as c
)
SELECT v1, b[OFFSET(off)] as v2, c[OFFSET(off)] as v3
FROM data, unnest(a) as v1 WITH OFFSET off
Output:
v1 v2 v3
1 2 3
2 3 4
3 4 5
4 5 6

Having in mind a little messy table you are dealing with - in making decision on restructuring the important aspect is practicality of query to implement that decision
In your specific case - I would recommend full flattening of the data like below (each row will be transformed into ~180000 rows each representing one of the elements of one of the array in original row - slice field will represent array number and pos will represent element position in that array) - query is generic enough to handle any number/names of slices and array sizes and at the same time result is flexible and also generic enough to be used in any imaginable algorithm
#standardSQL
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
you can test/play with it using below dummy example
#standardSQL
WITH `project.dataset.messytable` AS (
SELECT 1 id,
[ 1, 2, 3, 4, 5] x__x_arrVal_arrSlices_0,
[11, 12, 13, 14, 15] x__x_arrVal_arrSlices_1,
[21, 22, 23, 24, 25] x__x_arrVal_arrSlices_2 UNION ALL
SELECT 2 id,
[ 6, 7, 8, 9, 10] x__x_arrVal_arrSlices_0,
[16, 17, 18, 19, 20] x__x_arrVal_arrSlices_1,
[26, 27, 28, 29, 30] x__x_arrVal_arrSlices_2
)
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
the result is as below
Row id slice pos value
1 1 0 0 1
2 1 0 1 2
3 1 0 2 3
4 1 0 3 4
5 1 0 4 5
6 1 1 0 11
7 1 1 1 12
8 1 1 2 13
9 1 1 3 14
10 1 1 4 15
11 1 2 0 21
12 1 2 1 22
13 1 2 2 23
14 1 2 3 24
15 1 2 4 25
16 2 0 0 6
17 2 0 1 7
18 2 0 2 8
19 2 0 3 9
20 2 0 4 10
21 2 1 0 16
22 2 1 1 17
23 2 1 2 18
24 2 1 3 19
25 2 1 4 20
26 2 2 0 26
27 2 2 1 27
28 2 2 2 28
29 2 2 3 29
30 2 2 4 30

Related

pandas: compare two columns in df, return combined value ranges across rows

I have a large df that looks like:
test = pd.DataFrame({'start': [1, 1, 2, 8, 2000],
'end': [5, 3, 6, 9, 3000]})
start end
0 1 5
1 1 3
2 2 6
3 8 9
4 2000 3000
I want to compare all rows of test and get the combined value ranges:
desired output:
start end
0 1 6
1 8 9
2 2000 3000
I know I can compare within a row, e.g.
test['start'] < test['end]
I'm just not sure of the best/fastest way to compare and combine over millions of rows.

To prepare a dataframe with elements being repeated from a list in python

I have a list as primary = ['A' , 'B' , 'C' , 'D']
and a DataFrame as
df2 = pd.DataFrame(data=dateRange, columns = ['Date'])
which contains 1 date column starting from 01-July-2020 till 31-Dec-2020.
I created another column 'DayNum' which will contain the day number from the date like 01-July-2020 is Wednesday so the 'DayNum' column will have 2 and so on.
Now using the list I want to create another column 'primary' so that the DataFrame looks as follows:
In short, the elements on the list should repeat. You can say that this is a roster to show the name of the person on the roster on a weekly basis where Monday is the start (day 0) and Sunday is the end (day 6).
The output should be like this:
Date DayNum Primary
0 01-Jul-20 2 A
1 02-Jul-20 3 A
2 03-Jul-20 4 A
3 04-Jul-20 5 A
4 05-Jul-20 6 A
5 06-Jul-20 0 B
6 07-Jul-20 1 B
7 08-Jul-20 2 B
8 09-Jul-20 3 B
9 10-Jul-20 4 B
10 11-Jul-20 5 B
11 12-Jul-20 6 B
12 13-Jul-20 0 C
13 14-Jul-20 1 C
14 15-Jul-20 2 C
15 16-Jul-20 3 C
16 17-Jul-20 4 C
17 18-Jul-20 5 C
18 19-Jul-20 6 C
19 20-Jul-20 0 D
20 21-Jul-20 1 D
21 22-Jul-20 2 D
22 23-Jul-20 3 D
23 24-Jul-20 4 D
24 25-Jul-20 5 D
25 26-Jul-20 6 D
26 27-Jul-20 0 A
27 28-Jul-20 1 A
28 29-Jul-20 2 A
29 30-Jul-20 3 A
30 31-Jul-20 4 A
First compare column for 0 by Series.eq with cumulative sum by Series.cumsum for groups for each week, then use modulo by Series.mod with number of values in list and last map by dictioanry created by enumerate and list by Series.map:
primary = ['A','B','C','D']
d = dict(enumerate(primary))
df['Primary'] = df['DayNum'].eq(0).cumsum().mod(len(primary)).map(d)

Paired difference of columns in dataframe to generate dataframe with 1.3 million columns

I have a dataframe with 1600 columns.
The dataframe df looks like where the column names are 1, 3 , 2:
Row Labels 1 3 2
41730Type1 9 6 5
41730Type2 14 12 20
41731Type1 2 15 5
41731Type2 3 20 12
41732Type1 8 10 5
41732Type2 8 18 16
I need to create the following dataframe df2 pythonically:
Row Labels (1, 2) (1, 3) (2, 3)
41730Type1 -4 -3 1
41730Type2 6 -2 -8
41731Type1 3 13 10
41731Type2 9 17 8
41732Type1 -3 2 5
41732Type2 8 10 2
where e.g. column (1, 2) is created by df[2] - df[1]
The column names for df2 are created by pairing the column headers of df1 such that second element of each name is greater than first e.g. (1, 2), (1, 3), (2, 3)
The second challenge is can pandas dataframe support 1.3 million columns?
We can do combinations for the column , then create the dict and concat it back
import itertools
l=itertools.combinations(df.columns,2)
d={'{0[0]}|{0[1]}'.format(x) : df[x[0]]-df[x[1]] for x in [*l] }
newdf=pd.concat(d,axis=1)
1|3 1|2 3|2
RowLabels
41730Type1 3 4 1
41730Type2 2 -6 -8
41731Type1 -13 -3 10
41731Type2 -17 -9 8
41732Type1 -2 3 5
41732Type2 -10 -8 2
itertools combinations seems the obvious choice, same as #YOBEN_S, a different route to the solution, using numpy arrays and dictionary
from itertools import combinations
new_data = combinations(df.to_numpy().T,2)
new_cols = combinations(df.columns, 2)
result = {key : np.subtract(arr1,arr2)
if key[0] > key[1]
else np.subtract(arr2,arr1)
for (arr1, arr2), key
in zip(new_data,new_cols)}
outcome = pd.DataFrame.from_dict(result,orient='index').sort_index().T
outcome
(1, 2) (1, 3) (3, 2)
0 -4 -3 1
1 6 -2 -8
2 3 13 10
3 9 17 8
4 -3 2 5
5 8 10 2

'Series' objects are mutable, thus they cannot be hashed trying to sum columns and datatype is float

I am tryning to sum all values in a range of columns from the third to last of several thousand columns using:
day3prep['D3counts'] = day3prep.sum(day3prep.iloc[:, 2:].sum(axis=1))
dataframe is formated as:
ID G1 Z1 Z2 ...ZN
0 50 13 12 ...62
1 51 62 23 ...19
dataframe with summed column:
ID G1 Z1 Z2 ...ZN D3counts
0 50 13 12 ...62 sum(Z1:ZN in row 0)
1 51 62 23 ...19 sum(Z1:ZN in row 1)
I've changed the NaNs to 0's. The datatype is float but I am getting the error:
'Series' objects are mutable, thus they cannot be hashed
You only need this part:
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
With some random numbers:
import pandas as pd
import random
random.seed(42)
day3prep = pd.DataFrame({'ID': random.sample(range(10), 5), 'G1': random.sample(range(10), 5),
'Z1': random.sample(range(10), 5), 'Z2': random.sample(range(10), 5), 'Z3': random.sample(range(10), 5)})
day3prep['D3counts'] = day3prep.iloc[:, 2:].sum(axis=1)
Output:
> day3prep
ID G1 Z1 Z2 Z3 D3counts
0 1 2 0 8 8 16
1 0 1 9 0 6 15
2 4 8 1 3 3 7
3 9 4 7 5 7 19
4 6 3 6 6 4 16

How to create a partially filled column in pandas

I have a df_trg with, say 10 rows numbered 0-9.
I get from various sources values for an additional column foo which contains only a subset of rows, e.g. S1 has 0-3, 7, 9 and S2 has 4, 6.
I would like to get a data frame with a single new column foo where some rows may remain NaN.
Is there a "nicer" way other than:
df_trg['foo'] = np.nan
for src in sources:
df_trg['foo'][df_trg.index.isin(src.index)] = src
for example, using join or merge?
Let's create the source DataFrame (df), s1 and s2 (Series objects with
updating data) and a list of them (sources):
df = pd.DataFrame(np.arange(1, 51).reshape((5, -1)).T)
s1 = pd.Series([11, 12, 13, 14, 15, 16], index=[0, 1, 2, 3, 7, 9])
s2 = pd.Series([27, 28], index=[4, 6])
sources = [s1, s2]
Start the computation from adding foo column, initially filled with
an empty string:
df = df.assign(foo='')
Then run the following "updating" loop:
for src in sources:
df.foo.update(other=src)
The result is:
0 1 2 3 4 foo
0 1 11 21 31 41 11
1 2 12 22 32 42 12
2 3 13 23 33 43 13
3 4 14 24 34 44 14
4 5 15 25 35 45 27
5 6 16 26 36 46
6 7 17 27 37 47 28
7 8 18 28 38 48 15
8 9 19 29 39 49
9 10 20 30 40 50 16
In my opinion, this solution is (at least a little) nicer than yours and
shorter.
Alternative: Fill foo column initially with NaN, but this time
updating values will be converted to float (side effect of using NaN).