Count preceding rows that match criteria - sql

I am working time series data and I need to count the number of rows preceding the current row that matched a condition. For example, I need to know how many months prior to the row's month and customer had sales (NETSALES > 0). Ideally I would maintain a row counter that resets when the condition fails (e.g. NETSALES = 0).
Another way of solving the problem would be to flag any row that had more than 12 prior periods of NETSALES.
The closest I came was using the
COUNT(*)
OVER (PARTITION BY cust ORDER BY dt
ROWS 12 PRECEDING) as CtWindow,
http://sqlfiddle.com/#!6/990eb/2
In the example above, 201310 is correctly flagged as 12 but ideally the previous row would have been 11.
The solution can be in R or T-SQL.
Updated with data.table example:
library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
The goal is to calculate a "run" column like below -- which gets reset to zero when the value is zero
NETSALES cust dt run
1: 36.956464 1 1 1
2: 83.767621 1 2 2
3: 28.585003 1 3 3
4: 10.250524 1 4 4
5: 6.537188 1 5 5
6: 0.000000 1 6 6
7: 95.489944 1 7 7
8: 46.351387 1 8 8
9: 0.000000 1 9 0
10: 0.000000 1 10 0
11: 99.621881 1 11 1
12: 76.755104 1 12 2
13: 64.288721 1 13 3
14: 0.000000 1 14 0
15: 36.504473 1 15 1
16: 43.157142 1 16 2
17: 71.808349 1 17 3
18: 53.039105 1 18 4
19: 0.000000 1 19 0
20: 27.387369 1 20 1
21: 58.308899 2 1 1
22: 65.929296 2 2 2
23: 20.529473 2 3 3
24: 58.970898 2 4 4
25: 13.785201 2 5 5
26: 4.796752 2 6 6
27: 72.758112 2 7 7
28: 7.088647 2 8 8
29: 14.516362 2 9 9
30: 94.470714 2 10 10
31: 51.254178 2 11 11
32: 99.544261 2 12 12
33: 66.475412 2 13 13
34: 8.362936 2 14 14
35: 96.742115 2 15 15
36: 15.677712 2 16 16
37: 0.000000 2 17 0
38: 95.684652 2 18 1
39: 65.639292 2 19 2
40: 95.721081 2 20 3
NETSALES cust dt run

This seems to do it:
library(data.table)
set.seed(50)
DT <- data.table(NETSALES=ifelse(runif(40)<.15,0,runif(40,1,100)), cust=rep(1:2, each=20), dt=1:20)
DT[,dir:=ifelse(NETSALES>0,1,0)]
dir.rle <- rle(DT$dir)
DT <- transform(DT, indexer = rep(1:length(dir.rle$lengths), dir.rle$lengths))
DT[,runl:=cumsum(dir),by=indexer]
credit to Cumulative sums over run lengths. Can this loop be vectorized?
Edit by Roland:
Here is the same with better performance and also considering different customers:
#no need for ifelse
DT[,dir:= NETSALES>0]
#use a function to avoid storing the rle, which could be huge
runseq <- function(x) {
x.rle <- rle(x)
rep(1:length(x.rle$lengths), x.rle$lengths)
}
#never use transform with data.table
DT[,indexer := runseq(dir)]
#include cust in by
DT[,runl:=cumsum(dir),by=list(indexer,cust)]
Edit: joe added SQL solution
http://sqlfiddle.com/#!6/990eb/22
SQL solution is 48 minutes on a machine with 128gig of ram across 22m rows. R solution is about 20 seconds on a workstation with 4 gig of ram. Go R!

Related

What would be a strategy to generate semilog sequence with just a few asm instructions?

I need to generate the following number sequences. The actual usage is to find the bin in a histogram with increasingly bigger bins.
The first column is just a log sequence such that the histogram's bins increase size in powers of two.
The next column (2) is a further subdivision of the first column such that each larger bin is subdivided in two sub-bins.
I have a reference discrete solution but I am looking to further reduce the number of cycles, perhaps using floating point tricks.
Index: (1) (2) (3) (4) (5) (6) (7)
1: 1 1 1 1 1 1 1
2: 2 2 2 2 2 2 2
3: 2 3 3 3 3 3 3
4: 3 4 4 4 4 4 4
5: 3 4 5 5 5 5 5
6: 3 5 6 6 6 6 6
7: 3 5 7 7 7 7 7
8: 4 6 8 8 8 8 8
9: 4 6 8 9 9 9 9
10: 4 6 9 10 10 10 10
11: 4 6 9 11 11 11 11
12: 4 7 10 12 12 12 12
13: 4 7 10 13 13 13 13
14: 4 7 11 14 14 14 14
15: 4 7 11 15 15 15 15
16: 5 8 12 16 16 16 16
17: 5 8 12 16 17 17 17
18: 5 8 12 17 18 18 18
19: 5 8 12 17 19 19 19
20: 5 8 13 18 20 20 20
21: 5 8 13 18 21 21 21
22: 5 8 13 19 22 22 22
23: 5 8 13 19 23 23 23
24: 5 9 14 20 24 24 24
25: 5 9 14 20 25 25 25
26: 5 9 14 21 26 26 26
27: 5 9 14 21 27 27 27
28: 5 9 15 22 28 28 28
29: 5 9 15 22 29 29 29
30: 5 9 15 23 30 30 30
31: 5 9 15 23 31 31 31
32: 6 10 16 24 32 32 32
33: 6 10 16 24 32 33 33
34: 6 10 16 24 33 34 34
35: 6 10 16 24 33 35 35
36: 6 10 16 25 34 36 36
37: 6 10 16 25 34 37 37
38: 6 10 16 25 35 38 38
39: 6 10 16 25 35 39 39
40: 6 10 17 26 36 40 40
41: 6 10 17 26 36 41 41
42: 6 10 17 26 37 42 42
43: 6 10 17 26 37 43 43
44: 6 10 17 27 38 44 44
45: 6 10 17 27 38 45 45
46: 6 10 17 27 39 46 46
47: 6 10 17 27 39 47 47
48: 6 11 18 28 40 48 48
49: 6 11 18 28 40 49 49
50: 6 11 18 28 41 50 50
51: 6 11 18 28 41 51 51
52: 6 11 18 29 42 52 52
53: 6 11 18 29 42 53 53
54: 6 11 18 29 43 54 54
55: 6 11 18 29 43 55 55
56: 6 11 19 30 44 56 56
57: 6 11 19 30 44 57 57
58: 6 11 19 30 45 58 58
59: 6 11 19 30 45 59 59
60: 6 11 19 31 46 60 60
61: 6 11 19 31 46 61 61
62: 6 11 19 31 47 62 62
63: 6 11 19 31 47 63 63
The columns show a semi-logarithmic mapping of values. This means that while the major switchover points proceed in logarithmic progression, intermediate points proceed in linear fashion. This is strongly reminiscent of IEEE-754 floating-point encodings, where the exponent field expresses the binary logarithm of the represented quantity, while the significand field provides a linear progression between powers of two. While use of IEEE-754 floating-point formats is extremely widespread, it is not universal, so this approach can only be considered semi-portable.
One idea for an efficient implementation on most modern processors, CPUs and GPU alike, is therefore to convert the input Index into an IEEE-754 binary32 number represented as float at C or C++ source code level. One then extracts the appropriate bits (consisting of the exponent bits and some leading bits of the significand) from the binary representation of that float number, where the number of included significand bits increases by one with each output column, i.e. the column number is a granularity factor of the mapping.
There are various ways of accomplishing the details of the process outlined above. The most straightforward implementation is to use either ldexpf or multiplication with an appropriate power of two so the smallest input is scaled to the lowest binade of the binary32 format using multiplication with a power of two produced via exp2f().
The biggest caveat with this approach is that the lowest binade (biased exponent of 0) contains subnormal numbers, which are not available on platforms operating in FTZ (- flush to zero) mode. This may be the case for both x86 SIMD as well as for NVIDIA GPUs, for example. Another caveat is that ldexpf() or exp2f() may not implemented efficiently, that is, via (almost) direct support in hardware. Lastly, the accuracy of these functions may be insufficient. For example, for CUDA 11.8 I found that exp2f() with a negative integer argument does not always deliver the correct power-of-two result (specifically, exp2f(-127) is off by one ulp), making the variant using exp2f() fail.
Alternate approaches convert Index into a floating-point number without scaling, i.e. starting the mapping in a binade near unity. This raises the issue that for column j > 0, the first 2j entries incorrectly have logarithmic mapping applied to them. This can be solved by manually enforcing a linear mapping to these entries, so that the result equals Index for the first 2j entries. The IEEE-754 exponent bias for the logarithmic portion of the computed values can be removed prior or after bit field extraction, with preference depending on specifics of the instruction set architecture.
The four design variants described above are enumerated in the exemplary code below by the macro VARIANT which can take values 0 through 3. From some initial experiments it seems that when compiling with modern compilers at high optimization level (I tried gcc, clang, and icx) coding at the assembly level may not be necessary. On platforms without IEEE-754 floating-point arithmetic, a quick simulation of integer to floating-point conversion based on a CLZ (count leading zeros) instruction may be helpful.
#include <cstdio>
#include <cstdlib>
#include <cstdint>
#include <cstring>
#include <cmath>
#define VARIANT (3)
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
// provide semi-logarithmic mapping of i to output based on granularity parameter j
int semilogmap (int i, int j)
{
const int FP32_MANT_BITS = 23;
const int FP32_EXPO_BIAS = 127;
#if VARIANT == 0 // this requires subnormal support and will break when using FTZ (flush to zero)!
return float_as_uint32 ((float)i * exp2f (1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
#elif VARIANT == 1 // this requires subnormal support and will break when using FTZ (flush to zero)!
return float_as_uint32 (ldexpf ((float)i, 1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
#elif VARIANT == 2
return (i < (1 << j)) ? i :
((float_as_uint32 ((float)i) - ((FP32_EXPO_BIAS - 1 + j) << FP32_MANT_BITS)) >> (FP32_MANT_BITS - j));
#elif VARIANT == 3
return (i < (1 << j)) ? i :
((float_as_uint32 ((float)i) >> (FP32_MANT_BITS - j)) - ((FP32_EXPO_BIAS - 1 + j) << j));
#else
#error unsupported VARIANT
#endif // VARIANT
}
int main (void)
{
int col [64][7];
for (int i = 1; i <= 63; i++) {
for (int j = 0; j < 7; j++) {
col[i][j] = semilogmap (i, j);
}
}
for (int i = 1; i <= 63; i++) {
printf ("%2d: ", i);
for (int j = 0; j < 7; j++) {
printf (" %2d", col[i][j]);
}
printf ("\n");
}
return EXIT_SUCCESS;
}
In terms of the number of instructions generated, it might be instructive to look at a CUDA version of variant 0 for execution on NVIDIA GPUs. I had to implement my own version of exp2f() to achieve the necessary accuracy.
/* compute 2**x with a maximum error of 2.055 ulp */
__forceinline__ __device__ float my_exp2f (float x)
{
float r, t;
t = x;
if (x < -126.0f) t = t + 24.0f;
asm ("ex2.approx.ftz.f32 %0,%1;" : "=f"(r) : "f"(t));
if (x < -126.0f) r = r * 5.9604644775390625e-8f; // 0x1.0p-24
return r;
}
/* semi-logarithmic mapping of i to output based on granularity parameter j */
__device__ int semilogmap (int i, int j)
{
const int FP32_MANT_BITS = 23;
const int FP32_EXPO_BIAS = 127;
return __float_as_int ((float)i * my_exp2f (1 - FP32_EXPO_BIAS - j)) >> (FP32_MANT_BITS - j);
}
Compiled with nvcc -c -rdc=true -arch=sm_75 -o semilogmap.obj semilogmap.cu using the toolchain from CUDA 11.8, the following code, (comprising 11 instructions including the function return) is generated for semilogmap():
code for sm_75
Function : _Z10semilogmapii
.headerflags #"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
/*0000*/ IADD3 R0, -R5.reuse, -0x7e, RZ ;
/*0010*/ I2F R4, R4 ;
/*0020*/ ISETP.GT.AND P0, PT, R5.reuse, RZ, PT ;
/*0030*/ IADD3 R6, -R5, 0x17, RZ ;
/*0040*/ I2F R3, R0 ;
/*0050*/ #P0 FADD R3, R3, 24 ;
/*0060*/ MUFU.EX2 R7, R3 ;
/*0070*/ #P0 FMUL R7, R7, 5.9604644775390625e-08 ;
/*0080*/ FMUL R7, R4, R7 ;
/*0090*/ SHF.R.S32.HI R4, RZ, R6, R7 ;
/*00a0*/ RET.ABS.NODEC R20 0x0 ;

Pandas transform rows with specific character

i am working on features transformation, and ran into this issue. Let me know what you think. Thanks!
I have a table like this
And I want to create an output column like this
Some info:
All the outputs will be based on numbers that end with a ':'
I have 100M+ rows in this table. Need to consider performance issue.
Let me know if you have some good ideas. Thanks!
Here is some copy and paste-able sample data:
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
Solution #1:
You can use .str.contains(':') with np.where() to identify the values, otherwise return np.nan. Then, use ffill() to fill down on nan values:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
df['Output'] = np.where(df['Number'].str.contains(':'),df['Number'].str.split(':').str[0],np.nan)
df['Output'] = df['Output'].ffill()
df
Solution #2 - Even easier and potentially better performance you can do some regex with str.extract() and then again ffill():
df['Output'] = df['Number'].str.extract('^(\d+):').ffill()
df
Out[1]:
Number Output
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7
I think this is what you are looking for:
import pandas as pd
c = ['Number']
d = ['1:00',100,1001,1321,3254,'15:00',20,60,80,90,'4:00',26,45,90,89]
df = pd.DataFrame(data=d,columns=c)
temp= df['Number'].str.split(":", n = 1, expand = True)
df['New_Val'] = temp[0].ffill()
print(df)
The output of this will be as follows:
Number New_Val
0 1:00 1
1 100 1
2 1001 1
3 1321 1
4 3254 1
5 15:00 15
6 20 15
7 60 15
8 80 15
9 90 15
10 4:00 4
11 26 4
12 45 4
13 90 4
14 89 4
Looks like your DataFrame has string values. I considered them as a mix of numbers and strings.
Here's the solution if df['Number'] is all strings.
df1 = pd.DataFrame({'Number': {0: '1000',1: '1000021', 2: '15:00', 3: '23424234',
4: '23423', 5: '3', 6 : '9:00', 7: '3423', 8: '32', 9: '7:00'}})
temp= df1['Number'].str.split(":", n = 1, expand = True)
temp.loc[temp[1].astype(bool) != False, 'New_val'] = temp[0]
df1['New_val'] = temp['New_val'].ffill()
print (df1)
The output of df1 will be:
Number New_val
0 1000 NaN
1 1000021 NaN
2 15:00 15
3 23424234 15
4 23423 15
5 3 15
6 9:00 9
7 3423 9
8 32 9
9 7:00 7

How to find synchronous IDs based on timestamp and value

I am trying to find synchronous data entries, which share a certain value ("ref") over a certain amount of timestamps.
Dummy Data:
library(data.table)
dft <- data.table(
id = rep(1:5, each=5),
time = rep(1:5, 5),
ref = c(10,11,11,11,11,
10,11,11,11,21,
20,31,31,31,31,
20,41,41,41,41,
20,51,51,51,51)
)
setorder(dft, time)
dft[, time := as.POSIXct(time, origin = "2018-10-14")]
dft
In that example the ID's 1 and 2 would be synchronous over 4 timestamps in line 1,2,6,7,11,12,16,17, as they share the same ref value (rows are marked with !). NOTE: They share the same ref value within one timestamp and might share another ref value in the next timestamp.
How could I approach that problem? I would also like to define the amount of timestamps where the values have to be identical. If I define that at least 5 timestamps have to be synchronous, no ID's should result in that example. With 4 or lower, the ID's 1&2 should be shown as synchronous data-entries.
I have to do that calculation over several millions of rows, so I would prefer a data.table or dplyr solution or any other performant solution (SQL would also be fine).
id time ref
1: 1 2018-10-14 02:00:01 10 !
2: 2 2018-10-14 02:00:01 10 !
3: 3 2018-10-14 02:00:01 20
4: 4 2018-10-14 02:00:01 20
5: 5 2018-10-14 02:00:01 20
6: 1 2018-10-14 02:00:02 11 !
7: 2 2018-10-14 02:00:02 11 !
8: 3 2018-10-14 02:00:02 31
9: 4 2018-10-14 02:00:02 41
10: 5 2018-10-14 02:00:02 51
11: 1 2018-10-14 02:00:03 11 !
12: 2 2018-10-14 02:00:03 11 !
13: 3 2018-10-14 02:00:03 31
14: 4 2018-10-14 02:00:03 41
15: 5 2018-10-14 02:00:03 51
16: 1 2018-10-14 02:00:04 11 !
17: 2 2018-10-14 02:00:04 11 !
18: 3 2018-10-14 02:00:04 31
19: 4 2018-10-14 02:00:04 41
20: 5 2018-10-14 02:00:04 51
21: 1 2018-10-14 02:00:05 11
22: 2 2018-10-14 02:00:05 21
23: 3 2018-10-14 02:00:05 31
24: 4 2018-10-14 02:00:05 41
25: 5 2018-10-14 02:00:05 51
Benchmarking both examples from #DavidArenburg:
library(microbenchmark)
mc = microbenchmark(times = 100,
res1 = dft[dft, .(id, id2 = x.id), on = .(id > id, time, ref), nomatch = 0L, allow.cartesian=TRUE][, .N, by = .(id, id2)],
res2= dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref), allow.cartesian=TRUE][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
)
mc
Unit: milliseconds
expr min lq mean median uq max neval cld
res1 156.8389 158.8122 165.1828 159.6931 165.9156 292.7987 100 a
res2 311.1658 324.5684 350.3006 331.4310 343.6755 815.8397 100 b
A possible data.table solution
dft[dft, .(id, id2 = x.id), # get the desired columns
on = .(id > id, time, ref), # the join condition
nomatch = 0L, # remove unmatched records (NAs)
allow.cartesian = TRUE # In case of a big join, allow Cartesian join
][, .N, by = .(id, id2)] # Count obs. per ids combinations
# id id2 N
# 1: 1 2 4
# 2: 3 4 1
# 3: 3 5 1
# 4: 4 5 1
Explanation
We do a self join on time and ref, while specifying id > id so we won't join to the same id and extracting the joined ids (id and x.id which are the joined ids from both data sets) while removing all the unmatched rows ( nomatch = 0L) . Finally, we count the matched combinations (.N is a special symbol in data.table that stores the number of obs. per combination).
Old (and a bit more involved solution)
dft[dft, .(pmin(id, i.id), pmax(id, i.id)), on = .(time, ref)
][V1 != V2, .(synced = .N / 2L), by = .(id1 = V1, id2 = V2)]
Translating #David Arenburgs code to SQL gives me:
SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id
GROUP BY a.id, b.id
ORDER BY count(*) DESC;
And selecting only those with count > 1:
SELECT a.id as id, b.id as id2, count(*) FROM testdata a
INNER JOIN testdata b ON a.ref = b.ref AND a.timest = b.timest
WHERE a.id > b.id
GROUP BY a.id, b.id HAVING count(*) > 1
ORDER BY count(*) DESC;
Code to produce the SQL-table with the resulting dataframe (dft) from the question:
R:
fwrite(x = dft, file = "C:/testdata.csv", row.names = F)
SQL:
CREATE TABLE testdata (
id serial NOT NULL,
timest timestamp,
ref integer
);
COPY testdata(id, timest, ref)
FROM 'C:/testdata.csv' DELIMITER ',' CSV;

Simplify Array Query with Range

i have a big query datatable with 512 variables as arrays with quite the long names (x__x_arrVal_arrSlices_0__arrValues to arrSlices_511). In each array are 360 values. the bi-tool cannot compute an array in this form. this is the reason why i want to have each value as an output.
the query excerpt i use right now is:
SELECT
timestamp, x_stArrayTag_sArrayName, x_stArrayTag_sComission,
1 as row,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(1)] AS f001,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(10)] AS f010,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(20)] AS f020,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(30)] AS f030,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(40)] AS f040,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(50)] AS f050,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(60)] AS f060,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(70)] AS f070,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(80)] AS f080,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(90)] AS f090,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(100)] AS f100,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(110)] AS f110,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(120)] AS f120,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(130)] AS f130,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(140)] AS f140,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(150)] AS f150,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(160)] AS f160,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(170)] AS f170,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(180)] AS f180,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(190)] AS f190,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(200)] AS f200,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(210)] AS f210,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(220)] AS f220,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(230)] AS f230,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(240)] AS f240,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(250)] AS f250,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(260)] AS f260,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(270)] AS f270,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(280)] AS f280,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(290)] as f290,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(300)] AS f300,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(310)] AS f310,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(320)] AS f320,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(330)] AS f330,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(340)] AS f340,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(350)] AS f350,
x__x_arrVal_arrSlices_1__arrValues[OFFSET(359)] AS f359
FROM
`project.table`
WHERE
_PARTITIONTIME >= "2017-01-01 00:00:00"
AND _PARTITIONTIME < "2018-02-16 00:00:00"
UNION ALL
The output i get is unfortunately only a fracture of all values. getting all 512*360 values with this query is not possible because if i used this query for all slices i reach the limit of bigquery.
is there a possibility to rename the the long name and to select a range?
best regards
scotti
You can get 360 rows and 512 columns by using UNNEST. Here is a small example:
WITH data AS (
SELECT
[1, 2, 3, 4] as a,
[2, 3, 4, 5] as b,
[3, 4, 5, 6] as c
)
SELECT v1, b[OFFSET(off)] as v2, c[OFFSET(off)] as v3
FROM data, unnest(a) as v1 WITH OFFSET off
Output:
v1 v2 v3
1 2 3
2 3 4
3 4 5
4 5 6
Having in mind a little messy table you are dealing with - in making decision on restructuring the important aspect is practicality of query to implement that decision
In your specific case - I would recommend full flattening of the data like below (each row will be transformed into ~180000 rows each representing one of the elements of one of the array in original row - slice field will represent array number and pos will represent element position in that array) - query is generic enough to handle any number/names of slices and array sizes and at the same time result is flexible and also generic enough to be used in any imaginable algorithm
#standardSQL
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
you can test/play with it using below dummy example
#standardSQL
WITH `project.dataset.messytable` AS (
SELECT 1 id,
[ 1, 2, 3, 4, 5] x__x_arrVal_arrSlices_0,
[11, 12, 13, 14, 15] x__x_arrVal_arrSlices_1,
[21, 22, 23, 24, 25] x__x_arrVal_arrSlices_2 UNION ALL
SELECT 2 id,
[ 6, 7, 8, 9, 10] x__x_arrVal_arrSlices_0,
[16, 17, 18, 19, 20] x__x_arrVal_arrSlices_1,
[26, 27, 28, 29, 30] x__x_arrVal_arrSlices_2
)
SELECT
id,
slice,
pos,
value
FROM `project.dataset.messytable` t,
UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_(\d+)":\[.*?\]')) slice WITH OFFSET x
JOIN UNNEST(REGEXP_EXTRACT_ALL(TO_JSON_STRING(t), r'"x__x_arrVal_arrSlices_\d+":\[(.*?)\]')) arr WITH OFFSET y
ON x = y,
UNNEST(SPLIT(arr)) value WITH OFFSET pos
the result is as below
Row id slice pos value
1 1 0 0 1
2 1 0 1 2
3 1 0 2 3
4 1 0 3 4
5 1 0 4 5
6 1 1 0 11
7 1 1 1 12
8 1 1 2 13
9 1 1 3 14
10 1 1 4 15
11 1 2 0 21
12 1 2 1 22
13 1 2 2 23
14 1 2 3 24
15 1 2 4 25
16 2 0 0 6
17 2 0 1 7
18 2 0 2 8
19 2 0 3 9
20 2 0 4 10
21 2 1 0 16
22 2 1 1 17
23 2 1 2 18
24 2 1 3 19
25 2 1 4 20
26 2 2 0 26
27 2 2 1 27
28 2 2 2 28
29 2 2 3 29
30 2 2 4 30

How to emulate SQLs rank functions in R?

What is the R equivalent of rank functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() ("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?
I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.
The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.
Here's an example: Load the latest version of data.table from R-Forge:
install.packages("data.table",
repos= c("http://R-Forge.R-project.org", getOption("repos")))
library(data.table)
Create some example data:
set.seed(10)
DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
> DT
ID group value info
1: 1 1 0.01874617 a
2: 2 1 -0.18425254 b
3: 3 1 -1.37133055 b
4: 4 2 -0.59916772 a
5: 5 2 0.29454513 b
6: 6 2 0.38979430 a
7: 7 3 -1.20807618 b
8: 8 3 -0.36367602 a
9: 9 3 -1.62667268 c
10: 10 4 -0.25647839 d
11: 11 4 1.10177950 c
12: 12 4 0.75578151 d
Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):
> DT[,valRank:=rank(-value),by="group"]
ID group value info valRank
1: 1 1 0.01874617 a 1
2: 2 1 -0.18425254 b 2
3: 3 1 -1.37133055 b 3
4: 4 2 -0.59916772 a 3
5: 5 2 0.29454513 b 2
6: 6 2 0.38979430 a 1
7: 7 3 -1.20807618 b 2
8: 8 3 -0.36367602 a 1
9: 9 3 -1.62667268 c 3
10: 10 4 -0.25647839 d 3
11: 11 4 1.10177950 c 1
12: 12 4 0.75578151 d 2
For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):
DT[,infoRank:=rank(info,ties.method="min"),by="group"]
DT[,infoRankDense:=as.integer(factor(info)),by="group"]
R> DT
ID group value info valRank infoRank infoRankDense
1: 1 1 0.01874617 a 1 1 1
2: 2 1 -0.18425254 b 2 2 2
3: 3 1 -1.37133055 b 3 2 2
4: 4 2 -0.59916772 a 3 1 1
5: 5 2 0.29454513 b 2 3 2
6: 6 2 0.38979430 a 1 1 1
7: 7 3 -1.20807618 b 2 2 2
8: 8 3 -0.36367602 a 1 1 1
9: 9 3 -1.62667268 c 3 3 3
10: 10 4 -0.25647839 d 3 2 2
11: 11 4 1.10177950 c 1 1 1
12: 12 4 0.75578151 d 2 2 2
p.s. Hi Matthew Dowle.
LEAD and LAG
For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:
set.seed(10)
DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),
info=c(sample(c("a","b"),4*2,replace=TRUE),
sample(c("c","d"),4,replace=TRUE)),key="ID")
DT[,idRank:=rank(ID),by="group"]
setkey(DT,group, idRank)
> DT
ID group value info idRank
1: 4 1 -0.36367602 b 1
2: 5 1 -1.62667268 b 2
3: 7 1 -1.20807618 b 3
4: 1 2 1.10177950 a 1
5: 2 2 0.75578151 a 2
6: 12 2 -0.25647839 b 3
7: 3 3 0.74139013 c 1
8: 6 3 0.98744470 b 2
9: 9 3 -0.23823356 a 3
10: 8 4 -0.19515038 c 1
11: 10 4 0.08934727 c 2
12: 11 4 -0.95494386 c 3
Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.
DT[,prev:=DT[J(group,idRank-1), value, mult='last']]
DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]
ID group value info idRank prev prev2
1: 4 1 -0.36367602 b 1 NA NA
2: 5 1 -1.62667268 b 2 -0.36367602 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760
4: 1 2 1.10177950 a 1 NA NA
5: 2 2 0.75578151 a 2 1.10177950 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795
7: 3 3 0.74139013 c 1 NA NA
8: 6 3 0.98744470 b 2 0.74139013 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901
10: 8 4 -0.19515038 c 1 NA NA
11: 10 4 0.08934727 c 2 -0.19515038 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504
For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':
DT[,nex:=DT[J(group,idRank+1), value, mult='first']]
DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]
ID group value info idRank prev prev2 nex nex2
1: 4 1 -0.36367602 b 1 NA NA -1.62667268 -1.2080762
2: 5 1 -1.62667268 b 2 -0.36367602 NA -1.20807618 NA
3: 7 1 -1.20807618 b 3 -1.62667268 -0.3636760 NA NA
4: 1 2 1.10177950 a 1 NA NA 0.75578151 -0.2564784
5: 2 2 0.75578151 a 2 1.10177950 NA -0.25647839 NA
6: 12 2 -0.25647839 b 3 0.75578151 1.1017795 NA NA
7: 3 3 0.74139013 c 1 NA NA 0.98744470 -0.2382336
8: 6 3 0.98744470 b 2 0.74139013 NA -0.23823356 NA
9: 9 3 -0.23823356 a 3 0.98744470 0.7413901 NA NA
10: 8 4 -0.19515038 c 1 NA NA 0.08934727 -0.9549439
11: 10 4 0.08934727 c 2 -0.19515038 NA -0.95494386 NA
12: 11 4 -0.95494386 c 3 0.08934727 -0.1951504 NA NA
From data.table v1.9.5+, function frank() (for fast rank) has been implemented. frank() is useful in interactive scenarios, where as frankv() allows to easily program with.
It implements every operation available in base::rank. In addition, the advantages are:
frank() operates on list, data.frames and data.tables in addition to atomic vectors.
We can specify, for each column, whether rank should be computed on increasing or decreasing order.
It also implements rank type dense in addition to other types in base.
You can use - on a character column as well to rank by decreasing order.
Here's an illustration of all the above points using the same data.table DT from #BenBarnes' (excellent) post.
data:
require(data.table)
set.seed(10)
sample_n <- function(x, n) sample(x, n, replace=TRUE)
DT <- data.table(
ID = seq_len(4*3),
group = rep(1:4,each=3),
value = rnorm(4*3),
info = c(sample_n(letters[1:2], 8), sample_n(letters[3:4], 4)))
On single columns:
Compute dense rank:
DT[, rank := frank(value, ties.method="dense"), by=group]
You can also use the other methods min, max, random, average and first.
In decreasing order:
DT[, rank := frank(-value, ties.method="dense"), by=group]
Using frankv, similar to frank:
# increasing order
frankv(DT, "value", ties.method="dense")
# decreasing order
frankv(DT, "value", order=-1L, ties.method="dense")
On multiple columns
You can use .SD, which stands for Subset of Data and contains the data corresponding to that group. See the Introduction to data.table HTML vignette for more on .SD.
Rank by info, value columns while grouping by group:
DT[, rank := frank(.SD, info, value, ties.method="dense"), by=group]
Use - to specify decreasing order:
DT[, rank := frank(.SD, info, -value, ties.method="dense"), by=group]
You can also use - directly on character columns
DT[, rank := frank(.SD, -info, -value, ties.method="dense"), by=group]
You can use frankv similarly and provide the columns to cols argument and the order by which the columns should be ranked using the order argument.
Small benchmark to compare with base::rank:
set.seed(45L)
x = sample(1e4, 1e7, TRUE)
system.time(ans1 <- base::rank(x, ties.method="first"))
# user system elapsed
# 22.200 0.255 22.536
system.time(ans2 <- frank(x, ties.method="first"))
# user system elapsed
# 0.745 0.014 0.762
identical(ans1, ans2) # [1] TRUE
I like data.table as much as the next guy, but it isn't always necessary. data.table will always be faster, but even for moderately large data sets if the number of groups is fairly small, plyr will still perform adequately.
What BenBarnes did using data.tables can be done just as compactly (but as I noted before probably slower in many cases) using plyr:
library(plyr)
ddply(DT,.(group),transform,valRank = rank(-value))
ddply(DT,.(group),transform,valRank = rank(info,ties.method = "min"),
valRankDense = as.integer(factor(info)))
and even without loading a single extra package at all:
do.call(rbind,by(DT,DT$group,transform,valRank = rank(-value)))
do.call(rbind,by(DT,DT$group,transform,valRank = rank(info,ties.method = "min"),
valRankDense = as.integer(factor(info))))
although you do lose some of the syntactic niceties in that last case.
Dplyr now has windows functions including row_number and dense_rank: https://dplyr.tidyverse.org/reference/ranking.html:
df <- tibble::tribble(
~subjects, ~date, ~visits,
1L, "21/09/1999", 2L,
1L, "29/04/1999", 4L,
2L, "18/02/1999", 15L,
3L, "10/07/1999", 13L,
4L, "27/08/1999", 7L,
7L, "27/10/1999", 14L,
10L, "18/04/1999", 8L,
13L, "27/09/1999", 14L,
14L, "15/09/1999", 6L,
16L, "27/11/1999", 14L,
20L, "06/02/1999", 4L,
22L, "07/09/1999", 12L,
23L, "24/03/1999", 14L,
24L, "19/01/1999", 7L,
)
Note ORDER BY does not need to be stipulated unlike in the ROW_NUMBER() SQL code.
df_partition <- df %>%
group_by(subjects) %>% # group_by is equivalent to GROUP BY in the SQL partition
ROW_NUMBER()
mutate(rn = row_number(visits),
rn_reversed = row_number(desc(visits))) %>%
ungroup() %>% # grouping by subjects remains on data unless removed like this
mutate(dense_rank = dense_rank(visits))
I don't think there's a direct equivalent to Oracle's Analytic functions. Plyr will likely be able to achieve some of the analytic functions, but not all directly. I'm sure R can replicate each function separately but I don't think there's a single package that will do it all.
If there's a specific operation you need to achieve in R, then do some googling, and if you come up empty, ask a specific question here on StackOverflow.