BigQuery to count if two record values is greater than or equal the values in their columns and find percent overall

BigQuery to count if two record values is greater than or equal the values in their columns and find percent overall - google-bigquery

Let say I have a table of millions of records resulting from a simulation, below sample
TO Sim DUR Cost
1 1 20 145
1 2 24 120
1 3 27 176
1 4 30 170
1 5 23 173
1 6 26 148
1 7 21 175
1 8 22 171
1 9 23 169
1 10 23 178
2 1 23 172
2 2 29 152
2 3 25 162
2 4 20 179
2 5 26 154
2 6 27 137
2 7 27 131
2 8 28 148
2 9 25 156
2 10 22 169
how to do the calculation in BigQuery to find the percent count of rows that are satisfying two conditions. (i can do a UDF but I would like it to be all in SQL statements)
The excel equivalent to the new calculated column would be =countifs($C$2:$C$21,">="&C2,$D$2:$D$21,">="&D2,$A$2:$A$21,A2) / countif($A$2:$A$21,A2)
the results would look like
TO Sim DUR Cost f0
1 1 20 145 0.90
1 2 24 120 0.40
1 3 27 176 0.10
1 4 30 170 0.10
1 5 23 173 0.30
1 6 26 148 0.30
1 7 21 175 0.30
1 8 22 171 0.40
1 9 23 169 0.50
1 10 23 178 0.10
2 1 23 172 0.10
2 2 29 152 0.10
2 3 25 162 0.10
2 4 20 179 0.10
2 5 26 154 0.10
2 6 27 137 0.30
2 7 27 131 0.40
2 8 28 148 0.20
2 9 25 156 0.20
2 10 22 169 0.20

Below is for BigQuery Standard SQL
#standardSQL
SELECT ANY_VALUE(a).*, COUNTIF(b.dur >= a.dur AND b.cost >= a.cost) / COUNT(1) calc
FROM `project.dataset.table` a
JOIN `project.dataset.table` b
USING (to_)
GROUP BY FORMAT('%t', a)
-- ORDER BY to_, sim
if to apply to sample data from your question - result is
Row to_ sim dur cost calc
1 1 1 20 145 0.9
2 1 2 24 120 0.4
3 1 3 27 176 0.1
4 1 4 30 170 0.1
5 1 5 23 173 0.3
6 1 6 26 148 0.3
7 1 7 21 175 0.3
8 1 8 22 171 0.4
9 1 9 23 169 0.5
10 1 10 23 178 0.1
11 2 1 23 172 0.1
12 2 2 29 152 0.1
13 2 3 25 162 0.1
14 2 4 20 179 0.1
15 2 5 26 154 0.1
16 2 6 27 137 0.3
17 2 7 27 131 0.4
18 2 8 28 148 0.2
19 2 9 25 156 0.2
20 2 10 22 169 0.2
Note: I am using field name to_ instead of to which is keyword and not allowed to be used as column name

Related

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

I have a data set with dates, customers and income:
Date CustomerIncome
0 1/1/2018 A 53
1 2/1/2018 A 36
2 3/1/2018 A 53
3 5/1/2018 A 89
4 6/1/2018 A 84
5 8/1/2018 A 84
6 9/1/2018 A 54
7 10/1/2018 A 19
8 11/1/2018 A 44
9 12/1/2018 A 80
10 1/1/2018 B 24
11 2/1/2018 B 100
12 9/1/2018 B 40
13 10/1/2018 B 47
14 12/1/2018 B 10
15 2/1/2019 B 5
For both customers there are missing dates as they purchased nothing at some months.
I want to add per each customer what was the income of last month and also the rolling sum of income for the last year.
Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.
This is the expected result:
Date CustomerIncome S(1)R(12)
0 1/1/2018 A 53 0 53
1 2/1/2018 A 36 53 89
2 3/1/2018 A 53 36 142
3 5/1/2018 A 89 0 231
4 6/1/2018 A 84 89 315
5 8/1/2018 A 84 0 399
6 9/1/2018 A 54 84 453
7 10/1/2018 A 19 54 472
8 11/1/2018 A 44 19 516
9 12/1/2018 A 80 44 596
10 1/1/2018 B 24 0 24
11 2/1/2018 B 100 24 124
12 9/1/2018 B 40 0 164
13 10/1/2018 B 47 40 211
14 12/1/2018 B 10 0 221
15 2/1/2019 B 5 0 102
So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.

You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in
zip(df['Customer'], df['Date'], df['Income'])
if y > d['Rolling Date']
if y <= d['Date']
if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]:
Date Customer Income S(1) R(12)
0 2018-01-01 A 53 0 53
1 2018-02-01 A 36 53 89
2 2018-03-01 A 53 36 142
3 2018-05-01 A 89 0 231
4 2018-06-01 A 84 89 315
5 2018-08-01 A 84 0 399
6 2018-09-01 A 54 84 453
7 2018-10-01 A 19 54 472
8 2018-11-01 A 44 19 516
9 2018-12-01 A 80 44 596
10 2018-01-01 B 24 0 24
11 2018-02-01 B 100 24 124
12 2018-09-01 B 40 0 164
13 2018-10-01 B 47 40 211
14 2018-12-01 B 10 0 221
15 2019-02-01 B 5 0 102

How to replace last n values of a row with zero

I want to replace last 2 values of one of the column with zero. I understand for NaN values, I am able to use .fillna(0), but I would like to replace row 6 value of the last column as well.
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 1
7 34 Dimi 65 NaN
df.drop(df.tail(2).index,inplace=True)
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 0
7 34 Dimi 65 0

Before pandas 0.20.0 (long time) it was job for ix, but now it is deprecated. So you can use:
DataFrame.iloc for get last rows and also Index.get_loc for positions of column d_id_max:
df.iloc[-2:, df.columns.get_loc('d_id_max')] = 0
print (df)
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
Or DataFrame.loc with indexing index values:
df.loc[df.index[-2:], 'd_id_max'] = 0

Try .iloc and get_loc
df.iloc[[-1,-2], df.columns.get_loc('d_id_max')] = 0
Out[232]:
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0

You can use:
df['d_id_max'].iloc[-2:] = 0
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0

getting all rows after the smallest value in a column in each group after groupby

I have a dataframe as below.
After i do the grouby on 'Cycle' & 'Type', i want to find out the sum/mean/std of only the positive values of the negative values of the rows AFTER the smallest 'Switch' value.
How do i do this? I think the easier option would be to get a dataframe which has the rows after the smallest switch for all 'Cycle' & 'Type' group.
Cycle Type Time Switch
10 1 101 -0.134
10 1 102 0.001
10 1 103 -0.058
10 1 104 0.035
10 1 105 -0.209
10 1 106 0.002
10 1 107 -0.0443
10 1 108 0.001
10 1 109 -0.368
10 1 110 0.015
10 1 111 -0.009
10 1 112 0.055
10 1 113 -0.014
10 1 114 0.004
10 1 115 -0.033
10 1 116 0.003
10 1 117 -0.0401
10 1 118 0.003
10 1 119 -0.088
10 1 120 0.005
10 1 121 -0.026
10 1 122 0.001
10 1 123 -0.115
10 1 124 0.005
10 1 125 -0.085
10 1 126 0.002
10 1 127 -0.054
10 1 128 0.012
10 1 129 -0.034
8 1 101 -1.876
8 1 102 0.003
8 1 103 -0.134
8 1 104 0.002
8 1 105 -0.036
8 1 106 0.012
8 1 107 -0.08
8 1 108 0.037
8 1 109 -0.027
8 1 110 0.022
8 1 111 -0.001
8 1 112 0.028
8 1 113 -0.009
8 1 114 0.002
8 1 115 -0.006
8 1 116 0.01
8 1 117 -0.002
8 1 118 0.002
8 1 119 -0.002
8 1 120 0.008
8 1 121 -0.011
8 1 122 0.001
8 1 123 -0.028
8 1 124 0.003
8 1 125 -0.063
8 1 126 0.013
8 1 127 -0.003
8 1 128 0.02
8 1 129 -0.113
8 1 130 0.003
8 1 131 -0.03
8 1 132 0.012
8 1 133 -0.078
8 1 134 0.001
8 1 135 -0.764
8 1 136 0.006
8 1 137 -0.268
8 1 138 0.016
8 1 139 -0.171
8 1 140 0.013
8 1 141 -0.286
8 1 142 0.023
For the given dataframe, the output would be all the rows below -0.368 for cycle=10 & type=1. On the other hand for cycle=8 & type=1, all rows below -1.876 (so all rows below the first row).
The output dataframe would be as below (first 9 rows on cycle 10, type 1 are removed & first row of type 8 cycle 1 are removed)
Cycle Type Time Switch
10 1 110 0.015
10 1 111 -0.009
10 1 112 0.055
10 1 113 -0.014
10 1 114 0.004
10 1 115 -0.033
10 1 116 0.003
10 1 117 -0.0401
10 1 118 0.003
10 1 119 -0.088
10 1 120 0.005
10 1 121 -0.026
10 1 122 0.001
10 1 123 -0.115
10 1 124 0.005
10 1 125 -0.085
10 1 126 0.002
10 1 127 -0.054
10 1 128 0.012
10 1 129 -0.034
8 1 102 0.003
8 1 103 -0.134
8 1 104 0.002
8 1 105 -0.036
8 1 106 0.012
8 1 107 -0.08
8 1 108 0.037
8 1 109 -0.027
8 1 110 0.022
8 1 111 -0.001
8 1 112 0.028
8 1 113 -0.009
8 1 114 0.002
8 1 115 -0.006
8 1 116 0.01
8 1 117 -0.002
8 1 118 0.002
8 1 119 -0.002
8 1 120 0.008
8 1 121 -0.011
8 1 122 0.001
8 1 123 -0.028
8 1 124 0.003
8 1 125 -0.063
8 1 126 0.013
8 1 127 -0.003
8 1 128 0.02
8 1 129 -0.113
8 1 130 0.003
8 1 131 -0.03
8 1 132 0.012
8 1 133 -0.078
8 1 134 0.001
8 1 135 -0.764
8 1 136 0.006
8 1 137 -0.268
8 1 138 0.016
8 1 139 -0.171
8 1 140 0.013
8 1 141 -0.286
8 1 142 0.023
How do I accomplish this?
In the same way, if i have to get all the rows after the 2nd lowest, how do i do it?
The time keeps increasing, so if we know the time of the lowest 'Switch' all the time values above this 'Time' can be the contents of the new dataframe.
I have the progress as below.
I was able to find the time of the lowest value with codes below.
after_min = switch.groupby(['Cycle','Type'],as_index=False).apply(lambda x: x.nsmallest(1, 'Switch'))
a = after_min.groupby(['Cycle','Type'])['Time'].agg('first').reset_index(name='Time')
I am not able to create a mask or something like that which would filter the values below the lowest time in each group. Can anyone help?
If there is a way to get all the rows after the lowest value of 'Switch' even when we did not have the 'Time' column, please let me know.
Update The answer suggested by WeNYoBen works perfectly if I want to get all the rows after the LOWEST value. However, if i want to get all the rows after the 2nd lowest, it will not work.
With the same logic a mentioned by WeNYoBen, if i can transform the last value in level=1 of the result of the code below to my group, there may be a possibility to get the rows after 2nd lowest.
df.groupby(['Cycle','Type'],as_index=False).apply(lambda x: x.nsmallest(2, 'Switch'))
The code above gives the output as seen in the picture below. 63 & 4 are the indexes of 2nd lowest 'Switch' values. If i can only transform these values to each group respectively. Then i can get the rows below the 2nd lowest values using the logic from WeNYoBen (this can also be scaled by changing the nsmallest values in the above code to number desired). I am just not able to transform 63 & 1 to each group. Can anyone help?

Here is a solution by using transform and idxmin
df[df.index>df.groupby(['Cycle','Type']).Switch.transform('idxmin')]

How to find mean (average) of each 3 values in one column of unknown size using pandas. No Numpy

I am trying to figure out how to apply such functions (mean. STD etc) on different values of a CSV file. To keep it simple I put here the example of one column.
S08
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
D1 = D.loc[:,'S08']
rang = len(D1)
for i in range(rang):
x = D1.iloc[:,i+2]
m = x.mean()
print(m)
time S08 S09 S15 S37 S38 S39 S41 S45 S49
1 10 5 100 5 145 1500 1 10 99
2 20 15 200 15 135 1400 2 150 99
3 30 25 300 25 125 1300 3 140 99
4 40 35 400 35 115 1200 4 130 99
5 50 45 500 45 105 1100 5 120 99
6 60 55 600 55 95 1000 6 110 99
7 70 65 700 65 85 900 7 100 99
8 80 75 800 75 75 800 8 90 99
9 90 85 900 85 65 700 9 80 99
10 100 95 1000 95 55 600 10 70 99
11 110 105 1100 105 45 500 11 60 99
12 120 115 1200 115 35 400 12 50 99
13 130 125 1300 125 25 300 13 40 99
14 140 135 1400 135 15 200 14 30 99
15 150 145 1500 145 5 100 15 20 99
enter image description here

Use groupby by index floor divided by 3 and then aggregate columns by agg:
#create monotonic unique index (0,1,2...) if necessary
#df = df.reset_index(drop=True)
df = df.groupby(df.index // 3).agg({'col1':'mean', 'col2':'std'})
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(5, size=(10,3)), columns=list('ABC'))
print (df)
A B C
0 0 0 3
1 0 2 4
2 2 2 2
3 2 1 0
4 0 4 3
5 4 2 0
6 3 1 2
7 3 4 4
8 1 3 4
9 4 3 3
df1 = df.groupby(df.index // 3).agg({'A':'mean', 'B':'std'})
print (df1)
A B
0 0.666667 1.154701
1 2.000000 1.527525
2 2.333333 1.527525
3 4.000000 NaN
#floor dicide index values for create triple groups
print (df.index // 3)
Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
EDIT:
df1 = df.groupby(df.index // 3).agg(['mean','std'])
df1.columns = df1.columns.map('_'.join)
print (df1)
A_mean A_std B_mean B_std C_mean C_std
0 0.666667 1.154701 1.333333 1.154701 3.000000 1.000000
1 2.000000 2.000000 2.333333 1.527525 1.000000 1.732051
2 2.333333 1.154701 2.666667 1.527525 3.333333 1.154701
3 4.000000 NaN 3.000000 NaN 3.000000 NaN

how to eliminate a unique record using SQL?

I have a table with 3 columns. cid is the user, when is a timestamp of some transaction, and the 3rd column is me fumbling with how to achieve my objective.
In DB2, using this query:
SELECT cid, when, ROW_NUMBER() OVER (PARTITION BY cid ORDER BY when ASC) AS cid_when_rank
FROM (SELECT DISTINCT cid, when FROM yrb_purchase ORDER BY cid) AS temp
I get this table:
CID WHEN CID_WHEN_RANK
1 1999-04-20-12.12.00.000000 1
1 2001-12-01-11.59.00.000000 2
2 1998-08-08-17.33.00.000000 1
2 1999-02-13-15.13.00.000000 2
2 1999-04-16-11.46.00.000000 3
2 2001-02-23-12.37.00.000000 4
2 2001-04-24-17.02.00.000000 5
2 2001-10-21-11.05.00.000000 6
2 2001-12-01-15.39.00.000000 7
3 1998-01-27-09.19.00.000000 1
3 2001-10-06-11.12.00.000000 2
4 2000-06-13-09.45.00.000000 1
4 2001-06-30-13.58.00.000000 2
4 2001-08-11-17.40.00.000000 3
5 2001-07-17-16.27.00.000000 1
6 2000-05-18-11.43.00.000000 1
6 2001-07-08-18.09.00.000000 2
6 2001-10-02-12.37.00.000000 3
7 1999-06-15-12.13.00.000000 1
7 2000-05-05-14.49.00.000000 2
7 2000-09-26-16.32.00.000000 3
8 1999-01-19-09.32.00.000000 1
8 1999-08-02-09.20.00.000000 2
8 2000-07-03-12.39.00.000000 3
8 2001-08-13-13.11.00.000000 4
8 2001-10-18-10.18.00.000000 5
9 2001-09-10-13.03.00.000000 1
10 2000-03-11-10.05.00.000000 1
10 2001-03-11-15.46.00.000000 2
10 2001-04-29-18.30.00.000000 3
11 2001-07-27-11.45.00.000000 1
12 1999-02-07-10.59.00.000000 1
12 2001-08-24-11.12.00.000000 2
13 1998-03-17-14.04.00.000000 1
13 2001-05-18-10.11.00.000000 2
13 2001-09-14-12.56.00.000000 3
14 2001-10-10-17.18.00.000000 1
15 2000-12-01-18.27.00.000000 1
16 2000-01-04-14.18.00.000000 1
16 2001-02-27-15.08.00.000000 2
16 2001-11-16-09.52.00.000000 3
17 1998-04-08-17.59.00.000000 1
17 1999-06-07-10.13.00.000000 2
17 2001-09-13-12.08.00.000000 3
18 2001-09-22-10.01.00.000000 1
19 1999-03-09-12.11.00.000000 1
19 2001-07-23-09.27.00.000000 2
19 2001-12-01-16.10.00.000000 3
20 1999-11-22-14.29.00.000000 1
20 2000-05-27-17.56.00.000000 2
20 2001-06-01-09.37.00.000000 3
21 1998-02-17-16.08.00.000000 1
21 2000-02-15-13.22.00.000000 2
21 2001-03-10-15.05.00.000000 3
21 2001-03-10-16.22.00.000000 4
21 2001-10-25-10.15.00.000000 5
21 2001-11-19-11.02.00.000000 6
22 2001-03-04-17.13.00.000000 1
22 2001-08-16-16.59.00.000000 2
22 2001-10-23-11.24.00.000000 3
23 1998-07-04-16.33.00.000000 1
23 2000-09-26-13.17.00.000000 2
23 2000-09-27-12.27.00.000000 3
23 2001-06-23-16.45.00.000000 4
23 2001-10-27-18.01.00.000000 5
24 2001-10-23-14.59.00.000000 1
25 2001-03-14-09.26.00.000000 1
25 2001-11-30-14.23.00.000000 2
26 2001-04-27-15.07.00.000000 1
26 2001-06-30-11.26.00.000000 2
26 2001-12-01-18.04.00.000000 3
27 2000-06-05-09.44.00.000000 1
28 1999-07-17-10.14.00.000000 1
28 2001-02-03-15.50.00.000000 2
28 2001-02-13-12.08.00.000000 3
28 2001-07-20-16.52.00.000000 4
29 2001-06-10-17.16.00.000000 1
29 2001-09-20-10.19.00.000000 2
30 1999-05-22-16.59.00.000000 1
30 2001-10-20-15.28.00.000000 2
30 2001-12-01-14.50.00.000000 3
32 1999-05-05-14.20.00.000000 1
32 2000-05-12-13.51.00.000000 2
32 2001-05-18-10.43.00.000000 3
33 1999-02-07-18.58.00.000000 1
33 1999-09-30-14.05.00.000000 2
33 2001-09-18-12.48.00.000000 3
34 1999-05-29-15.57.00.000000 1
35 2001-03-19-18.38.00.000000 1
35 2001-03-28-15.49.00.000000 2
36 1999-06-22-11.42.00.000000 1
36 1999-10-30-15.25.00.000000 2
36 2000-01-27-10.17.00.000000 3
36 2000-11-04-09.06.00.000000 4
37 1999-01-11-09.51.00.000000 1
37 2000-11-25-17.53.00.000000 2
37 2000-12-01-17.21.00.000000 3
37 2001-10-21-16.49.00.000000 4
38 1997-10-11-17.15.00.000000 1
39 2000-03-09-13.46.00.000000 1
39 2001-01-09-16.22.00.000000 2
39 2001-07-03-14.12.00.000000 3
40 1998-07-27-17.39.00.000000 1
40 1999-01-27-09.36.00.000000 2
40 1999-06-12-17.18.00.000000 3
40 2000-05-17-14.17.00.000000 4
40 2001-04-08-15.39.00.000000 5
40 2001-09-30-10.26.00.000000 6
41 1998-06-05-10.06.00.000000 1
41 1998-08-23-09.39.00.000000 2
41 1999-12-01-18.42.00.000000 3
41 2001-03-30-15.26.00.000000 4
41 2001-11-15-15.33.00.000000 5
42 2000-06-22-12.16.00.000000 1
42 2001-01-13-15.03.00.000000 2
42 2001-08-19-14.18.00.000000 3
43 1998-07-07-11.29.00.000000 1
43 1999-01-22-15.46.00.000000 2
43 2000-08-04-12.16.00.000000 3
43 2001-03-17-14.18.00.000000 4
44 1999-11-03-09.32.00.000000 1
44 2001-05-26-17.23.00.000000 2
44 2001-07-18-12.59.00.000000 3
44 2001-10-23-10.04.00.000000 4
44 2001-11-09-16.18.00.000000 5
45 2000-03-19-10.31.00.000000 1
45 2001-07-14-11.36.00.000000 2
I am trying to eliminate all the customers (cid) who have made only one purchase. For example, cid=5 and cid=9 are good examples. The logic is that if they have a cid_when_rank=1, but no cid_when_rank=2, I need to drop those tuples. I have been breaking my head using INTERSECTION, EXCEPT, and using logic in the WHERE clause, but no luck. I looked online on how to eliminate DISTINCT records, but all I found was people discovering the DISTINCT keyword.
Please do not suggest hard coding cid=5 or cid=9 as there are more than those two records in the table.
Can you please suggest a simple SQL way to get this done. Please be aware I am not very strong at SQL yet, and would appreciate the most basic answer
Thanks in advance!
************************************EDIT #1**********************************
when I tried the first and second suggested answers my table went from 127 records to 287. I am trying to simply remove the records where a cid has a rank of 1, and does not have a rank of 2. Hope you can help.
The results of both suggested answers yield the same table:
CID WHEN CID_WHEN_RANK
1 1999-04-20-12.12.00.000000 1
1 2001-12-01-11.59.00.000000 2
1 2001-12-01-11.59.00.000000 3
1 2001-12-01-11.59.00.000000 4
1 2001-12-01-11.59.00.000000 5
2 1998-08-08-17.33.00.000000 1
2 1998-08-08-17.33.00.000000 2
2 1999-02-13-15.13.00.000000 3
2 1999-04-16-11.46.00.000000 4
2 2001-02-23-12.37.00.000000 5
2 2001-04-24-17.02.00.000000 6
2 2001-04-24-17.02.00.000000 7
2 2001-04-24-17.02.00.000000 8
2 2001-10-21-11.05.00.000000 9
2 2001-10-21-11.05.00.000000 10
2 2001-12-01-15.39.00.000000 11
3 1998-01-27-09.19.00.000000 1
3 1998-01-27-09.19.00.000000 2
3 1998-01-27-09.19.00.000000 3
3 2001-10-06-11.12.00.000000 4
3 2001-10-06-11.12.00.000000 5
3 2001-10-06-11.12.00.000000 6
3 2001-10-06-11.12.00.000000 7
3 2001-10-06-11.12.00.000000 8
4 2000-06-13-09.45.00.000000 1
4 2001-06-30-13.58.00.000000 2
4 2001-06-30-13.58.00.000000 3
4 2001-06-30-13.58.00.000000 4
4 2001-08-11-17.40.00.000000 5
5 2001-07-17-16.27.00.000000 1
5 2001-07-17-16.27.00.000000 2
5 2001-07-17-16.27.00.000000 3
5 2001-07-17-16.27.00.000000 4
5 2001-07-17-16.27.00.000000 5
5 2001-07-17-16.27.00.000000 6
5 2001-07-17-16.27.00.000000 7
6 2000-05-18-11.43.00.000000 1
6 2000-05-18-11.43.00.000000 2
6 2000-05-18-11.43.00.000000 3
6 2001-07-08-18.09.00.000000 4
6 2001-07-08-18.09.00.000000 5
6 2001-10-02-12.37.00.000000 6
7 1999-06-15-12.13.00.000000 1
7 1999-06-15-12.13.00.000000 2
7 2000-05-05-14.49.00.000000 3
7 2000-09-26-16.32.00.000000 4
8 1999-01-19-09.32.00.000000 1
8 1999-08-02-09.20.00.000000 2
8 2000-07-03-12.39.00.000000 3
8 2000-07-03-12.39.00.000000 4
8 2001-08-13-13.11.00.000000 5
8 2001-10-18-10.18.00.000000 6
8 2001-10-18-10.18.00.000000 7
9 2001-09-10-13.03.00.000000 1
9 2001-09-10-13.03.00.000000 2
9 2001-09-10-13.03.00.000000 3
9 2001-09-10-13.03.00.000000 4
9 2001-09-10-13.03.00.000000 5
9 2001-09-10-13.03.00.000000 6
9 2001-09-10-13.03.00.000000 7
9 2001-09-10-13.03.00.000000 8
10 2000-03-11-10.05.00.000000 1
10 2001-03-11-15.46.00.000000 2
10 2001-03-11-15.46.00.000000 3
10 2001-04-29-18.30.00.000000 4
10 2001-04-29-18.30.00.000000 5
11 2001-07-27-11.45.00.000000 1
11 2001-07-27-11.45.00.000000 2
11 2001-07-27-11.45.00.000000 3
11 2001-07-27-11.45.00.000000 4
11 2001-07-27-11.45.00.000000 5
12 1999-02-07-10.59.00.000000 1
12 2001-08-24-11.12.00.000000 2
12 2001-08-24-11.12.00.000000 3
12 2001-08-24-11.12.00.000000 4
13 1998-03-17-14.04.00.000000 1
13 2001-05-18-10.11.00.000000 2
13 2001-05-18-10.11.00.000000 3
13 2001-05-18-10.11.00.000000 4
13 2001-09-14-12.56.00.000000 5
14 2001-10-10-17.18.00.000000 1
14 2001-10-10-17.18.00.000000 2
14 2001-10-10-17.18.00.000000 3
14 2001-10-10-17.18.00.000000 4
14 2001-10-10-17.18.00.000000 5
14 2001-10-10-17.18.00.000000 6
14 2001-10-10-17.18.00.000000 7
14 2001-10-10-17.18.00.000000 8
15 2000-12-01-18.27.00.000000 1
15 2000-12-01-18.27.00.000000 2
15 2000-12-01-18.27.00.000000 3
15 2000-12-01-18.27.00.000000 4
15 2000-12-01-18.27.00.000000 5
16 2000-01-04-14.18.00.000000 1
16 2001-02-27-15.08.00.000000 2
16 2001-02-27-15.08.00.000000 3
16 2001-02-27-15.08.00.000000 4
16 2001-11-16-09.52.00.000000 5
16 2001-11-16-09.52.00.000000 6
16 2001-11-16-09.52.00.000000 7
17 1998-04-08-17.59.00.000000 1
17 1999-06-07-10.13.00.000000 2
17 2001-09-13-12.08.00.000000 3
17 2001-09-13-12.08.00.000000 4
17 2001-09-13-12.08.00.000000 5
18 2001-09-22-10.01.00.000000 1
18 2001-09-22-10.01.00.000000 2
18 2001-09-22-10.01.00.000000 3
19 1999-03-09-12.11.00.000000 1
19 1999-03-09-12.11.00.000000 2
19 1999-03-09-12.11.00.000000 3
19 2001-07-23-09.27.00.000000 4
19 2001-07-23-09.27.00.000000 5
19 2001-07-23-09.27.00.000000 6
19 2001-12-01-16.10.00.000000 7
19 2001-12-01-16.10.00.000000 8
19 2001-12-01-16.10.00.000000 9
19 2001-12-01-16.10.00.000000 10
19 2001-12-01-16.10.00.000000 11
20 1999-11-22-14.29.00.000000 1
20 1999-11-22-14.29.00.000000 2
20 2000-05-27-17.56.00.000000 3
20 2001-06-01-09.37.00.000000 4
20 2001-06-01-09.37.00.000000 5
21 1998-02-17-16.08.00.000000 1
21 2000-02-15-13.22.00.000000 2
21 2001-03-10-15.05.00.000000 3
21 2001-03-10-15.05.00.000000 4
21 2001-03-10-15.05.00.000000 5
21 2001-03-10-16.22.00.000000 6
21 2001-10-25-10.15.00.000000 7
21 2001-11-19-11.02.00.000000 8
21 2001-11-19-11.02.00.000000 9
21 2001-11-19-11.02.00.000000 10
21 2001-11-19-11.02.00.000000 11
22 2001-03-04-17.13.00.000000 1
22 2001-03-04-17.13.00.000000 2
22 2001-03-04-17.13.00.000000 3
22 2001-03-04-17.13.00.000000 4
22 2001-08-16-16.59.00.000000 5
22 2001-10-23-11.24.00.000000 6
23 1998-07-04-16.33.00.000000 1
23 2000-09-26-13.17.00.000000 2
23 2000-09-26-13.17.00.000000 3
23 2000-09-27-12.27.00.000000 4
23 2000-09-27-12.27.00.000000 5
23 2001-06-23-16.45.00.000000 6
23 2001-06-23-16.45.00.000000 7
23 2001-10-27-18.01.00.000000 8
23 2001-10-27-18.01.00.000000 9
23 2001-10-27-18.01.00.000000 10
23 2001-10-27-18.01.00.000000 11
24 2001-10-23-14.59.00.000000 1
24 2001-10-23-14.59.00.000000 2
24 2001-10-23-14.59.00.000000 3
25 2001-03-14-09.26.00.000000 1
25 2001-03-14-09.26.00.000000 2
25 2001-03-14-09.26.00.000000 3
25 2001-11-30-14.23.00.000000 4
26 2001-04-27-15.07.00.000000 1
26 2001-04-27-15.07.00.000000 2
26 2001-04-27-15.07.00.000000 3
26 2001-04-27-15.07.00.000000 4
26 2001-04-27-15.07.00.000000 5
26 2001-06-30-11.26.00.000000 6
26 2001-06-30-11.26.00.000000 7
26 2001-06-30-11.26.00.000000 8
26 2001-12-01-18.04.00.000000 9
26 2001-12-01-18.04.00.000000 10
26 2001-12-01-18.04.00.000000 11
27 2000-06-05-09.44.00.000000 1
27 2000-06-05-09.44.00.000000 2
28 1999-07-17-10.14.00.000000 1
28 2001-02-03-15.50.00.000000 2
28 2001-02-03-15.50.00.000000 3
28 2001-02-03-15.50.00.000000 4
28 2001-02-13-12.08.00.000000 5
28 2001-02-13-12.08.00.000000 6
28 2001-07-20-16.52.00.000000 7
28 2001-07-20-16.52.00.000000 8
29 2001-06-10-17.16.00.000000 1
29 2001-06-10-17.16.00.000000 2
29 2001-06-10-17.16.00.000000 3
29 2001-09-20-10.19.00.000000 4
29 2001-09-20-10.19.00.000000 5
29 2001-09-20-10.19.00.000000 6
30 1999-05-22-16.59.00.000000 1
30 2001-10-20-15.28.00.000000 2
30 2001-10-20-15.28.00.000000 3
30 2001-10-20-15.28.00.000000 4
30 2001-10-20-15.28.00.000000 5
30 2001-12-01-14.50.00.000000 6
30 2001-12-01-14.50.00.000000 7
32 1999-05-05-14.20.00.000000 1
32 1999-05-05-14.20.00.000000 2
32 2000-05-12-13.51.00.000000 3
32 2001-05-18-10.43.00.000000 4
32 2001-05-18-10.43.00.000000 5
32 2001-05-18-10.43.00.000000 6
32 2001-05-18-10.43.00.000000 7
32 2001-05-18-10.43.00.000000 8
33 1999-02-07-18.58.00.000000 1
33 1999-02-07-18.58.00.000000 2
33 1999-02-07-18.58.00.000000 3
33 1999-09-30-14.05.00.000000 4
33 1999-09-30-14.05.00.000000 5
33 1999-09-30-14.05.00.000000 6
33 2001-09-18-12.48.00.000000 7
33 2001-09-18-12.48.00.000000 8
34 1999-05-29-15.57.00.000000 1
34 1999-05-29-15.57.00.000000 2
35 2001-03-19-18.38.00.000000 1
35 2001-03-19-18.38.00.000000 2
35 2001-03-28-15.49.00.000000 3
35 2001-03-28-15.49.00.000000 4
36 1999-06-22-11.42.00.000000 1
36 1999-10-30-15.25.00.000000 2
36 1999-10-30-15.25.00.000000 3
36 1999-10-30-15.25.00.000000 4
36 2000-01-27-10.17.00.000000 5
36 2000-11-04-09.06.00.000000 6
37 1999-01-11-09.51.00.000000 1
37 1999-01-11-09.51.00.000000 2
37 1999-01-11-09.51.00.000000 3
37 2000-11-25-17.53.00.000000 4
37 2000-11-25-17.53.00.000000 5
37 2000-12-01-17.21.00.000000 6
37 2000-12-01-17.21.00.000000 7
37 2001-10-21-16.49.00.000000 8
38 1997-10-11-17.15.00.000000 1
38 1997-10-11-17.15.00.000000 2
38 1997-10-11-17.15.00.000000 3
38 1997-10-11-17.15.00.000000 4
38 1997-10-11-17.15.00.000000 5
38 1997-10-11-17.15.00.000000 6
39 2000-03-09-13.46.00.000000 1
39 2000-03-09-13.46.00.000000 2
39 2001-01-09-16.22.00.000000 3
39 2001-01-09-16.22.00.000000 4
39 2001-01-09-16.22.00.000000 5
39 2001-01-09-16.22.00.000000 6
39 2001-07-03-14.12.00.000000 7
40 1998-07-27-17.39.00.000000 1
40 1999-01-27-09.36.00.000000 2
40 1999-06-12-17.18.00.000000 3
40 1999-06-12-17.18.00.000000 4
40 2000-05-17-14.17.00.000000 5
40 2001-04-08-15.39.00.000000 6
40 2001-09-30-10.26.00.000000 7
40 2001-09-30-10.26.00.000000 8
41 1998-06-05-10.06.00.000000 1
41 1998-06-05-10.06.00.000000 2
41 1998-06-05-10.06.00.000000 3
41 1998-08-23-09.39.00.000000 4
41 1998-08-23-09.39.00.000000 5
41 1999-12-01-18.42.00.000000 6
41 1999-12-01-18.42.00.000000 7
41 1999-12-01-18.42.00.000000 8
41 2001-03-30-15.26.00.000000 9
41 2001-03-30-15.26.00.000000 10
41 2001-11-15-15.33.00.000000 11
42 2000-06-22-12.16.00.000000 1
42 2000-06-22-12.16.00.000000 2
42 2001-01-13-15.03.00.000000 3
42 2001-01-13-15.03.00.000000 4
42 2001-08-19-14.18.00.000000 5
42 2001-08-19-14.18.00.000000 6
42 2001-08-19-14.18.00.000000 7
42 2001-08-19-14.18.00.000000 8
43 1998-07-07-11.29.00.000000 1
43 1999-01-22-15.46.00.000000 2
43 2000-08-04-12.16.00.000000 3
43 2001-03-17-14.18.00.000000 4
43 2001-03-17-14.18.00.000000 5
43 2001-03-17-14.18.00.000000 6
44 1999-11-03-09.32.00.000000 1
44 2001-05-26-17.23.00.000000 2
44 2001-07-18-12.59.00.000000 3
44 2001-10-23-10.04.00.000000 4
44 2001-10-23-10.04.00.000000 5
44 2001-10-23-10.04.00.000000 6
44 2001-10-23-10.04.00.000000 7
44 2001-11-09-16.18.00.000000 8
45 2000-03-19-10.31.00.000000 1
45 2000-03-19-10.31.00.000000 2
45 2000-03-19-10.31.00.000000 3
45 2001-07-14-11.36.00.000000 4
287 record(s) selected.
Any suggestions?

You can use the count window function to fetch cid's when they have more than 1 row.
select cid,when,cid_when_rank
from (
SELECT cid, when, ROW_NUMBER() OVER(PARTITION BY cid ORDER BY when ASC) AS cid_when_rank
,COUNT(*) OVER(PARTITION BY cid) as cnt
FROM yrb_purchase
) t
where cnt > 1
Edit: Based on OP's comment,
select cid,when,cid_when_rank
from (
SELECT cid, when, ROW_NUMBER() OVER(PARTITION BY cid ORDER BY when ASC) AS cid_when_rank
,COUNT(*) OVER(PARTITION BY cid) as cnt
FROM (SELECT DISTINCT cid, when FROM yrb_purchase) tmp
) t
where cnt > 1

Using count(*) as a window function is a very good solution. One way that might return results faster is exists:
select p.*
from yrb_purchase p
where exists (select 1 from yrb_purchase p2 where p2.when <> p.when);
Of course, if you need the row number as well, then the overhead for the count is probably immeasurable.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery to count if two record values is greater than or equal the values in their columns and find percent overall - google-bigquery

Related

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

How to replace last n values of a row with zero

getting all rows after the smallest value in a column in each group after groupby

How to find mean (average) of each 3 values in one column of unknown size using pandas. No Numpy

how to eliminate a unique record using SQL?

Categories

Resources