Calculating pairwise cosine similarity between quite a large number of vectors in Bigquery

Calculating pairwise cosine similarity between quite a large number of vectors in Bigquery - cosine-similarity

I have a table id_vectors that contains id and their corresponding coordinates. Each of the coordinates is a repeated fields with 512 elements inside it.
I am looking for pairwise cosine similarity between all those vectors, e.g. If I have three ids 1,2 and 3 then I am looking for a table where I have cosine similarity between them (based on the calculation using 512 coordinates) like below:
id1 id2 similarity
1 2 0.5
1 3 0.1
2 3 0.99
Now in my table I have 424,970 unique ID and their corresponding 512-dimension coordinates. Which means that basically I need to create around (424970 * 424969 / 2) unique pair of IDs and calculate their similarity.
I first tried with the following query using reference from here:
#standardSQL
with pairwise as
(SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(coord1) value1 WITH OFFSET pos1
JOIN UNNEST(coord2) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise
But after running for 6 hrs I encountered the following error message
Query exceeded resource limits. 2.2127481953201417E7 CPU seconds were used, and this query must use less than 428000.0 CPU seconds.
Then I thought rather than using an intermediate table pairwise, why don't I try to create that table first then do the cosine similarity calculation.
So I tried the following query:
SELECT t1.id as id_1, t1.coords as coord1, t2.id as id_2, t2.coords as coord2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id
But this time the query could not be completed and I encountered the following message:
Error: Quota exceeded: Your project exceeded quota for total shuffle size limit. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors.
Then I tried to create even a smaller table, by just creating the combination pairs of the ids and stripping off the coordinates from it, using the following query:
SELECT t1.id as id_1, t2.id as id_2
FROM `project.dataset.id_vectors` t1
inner join `project.dataset.id_vectors` t2
on t1.id < t2.id
Again my query ends up with the error message Query exceeded resource limits. 610104.3843576935 CPU seconds were used, and this query must use less than 3000.0 CPU seconds. (error code: billingTierLimitExceeded)
I totally understand that this is a huge query and my stopping point is my billing quota.
What I am asking is that, is there a way to execute the query in a smarter way so that I do not exceed either of the resourceLimit, shuffleSizeLimit or billingTierLimit?

Quick idea is - instead of joining table on itself with redundant coordinates - you should rather just create simple table of pairs (id1, id2), so then you will "dress" respective id's with their coordinates vectors by having two extra joining to dataset.table.id_vectors
Below is quick example of how this could looks like:
#standardSQL
WITH pairwise AS (
SELECT t1.id AS id_1, t2.id AS id_2
FROM `project.dataset.id_vectors` t1
INNER JOIN `project.dataset.id_vectors` t2
ON t1.id < t2.id
)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.coords) value1 WITH OFFSET pos1
JOIN UNNEST(b.coords) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2
Obviously it works on small dummy set as you can see below:
#standardSQL
WITH `project.dataset.id_vectors` AS (
SELECT 1 id, [1.0, 2.0, 3.0, 4.0] coords UNION ALL
SELECT 2, [1.0, 2.0, 3.0, 4.0] UNION ALL
SELECT 3, [2.0, 0.0, 1.0, 1.0] UNION ALL
SELECT 4, [0, 2.0, 1.0, 1.0] UNION ALL
SELECT 5, [2.0, 1.0, 1.0, 0.0] UNION ALL
SELECT 6, [1.0, 1.0, 1.0, 1.0]
), pairwise AS (
SELECT t1.id AS id_1, t2.id AS id_2
FROM `project.dataset.id_vectors` t1
INNER JOIN `project.dataset.id_vectors` t2
ON t1.id < t2.id
)
SELECT id_1, id_2, (
SELECT
SUM(value1 * value2)/
SQRT(SUM(value1 * value1))/
SQRT(SUM(value2 * value2))
FROM UNNEST(a.coords) value1 WITH OFFSET pos1
JOIN UNNEST(b.coords) value2 WITH OFFSET pos2
ON pos1 = pos2
) cosine_similarity
FROM pairwise t
JOIN `project.dataset.id_vectors` a ON a.id = id_1
JOIN `project.dataset.id_vectors` b ON b.id = id_2
with result
Row id_1 id_2 cosine_similarity
1 1 2 1.0
2 1 3 0.6708203932499369
3 1 4 0.819891591749923
4 1 5 0.521749194749951
5 1 6 0.9128709291752769
6 2 3 0.6708203932499369
7 2 4 0.819891591749923
8 2 5 0.521749194749951
9 2 6 0.9128709291752769
10 3 4 0.3333333333333334
11 3 5 0.8333333333333335
12 3 6 0.8164965809277261
13 4 5 0.5000000000000001
14 4 6 0.8164965809277261
15 5 6 0.8164965809277261
So, try on your real data and let's see how it will work for you :o)
And ... obviously you should pre-create / materialize pairwise table
Another optimization idea is to have pre-calculated values of SQRT(SUM(value1 * value1)) in your project.dataset.id_vectors - this can save quite CPU - this should be simple adjustment so I leave it to you :o)

Related

BigQuery recursively join based on links between 2 ID columns

Given a table representing a many-many join between IDs like the following:
WITH t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
)
SELECT * FROM t
id_1
id_2
1
a
2
a
2
b
3
b
4
c
5
c
6
d
6
e
7
f
I would like to be able recursively join then aggregate rows in order to find each disconnected sub-graph represented by these links - that is each collection of IDs that are linked together:
The desired output for the example above would look something like this:
id_1_coll
id_2_coll
1, 2, 3
a, b
4, 5
c
6
d, e
7
f
where each row contains all the other IDs one could reach following the links in the table.
Note that 1 links to b even although there is no explicit link row because we can follow the path 1 --> a --> 2 --> b using the links in the first 3 rows.

One potential approach is to remodel the relationships between id_1 and id_2 such that we get all the links from id_1 to itself then use a recursive common table expression to traverse all the possible paths between id_1 values then aggregate (somewhat arbitrarily) to the lowest such value that can be reached from each id_1.
Explanation
Our steps are
Remodel the relationship into a series of self-joins for id_1
Map each id_1 to the lowest id_1 that it is linked to via a recursive CTE
Aggregate the recursive CTE using the lowest id_1s as the GROUP BY column and grabbing all the linked id_1 and id_2 values via the ARRAY_AGG() function
We can use something like this to remodel the relationships into a self join (1.):
SELECT
a.id_1, a.id_2, b.id_1 AS linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
Next - to set up the recursive table expression (2.) we can tweak the query above to also give us the lowest (LEAST) of the values for id_1 at each link then use this as the base iteration:
WITH RECURSIVE base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
)
We can also grab the lowest id_1 value at this time:
id_1
linked_id
lowest_linked_id
1
2
1
2
1
1
2
3
2
3
2
2
4
5
4
5
4
4
For our recursive loop, we want to maintain an ARRAY of linked ids and join each new iteration such that the id_1 value of the n+1th iteration is equal to the linked_id value of the nth iteration AND the nth linked_id value is not in the array of previously linked ids.
We can code this as follows:
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
)
Giving us the following results:
|id_1|linked_id|lowest_linked_id|linked_ids|
|----|---------|------------|---|
|3|2|1|[1,2]|
|2|3|1|[1,2,3]|
|4|5|4|[5]|
|1|2|1|[2]|
|5|4|4|[4]|
|2|3|2|[3]|
|2|1|1|[1]|
|3|2|2|[2]|
which we can now link back to the original table for the id_2 values then aggregate (3.) as shown in the complete query below
Solution
WITH RECURSIVE t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
base_iter AS (
SELECT
a.id_1, b.id_1 AS linked_id, LEAST(a.id_1, b.id_1) AS lowest_linked_id
FROM t as a
INNER JOIN t as b
ON a.id_2 = b.id_2
WHERE a.id_1 != b.id_1
),
recursive_loop AS (
SELECT id_1, linked_id, lowest_linked_id, [linked_id ] AS linked_ids
FROM base_iter
UNION ALL
SELECT
prev_iter.id_1, prev_iter.linked_id,
iter.lowest_linked_id,
ARRAY_CONCAT(iter.linked_ids, [prev_iter.linked_id])
FROM base_iter AS prev_iter
JOIN recursive_loop AS iter
ON iter.id_1 = prev_iter.linked_id
AND iter.lowest_linked_id < prev_iter.lowest_linked_id
AND prev_iter.linked_id NOT IN UNNEST(iter.linked_ids )
),
link_back AS (
SELECT
t.id_1, IFNULL(lowest_linked_id, t.id_1) AS lowest_linked_id, t.id_2
FROM t
LEFT JOIN recursive_loop
ON t.id_1 = recursive_loop.id_1
),
by_id_1 AS (
SELECT
id_1,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
by_id_2 AS (
SELECT
id_2,
MIN(lowest_linked_id) AS grp
FROM link_back
GROUP BY 1
),
result AS (
SELECT
by_id_1.grp,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) AS id1_coll,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) AS id2_coll,
FROM
by_id_1
INNER JOIN by_id_2
ON by_id_1.grp = by_id_2.grp
GROUP BY grp
)
SELECT grp, TO_JSON(id1_coll) AS id1_coll, TO_JSON(id2_coll) AS id2_coll
FROM result ORDER BY grp
Giving us the required output:
grp
id1_coll
id2_coll
1
[1,2,3]
[a,b]
4
[4,5]
[c]
6
[6]
[d,e]
7
[7]
[f]
Limitations/Issues
Unfortunately this approach is inneficient (we have to traverse every single pathway before aggregating it back together) and fails with the real-world case where we have several million join rows. When trying to execute on this data BigQuery runs up a huge "Slot time consumed" then eventually errors out with:
Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
I hope there might be a better way of doing the recursive join such that pathways can be merged/aggregated as we go (if we have an id_1 value AND a linked_id in already in the list of linked_ids we dont need to check it further).

Using ROW_NUMBER() the query is as the follow:
WITH RECURSIVE
t AS (
SELECT 1 AS id_1, 'a' AS id_2,
UNION ALL SELECT 2, 'a'
UNION ALL SELECT 2, 'b'
UNION ALL SELECT 3, 'b'
UNION ALL SELECT 4, 'c'
UNION ALL SELECT 5, 'c'
UNION ALL SELECT 6, 'd'
UNION ALL SELECT 6, 'e'
UNION ALL SELECT 7, 'f'
),
t1 AS (
SELECT ROW_NUMBER() OVER(ORDER BY t.id_1) n, t.id_1, t.id_2 FROM t
),
t2 AS (
SELECT n, [n] n_arr, [id_1] arr_1, [id_2] arr_2, id_1, id_2 FROM t1
WHERE n IN (SELECT MIN(n) FROM t1 GROUP BY id_1)
UNION ALL
SELECT t2.n, ARRAY_CONCAT(t2.n_arr, [t1.n]),
CASE WHEN t1.id_1 NOT IN UNNEST(t2.arr_1)
THEN ARRAY_CONCAT(t2.arr_1, [t1.id_1])
ELSE t2.arr_1 END,
CASE WHEN t1.id_2 NOT IN UNNEST(t2.arr_2)
THEN ARRAY_CONCAT(t2.arr_2, [t1.id_2])
ELSE t2.arr_2 END,
t1.id_1, t1.id_2
FROM t2 JOIN t1 ON
t2.n < t1.n AND
t1.n NOT IN UNNEST(t2.n_arr) AND
(t2.id_1 = t1.id_1 OR t2.id_2 = t1.id_2) AND
(t1.id_1 NOT IN UNNEST(t2.arr_1) OR t1.id_2 NOT IN UNNEST(t2.arr_2))
),
t3 AS (
SELECT
n,
ARRAY_AGG(DISTINCT id_1 ORDER BY id_1) arr_1,
ARRAY_AGG(DISTINCT id_2 ORDER BY id_2) arr_2
FROM t2
WHERE n IN (SELECT MIN(n) FROM t2 GROUP BY id_1)
GROUP BY n
)
SELECT n, TO_JSON(arr_1), TO_JSON(arr_2) FROM t3 ORDER BY n
t1 : Append with row numbers.
t2 : Extract rows matching either id_1 or id_2 by recursive query.
t3 : Make arrays from id_1 and id_2 with ARRAY_AGG().
However, it may not help your Limitations/Issues.

The way this question is phrased makes it appear you want "show me distinct groups from a presorted list, unchained to a previous group". For that, something like this should suffice (assuming auto-incrementing order/one or both id's move to the next value):
SELECT GrpNr,
STRING_AGG(DISTINCT CAST(id_1 as STRING), ',') as id_1_coll,
STRING_AGG(DISTINCT CAST(id_2 as STRING), ',') as id_2_coll
FROM
(
SELECT id_1, id_2,
SUM(CASE WHEN a.id_1 <> a.previous_id_1 and a.id_2 <> a.previous_id_2 THEN 1 ELSE 0 END)
OVER (ORDER BY RowNr) as GrpNr
FROM
(
SELECT *,
ROW_NUMBER() OVER () as RowNr,
LAG(t.id_1, 1) OVER (ORDER BY 1) AS previous_id_1,
LAG(t.id_2, 1) OVER (ORDER BY 1) AS previous_id_2
FROM t
) a
ORDER BY RowNr
) a
GROUP BY GrpNr
ORDER BY GrpNr
I don't think this is the question you mean to ask. This seems to be a graph-walking problem as referenced in the other answers, and in the response from #GordonLinoff to the question here, which I tested (and presume works for BigQuery).
This can also be done using sequential updates as done by #RomanPekar
here (which I also tested). The main consideration seems to be performance. I'd assume dbms have gotten better at recursion since this was posted.
Rolling it up in either case should be fairly easy using String_Agg() as given above or as you have.
I'd be curious to see a more accurate representation of the data. If there is some consistency to how the data is stored/limitations to levels of nesting/other group structures there may be a shortcut approach other than recursion or iterative updates.

Comparing Column Values and returning ERROR or OK

I'm needing to verify a source system with a destination system and ensure the values are matching between them. The problem is the source system is a total mess and is proving hard to validate.
I've got the following sample data where they should all be OK, but they're showing as ERROR. Does anyone know a way of doing a comparison that would result as an OK for all for the below?
CREATE TABLE #testdata (
ID INT
,ValueSource VARCHAR(800)
,ValueDestination VARCHAR(800)
,Value_Varchar_Check AS (
CASE
WHEN coalesce(ValueSource, '0') = coalesce(ValueDestination, '0')
THEN 'OK'
ELSE 'ERROR'
END
)
)
INSERT INTO #testdata (
ID
,ValueSource
,ValueDestination
)
SELECT 1
,'hepatitis c,other (specify)' 'hepatitis c, other (specify)'
UNION ALL
SELECT 2
,'lung problems / asthma,lung problems / asthma'
,'lung problems / asthma'
UNION ALL
SELECT 3
,'lung problems / asthma,diabetes'
,'diabetes, lung problems / asthma'
UNION ALL
SELECT 4
,'seizures/epilepsy,hepatitis c,seizures/epilepsy'
,'hepatitis c, seizures/epilepsy'

I don't think you can write this as a generated column as it is quite a tricky thing to compute. If you are using SQL Server 2016 or later, you can use STRING_SPLIT to convert the ValueSource and ValueDestination values into tables and then sort them alphabetically using a query like this:
SELECT DISTINCT ID, TRIM(value) AS value,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TRIM(value)) AS rn
FROM testdata
CROSS APPLY STRING_SPLIT(ValueSource, ',')
For ValueSource, this produces:
ID value rn
1 hepatitis c 1
1 other (specify) 2
2 lung problems / asthma 1
3 diabetes 1
3 lung problems / asthma 2
4 hepatitis c 1
4 seizures/epilepsy 2
You can then FULL OUTER JOIN those two tables on ID, value and rn, and detect an error when there are null values from either side (since that implies that the values for a given ID and rn don't match):
WITH t1 AS (
SELECT DISTINCT ID, TRIM(value) AS value,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TRIM(value)) AS rn
FROM testdata
CROSS APPLY STRING_SPLIT(ValueSource, ',')
),
t2 AS (
SELECT DISTINCT ID, TRIM(value) AS value,
DENSE_RANK() OVER (PARTITION BY ID ORDER BY TRIM(value)) AS rn
FROM testdata
CROSS APPLY STRING_SPLIT(ValueDestination, ',')
)
SELECT COALESCE(t1.ID, t2.ID) AS ID,
CASE WHEN COUNT(CASE WHEN t1.value IS NULL OR t2.value IS NULL THEN 1 END) > 0 THEN 'Error'
ELSE 'OK'
END AS Status
FROM t1
FULL OUTER JOIN t2 ON t2.ID = t1.ID AND t2.rn = t1.rn AND t2.value = t1.value
GROUP BY COALESCE(t1.ID, t2.ID)
Output (for your sample data):
ID Status
1 OK
2 OK
3 OK
4 OK
Demo on SQLFiddle
You can then use the entire query above as a CTE (call it t3) to update your original table:
UPDATE t
SET t.Value_Varchar_Check = t3.Status
FROM testdata t
JOIN t3 ON t.ID = t3.ID
Output:
ID ValueSource ValueDestination Value_Varchar_Check
1 hepatitis c,other (specify) hepatitis c, other (specify) OK
2 lung problems / asthma,lung problems / asthma lung problems / asthma OK
3 lung problems / asthma,diabetes diabetes, lung problems / asthma OK
4 seizures/epilepsy,hepatitis c,seizures/epilepsy hepatitis c, seizures/epilepsy OK
Demo on SQLFiddle

SQL join where value in second table is first lower value w.r.t the first table

Let's say I have 2 tables and both of them have a column that contains timestamp for various events. The timestamp values in both the tables are different as they are for different events.
I want to join the two tables such that every record in table1 is joined with first lower timestamp on table2.
For e.g.
Table1 Table2
142.13 141.16
157.34 145.45
168.45 155.85
170.23 166.76
168.44
Joined Table should be:
142.13,141.16
157.34,155.85
168.45,166.76
170.23,168.44
I am using Apache Spark SQL.
I am a noob in SQL and this doesn't look like job for a noob :). Thanks.

Try this:
with t1 as (
select 142.13 v from dual union all
select 157.34 v from dual union all
select 168.45 v from dual union all
select 170.23 v from dual
),
t2 as (
select 141.16 v from dual union all
select 145.45 v from dual union all
select 155.85 v from dual union all
select 166.76 v from dual union all
select 168.44 v from dual
)
select v, ( select max(v) from t2 where t2.v <= t1.v )
from t1;
V (SELECTMAX(V)FROMT2WHERET2.V<=T1.V)
---------- -----------------------------------
142.13 141.16
157.34 155.85
168.45 168.44
170.23 168.44
4 rows selected.
the WITH clause is just me faking the data ...
the simplified query is just:
select t1.v, ( select max(t2.v) from table2 t2 where t2.v <= t1.v ) from table1 t1
[edit]
admittedly, I'm not familiar with Spark .. but this is simple enough SQL .. I'm assuming it works :)
[/edit]

Ditto has shown the straight-forward way to solve this. If Apache Spark really has problems with this very basic query, then join first (which can lead to a big intermediate result) and aggregate then:
select t1.v, max(t2.v)
from table1 t1
join table2 t2 on t2.v <= t1.v
group by t1.v
order by t1.v;

If you are using apache spark sql then you can join these two tables as dataframes with a adding a column using monotonically_increasing_id()
val t1 = spark.sparkContext.parallelize(Seq(142.13, 157.34, 168.45, 170.23)).toDF("c1")
val t2 = spark.sparkContext.parallelize(Seq(141.16,145.45,155.85,166.76,168.44)).toDF("c2")
val t11 = t1.withColumn("id", monotonically_increasing_id())
val t22 = t2.withColumn("id", monotonically_increasing_id())
val res = t11.join(t22, t11("id") + 1 === t22("id") ).drop("id")
Output:
+------+------+
| c1| c2|
+------+------+
|142.13|145.45|
|168.45|166.76|
|157.34|155.85|
|170.23|168.44|
+------+------+
Hope this helps

When IDs are identical check that the Ordinal is greater than the previous submission

Example query
USE HES
SELECT T1.ID, T2.DATE, T1.ORDINAL
FROM TABLE1 AS T1
LEFT JOIN TABLE2 AS T2
ON T1.ID = T2.ID AND T1.PARTYEAR = T2.PARTYEAR
WHERE
T1.MONTHYEAR = '201501'
Results from example query
ID Date Ordinal
1 01/01/2016 1
1 02/01/2016 2
1 03/01/2016 3
2 04/01/2016 1
2 05/01/2016 2
3 06/01/2016 1
3 07/01/2016 2
3 08/01/2016 3
4 09/01/2016 1
4 10/01/2016 1
Question
Each user has a unique ID, for each ID how would I to check that each data submission contains an Ordinal that is greater than the one that was previously submitted.
So, in the example query results above, ID 4 contains an issue.
I'm fairly new to SQL, I've been searching for similar examples but with no success.
Any help would be greatly appreciated.

Use LAG with OVER clause:
WITH cte AS
(
SELECT T1.ID, T2.DATE, T1.ORDINAL, LAG(T1.ORDINAL) OVER(PARTITION BY T1.ID ORDER BY T1.ORDINAL) AS LagOrdinal
FROM TABLE1 AS T1
LEFT JOIN TABLE2 AS T2
ON T1.ID = T2.ID AND T1.PARTYEAR = T2.PARTYEAR
WHERE
T1.MONTHYEAR = '201501'
)
SELECT ID, DATE, ORDINAL, CASE WHEN ORDINAL > LagOrdinal THEN 1 ELSE 0 END AS OrdinalIsGreater
FROM cte;

Try this one:
SELECT * INTO #tmp
FROM (VALUES
(1, CONVERT(date, '01/01/2016'), 1),
(1, '02/01/2016', 2),
(1, '03/01/2016', 3),
(2, '04/01/2016', 1),
(2, '05/01/2016', 2),
(3, '06/01/2016', 1),
(3, '07/01/2016', 2),
(3, '08/01/2016', 3),
(4, '09/01/2016', 1),
(4, '10/01/2016', 1)
)T(ID, Date, Ordinal)
WITH Numbered AS
(
SELECT ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Date) R, *
FROM #tmp
)
SELECT N2.ID, N2.Date, N1.Ordinal Prev, N2.Ordinal Curr
FROM Numbered N1
JOIN Numbered N2 ON N1.R+1=N2.R AND N1.ID=N2.ID
WHERE N1.Ordinal >= N2.Ordinal
It can be simplified when SQL Server version >= 2012, #tmp is your current result.

Like #Serg said, you can achieve this using lag
select *
from (
SELECT T1.ID, T2.DATE, T1.ORDINAL,
lag(t1.ordinal) over (partition by t1.id order by t2.date) as prevOrdinal
FROM TABLE1 AS T1
LEFT JOIN TABLE2 AS T2
ON T1.ID = T2.ID AND T1.PARTYEAR = T2.PARTYEAR
WHERE
T1.MONTHYEAR = '201501') as t
where t.prevOrdinal >= t.ordinal;
OUTPUT
ID DATE ORDINAL prevOrdinal
4 2016-10-01 1 1

Max sum for the continous N rows

I've the following table (both A and B are integers):
Update 1 - Could anyone do me a favour and run the solution on a set of 1M records with B being a random decimal (to avoid overflows) residing in [0 to 1] range for N=> 10, 100 and 1000? I'd like to get a flavor of the time, required to run the solution query. Thanks a lot in advance.
Sample data:
A B
1 1
2 8
3 1
4 11
5 1
6 1
7 6
8 1
9 1
10 2
How do I get the maximum Sum of B values for any N sequential A's? The solution mustn't use cursors, usage of table vars/tem tables has to be stongly justified.
I can use SQLCLR in case if it'll give a distinct performance boost.
Some clarifications:
Max Sum for 1 element is 11 (see A = 4)
Max Sum for 2 elements is 12 (it's either A=> 1 & 2 or A=> 2 & 3),
Max Sum for 3 elements is 20 (A=>2, 3, 4),
Max Sum for 4 is 21 (A=>1,2,3,4 or A=>2,3,4,5) etc.

Since the A values are guaranteed to be consecutive integers, given N we know for any particular A which values we are interested in. So
SELECT
A,
(SELECT SUM(B) FROM Table T2 WHERE T.A <= T2.A AND T2.A <= T.A + N - 1)
AS SumOfBs
FROM Table T
WHERE A + N - 1 <= (SELECT COUNT(*) FROM Table)
gives, for each A, the sum of the B values for the N rows starting there. The WHERE restricts us to rows that do actually have N rows starting there. Put this in a subquery and we can get the maximum:
SELECT
MAX(SumOfBs) AS DesiredValue
FROM
(
SELECT
A,
(SELECT SUM(B) FROM Table T2 WHERE T.A <= T2.A AND T2.A <= T.A + N - 1)
AS SumOfBs
FROM Table T
WHERE A + N - 1 <= (SELECT COUNT(*) FROM Table)
) Intermediate
should do the job.

I've loaded your test data into a table called data.
The following SQL gives me the answer 20 for N=3:
declare #N int
set #N = 3
select max(SumB)
from data d
cross apply (select SumB = SUM(B) from data sub where sub.A between d.A - (#N-1) and d.A) x

Try:
with cte as
(select 1 window_count union all
select window_count+1 window_count from cte where window_count<#N)
select max(sum_B) from
(select T1.A,
sum(T2.B) sum_B
from MyTable T1
cross join cte
join MyTable T2 on T1.A = T2.A + cte.window_count - 1
group by T1.A) sq

I'm possibly not understanding the question fully, but it looks to me like...
SELECT SUM(B) FROM table WHERE A <= n
If not correct, can you explain a bit more?

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Calculating pairwise cosine similarity between quite a large number of vectors in Bigquery - cosine-similarity

Related

BigQuery recursively join based on links between 2 ID columns

Comparing Column Values and returning ERROR or OK

SQL join where value in second table is first lower value w.r.t the first table

When IDs are identical check that the Ordinal is greater than the previous submission

Max sum for the continous N rows

Categories

Resources