find a line based on string pattern and move it X places - awk

I want to move the 1st instance of AND F.col_x IS NOT NULL pattern ( for each of the blocks 23 24 25 etc ) that follows the WHERE F.col_x = D.col_x pattern either of these ways
--Look for brackets Just before group by and add it in there
--alternately move that line 1 line below from where it was taken.
Either way the results would be the same
INPUT
*22
Select ((MYLILFUNC(F.col_x,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.col_x D
WHERE F.col_x = D.col_x
AND F.col_x IS NOT NULL
AND D.col_x IS NOT NULL )
GROUP BY F.col_x;
*23
Select ((MYLILFUNC(F.COL_y,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.DIM_DRG_CODE D
WHERE F.COL_y = D.COL_y
AND F.COL_y IS NOT NULL
AND D.COL_y IS NOT NULL )
GROUP BY F.COL_y;
*24
Select ((MYLILFUNC(F.COL_Z,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.COL_Z D
WHERE F.COL_Z = D.COL_Z
AND F.COL_Z IS NOT NULL
AND D.COL_Z IS NOT NULL )
GROUP BY F.COL_Z;
*25
Select ((MYLILFUNC(F.COL_XXX,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.COL_XX D
WHERE F.COL_XXX = D.COL_XXX
AND F.COL_XXX IS NOT NULL
AND D.COL_XXX IS NOT NULL )
GROUP BY F.COL_XXX;
OUTPUT
*22
Select ((MYLILFUNC(F.col_x,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.col_x D
WHERE F.col_x = D.col_x
AND D.col_x IS NOT NULL ) AND F.col_x IS NOT NULL
GROUP BY F.col_x;
*23
Select ((MYLILFUNC(F.COL_y,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.DIM_DRG_CODE D
WHERE F.COL_y = D.COL_y
AND D.COL_y IS NOT NULL ) AND F.COL_y IS NOT NULL
GROUP BY F.COL_y;
*24
Select ((MYLILFUNC(F.COL_Z,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.COL_Z D
WHERE F.COL_Z = D.COL_Z
AND D.COL_Z IS NOT NULL ) AND F.COL_Z IS NOT NULL
GROUP BY F.COL_Z;
*25
Select ((MYLILFUNC(F.COL_XXX,-99999))) AS WIDTH,
COUNT(*) AS SIZE
FROM MYDB.BGSQLTB F where NOT EXISTS ( sel '1' from MYDB.COL_XX D
WHERE F.COL_XXX = D.COL_XXX
AND D.COL_XXX IS NOT NULL ) AND F.COL_XXX IS NOT NULL
GROUP BY F.COL_XXX;
My search pattern using Ed is a bit too wide and takes more lines and I am not sure how I can get the moving logic done because it is relative to each selected line.

You can do this in a couple ways. With sed you can do
sed -e '/AND F\.[a-zA-Z_]* *IS *NOT *NULL/ { h; d }; /GROUP BY/ { H; x }'
What happens is anytime the first regular expression matches, the { h; d }; commands store the line in the hold buffer and move to the next line without outputting anything. Whenever the second regexp matches, the { H; x } append the current line to the hold buffer with a newline in between and then swap the hold buffer and the current line buffer. Then sed will automatically print out the pattern line. It's easy for this not to work correctly depending on your input, but it works fine for the sample you provided.
In awk it would be
awk 'tolower($0) ~ /and f.col_[a-z]* is not null/ {save = $0; next} /GROUP BY/ { print save } {print}'

Related

Avoid handling invalid codepoint in Bigquery

I'm using the following SQL code in order to decode UTF-8 chars in Big query:
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
(SELECT CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''
)
);
WITH normal AS (
select '\\u05DE\\u05EA\\u05DE\\u05D8\\u05D9\\u05E7\\u05D4123' as edited
), uchars AS (
SELECT DISTINCT
c,
DecodeUnicode(c) uchar
FROM normal,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})')) c
)
SELECT
edited,
STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) decoded
FROM(
SELECT
edited,
pos,
SUBSTR(edited,
SUM(CASE char WHEN '' THEN 1 ELSE 6 END)
OVER(PARTITION BY edited ORDER BY pos) - CASE char WHEN '' THEN 0 ELSE 5 END,
CASE char WHEN '' THEN 1 ELSE 6 END) x,
uchar
FROM normal ,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})|.')) char WITH
OFFSET AS pos LEFT JOIN uchars u ON u.c = char
)
GROUP BY edited
The problem is that some of the values I'm handling are not valid when using the function above ('DecodeUnicode')
- for example this part 'u05D4123' is not charbase valid.
What can I change in my code that when I have such values the function will not handle it and therefore I will not get the 'Invalid codepoint' Error that I get now?
One of the option is to use SAFE.CODE_POINTS_TO_STRING instead of CODE_POINTS_TO_STRING but then you still will need to eliminate not handled code from result - for example using regexp as it is in below example
#standardSQL
CREATE TEMP FUNCTION DecodeUnicode(s STRING) AS (
(SELECT SAFE.CODE_POINTS_TO_STRING(ARRAY_AGG(CAST(CONCAT('0x', x) AS INT64)))
FROM UNNEST(SPLIT(s, '\\u')) AS x
WHERE x != ''
)
);
WITH normal AS (
SELECT '\\u05DE\\u05EA\\u05DE\\u05D8\\u05D9\\u05E7\\u05D4123' AS edited
), uchars AS (
SELECT DISTINCT
c,
DecodeUnicode(c) uchar
FROM normal,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})')) c
)
SELECT
edited,
STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) decoded,
REGEXP_REPLACE(STRING_AGG(IFNULL(uchar, x), '' ORDER BY pos) ,r'\\u[ABCDEF0-9]{4,8}', '') decoded_and_fixed
FROM(
SELECT
edited,
pos,
SUBSTR(edited,
SUM(CASE char WHEN '' THEN 1 ELSE 6 END)
OVER(PARTITION BY edited ORDER BY pos) - CASE char WHEN '' THEN 0 ELSE 5 END,
CASE char WHEN '' THEN 1 ELSE 6 END) x,
uchar
FROM normal ,
UNNEST(REGEXP_EXTRACT_ALL(edited, r'(\\u[ABCDEF0-9]{4,8})|.')) char WITH
OFFSET AS pos LEFT JOIN uchars u ON u.c = char
)
GROUP BY edited
with result
Row edited decoded decoded_and_fixed
1 \u05DE\u05EA\u05DE\u05D8\u05D9\u05E7\u05D4123 מתמטיק\u05D4 מתמטיק

Add rows based on a column value

I am having issues with creating additional rows based on a column value.
If my PageCount = 3 then I would need to have 2 additional rows where PONo is repeated but the ImagePath is incremented by 1 for each new row.
I am able to get the first row but, creating the additional rows with the ImagePath incremented by 1 is where I am stuck.
My result:
Expected result:
Finished: finished values
Current Select statement:
SELECT PO, CASE WHEN LEFT(u.Path,3)= 'M:\' THEN '\\ServerName\'+RIGHT(u.Path,LEN(u.Path)-3) ELSE u.Path END AS [Imagepath],PAGECOUNT
FROM OPENQUERY([LinkedServer],'select * from data.vw_purchasing_docs_unc') AS u INNER JOIN
OPENQUERY([LinkedServer],'select * from data.purchasing_docs') AS d ON u.docid=d.docid
WHERE (CONVERT(VARCHAR(10),d.STATUS_DATE,120)=CONVERT(VARCHAR(10),GETDATE(),120))
batch file:
bcp "select d.DOCID,DOC_TYPE,PO,d.STATUS, CASE WHEN LEFT(Path,3)= 'M:\' THEN '\\ServerName'+RIGHT(DWPath,LEN(Path)-3) ELSE Path END AS ImagePath, STATUS_DATE,'No' AS dwimport from openquery([LinkedServer],'select * from data.vw_purchasing_docs_unc') as u INNER JOIN openquery([LinkedServer],'select * from dwdata.purchasing_docs') as d ON u.docid=d.docid WHERE (CONVERT(varchar(10),STATUS_DATE,120)=CONVERT(varchar(10),GETDATE(),120)) AND d.STATUS IN ('FILED - Processing Complete','FILED - Partial Payment','FILED - Confirming') AND DOC_TYPE IN ('CO = Change Order','Purchase Order','CP = Capital Projects','Change Order','PO = Purchase Order','PO','PR = General Operating')" queryout "E:\Data\PO Trigger CSV\PO_Trigger_Doc.csv" -r \n -T -c -t"," -Umv -Smtvwrtst -Pm -q -k
Select ponumber,b.rplc,pagecount
from table t
cross apply
(select replace(imagepath,'f'+cast(n-1) as varchar(100),'f0') as rplc from numbers n where n<=t.pagecount)b
To create numbers table,if you are wondering why you need it.Look here
CREATE TABLE Number (N INT IDENTITY(1,1) PRIMARY KEY NOT NULL);
GO
INSERT INTO Number DEFAULT VALUES;
GO 10000
Using your select statement after update:
;With cte(ponumber,imagepath,pagecount)
as
SELECT PO, CASE WHEN LEFT(u.Path,3)= 'M:\' THEN '\\ServerName\'+RIGHT(u.Path,LEN(u.Path)-3) ELSE u.Path END AS [Imagepath],PAGECOUNT
FROM OPENQUERY([LinkedServer],'select * from data.vw_purchasing_docs_unc') AS u INNER JOIN
OPENQUERY([LinkedServer],'select * from data.purchasing_docs') AS d ON u.docid=d.docid
WHERE (CONVERT(VARCHAR(10),d.STATUS_DATE,120)=CONVERT(VARCHAR(10),GETDATE(),120))
)
select Ponumber,b.rplc,pagecount from cte c
cross apply
(select replace(imagepath,'f'+cast((n-1) as varchar(100)),'f0') as rplc from numbers n where n<=c.pagecount)b
If you would like to avoid additional table, you can use CTE:
WITH Images AS
(
SELECT * FROM (VALUES
('C:\Folder', 2),
('D:\Folder', 3)) T(ImagePath, Val)
), Numbers AS
(
SELECT * FROM (VALUES (1),(2),(3),(4)) T(N)
UNION ALL
SELECT N1.N*4+T.N N FROM (VALUES(1),(2),(3),(4)) T(N) CROSS JOIN Numbers N1
WHERE N1.N*4+T.N<=100
)
SELECT ImagePath + '\f' + CONVERT(nvarchar(10) ,ROW_NUMBER() OVER (PARTITION BY ImagePath ORDER BY (SELECT 1))) NewPath
FROM Images
CROSS APPLY (SELECT TOP(Val) * FROM Numbers) T(N)
Images is your source table. It can be anything, i.e. OPENQUERY. It produces:
NewPath
-------
C:\Folder\f1
C:\Folder\f2
D:\Folder\f1
D:\Folder\f2
D:\Folder\f3

PIG query to pivot rows and columns with count of like rows

Trying to create sql or PIG queries that will yield a count of distinct values results based on type.
In other words, given this table:
Type: Value:
A x
B y
C y
B y
C z
A x
A z
A z
A x
B x
B z
B x
C x
I want to get the following results:
Type: x: y: z:
A 3 0 2
B 2 2 1
C 1 1 1
Additionally, a table of averages as a result would be helpful too
Type: x: y: z:
A 0.60 0.00 0.40
B 0.40 0.40 0.20
C 0.33 0.33 0.33
EDIT 4
I am a nooby at PIG, but reading 8 different stack overflow posts I came up with this.
When I use this PIG query
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;
Sort of works.......
I get this:
(0,x,y,z)
(A,3,0,2)
(B,2,2,1)
(C,1,1,1)
Which I can import to a text file strip the "(" and ")" and use as a CSV with schema being first line. This sort of works it is SO SLOW. I would like a nicer, faster, cleaner way of doing this. If anyone out there knows of a way please let me know.
The best I can think of would work only with Oracle, and although it wouldn't provide you with a column for each value, it would present you the data like this:
A x=3,y=3,z=3
B x=4,y=3
C y=3,z=2
of course if you have 900 values it would show:
A x=3,y=6,...,ff=12
etc...
I'm not able to add comment so I can't ask you if oracle is ok. Anyway here's the query that would achieve that:
SELECT type, values FROM
(SELECT type, SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values, seq,
MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(SELECT type, value, COUNT(*) OCC
FROM tableName
GROUP BY type, value))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
For the Average you need to add the information before all the rest, here's the code:
SELECT * FROM
(SELECT type,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || OCC, ','),2) values,
SUBSTR(SYS_CONNECT_BY_PATH(value || '=' || (OCC / TOT), ','),2) average,
seq, MAX(seq) OVER (partition by type) max
FROM
(SELECT type, value, TOT, OCC, ROW_NUMBER () OVER (partition by type ORDER BY type, value) seq
FROM
(
SELECT type, value, TOT, COUNT(*) OCC
FROM (SELECT type, value, COUNT(*) OVER (partition by type) TOT
FROM tableName)
GROUP BY type, value, TOT
))
START WITH seq=1
CONNECT by PRIOR
seq+1=seq
AND PRIOR
type=type)
WHERE seq = max;
You can do this using the vector operation UDF's in Brickhouse ( http://github.com/klout/brickhouse ) Consider that each 'value' is a dimension in a very high dimension space. You can interpret a single value instance as a vector in that dimension, with value 1. In Hive, we would represent such a vector simply as a map with a string as the key, and an int or other numeric as the value.
What you want to create is a vector which is the sum of all vectors, grouped by type. The query would be :
SELECT type,
union_vector_sum( map( value, 1 ) ) as vector,
FROM table
GROUP BY type;
Brickhouse even has a normalize function, which will produce your 'averages'
SELECT type,
vector_normalize(union_vector_sum( map( value, 1 ) ))
as normalized_vector,
FROM table
GROUP BY type;
Updated code according to Edit#3 in question:
A = load '/path/to/input/file' using AvroStorage();
B = group A by (type, value);
C = foreach B generate flatten(group) as (type, value), COUNT(A) as count;
-- Now get all the values.
M = foreach A generate value;
-- Left Outer Join all the values with C, so that every type has exactly same number of values associated
N = join M by value left outer, C by value;
O = foreach N generate
C::type as type,
M::value as value,
(C::count == null ? 0 : C::count) as count; --count = 0 means value was not associated with the type
P = group O by type;
Q = foreach P {
R = order O by value asc; --Ordered by value, so values counts are ordered consistently in all the rows.
generate group as type, flatten(R.count);
}
Please note that I did not execute the code above. These are just the representational steps.
A = LOAD 'tablex' USING org.apache.hcatalog.pig.HCatLoader();
x = foreach A GENERATE id_orig_h;
xx = distinct x;
y = foreach A GENERATE id_resp_h;
yy = distinct y;
yyy = group yy all;
zz = GROUP A BY (id_orig_h, id_resp_h);
B = CROSS xx, yy;
C = foreach B generate xx::id_orig_h as id_orig_h, yy::id_resp_h as id_resp_h;
D = foreach zz GENERATE flatten (group) as (id_orig_h, id_resp_h), COUNT(A) as count;
E = JOIN C by (id_orig_h, id_resp_h) LEFT OUTER, D BY (id_orig_h, id_resp_h);
F = foreach E generate C::id_orig_h as id_orig_h, C::id_resp_h as id_resp_h, D::count as count;
G = foreach yyy generate 0 as id:chararray, flatten(BagToTuple(yy));
H = group F by id_orig_h;
I = foreach H generate group as id_orig_h, flatten(BagToTuple(F.count)) as count;
dump G;
dump I;

#1222 - The used SELECT statements have a different number of columns

Why am i getting a #1222 - The used SELECT statements have a different number of columns
? i am trying to load wall posts from this users friends and his self.
SELECT u.id AS pid, b2.id AS id, b2.message AS message, b2.date AS date FROM
(
(
SELECT b.id AS id, b.pid AS pid, b.message AS message, b.date AS date FROM
wall_posts AS b
JOIN Friends AS f ON f.id = b.pid
WHERE f.buddy_id = '1' AND f.status = 'b'
ORDER BY date DESC
LIMIT 0, 10
)
UNION
(
SELECT * FROM
wall_posts
WHERE pid = '1'
ORDER BY date DESC
LIMIT 0, 10
)
ORDER BY date DESC
LIMIT 0, 10
) AS b2
JOIN Users AS u
ON b2.pid = u.id
WHERE u.banned='0' AND u.email_activated='1'
ORDER BY date DESC
LIMIT 0, 10
The wall_posts table structure looks like id date privacy pid uid message
The Friends table structure looks like Fid id buddy_id invite_up_date status
pid stands for profile id. I am not really sure whats going on.
The first statement in the UNION returns four columns:
SELECT b.id AS id,
b.pid AS pid,
b.message AS message,
b.date AS date
FROM wall_posts AS b
The second one returns six, because the * expands to include all the columns from WALL_POSTS:
SELECT b.id,
b.date,
b.privacy,
b.pid.
b.uid message
FROM wall_posts AS b
The UNION and UNION ALL operators require that:
The same number of columns exist in all the statements that make up the UNION'd query
The data types have to match at each position/column
Use:
FROM ((SELECT b.id AS id,
b.pid AS pid,
b.message AS message,
b.date AS date
FROM wall_posts AS b
JOIN Friends AS f ON f.id = b.pid
WHERE f.buddy_id = '1' AND f.status = 'b'
ORDER BY date DESC
LIMIT 0, 10)
UNION
(SELECT id,
pid,
message,
date
FROM wall_posts
WHERE pid = '1'
ORDER BY date DESC
LIMIT 0, 10))
You're taking the UNION of a 4-column relation (id, pid, message, and date) with a 6-column relation (* = the 6 columns of wall_posts). SQL doesn't let you do that.
(
SELECT b.id AS id, b.pid AS pid, b.message AS message, b.date AS date FROM
wall_posts AS b
JOIN Friends AS f ON f.id = b.pid
WHERE f.buddy_id = '1' AND f.status = 'b'
ORDER BY date DESC
LIMIT 0, 10
)
UNION
(
SELECT id, pid , message , date
FROM
wall_posts
WHERE pid = '1'
ORDER BY date DESC
LIMIT 0, 10
)
You were selecting 4 in the first query and 6 in the second, so match them up.
Beside from the answer given by #omg-ponies; I just want to add that this error also occur in variable assignment. In my case I used an insert; associated with that insert was a trigger. I mistakenly assign different number of fields to different number of variables. Below is my case details.
INSERT INTO tab1 (event, eventTypeID, fromDate, toDate, remarks)
-> SELECT event, eventTypeID,
-> fromDate, toDate, remarks FROM rrp group by trainingCode;
ERROR 1222 (21000): The used SELECT statements have a different number of columns
So you see I got this error by issuing an insert statement instead of union statement. My case difference were
I issued a bulk insert sql
i.e. insert into tab1 (field, ...) as select field, ... from tab2
tab2 had an on insert trigger; this trigger basically decline duplicates
It turns out that I had an error in the trigger. I fetch record based on new input data and assigned them in incorrect number of variables.
DELIMITER ##
DROP TRIGGER trgInsertTrigger ##
CREATE TRIGGER trgInsertTrigger
BEFORE INSERT ON training
FOR EACH ROW
BEGIN
SET #recs = 0;
SET #trgID = 0;
SET #trgDescID = 0;
SET #trgDesc = '';
SET #district = '';
SET #msg = '';
SELECT COUNT(*), t.trainingID, td.trgDescID, td.trgDescName, t.trgDistrictID
INTO #recs, #trgID, #trgDescID, #proj, #trgDesc, #district
from training as t
left join trainingDistrict as tdist on t.trainingID = tdist.trainingID
left join trgDesc as td on t.trgDescID = td.trgDescID
WHERE
t.trgDescID = NEW.trgDescID
AND t.venue = NEW.venue
AND t.fromDate = NEW.fromDate
AND t.toDate = NEW.toDate
AND t.gender = NEW.gender
AND t.totalParticipants = NEW.totalParticipants
AND t.districtIDs = NEW.districtIDs;
IF #recs > 0 THEN
SET #msg = CONCAT('Error: Duplicate Training: previous ID ', CAST(#trgID AS CHAR CHARACTER SET utf8) COLLATE utf8_bin);
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = #msg;
END IF;
END ##
DELIMITER ;
As you can see i am fetching 5 fields but assigning them in 6 var. (My fault totally I forgot to delete the variable after editing.
You are using MySQL Union.
UNION is used to combine the result from multiple SELECT statements into a single result set.
The column names from the first SELECT statement are used as the column names for the results returned. Selected columns listed in corresponding positions of each SELECT statement should have the same data type. (For example, the first column selected by the first statement should have the same type as the first column selected by the other statements.)
Reference: MySQL Union
Your first select statement has 4 columns and second statement has 6 as you said wall_post has 6 column.
You should have same number of column and also in same order in both statement.
otherwise it shows error or wrong data.

T-sql problem with running sum

I am trying to write T-sql script which will find "open" records for one table
Structure of data is following
Id (int PK) Ts (datetime) Art_id (int) Amount (float)
1 '2009-01-01' 1 1
2 '2009-01-05' 1 -1
3 '2009-01-10' 1 1
4 '2009-01-11' 1 -1
5 '2009-01-13' 1 1
6 '2009-01-14' 1 1
7 '2009-01-15' 2 1
8 '2009-01-17' 2 -1
9 '2009-01-18' 2 1
According to my needs I am trying to show only records after last sum for every one articles where 0 sorting by date of last running sum of zero value. So I am trying to abstract (show) records 5 and 6 for Art_id=1 and record 9 for art_id=2. I am using MSSQL2005 and my table has around 30K records with 6000 distinct values of ART_ID.
In this solution I simply want to find all the rows where there isn't a subsequent row for that Art_id where the running sum was 0. I am assuming we can use the ID as a better tiebreaker than TS, since two rows can come in with the same timestamp but they will get sequential identity values.
;WITH base AS
(
SELECT
ID, Art_id, TS, Amount,
RunningSum = Amount + COALESCE
(
(
SELECT SUM(Amount)
FROM dbo.foo
WHERE Art_id = f.Art_id
AND ID < f.ID
)
, 0
)
FROM dbo.[table name] AS f
)
SELECT ID, Art_id, TS, Amount
FROM base AS b1
WHERE NOT EXISTS
(
SELECT 1
FROM base AS b2
WHERE Art_id = b1.Art_id
AND ID >= b1.ID
AND RunningSum = 0
)
ORDER BY ID;
Complete working query:
SELECT
*
FROM TABLE_NAME E
JOIN
(SELECT
C.ART_ID,
MAX(TS) MAX_TS
FROM
(SELECT
ART_ID,
TS,
COALESCE((SELECT SUM(AMOUNT) FROM TABLE_NAME B WHERE (B.Art_id = A.Art_id) AND (B.Ts < A.Ts)),0) ROW_SUM
FROM TABLE_NAME A) C
WHERE C.ROW_SUM = 0
GROUP BY C.ART_ID) D
ON
(D.ART_ID = E.ART_ID) AND
(E.TS >= D.MAX_TS)
First we calculate running sums for every row:
SELECT
ART_ID,
TS,
COALESCE((SELECT SUM(AMOUNT) FROM TABLE_NAME B WHERE (B.Art_id = A.Art_id) AND (B.Ts < A.Ts)),0) ROW_SUM
FROM TABLE_NAME A
Then we look for last article with 0:
SELECT
C.ART_ID,
MAX(TS) MAX_TS
FROM
(SELECT
ART_ID,
TS,
COALESCE((SELECT SUM(AMOUNT) FROM TABLE_NAME B WHERE (B.Art_id = A.Art_id) AND (B.Ts < A.Ts)),0) ROW_SUM
FROM TABLE_NAME A) C
WHERE C.ROW_SUM = 0
GROUP BY C.ART_ID
You can find all rows where the running sum is zero with:
select cur.id, cur.art_id
from #articles cur
left join #articles prev
on prev.art_id = cur.art_id
and prev.id <= cur.id
group by cur.id, cur.art_id
having sum(prev.amount) = 0
Then you can query all rows that come after the rows with a zero running sum:
select a.*
from #articles a
left join (
select cur.id, cur.art_id, running = sum(prev.amount)
from #articles cur
left join #articles prev
on prev.art_id = cur.art_id
and prev.ts <= cur.ts
group by cur.id, cur.art_id
having sum(prev.amount) = 0
) later_zero_running on
a.art_id = later_zero_running.art_id
and a.id <= later_zero_running.id
where later_zero_running.id is null
The LEFT JOIN in combination with the WHERE says: there can not be a row after this row, where the running sum is zero.