Slow SELECT FOR ALL ENTRIES

Slow SELECT FOR ALL ENTRIES - abap

The below SELECT runs with the internal table GIT_KUNNR_TAB containing 2.291.000 lines with unique clients (kunnr) and takes 16 minutes to complete.
select kunnr umsks umskz gjahr belnr buzei bschl shkzg dmbtr bldat
zfbdt zbd1t zbd2t zbd3t rebzg rebzj rebzz rebzt
into corresponding fields of table git_oi_tab
from bsid
for all entries in git_kunnr_tab
where bukrs = p_bukrs
and kunnr = git_kunnr_tab-kunnr
and umsks = ' '
and augdt = clear_augdt
and budat le p_key
and blart in s_blart
and xref3 in s_xref3.
BSID contains in total 20.000.000 records and for the 2.291.000 unique clients it gets 445.000 records from BSID.
Most of the time there are even more lines in GIT_KUNNR_TAB.
Is there any quicker selection?

Drop the FOR ALL ENTRIES part
Most likely the rest of the WHERE condition is selective enough. You get back more records than necessary, but much quicker.
As git_kunnr_tab is unique, you can turn it into a HASHED table, and filter git_oi_tab with that on the application server.
SELECT kunnr umsks umskz gjahr belnr buzei bschl shkzg dmbtr bldat
zfbdt zbd1t zbd2t zbd3t rebzg rebzj rebzz rebzt
INTO corresponding fields of table git_oi_tab
FROM bsid
WHERE bukrs = p_bukrs
AND umsks = ' '
AND augdt = clear_augdt
AND budat le p_key
AND blart in s_blart
AND xref3 in s_xref3.
DATA: lt_kunnr_tab TYPE HASHED TABLE of <type of git_kunnr_tab>
WITH UNIQE KEY kunnr.
lt_kunnr_tab = git_kunnr_tab.
LOOP AT git_oi_tab ASSIGNING FIELD-SYMBOL(<fs_oi>).
READ TABLE lt_kunnr_tab TRANSPORTING NO FIELDS
WITH KEY kunnr = <fs_oi>-kunnr.
IF sy-subrc <> 0
DELETE git_oi_tab.
ENDIF.
ENDIF.
ENDLOOP.
FREE lt_kunnr_tab.
This is not a general solution
If the FAE driver table contains more than 20% of the rows of the target table, dropping it completely is mostly beneficial for speed.
If it has less rows, FAE is the better solution.
Be careful however, dropping FAE can significantly increase the memory consumption of the resulting internal table!
FOR ALL ENTRIES vs Range table
You can see many places in the internet that Range tables are faster than FAE. This is true in some very specific cases:
Only one field is used from the FAE driver table1
There are more rows in the driver table than FAE sends down in one batch
By default the batch size is 5 in Oracle, 50 in DB2, 100 in HANA
There are not so many rows in the Range thable that it causes a dump
The maximum length is 1 048 576 bytes (note 1002491)
Range tables can be faster than FAE because it sends down all the filtering conditions in one query. This is of course dangerous, as the size of a query is limited. If it exceeds the set limit, you get a dump.
However, using the hint MAX_IN_BLOCKING_FACTOR and MAX_BLOCKING_FACTOR you can give FAE all the benefits of range tables, without its downsides by increasing the batch size.
So only use Range tables with actual ranges, like between A and C, or not between G and J.
this one is not about speed, but functional correctness. Range tables treat fields independently, while FAE works with rows

Normally for only a one field using a range is much faster.
So if you selecting the data by some key from the internal table comparing only one field from the table, turn it into the range instead of FAE:
TYPES:
tr_kunnr TYPE RANGE OF kunnr.
* or just do loop/append if you on the old system (>7.4)
DATA(lr_kunnr) = VALUE tr_kunnr(
FOR <fs_oi> IN git_oi_tab
(
sign = 'I'
option = 'EQ'
low = fs_oi-kunnr
)
).
select kunnr umsks umskz gjahr belnr buzei bschl shkzg dmbtr bldat
zfbdt zbd1t zbd2t zbd3t rebzg rebzj rebzz rebzt
into corresponding fields of table #git_oi_tab
from bsid
where bukrs = #p_bukrs
and kunnr in #lr_kunnr...
I can't find the article, but an investigation was made, and the range is much faster in case of one field comparison than an FAE.

Related

Sort by given "rank" of column values

I have a table like this (unsorted):
risk
category
Low
A
Medium
B
High
C
Medium
A
Low
B
High
A
Low
C
Low
E
Low
D
High
B
I need to sort rows by category, but first based on the value of risk. The desired result should look like this (sorted):
risk
category
Low
A
Low
B
Low
C
Low
D
Low
E
Medium
A
Medium
B
High
A
High
B
High
C
I've come up with below query but wonder if it is correct:
SELECT
*
FROM
some_table
ORDER BY
CASE
WHEN risk = 'Low' THEN
0
WHEN risk = 'Medium' THEN
1
WHEN risk = 'High' THEN
2
ELSE
3
END,
category;
Just want to understand whether the query is correct or not. The actual data set is huge and there are many other values for risk and categories and hence I can't figure out if the results are correct or not. I've just simplified it here.

Basically correct, but you can simplify:
SELECT *
FROM some_table
ORDER BY CASE risk
WHEN 'Low' THEN 0
WHEN 'Medium' THEN 1
WHEN 'High' THEN 2
-- rest defaults to NULL and sorts last
END
, category;
A "switched" CASE is shorter and slightly cheaper.
In the absence of an ELSE branch, all remaining cases default to NULL, and NULL sorts last in default ascending sort order. So you don't need to do anything extra.
Many other values
... there are many other values for risk
While all other values are lumped together at the bottom of the sort order, this seems ok.
If all of those many values get their individual ranking, I would suggest an additional table to handle ranks of risk values. Like:
CREATE TABLE riskrank (
risk text PRIMARY KEY
, riskrank real
);
INSERT INTO riskrank VALUES
('Low' , 0)
, ('Medium', 1)
, ('High' , 2)
-- many more?
;
Data type real, so it's easy to squeeze in rows with fractional digits in different positions (like enum values do it internally).
Then your query is:
SELECT s.*
FROM some_table s
LEFT JOIN risk_rank rr USING (risk)
ORDER BY rr.riskrank, s.category;
LEFT JOIN, so missing entries in riskrank don't eliminate rows.
enum?
I already mentioned the data type enum. That's a possible alternative as enum values are sorted in the order they are defined (not how they are spelled). They only occupy 4 bytes on disk (real internally), are fast and enforce valid values implicitly. See:
How to change the data type of a table column to enum?
However, I would only even consider an enum if the sort order of your values is immutable. Changing sort order and adding / removing allowed values is cumbersome. The manual:
Although enum types are primarily intended for static sets of values,
there is support for adding new values to an existing enum type, and
for renaming values (see ALTER TYPE). Existing values cannot be
removed from an enum type, nor can the sort ordering of such values be
changed, short of dropping and re-creating the enum type.

BigQuery query execution costs

I know there is documentation on BigQuery pricing, but I am confused by which value they charge you on. When you compose a query, the editor with show This query will process 69.3 GB when run. but when you've executed the query, there is a Job Information tab next to the Results tab. In that Job Information, there are two values: "Bytes Process" and "Bytes Billed"
I was informed that you are charged on the "Bytes Billed" value (seems logical based on the name!).
What's causing my confusion is that the Bytes Billed for the above 69.3GB query is 472MB. I'm given to believe that the WHERE clause does not impact pricing
Why is it so much less?
How can I accurately estimate query costs if I can't see the Bytes Billed beforehand?
Thanks in advance
Edit 1
Here is my query:
SELECT
timestamp_trunc(DateTimeUTC, SECOND) as DateTimeUTC,
ANY_VALUE(if(Code = 'Aftrtmnt_1_Scr_Cat_Tank_Level', value, null)) as Aftrtmnt_1_Scr_Cat_Tank_Level,
ANY_VALUE(if(Code = 'ctv_ds_ect', value, null)) as ctv_ds_ect,
ANY_VALUE(if(Code = 'Engine_Coolant_Level', value, null)) as Engine_Coolant_Level,
ANY_VALUE(if(Code = 'ctv_batt_volt_min', value, null)) as ctv_batt_volt_min,
ANY_VALUE(if(Code = 'ctv_moderate_brake_count', value, null)) as ctv_moderate_brake_count,
ANY_VALUE(if(Code = 'ctv_amber_lamp_count', value, null)) as ctv_amber_lamp_count,
VIN,
ANY_VALUE(if(Code = 'ctv_trip_distance_miles', value, null)) as ctv_trip_distance_miles,
FROM `xxxx.yyyy.zzzz`
WHERE
DATE(DateTimeUTC) > '2021-03-01') and DATE(DateTimeUTC) < '2021-06-01' and
Code in ('Aftrtmnt_1_Scr_Cat_Tank_Level', 'ctv_ds_ect', 'Engine_Coolant_Level', 'ctv_trip_distance_miles', 'ctv_batt_volt_min', 'ctv_moderate_brake_count', 'ctv_amber_lamp_count')
and event_name = 'Trip Detail'
group by timestamp_trunc(DateTimeUTC, SECOND), VIN
Essentially it just pivots the main table and the intention is to insert the result into another table
THis article states that the WHERE clause does not impact cost, which is different to what I previously thought

I believe that your actual cost should never be more than estimated, but could be less.
Consider a table that is both partitioned and clustered. Let's assume the partition is on a date field my_date and clustered on a string field my_type.
Then, consider the following query...
select my_date, my_type from <table>
The estimate thinks you are scanning both columns in their entirety, and so your billing should match the estimate
However, if you filter against the partition, you should see a reduction in both the estimation and the billed amount.
select my_date, my_type from <table> where my_date = '2021-06-17'
But, if you filter against the clustered column, I don't believe the estimate evaluates that filter, because it doesn't know what you are filtering, just which columns. However, when you execute the query, you do get the benefit of the clustering, because it won't actually scan the entire column, just the relevant clusters.
select my_date, my_type from <table> where my_type = 'A'
It is not checking 'A' against the clustering in the estimation. Consider a case where 'A' doesn't exist in that clustered column, the estimator would show an estimate, but you would actually scan 0 Bytes when you execute.

Does a query against a nested field in BigQuery count only the size of the subfield as the "Processed Data Amount" in on-demand pricing?

The other possibility is that the "Processed Data Amount" has the size of the top, enclosing STRUCT/RECORD type even though only one subfield of the STRUCT/RECORD column is selected.
The online doc has that "0 bytes + the size of the contained fields", which is not explicit to me. Can someone help to clarify? Thanks.

Think of a record as a storage mechanism. When you query against it (like you would a regular table), you are still only charged for the columns you use (select, filter, join, etc).
Check out the following query estimates for these similar queries.
-- This query would process 5.4GB
select
* -- everything
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
-- This query would process 33.7MB
select
visitorId, -- integer
totals -- record
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`
-- This query would process 6.9MB
select
visitorId, -- integer
totals.hits -- specific column from record
from `bigquery-public-data.google_analytics_sample.ga_sessions_*`

On-demand pricing is based on the number of bytes processed by a query, in the Struct/Record data type you are charged according to the columns you select within the Record.
The expression “0 bytes + the size of the contained fields” means that the size will depend on the columns data types within the record.
Moreover, you can estimate costs before running your query using the query validator

COLLECT and MODIFY optimization for large itab

I have 2 parts of code. Both of them process 1,5 million records, but the first part is taking 20 minutes and the 2nd part is taking 13,5 hours!!!!!!
Here is the 1st part:
loop at it_bkpf.
select * from bseg into corresponding fields of itab
where bukrs = it_bkpf-bukrs and
belnr = it_bkpf-belnr and
gjahr = it_bkpf-gjahr and
hkont in s_hkont.
if sy-subrc = 0 .
itab-budat = it_bkpf-budat.
clear *bseg .
select single * from *bseg
where bukrs = itab-bukrs and
belnr = itab-belnr and
gjahr = itab-gjahr and
hkont = wtax .
if sy-subrc <> 0 .
itab-budat = '99991231'.
endif.
endif.
append itab.
endselect.
endloop.
The 2nd part which is doing 13,5 hours is the following:
sort itab by belnr.
loop at itab where hkont(2) = '73'.
move-corresponding itab to itab2.
collect itab2.
endloop.
loop at itab2.
lv_5per_total = con_5per_tax * itab2-dmbtr.
lv_5per_upper = lv_5per_total + '0.02'.
lv_5per_lower = lv_5per_total - '0.02'.
read table itab with key belnr = itab2-belnr
hkont = wtax.
if sy-subrc = 0.
if itab-dmbtr between lv_5per_lower and lv_5per_upper.
itab-budat = '99991231'.
modify itab transporting budat where belnr = itab2-belnr.
endif.
endif.
endloop.
Does anyone have an idea on how to fix the 2nd part?
Some extra things:
it_bkpf has 1,5 million records.
After the 1st process ITAB has 1,5 million records.
In the 2nd part in the 1st loop I summ the amounts per belnr for the accounts that start with 73.
In the 2nd loop I compare the sum per belnr with the amount of the belnr/account and do what the code says.
Thanks
ADDITIONAL INFORMATION:
1st of all the initial code existed and I added the new one. ITAB existed and ITAB2 is mine. So the declaration of the tables was:
DATA : BEGIN OF itab OCCURS 0,
bukrs LIKE bseg-bukrs,
hkont LIKE bseg-hkont,
belnr LIKE bkpf-belnr,
gjahr LIKE bkpf-gjahr,
dmbtr LIKE bseg-dmbtr,
shkzg LIKE bseg-shkzg ,
budat LIKE bkpf-budat,
zzcode LIKE bseg-zzcode.
DATA END OF itab.
DATA : BEGIN OF itab2 OCCURS 0 ,
belnr LIKE bkpf-belnr,
dmbtr LIKE bseg-dmbtr,
END OF itab2.
After your suggestion I made the following changes:
types: begin of ty_belnr_sums,
belnr like bkpf-belnr,
dmbtr like bseg-dmbtr,
end of ty_belnr_sums.
data: git_belnr_sums type sorted table of ty_belnr_sums
with unique key belnr.
data: gwa_belnr_sums type ty_belnr_sums.
data: lv_5per_upper type p decimals 2,
lv_5per_lower type p decimals 2,
lv_5per_total type p decimals 2.
sort itab by belnr hkont.
loop at itab where hkont(2) = '73'.
move-corresponding itab to gwa_belnr_sums.
collect gwa_belnr_sums into git_belnr_sums .
endloop.
loop at git_belnr_sums into gwa_belnr_sums.
lv_5per_total = con_5per_tax * gwa_belnr_sums-dmbtr.
lv_5per_upper = lv_5per_total + '0.02'.
lv_5per_lower = lv_5per_total - '0.02'.
read table itab with key belnr = gwa_belnr_sums-belnr
hkont = wtax
binary search.
if sy-subrc = 0.
if itab-dmbtr between lv_5per_lower and lv_5per_upper.
itab-budat = '99991231'.
modify itab transporting budat
where belnr = gwa_belnr_sums-belnr.
endif.
endif.
endloop.
Now I am running in the background for 1,5 millions records and it continues after 1 hour.

I guess all people gave you all the hints to optimize your code. Essentially, as you want to optimize only the 2nd part, the only issues seem to be with itab operations (loop, read, modify), that you may improve by using an index or a or a hash table on the internal table itab. If these concepts are unclear, I recommend the ABAP documentation: Row-Based Administration Costs of Internal Tables and Internal Tables - Performance Notes
Two more things:
you should measure the performance of your code with transaction SAT, and you will see on which operation the time is spent.
Not related to the performance but on the data model: the key of a financial document is made of 3 fields, company (BUKRS), document number (BELNR) and year (GJAHR). So, you shouldn't group ("collect") by document number only, or you risk to mix documents of distinct companies or years. Instead, keep the 3 fields.
Now, if you accept to adapt a few little things in the 1st part, right after reading the lines of BSEG of every document, then you may simply loop at these lines, no need of an index or binary search or hash table, and that's done. The only thing to do is to first store the lines of BSEG in a temporary itab_temp, to contain only the lines of one document at a time, so that you loop only at these lines, and then add them to itab.
If you want to update the BUDAT component later, not in the 1st part, then store the document key + BUDAT in a distinct internal table (say itab_budat), and define a non-unique secondary index on itab with the document key fields (say by_doc_key), and you'll only need this straight forward code :
LOOP AT itab_budat ASSIGNING <line_budat>.
MODIFY itab FROM itab USING KEY by_doc_key TRANSPORTING budat
WHERE bukrs = <line_budat>-bukrs
AND belnr = <line_budat>-belnr
AND gjahr = <line_budat>-gjahr.
ENDLOOP.
Note that I propose a (non-unique) secondary index, not a primary one, because as you don't want to change (too much) the 1st part, this will avoid potential issues that you might have with a primary index.
If you need more information on the terms, syntaxes, and so on, you'll find them in the ABAP documentation (inline or web).

There are several performance issues here:
1st part:
Never do SELECT in LOOP. I said never. You should do instead a SELECT ... FOR ALL ENTRIES or SELECT ... JOIN (BSEG is cluster table in ECC so no JOIN possible, but transparent table in S4/HANA, so you can JOIN it with BKPF). As far as I see from the coding, you select G/L items, so you can check if using BSIS/BSAS would be better (instead of BSEG). You can also check table BSET (looks like you are interested in "tax" lines)
2nd part:
If you do a LOOP with WHERE condition, you get the best performance if the internal table is TYPE SORTED (with proper key). However your WHERE condition uses offset, so I am not sure, if the SORTED table will help, but it is worth a try.
However the real issue is here: COLLECT itab2. COLLECT is not supported for TYPE STANDARD tables (this is clearly stated in the SAPHelp), the internal table has to be TYPE HASHED (there are technical reasons behind, I am not going into details).
After that you LOOP one internal table (itab2) and READ TABLE the other (itab). To have the best performance for the single line read, itab has to be TYPE HASHED (or at least TYPE SORTED with the proper key).
Instead of using internal tables with HEADER LINES, you have to LOOP ... ASSIGNING FIELD-SYMBOL(<...>), so you don't need the MODIFY anymore, this will also slightly improve the performance.

As said by JozsefSzikszai, it would probably better to use BSAS / BSIS as they have indexes. If you really want to use BSEG :
BSEG is a cluster table. AFAIK, it is preferable for those tables to perform a SELECT using the key fields to get an internal table, and then work on the internal table for additionnal fields
select * from bseg into corresponding fields of itab
for all entries in it_bkpf
where bukrs = it_bkpf-bukrs and
belnr = it_bkpf-belnr and
gjahr = it_bkpf-gjahr.
delete itab where hkont not in s_hkont.
In the second part, you're performing a READ TABLE on a standard table (itab).
Using a BINARY search would cut greatly in your time... A read table on a standard table is a full table scan (the lines are read till we find the value). Sorting itab by belnr and hkont (instead of only belnr) and addinf BINARY SEARCH will make it a dichotomice search.

The first is faster because it is just linearly bad
The second one is quadratic.
The first one looks quadratic too, but bukrs, belnr, gjahr are almost the complete primary key of bseg, so for every line of it_bkpf there not many lines found by the SELECTs.
First part
As JozsefSzikszai wrote, FOR ALL ENTRIES is much faster than nested SELECTs. Moreover, huge amounts of unnecessary data is moved from the DB to the application server, as the SELECT SINGLE is just an existence check, still it reads all 362 colums of BSEG.
So create a SORTED table before the LOOP:
TYPES: BEGIN OF tt_bseg,
bukrs TYPE bukrs,
belnr TYPE belnr,
gjahr TYPE gjahr,
END OF tt_bseg.
DATA: lt_bseg2 TYPE SORTED TABLE OF tt_bseg WITH NON-UNIQUE KEY bukrs belnr gjahr.
SELECT bukrs belnr gjahr FROM bseg
INTO TABLE lt_bseg
FOR ALL ENTRIES IN it_bkpf
WHERE bukrs = it_bkpf-bukrs
AND belnr = it_bkpf-belnr
AND gjahr = it_bkpf-gjahr
AND hkont = wtax.
Instead of the SELECT SINGLE, use READ TABLE:
READ TABLE lt_bseg TRANSPORTING NO FIELDS
WITH KEY bukrs = itab-bukrs
belnr = itab-belnr
gjahr = itab-gjahr.
Second part
This is quadratic because for every iteration of itab2, all lines of itab are checked for the MODIFY.
Simplest solution:
Instead of SORT itab BY belnr load the content of itab into a SORTED table, and use that later instead:
DATA: lt_itab TYPE SORTED TABLE OF itab
WITH NON-UNIQUE KEY belnr hkont.
lt_itab = itab.
FREE itab.

There were many reasonable recommendations from guys, especially from Sandra Rossi, but I'll just add my two cents to your latest variant that runs 1 hour:
use field-symbols as much as possible. They really can influence performance on large datasets (and 1.5M is large, yes!)
use secondary keys, as proposed by Sandra. It really matters a lot (see standard program DEMO_SECONDARY_KEYS)
do not use INTO CORRESPONDING FIELDS unless needed, it slows down the queries a bit
some doublework in your snippet, as you collect unnecessary rows in first loop by itab (BSEG), but BUDAT is written only when BSEG position sum fits the lower-upper range. So some collected sums are wasting time.
All in all, I haven't fully understood your logic, as you write BUDAT not to the whole matched rowset but only to the first row of the group with matched BELNR, which has no sense, but nevertheless.
Nevertheless, if one try to just repeat you logic without any alterations and apply a new GROUP BY syntax, here is what one might get.
* To omit `INTO CORRESPONDING` you should properly declare your structure
* and fetch only those fields which are needed. Also avoid obsolete `OCCURS` syntax.
TYPES: BEGIN OF ty_itab,
bukrs TYPE bseg-bukrs,
belnr TYPE bkpf-belnr,
gjahr TYPE bkpf-gjahr,
hkont TYPE bseg-hkont,
dmbtr TYPE bseg-dmbtr,
shkzg TYPE bseg-shkzg ,
budat TYPE bkpf-budat,
END OF ty_itab.
TYPES: begin of ty_belnr_sums,
belnr type bkpf-belnr,
dmbtr type bseg-dmbtr,
end of ty_belnr_sums.
DATA: itab TYPE TABLE OF ty_itab INITIAL SIZE 0,
con_5per_tax type p decimals 2 value '0.03'.
SELECT g~bukrs g~belnr g~gjahr g~hkont g~dmbtr g~shkzg f~budat UP TO 1500000 rows
INTO table itab
FROM bseg AS g
JOIN bkpf AS f
ON g~bukrs = f~bukrs
AND g~belnr = f~belnr
AND g~gjahr = f~gjahr.
DATA members LIKE itab.
LOOP AT itab ASSIGNING FIELD-SYMBOL(<fs_itab>)
GROUP BY ( belnr = <fs_itab>-belnr )
hkont = <fs_itab>-hkont )
ASCENDING
ASSIGNING FIELD-SYMBOL(<group>).
CLEAR members.
CHECK <group>-hkont(2) = '73'.
LOOP AT GROUP <group> ASSIGNING FIELD-SYMBOL(<group_line>).
members = VALUE #( BASE members ( <group_line> ) ).
ENDLOOP.
DATA(sum) = REDUCE dmbtr( INIT val TYPE dmbtr
FOR wa IN members
NEXT val = val + wa-dmbtr ).
IF members[1]-dmbtr BETWEEN con_5per_tax * sum - '0.02' AND con_5per_tax * sum + '0.02'.
<first_line>-budat = '99991231'.
ENDIF.
ENDLOOP.
During my test on 1.5M dataset I measured the runtime with GET RUN TIME FIELD and got following results.
Old snippet:
My snippet:

Unique Rows that match a specific criteria

My data is Microsoft Office 365 Mailbox audit logs.
I am working with 14 columns, incorporating names, timestamps, IP addresses, etc.
I have two tables, lets call them EXISTING and NEW. The column definition, order and count are identical in the two tables.
The data in Existing is (very close to!) Distinct.
The data in New is drawn from multiple overlapping searches and is not Distinct.
There are about millions of rows in Existing and hundreds of thousands in New.
Data is being written to New all the time, 24x7, with about 1 million rows a day being added.
~95% of the Rows in New are already present in Existing and are therefore unwanted duplicates. However the data in New contains has many gaps, there are many recent rows in Existing that are NOT present in New.
Want to select all rows from New that are not present in Existing, using Invoke-SqlCmd in Powershell.
Then want to delete all the processed rows from New so it doesn't grow uncontrollably.
My approach so far has been:
Add a [Processed] column to New.
Set [Processed] to 0 for all existing data for selection purposes. New rows that continue to be added will have [Processed] = NULL, and will be left alone.
SELECT DISTINCT all data with [Processed] = 0 from New and copy to a table temporary table called Staging. Find the oldest timestamp ([LastAccessed]) in this data. Then delete all rows from New with [Processed] = 0.
Copy all data from Existing with [LastAccessed] equal to or later to above time stamp across to STAGING, adding the column [Processed] = 1.
Now I want all data in Staging where [Processed] = 0 and there is No duplicate.
Nearest concept I can come up with is:
SELECT MailboxOwnerUPN
,MailboxResolvedOwnerName
,LastAccessed
,ClientIPAddress
,ClientInfoString
,MailboxGuid
,Operation
,OperationResult
,LogonType
,ExternalAccess
,InternalLogonType
,LogonUserDisplayName
,OriginatingServer
FROM dbo.Office365Staging
GROUP BY MailboxOwnerUPN
,MailboxResolvedOwnerName
,LastAccessed
,ClientIPAddress
,ClientInfoString
,MailboxGuid
,Operation
,OperationResult
,LogonType
,ExternalAccess
,InternalLogonType
,LogonUserDisplayName
,OriginatingServer
HAVING Count(1) = 1 and Processed = 0;
Which of course I can't do because [Processed] isn't part of the Select or Group. If I add the Column [Processed] then all lines are unique and there are no duplicates. Have tried a variety of joins and other techniques, without success thus far.
Initially without [Processed] = 0, the query worked, but returned unwanted unique lines from Existing. I only want unique lines from New.
Clearly due to the size of these structures efficiency is a consideration. This process will be happening regularly, every 15 minutes ideally.
Identifying these new lines then starts another process of Geo-IP, reputation, alerting, etc in PowerShell....

Thought the performance of the following would be horrid, but it is OK at ~27 seconds....
SELECT [MailboxOwnerUPN]
,[MailboxResolvedOwnerName]
,[LastAccessed]
,[ClientIPAddress]
,[ClientInfoString]
,[MailboxGuid]
,[Operation]
,[OperationResult]
,[LogonType]
,[ExternalAccess]
,[InternalLogonType]
,[LogonUserDisplayName]
,[OriginatingServer]
FROM dbo.New
WHERE [Processed] = 1 and
NOT EXISTS (Select * From dbo.Existing
Where New.LastAccessed = Existing.LastAccessed and
New.ClientIPAddress = Existing.ClientIPAddress and
New.ClientInfoString = Existing.ClientInfoString and
New.MailboxGuid = Existing.MailboxGuid)
GO

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas