Google Pub/Sub to Dataflow, avoid duplicates with Record ID - google-bigquery

I'm trying to build a Streaming Dataflow Job which read events from Pub/Sub and write them into BigQuery.
According to the documentation, Dataflow can detect duplicate messages delivery if a Record ID is used (see: https://cloud.google.com/dataflow/model/pubsub-io#using-record-ids)
But even using this Record ID, I still have some duplicates
(around 0.0002%).
Did I miss something ?
EDIT:
I use Spotify Async PubSub Client to publish messages with the following snipplet:
Message
.builder()
.data(new String(Base64.encodeBase64(json.getBytes())))
.attributes("myid", id, "mytimestamp", timestamp.toString)
.build()
Then I use Spotify scio to read the message from pub/sub and save it to DataFlow:
val input = sc.withName("ReadFromSubscription")
.pubsubSubscription(subscriptionName, "myid", "mytimestamp")
input
.withName("FixedWindow")
.withFixedWindows(windowSize) // apply windowing logic
.toWindowed // convert to WindowedSCollection
//
.withName("ParseJson")
.map { wv =>
wv.copy(value = TableRow(
"message_id" -> (Json.parse(wv.value) \ "id").as[String],
"message" -> wv.value)
)
}
//
.toSCollection // convert back to normal SCollection
//
.withName("SaveToBigQuery")
.saveAsBigQuery(bigQueryTable(opts), BQ_SCHEMA, WriteDisposition.WRITE_APPEND)
The Window size is 1 minute.
After only few seconds injecting messages I already have duplicates in BigQuery.
I use this query to count duplicates:
SELECT
COUNT(message_id) AS TOTAL,
COUNT(DISTINCT message_id) AS DISTINCT_TOTAL
FROM my_dataset.my_table
//returning 273666 273564
And this one to look at them:
SELECT *
FROM my_dataset.my_table
WHERE message_id IN (
SELECT message_id
FROM my_dataset.my_table
GROUP BY message_id
HAVING COUNT(*) > 1
) ORDER BY message_id
//returning for instance:
row|id | processed_at | processed_at_epoch
1 00166a5c-9143-3b9e-92c6-aab52601b0be 2017-02-02 14:06:50 UTC 1486044410367 { ...json1... }
2 00166a5c-9143-3b9e-92c6-aab52601b0be 2017-02-02 14:06:50 UTC 1486044410368 { ...json1... }
3 00354cc4-4794-3878-8762-f8784187c843 2017-02-02 13:59:33 UTC 1486043973907 { ...json2... }
4 00354cc4-4794-3878-8762-f8784187c843 2017-02-02 13:59:33 UTC 1486043973741 { ...json2... }
5 0047284e-0e89-3d57-b04d-ebe4c673cc1a 2017-02-02 14:09:10 UTC 1486044550489 { ...json3... }
6 0047284e-0e89-3d57-b04d-ebe4c673cc1a 2017-02-02 14:08:52 UTC 1486044532680 { ...json3... }

The BigQuery documentation states that there may be rare cases where duplicates arrive:
"BigQuery remembers this ID for at least one minute" -- if Dataflow takes more than one minute before retrying the insert BigQuery may allow the duplicate in. You may be able to look at the logs from the pipeline to determine if this is the case.
"In the rare instance of a Google datacenter losing connectivity unexpectedly, automatic deduplication may not be possible."
You may want to try the instructions for manually removing duplicates. This will also allow you to see the insertID that was used with each row to determine if the problem was on the Dataflow side (generating different insertIDs for the same record) or on the BigQuery side (failing to deduplicate rows based on their insertID).

Related

How to filter a date-field with a swift vapor-fluent query

To avoid multiple inserts of the same person in a database, I wrote the following function:
func anzahlDoubletten(_ req: Request, nname: String, vname: String, gebTag: Date)
async throws -> Int {
try await
Teilnehmer.query(on: req.db)
.filter(\.$nname == nname)
.filter(\.$vname == vname)
.filter(\.$gebTag == gebTag)
.count()
}
The function always returns 0, even if there are multiple records with the same surname, prename and birthday in the database.
Here is the resulting sql-query:
[ DEBUG ] SELECT COUNT("teilnehmer"."id") AS "aggregate" FROM "teilnehmer" WHERE "teilnehmer"."nname" = $1 AND "teilnehmer"."vname" = $2 AND "teilnehmer"."geburtstag" = $3 ["neumann", "alfred e.", 1999-09-09 00:00:00 +0000] [database-id: psql, request-id: 1AC70C41-EADE-43C2-A12A-99C19462EDE3] (FluentPostgresDriver/FluentPostgresDatabase.swift:29)
[ INFO ] anzahlDoubletten=0 [request-id: 1AC70C41-EADE-43C2-A12A-99C19462EDE3] (App/Controllers/TeilnehmerController.swift:49)
if I query directly I obtain:
lwm=# select nname, vname, geburtstag from teilnehmer;
nname | vname | geburtstag
---------+-----------+------------
neumann | alfred e. | 1999-09-09
neumann | alfred e. | 1999-09-09
neumann | alfred e. | 1999-09-09
neumann | alfred e. | 1999-09-09
so count() should return 4 not 0:
lwm=# select count(*) from teilnehmer where nname = 'neumann' and vname = 'alfred e.' and geburtstag = '1999-09-09';
count
-------
4
My DateFormatter is defined like so:
let dateFormatter = ISO8601DateFormatter()
dateFormatter.formatOptions = [.withFullDate, .withDashSeparatorInDate]
And finally the attribute "birthday" in my model:
...
#Field(key: "geburtstag")
var gebTag: Date
...
I inserted the 4 alfreds in my database using the model and fluent, passing the birthday "1999-09-09" as a String and fluent inserted all records correctly.
But .filter(\.$gebTag == gebTag) seems to return constantly 'false'.
Is it at all possible to use .filter() with data types other than String?
And if so, what am I doing wrong?
Many thanks for your help
Michael
The problem you've hit is that you're storing only dates whereas you're filtering on dates with times. Unfortunately there's no native way to store just a date. However there are a few options.
The easiest way is to change the date field to a String and then use your date formatter (make sure you remove the time part) to convert the query option to a String.
I am guessing slightly here, but I suspect that your table was not created by a Migration? If it had been, your geburtstag field would include a time component as this is the default and you would have spotted the problem quickly.
In any event, the filter is actually filtering on the time component of gebTag as well as the date. This is why it is returning zero.
I suggest converting the geburtstag to a type that includes the time and ensuring that the time component is set to 0:00:00 when you store it. You can reset the time component to 'midnight' using something like this:
extension Date {
var midnight: Date { return Calendar.current.date(bySettingHour: 0, minute: 0, second: 0, of: self)! }
}
Then change your filter to:
.filter(\.$gebTag == gebTag.midnight)
Alternatively, just use the static method in Calendar:
.filter(\.$gebTag == Calendar.startOfDay(for:gebTag))
I think this is the most straightforward way of doing it.

How to optimize querying multiple unrelated tables in SQLite?

I have scenario when I have to iterate through multiple tables in quite big sqlite database. In tables I store informations about planet position on sky through years. So e.g. for Mars I have tables Mars_2000, Mars_2001 and so on. Table structure is always the same:
|id:INTEGER|date:TEXT|longitude:REAL|
Thing is that for certain task I need to iterate through this tables, which cost much time (for more than 10 queries it's painful).
I suppose that if I merge all tables with years to one big table performance might be better as one query through one big table is better than 50 through smaller tables. I wanted to make sure that this might work, as database is humongous (around 20Gb), and reshaping it would cost a while.
Is this plan I just described viable? Is there any other solution for such case?
It might be helpfull so I attach function that produces my SQL query that is unique for each table:
pub fn transition_query(
select_param: &str, // usually asterix
table_name: &str, // table I'd like to query
birth_degree: &f64, // constant number
wanted_degree: &f64, // another constant number
orb: &f64, // another constant number
upper_date_limit: DateTime<Utc>, // casts to SQL-like string
lower_date_limit: DateTime<Utc>, // casts to SQL-like string
) -> String {
let parsed_upper_date_limit = CelestialBodyPosition::parse_date(upper_date_limit);
let parsed_lower_date_limit = CelestialBodyPosition::parse_date(lower_date_limit);
return format!("
SELECT *,(SECOND_LAG>60 OR SECOND_LAG IS NULL) AS TRANSIT_START, (SECOND_LEAD > 60 OR SECOND_LEAD IS NULL) AS TRANSIT_END, time FROM (
SELECT
*,
UNIX_TIME - LAG(UNIX_TIME,1) OVER (ORDER BY time) as SECOND_LAG,
LEAD(UNIX_TIME,1) OVER (ORDER BY time) - UNIX_TIME as SECOND_LEAD FROM (
SELECT {select_param},
DATE(time) as day_scoped_date,
CAST(strftime('%s', time) AS INT) AS UNIX_TIME,
longitude
FROM {table_name}
WHERE ((-{orb} <= abs(realModulo(longitude -{birth_degree} -{wanted_degree},360))
AND abs(realModulo(longitude -{birth_degree} -{wanted_degree},360)) <= {orb})
OR
(-{orb} <= abs(realModulo(longitude -{birth_degree} +{wanted_degree},360))
AND abs(realModulo(longitude -{birth_degree} +{wanted_degree},360)) <= {orb}))
AND time < '{parsed_upper_date_limit}' AND time > '{parsed_lower_date_limit}'
)
) WHERE (TRANSIT_START AND NOT TRANSIT_END) OR (TRANSIT_END AND NOT TRANSIT_START) ;
");
}
I solved the issue programmatically. Whole thing was done with Rust and r2d2_sqlite library. I'm still doing a lot of queries, but now it's done in threads. It allowed me to reduce execution time from 25s to around 3s. Here's the code:
use std::sync::mpsc;
use std::thread;
use r2d2_sqlite::SqliteConnectionManager;
use r2d2;
let manager = SqliteConnectionManager::file("db_path");
let pool = r2d2::Pool::builder().build(manager).unwrap();
let mut result: Vec<CelestialBodyPosition> = vec![]; // Vector of structs
let (tx, rx) = mpsc::channel(); // Allows ansynchronous communication
let mut children = vec![]; //vector of join handlers (not sure if needed at all
for query in queries {
let pool = pool.clone(); // For each loop I clone connection to databse
let inner_tx = tx.clone(); // and messager, as each thread should have spearated one.
children.push(thread::spawn(move || {
let conn = pool.get().unwrap();
add_real_modulo_function(&conn); // this adds custom sqlite function I needed
let mut sql = conn.prepare(&query).unwrap();
// this does query, and maps result to my internal type
let positions: Vec<CelestialBodyPosition> = sql
.query_map(params![], |row| {
Ok(CelestialBodyPosition::new(row.get(1)?, row.get(2)?))
})
.unwrap()
.map(|position| position.unwrap())
.collect();
// this sends partial result to receiver
return inner_tx.send(positions).unwrap();
}));
}
// first messenger has to be dropped, otherwise program will wait for its input
drop(tx);
for received in rx {
result.extend(received); // combine all results
}
return result;
As you can see no optimization happened from sqlite site, which kinda makes me feel I'm doing something wrong, but for now it's alright. It might be good to press some more control over amount of spawned threads.

NextgenSplunk: Need help forming a splunk query which takes sessionId from a particular set of logs, use it to form next query

I need to form a Splunk query to find a particular sessionId for which log a is available but log b is not. Both are part of the same transaction but code breaking in between somewhere.
LOGGER.info("Log a:: setting some details in session");
Response response = handler.transactionMethod(token); //throws some exception
LOGGER.info("Log b:: getting details in session");
So in the success scenario, both Log a and Log b will be printed. But when transactionMethod throws an exception, only Log a will be printed for that sessionId and not Log b.
The requirement is I need to find any of the sessionId for which only Log a is present, not Log b.
Assuming that you have 2 fields TEXT and SessionID already defined, we will use the following test data:
SessionID=1001 TEXT="setting
SessionID=1001 TEXT="getting
SessionID=1002 TEXT="setting
SessionID=1003 TEXT="getting"
Splunk query:
| makeresults count=4
| streamstats count
| eval TEXT=case(count=1 OR count=3, "setting", count=2 OR count=4, "getting")
| eval SessionID=case(count=1 OR count=2, 1001, count=3, 1002, count=4, 1003)
``` The above is just setting of the test data ```
``` Below is the actual SPL for the task ```
| stats count(eval(TEXT=setting")) as LogA count(eval(TEXT="getting") as Logb by SessionID
| search LogA > 0 and LogB = 0
As you can see I specifically excluded the case when only "LogB" record is present (SessionID=3)

How to track SLA of VM availability set (or availability zone) through heartbeats with Log Analytics (KQL)

I want to track the SLAs of our VMs in a Monitor Workbook using a Log Analytics query.
For this, I use the 'Heartbeat' table, which gives the heartbeats of each VM.
However, some of our VMs are in an availability set/zone and as such, the SLA is only broken,
if in an interval of 1 minute, both heartbeats are missing.
As such I need to be able to group the heartbeats by availability set/zone in the query, but there doesn't seem to be such a property on the heartbeat.
I can use a separate Azure Resource Graph query to search for which VMs are in an availability set/zone, but when I merge this query with my Log Analytics query, I can't do any further Kusto Query Language processing on the query (I can only merge the tables).
For information, these are my Log Analytics Heartbeat query and my Resource Graph SLA query:
let timeRangeStart = {TimeRange:start};
let timeRangeEnd = {TimeRange:end};
Heartbeat
| where ResourceType == "virtualMachines"
| extend ResourceGroup = case(ResourceGroup <> "", ResourceGroup, "On-Prem")
| where TimeGenerated > timeRangeStart and TimeGenerated < timeRangeEnd and Computer in ({Servers})
| extend Resource=tolower(iff(isempty(_ResourceId), Resource, _ResourceId))
| summarize heartbeat_tot = count() by Resource,ResourceGroup, SubscriptionId
| extend total_number_of_buckets=round((timeRangeEnd-timeRangeStart)/1m)
| extend round(availability_rate=heartbeat_tot*100/total_number_of_buckets,2)
| extend availability_rate = min_of(availability_rate, 100)
| order by availability_rate asc
Resources // VMs
| where type == 'microsoft.compute/virtualmachines'
| extend AvSet = properties.availabilitySet.id
| extend AvZone = properties.availabilityZone.id
| extend VMname_SLA = iff(isnotempty(AvZone), AvZone, iff(isnotempty(AvSet), AvSet, id))
| extend SLA_VM = iff(isnotnull(AvZone), '99.99%', iff(isnotnull(AvSet), '99.95%', ''))
| extend managedBy = tolower(id)
| join kind = leftouter (
Resources // Disks
| where type == 'microsoft.compute/disks'
| where isnotempty(managedBy)
| extend managedBy = tolower(managedBy)
// What do Standard HDD disks have as SKU tag??? I used StandardHDD for the time being
| extend Tier_disk = sku.tier
| extend SLA_disk = iff(Tier_disk == 'StandardHDD', '95%', iff(Tier_disk == 'Standard', '99.5%', '99.9%'))
) on managedBy
| extend SLA_tot = iff(isnotempty(SLA_VM), SLA_VM, SLA_disk)
| project managedBy, VMname_SLA, SLA_tot
| order by managedBy asc
How many resources is it?
If it is not a large number of resources, a workaround would be:
run your ARG query in text parameter, and format the results of the query to effectively generate a json array of objects, with id, location, etc that you need. then mark this parameter as hidden
in your Logs query, reference that parameter json text before the query, and use KQL operators to turn that JSON structure into a table. then you can join/filter on that table in the query
it isn't optimal, and won't work well if there are large numbers of resources since every time you run your query you're effectively "uploading" a json blob and then immediately parsing it apart again.

PIG FILTER relation with next row the same relation

i'm searching for a long time now to solve my problem but nearly found nothing helpful.
Hopefully some of you can give me a tip.
I have a relation A with the following format: username, timestamp, ip
For example:
Harald 2014-02-18T16:14:49.503Z 123.123.123.123
Harald 2014-02-18T16:14:51.503Z 123.123.123.123
Harald 2014-02-18T16:14:55.503Z 321.321.321.321
And i want to find out, who changed his ip adress in less then 5 seconds. So the second and the third row should be interesting.
I want do group the relation by username und want to compare the timestamp of the actuall row with the next row. if the ip adress isnt the same and the timestamp is less then 5 seconds bigger, this should be at the output.
could someone help me with that issue?
regards.
first i want to thank you for your time.
but i actually stuck at the Sessionize part.
this is my data comming in:
aoebcu 2014-02-19T14:23:17.503Z 220.61.65.25
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
aoebcu 2014-02-19T14:23:14.503Z 222.117.144.19
jekgru 2014-02-19T14:23:14.503Z 213.56.157.109
zmembx 2014-02-19T14:23:12.503Z 199.188.198.91
qhixcg 2014-02-19T14:23:11.503Z 203.40.104.119
and my code till now looks like this:
hijack_Reduced = FOREACH finalLogs GENERATE ClientUserName, timestamp, OriginalClientIP;
hijack_Filtered = FILTER hijack_Reduced BY OriginalClientIP != '-';
hijack_Sessionized = FOREACH (GROUP hijack_Filtered BY ClientUserName) {
views = ORDER hijack_Filtered BY timestamp;
GENERATE FLATTEN(Sessionize(views)) AS (ClientUserName,timestamp,OriginalClientIP,session_id);
}
but when i run this script, i got the following error Message:
15:36:22 ERROR -
org.apache.pig.tools.pigstats.SimplePigStats.setBackendException(542)
| ERROR 0: Exception while executing [POUserFunc (Name:
POUserFunc(datafu.pig.sessions.Sessionize)[bag] - scope-199 Operator
Key: scope-199) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "aoebcu"
i already tried a lot, but nothing worked.
do you got an idea?
Regards
While you could write a UDF for this, you can actually make use of the UDFs already available in Apache DataFu to solve this.
My solution involves applying sessionization to the data. Basically you look at consecutive events and assign each event a session ID. If the time elapsed between two events exceeds a specified amount of time, in your case 5 seconds, then the next event gets a new session ID. Otherwise consecutive events get the same session ID. Once each event is assigned its session ID the rest is easy. We group by session ID and look for sessions that have more than one distinct IP address.
I'll walk through my solution.
Suppose you have the following input data. Both Harold and Kumar change their IP addresses. But Harold does it within 5 seconds, while Kumar does not. So the output of our script should just be simply "Harold".
Harold,2014-02-18T16:14:49.503Z,123.123.123.123
Harold,2014-02-18T16:14:51.503Z,123.123.123.123
Harold,2014-02-18T16:14:55.503Z,321.321.321.321
Kumar,2014-02-18T16:14:49.503Z,123.123.123.123
Kumar,2014-02-18T16:14:55.503Z,123.123.123.123
Kumar,2014-02-18T16:15:05.503Z,321.321.321.321
Load the data
data = LOAD 'input' using PigStorage(',')
AS (user:chararray,time:chararray,ip:chararray);
Now define a couple UDFs from DataFu. The Sessionize UDF performs sessionization as I described earlier. The DistinctBy UDF will be used to find the distinct IP addresses within each session.
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1');
Group the data by user, sort by time, and apply the Sessonize UDF. Note that the timestamp must be the first field, as this is what Sessionize expects. This UDF appends a session ID to each tuple.
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
Now that the data is sessionized, we can group by the user and session. I group by user too because I want to spit this value back out. We pass the bag of events into the DistinctBy UDF. Check the documentation of this UDF for a more detailed description. But essentially we will get as many tuples as there are distinct IP addresses per session. Note that I have removed the time from the relation below. This is because 1) it isn't needed, and 2) the DistinctBy in 1.2.0 of DataFu has a bug when handling fields containing dashes, as the time field does.
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
Now select all the sessions that had more than one distinct IP address and return the distinct users for these sessions.
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
This produces simply:
Harold
Here is the full source code, which you should be able to paste directly into the DataFu unit tests and run:
/**
define Sessionize datafu.pig.sessions.Sessionize('5s');
define DistinctBy datafu.pig.bags.DistinctBy('1'); -- distinct by ip
data = LOAD 'input' using PigStorage(',') AS (user:chararray,time:chararray,ip:chararray);
data = FOREACH data GENERATE time,user,ip;
data_sessionized = FOREACH (GROUP data BY user) {
views = ORDER data BY time;
GENERATE flatten(Sessionize(views)) as (time,user,ip,session_id);
}
data_sessionized = FOREACH data_sessionized GENERATE user,ip,session_id;
data_sessionized = FOREACH (GROUP data_sessionized BY (user, session_id)) GENERATE
group.user as user,
SIZE(DistinctBy(data_sessionized)) as distinctIpCount;
data_sessionized = FILTER data_sessionized BY distinctIpCount > 1;
data_sessionized = FOREACH data_sessionized GENERATE user;
data_sessionized = DISTINCT data_sessionized;
STORE data_sessionized INTO 'output';
*/
#Multiline private String sessionizeUserIpTest;
private String[] sessionizeUserIpTestData = new String[] {
"Harold,2014-02-18T16:14:49.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:51.503Z,123.123.123.123",
"Harold,2014-02-18T16:14:55.503Z,321.321.321.321",
"Kumar,2014-02-18T16:14:49.503Z,123.123.123.123",
"Kumar,2014-02-18T16:14:55.503Z,123.123.123.123",
"Kumar,2014-02-18T16:15:05.503Z,321.321.321.321"
};
#Test
public void sessionizeUserIpTest() throws Exception
{
PigTest test = createPigTestFromString(sessionizeUserIpTest);
this.writeLinesToFile("input",
sessionizeUserIpTestData);
List<Tuple> result = this.getLinesForAlias(test, "data_sessionized");
assertEquals(result.size(),1);
assertEquals(result.get(0).get(0),"Harold");
}