apache fink 0.10 Filtering duplicates over an infinite stream with time window purge - purge

how can i filter out duplicates over an infinite stream with a time window purge? I dont have infinite space / ram and i know that after say, 2 seconds (on the local clock), any duplicate that can occur will have occured. which means that after 2 seconds i can throw away (purge) the old data.
Filtering duplicates over an infinite stream with time window purge.
I got an excellent answer to how to remove the duplicates here in this question (big thanks to Till): apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?
but i dont know how to tell flink to throw away the old data after 2 seconds (local time).
how can i do this with flink 0.10 please?
Thanks alot!!!
here is the statement which removes duplicates but doesnt purge:
input.keyBy(0, 1).flatMap(new DuplicateFilter()).print();
if I add .timeWindow(Time.minutes(1), Time.seconds(30)) after keyBy(0, 1) its not compilable.

Thanks to Till - the answer is given in an update to the following link:
apache flink 0.10 how to get the first occurence of a composite key from an unbounded input dataStream?
see the update.

Related

Airflow Operator BigQueryTablePartitionExistenceSensor Question

I'm trying to use this BigQueryTablePartitionExistenceSensor operator in Airflow and I was wondering if this operator checks whether the partition is fully loaded or can potentially mark to success even if the data isn't complete yet.
For example, if my table is partitioned on DAY and the load for 20220420 has started but isn't complete, would this sensor trigger? Or, would it wait until that load step has been completed before marking the sensor to success?
Thanks
The Operator will not wait until your data has loaded, it will just check for the existence of the partition value at that moment in time. So if a single row gets inserted into that partition then this sensor would return True. See the sensor code that gets called by this operator.
An idea I've used in the past for similar problems has been to use a sentinel Label on the partitioned table to mark a load as "in-progress" or "done"
As has already been answered, it does not await anything except the existence of the partition.
If your data is streamed into partitions, and you have ordered delivery, you can probably add a sensor for the next-day partition — on the assumption that the previous day is complete when events have started streaming into the next.
If the load is managed by the same Airflow instance, I'd suggest using an ExternalTaskSensor on the load job. If not, you might be able to use the more generic SqlSensor, and run a custom SQL query on metadata tables to determine if a partition is complete, perhaps you can add a label or something with the Load job that you then query for.

Apache Spark: count vs head(1).isEmpty

For a given spark df, I want to know if a certain column has null value or not. The code I had was -
if (df.filter(col(colName).isNull).count() > 0) {//throw exception}
This was taking long and was being called 2 times for 1 df since I was checking for 2 columns. Each time it was called, I saw a job for count, so 2 jobs for 1 df.
I then changed the code to look like this -
if (!df.filter(col(colName).isNull).head(1).isEmpty) {//throw exception}
With this change, I now see 4 head jobs compared to the 2 count jobs before, increasing the overall time.
Can you experts please help me understand why the number of jobs doubled? The head function should be called only 2 times.
Thanks for your help!
N
Update: added screenshot showing the jobs for both cases. The left side shows the one with count and right side is the head. That's the only line that is different between the 2 runs.
dataframe.head(1) does 2 things -
1. Executes the action behind the dataframe on executor(s).
2. Collects 1st row of the result from executor(s) to the driver.
dataframe.count() does 2 things -
1. Executes the action behind the dataframe on executor(s). If there are no transformation on the file and parquet format is used then it is basically scanning the statistics of the file(s).
2. Collects count from executor(s) to the driver.
Based on the source of dataframe being a file which stores statistics and absence of any transformation, count() can run faster than head.
I am not 100% sure why there are 2 jobs vs 4. Can you please paste the screenshot.
Is hard to say just looking for this line of code, but there is one reason for head can taking more time. head is a deterministic request if you have sort or order_by in any part that will request a shuffle to always return the first row. With the case of count you don't need the result ordered, so there is no need to shuffle, basic a simple mapreduce step. That is probably why your head can taking more time.

How to increase the "uniqueness" of the apache session

We are running an application with a lot of visitors on an Apache 2.5 with php5.6.
After considering our sessionids as unique for the longest time we discovered, that after about 12 months duplicates of sessionids are generated, which mess up our saved records in the database, which is connected to the sessionid as identifier.
Is there a possibility to make the sessionid "more" unique to reduce the possibility of duplicates?
Looks like I created a duplicate with
How unique is the php session id
So increasing the entropy length should help.
My system has as current configuration
session.entropy_length = 32;
session.entropy_file = /dev/urandom;
the other solution suggests 512 instead of 32.
However the main key to avoid duplicates (or make them less probable) is to increase the length of the session string. I guess we will add 4 characters more, that should help with our problem.

Target Based commit point while updating into table

One of my mappings is running for a really long time (2 hours).From the session log i can see the statment "Time out based commit poin" which is tking most of the time and Busy percentage for the SQL tranfsormation is very high(which is taking time,I ran the SQL query manually in DB,its working fine ).So, basically there is a router which splits the record between insert and update.And the update stream is taking long.It has a SQL transforamtion,Update statrtergy and aggregator.I added an sorter before aggregator but no luck.
Also changed comit interval ,Lins Sequential Buffer lenght and Maximum memory allowed by checking some of the other blogs.Could you please help me with this.
If possible try to avoid the transformations which are creating cache because in future if the input records increase. Cache size will also increase and decrease the throughput
1) Aggregator : Try to use the Aggregation in SQL override itself
2) Sorter : Try to do the same in the SQL Override itself
Generally SQL transformation is slow for huge data loads, because for each input record an SQL session is invoked and a connection is established to database and the row is fetched. Say for example there are 1 million records, 1 million SQL sessions are initiated in the backend and the database is called.
What the SQL transformation doing ? Is it just generating a Surrogate key or its fetching a value from a table based on derived value from the stream
For fetching a value from a table based on derived value from the stream:
Try to use lookup
For generating Surrogate key, Use Oracle Sequence instead
Let me know if its purpose is any thing other than that
Also do the below checks
Sort the session log on thread and just make a note of start and end times of
the following
1) lookup caches creation (time between Query issued --> First row returned --> Cache creation completed)
2) Reader thread first row return time
Regards,
Raj

Prevent output for query --destination_table command

Is there a way to prevent screen output for the query --destination_table?
I wan to move data sets through the workflow, but not necessarily see the all the rows
bug on job_73d3dffab7974d9db360f5c31a3a9fa7
This is a known issue, we'll fix it in the next version of bq. To work around, you can add --max_rows=0. This only changes the number of rows that get sent back, not the number of rows that get returned by the query (you can use LIMIT N for that in the query).