How to get Last 1 hour data, every 5 minutes, without grouping? - apache-spark-sql

How to trigger every 5 minutes and get data for the last 1 hour? I came up with this but it does not seem to give me all the rows in the last 1 hr. My reasoning is :
Read the stream,
filter data for last 1 hr based on timestamp column, and
write/print using forEachbatch. And
watermark it so that it does not hold on to all the past data.
spark.
readStream.format("delta").table("xxx")
.withWatermark("ts", "60 minutes")
.filter($"ts" > current_timestamp - expr("INTERVAL 60 minutes"))
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("5 minutes"))
.foreachBatch{ (batchDF: DataFrame, batchId: Long) => batchDF.collect().foreach(println)
}
.start()
Or do I have to use a Window? But I can't seem to get rid of GroupBy if I use Window and I don't want to group.
spark.
readStream.format("delta").table("xxx")
.withWatermark("ts", "1 hour")
.groupBy(window($"ts", "1 hour"))
.count()
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("5 minutes"))
.foreachBatch{ (batchDF: DataFrame, batchId: Long) =>
print("...entering foreachBatch...\n")
batchDF.collect().foreach(println)
}
.start()

Instead of using spark streaming to execution your spark code every 5 minutes, you should use either an external scheduler (cron, etc...) or API java.util.Timer if you want to schedule processing in your code
Why you shouldn't spark-streaming to schedule spark code execution
If you use spark-streaming to schedule code, you will have two issues.
First issue, spark-streaming processes data only once. So every 5 minutes, only the new records are loaded. You can think of bypassing this by using window function and retrieving aggregated list of rows by using collect_list, or an user defined aggregate function, but then you will meet the second issue.
Second issue, although your treatment will be triggered every 5 minutes, function inside foreachBatch will be executed only if there are new records to process. Without new records during the 5 minutes interval between two execution, nothing happens.
In conclusion, spark streaming is not designed to schedule spark code to be executed at specific time interval.
Solution with java.util.Timer
So instead of using spark streaming, you should use a scheduler, either external such as cron, oozie, airflow, etc... or in your code
If you need to do it in your code, you can use java.util.Timer as below:
import org.apache.spark.sql.functions.{current_timestamp, expr}
import spark.implicits._
val t = new java.util.Timer()
val task = new java.util.TimerTask {
def run(): Unit = {
spark.read.format("delta").table("xxx")
.filter($"ts" > (current_timestamp() - expr("INTERVAL 60 minutes")))
.collect()
.foreach(println)
}
}
t.schedule(task, 5*60*1000L, 5*60*1000L) // 5 minutes
task.run()

Related

How to add 2 minutes delay between jobs in a queue?

I am using Hangfire in ASP.NET Core with a server that has 20 workers, which means 20 jobs can be enqueued at the same time.
What I need is to enqueue them one by one with 2 minutes delay between each one and another. Each job can take 1-45 minutes, but I don't have a problem running jobs concurrently, but I do have a problem starting 20 jobs at the same time. That's why changing the worker count to 1 is not practical for me (this will slow the process a lot).
The idea is that I just don't want 2 jobs to run at the same second since this may make some conflicts in my logic, but if the second job started 2 minutes after the first one, then I am good.
How can I achieve that?
You can use BackgroundJob.Schedule() to run your job run at a specific time:
BackgroundJob.Schedule(() => Console.WriteLine("Hello"), dateTimeToExecute);
Based on that set a date for the first job to execute, and then increase this date to 2 minutes for each new job.
Something like this:
var dateStartDate = DateTime.Now;
foreach (var j in listOfjobsToExecute)
{
BackgroundJob.Schedule(() => j.Run(), dateStartDate);
dateStartDate = dateStartDate.AddMinutes(2);
}
See more here:
https://docs.hangfire.io/en/latest/background-methods/calling-methods-with-delay.html?highlight=delay

How to read data from BigQuery periodically in Apache Beam?

I want to read data from Bigquery periodically in Beam, and the test codes as below
pipeline.apply("Generate Sequence",
GenerateSequence.from(0).withRate(1, Duration.standardMinutes(2)))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2))))
.apply("Read from BQ", new ReadBQ())
.apply("Convert Row",
MapElements.into(TypeDescriptor.of(MyData.class)).via(MyData::fromTableRow))
.apply("Map TableRow", ParDo.of(new MapTableRowV1()))
;
static class ReadBQ extends PTransform<PCollection<Long>, PCollection<TableRow>> {
#Override
public PCollection<TableRow> expand(PCollection<Long> input) {
BigQueryIO.TypedRead<TableRow> rows = BigQueryIO.readTableRows()
.fromQuery("select * from project.dataset.table limit 10")
.usingStandardSql();
return rows.expand(input.getPipeline().begin());
}
}
static class MapTableRowV1 extends DoFn<AdUnitECPM, Void> {
#ProcessElement
public void processElement(ProcessContext pc) {
LOG.info("String of mydata is " + pc.element().toString());
}
}
Since BigQueryIO.TypedRead is related to PBegin, one trick is done in ReadBQ through rows.expand(input.getPipeline().begin()). However, this job does NOT run every two minutes. How to read data from bigquery periodically?
Look at using Looping Timers. That provides the right pattern.
As written your code would only fire once after sequence is built. For fixed windows you would need an input value coming into the Window for it to trigger. For example, have the pipeline read a Pub/Sub input and then have an agent push events every 2 minutes into the topic/sub.
I am going to assume that you are running in streaming mode here; however, another way to do this would be to use a batch job and then run it every 2 mins from Composer. Reason being if your job is idle for effectively 90 secs (2 min - processing time) your streaming job wasting some resources.
One other note: Look at thinning down you column selection in your BigQuery SQL (to save time and money). Look at using some filter on your SQL to pick up a partition or cluster. Look at using #timestamp filter to only scan records that are new in last N. This could give you better control over how you deal with latency and variability at the db level.
As you have mentioned in the question, BigQueryIO read transforms start with PBegin, which puts it at the start of the Graph. In order to achieve what you are looking for, you will need to make use of the BigQuery client libraries directly within a DoFn.
For an example of this have a look at this
transform
Using a normal DoFn for this will be ok for small amounts of data, but for a large amount of data, you will want to look at implementing that logic in a SDF.

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

Spring Flux. How to buffer objects for a minute and then emmit Mono

I have a class/method that generates SomeData every second.
I need to:
collect this SomeData into Somewhere<SomeData> for 1 minute,
after 1 minute take collection and prepare some ReportObject
and emmit report via EmitterProcessor<ReportObject>.
How can I implement that using Flux?
You can use something like:
Flux<SomeData> data = //...
Flux<ReportObject> reports = data
// if you want 60 seconds windows
.window(Duration.ofMinutes(1))
// use a reduce operator to create a report
.flatMap(window -> window.reduce(..));

Spark : Data processing using Spark for large number of files says SocketException : Read timed out

I am running Spark in standalone mode on 2 machines which have these configs
500gb memory, 4 cores, 7.5 RAM
250gb memory, 8 cores, 15 RAM
I have created a master and a slave on 8core machine, giving 7 cores to worker. I have created another slave on 4core machine with 3 worker cores. The UI shows 13.7 and 6.5 G usable RAM for 8core and 4core respectively.
Now on this I have to process an aggregate of user ratings over a period of 15 days. I am trying to do this using Pyspark
This data is stored in hourwise files in day-wise directories in an s3 bucket, every file must be around 100MB eg
s3://some_bucket/2015-04/2015-04-09/data_files_hour1
I am reading the files like this
a = sc.textFile(files, 15).coalesce(7*sc.defaultParallelism) #to restrict partitions
where files is a string of this form 's3://some_bucket/2015-04/2015-04-09/*,s3://some_bucket/2015-04/2015-04-09/*'
Then I do a series of maps and filters and persist the result
a.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to do a reduceByKey to get an aggregate score over the span of days.
b = a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
Then I need to make a redis call for the actual terms for the items the user has rated, so I call mapPartitions like this
final_scores = b.mapPartitions(get_tags)
get_tags function creates a redis connection each time of invocation and calls redis and yield a (user, item, rate) tuple
(The redis hash is stored in the 4core)
I have tweaked the settings for SparkConf to be at
conf = (SparkConf().setAppName(APP_NAME).setMaster(master)
.set("spark.executor.memory", "5g")
.set("spark.akka.timeout", "10000")
.set("spark.akka.frameSize", "1000")
.set("spark.task.cpus", "5")
.set("spark.cores.max", "10")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryoserializer.buffer.max.mb", "10")
.set("spark.shuffle.consolidateFiles", "True")
.set("spark.files.fetchTimeout", "500")
.set("spark.task.maxFailures", "5"))
I run the job with driver-memory of 2g in client mode, since cluster mode doesn't seem to be supported here.
The above process takes a long time for 2 days' of data (around 2.5hours) and completely gives up on 14 days'.
What needs to improve here?
Is this infrastructure insufficient in terms of RAM and cores (This is offline and can take hours, but it has got to finish in 5 hours or so)
Should I increase/decrease the number of partitions?
Redis could be slowing the system, but the number of keys is just too huge to make a one time call.
I am not sure where the task is failing, in reading the files or in reducing.
Should I not use Python given better Spark APIs in Scala, will that help with efficiency as well?
This is the exception trace
Lost task 4.1 in stage 0.0 (TID 11, <node>): java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:164)
at org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:227)
at org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at org.apache.http.util.EntityUtils.consume(EntityUtils.java:88)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.releaseConnection(HttpMethodReleaseInputStream.java:102)
at org.jets3t.service.impl.rest.httpclient.HttpMethodReleaseInputStream.close(HttpMethodReleaseInputStream.java:194)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.seek(NativeS3FileSystem.java:152)
at org.apache.hadoop.fs.BufferedFSInputStream.seek(BufferedFSInputStream.java:89)
at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:63)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:126)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:236)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:212)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:93)
at org.apache.spark.rdd.CoalescedRDD$$anonfun$compute$1.apply(CoalescedRDD.scala:92)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:405)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1617)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)
I could really use some help, thanks in advance
Here is what my main code looks like
def main(sc):
f=get_files()
a=sc.textFile(f, 15)
.coalesce(7*sc.defaultParallelism)
.map(lambda line: line.split(","))
.filter(len(line)>0)
.map(lambda line: (line[18], line[2], line[13], line[15])).map(scoring)
.map(lambda line: ((line[0], line[1]), line[2])).persist(StorageLevel.MEMORY_ONLY_SER)
b=a.reduceByKey(lambda x, y: x+y).map(aggregate)
b.persist(StorageLevel.MEMORY_ONLY_SER)
c=taggings.mapPartitions(get_tags)
c.saveAsTextFile("f")
a.unpersist()
b.unpersist()
The get_tags function is
def get_tags(partition):
rh = redis.Redis(host=settings['REDIS_HOST'], port=settings['REDIS_PORT'], db=0)
for element in partition:
user = element[0]
song = element[1]
rating = element[2]
tags = rh.hget(settings['REDIS_HASH'], song)
if tags:
tags = json.loads(tags)
else:
tags = scrape(song, rh)
if tags:
for tag in tags:
yield (user, tag, rating)
The get_files function is as:
def get_files():
paths = get_path_from_dates(DAYS)
base_path = 's3n://acc_key:sec_key#bucket/'
files = list()
for path in paths:
fle = base_path+path+'/file_format.*'
files.append(fle)
return ','.join(files)
The get_path_from_dates(DAYS) is
def get_path_from_dates(last):
days = list()
t = 0
while t <= last:
d = today - timedelta(days=t)
path = d.strftime('%Y-%m')+'/'+d.strftime('%Y-%m-%d')
days.append(path)
t += 1
return days
As a small optimization, I have created two separate tasks, one to read from s3 and get additive sum, second to read transformations from redis. The first tasks has high number of partitions since there are around 2300 files to read. The second one has much lesser number of partitions to prevent redis connection latency, and there is only one file to read which is on the EC2 cluster itself. This is only partial, still looking for suggestions to improve ...
I was in a similar usecase: doing coalesce on a RDD with 300,000+ partitions. The difference is that I was using s3a(SocketTimeoutException from S3AFileSystem.waitAysncCopy). Finally the issue was resolved by setting a larger fs.s3a.connection.timeout(Hadoop's core-site.xml). Hopefully you can get a clue.