Spark Structured Streaming Window() function - GeneratedIterator grows beyond 64 KB - apache-spark-sql

I am running the following Sliding Window SQL query using Spark Structured Streaming approach.
"SELECT WINDOW(record_time, \"120 seconds\",\"1 seconds\"), COUNT(*) FROM records GROUP BY WINDOW(record_time, \"120 seconds\",\"1 seconds\")";
I am getting the following error if I keep the window Size as 120 seconds and sliding interval as 1 seconds:
org.codehaus.janino.JaninoRuntimeException: Code of method "agg_doAggregateWithKeys$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$GeneratedIterator;)V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" grows beyond 64 KB
For Window(90s,1s) and Window(120s,2s) its working fine.
Even though I got this error, but still I got the output for the query on the console.
Is this Ok? Should I ignore this error?

Just try saying 'window' instead. So, your query should look as follows:
SELECT window, COUNT(*) FROM records GROUP BY WINDOW(record_time, "120 seconds","1 seconds");

Related

"The maximum total size of all input parameters is 5242880 bytes." Error message when using BigQuery remote function

When calling a remote function on BigQuery, I get the following error The maximum total size of all input parameters is 5242880 bytes.. This error is undocumented (or my Googling ability is lacking). It is even more unexpected as the query that pop this error is as simple as it gets:
SELECT
`remote_function`( payload ) AS results
FROM
`source_table`
The error is definitely on BQ's side, as no error is observed on the cloud-function called.
How to proceed on such issue, I'm guessing I hitting some kind of limit related to remote functions, but which one?

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110

Sort by minute doesn't work

I'm trying to sort my photos by hour and minute but it will not catch the minutes - it's just keep saying that nothing exists for that hour and minute. If I'm trying to sort only after the hours, it works perfectly. I have tested WHERE DATEPART(minute, exif_taken) = "'.$_GET['min'].'" but I'm keep getting the following error message:
Fatal error: Uncaught exception 'PDOException' with message 'SQLSTATE[42000]: Syntax error or access violation: 1305 FUNCTION gallery.DATEPART does not exist' in ...
I'm using WAMP Server with default settings except for some modules activated for both Apache and PHP like mod_rewrite and php_exif. Here's how my SQL query looks like:
SELECT *
FROM photos
WHERE HOUR(exif_taken) = "'.$_GET['h'].'"
AND MINUTE(exif_taken) = "'.$_GET['min'].'"
ORDER BY exif_taken DESC
$_GET['h'] is the hour and $_GET['min'] the minute.
How can I solve my problem?
Thanks in advance.
Are you using MySQL? If true, DATEPART() function doesn't exist. You shoul use MINUTE() function..
Here is the complete documentation.
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html
BTW, for god sake, sanitize your _GET vars before send to the SQL query. You are exposing your system to SQL Injections.
In some odd way it's working perfectly now with the SQL query I posted in my question O.o