influxdb ***v2*** query requesting microseconds-since-epoch timestamp - api

When I query the influxdb v2 using the /api/v2/query api, timestamps are returned in RFC3339 format - something that is ridiculously slow to parse back into a useful timestamp. Influxdb v1.x used to allow the specification of seconds/microseconds/nanoseconds since epoch. How does one do this in v2?
xx

My approach is converting datetime to uint. It won't take much time, faster than parsing RFC3339 on client side, as InfluxDB 2.0 API Documentation doesn't provide a way to print epoch.
|> map(fn: (r) => ({r with epoch_ms: uint(v: r._time) / uint(v: 1000000)}))
|> drop(columns: ["_start", "_stop", "_time"])
https://docs.influxdata.com/influxdb/v2.1/api/#operation/PostQuery

Related

Fetching results from Celery backend is abnormally slow

I'm using Celery with a Redis broker to do some "heavy" processing for my Django app. Everything is running locally in Docker containers on WSL2.
The tasks output a JSON which is roughly 2.5 Mb large and it takes up to 9 seconds to retrieve the result via get() in the Django app. For smaller payloads, the time goes down
I tried increasing the RAM and the CPU for WSL2 up to 6 CPUs and 8Gb RAM. Celery was configured with --max-memory-per-child=1024000 --concurrency=4
I've tried using different result_backend configuration with similar results:
Redis
RPC
SQLite with SQLAlchemy
I tried setting an interval when using SQLite (doesn't matter for RPC & Redis) with a 0.5sec improvement get(interval=0.01)
I also tried changing the result_serializer from JSON to pickle for poorer performance. But I don't think the serializer is the culprit here as serializing / deserializing the same JSON is pretty fast in console
>>> timeit.timeit(lambda: pickle.dumps(big_dict,0), number=10)
0.567067899999528
>>> timeit.timeit(lambda: pickle.loads(str), number=10)
0.3542163999991317
I tried using compression, only zlib seemed to provide a small gain.
I'm not too familiar with this setup but IMHO I should be able to retrieve results faster. The best I could achieve was 6sec. Any idea how to improve this or how to explain it ?
settings.py
CELERY_BROKER_URL = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
CELERY_RESULT_BACKEND = "redis://{host}:{port}/{db}".format(
host=os.environ.get('REDIS_HOST'),
port=os.environ.get('REDIS_PORT'),
db=os.environ.get('CELERY_REDIS_DB')
)
# CELERY_RESULT_BACKEND = 'db+sqlite:///celery.sqlite' # SQL Example (need SQLAlchemy==1.4.29 in requirements.txt)
# CELERY_RESULT_BACKEND = 'rpc://localhost' # RPC Example
CELERY_ACCEPT_CONTENT = ['json']
CELERY_TASK_SERIALIZER = 'json'
CELERY_RESULT_SERIALIZER = 'json'
Thanks
In general, Redis has a reputation for being bad at dealing with large objects and is not generally intended to be a large object store. You're better off using a general purpose RDBMS or a file store and returning a key to where the JSON can be retrieved.

java/jdbc timeout in clojure

I am trying to add timeout to jdbc/query and jdbc/execute!. Somewhere in the web I found that both functions take :timeout as an option. Documention also says the options are passed to prepare-statment which takes in :timeout as an option.
My function calls look like,
(jdbc/query db-read-spec query {:timeout 2})
(jdbc/execute! db-write-spec query {:timeout 2})
Is this how it is done? If yes, How do I test this?
If there is different way of doing this which is testable, that works too.
The :timeout option causes .setQueryTimeout to be called on the PreparedStatement used under the hood of clojure.java.jdbc. It is in seconds, not milliseconds, so your query would have to be extremely slow for a timeout of 2,000 seconds (just over half an hour) to take effect.
JDBC supports several different timeouts across several of its classes. For example, javax.sql.DataSource supports .setLoginTimeout (also in seconds), as does java.sql.DriverManager.
There are also database-specific options you can add to the connection string (which you can add as additional key/value pairs in your "db-spec") to control lower-level timeouts. For example, MySQL supports connectionTimeout and socketTimeout in the connection string -- and both of those are in milliseconds. clojure.java.jdbc allows for those to be provided in your "db-spec" hash map as :connectTimeout and :socketTimeout keys respectively.
Note that clojure.java.jdbc is considered "Stable" at this point and all current and future development effort is focused on next.jdbc at this point. next.jdbc makes it easier to use the loginTimeout since it operates on JDBC objects directly, so the whole (Java) API is available as well. It also has built-in support for connection pooling and is, overall, simpler and faster than clojure.java.jdbc.
You can leverage query-hint on mysql-select-queries (time in ms)
SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM t1 INNER JOIN t2 WHERE....
then you can just wrap your queries:
(defn timed-query [db query t]
(j/query db [(str (subs query 0 6)
(format " /*+ MAX_EXECUTION_TIME(%s) */ " t)
(subs query 7))]))
and test:
(deftest test-query-timeout
(is (thrown? Exception (timed-query db "select * from Employees where id>5" 1))))
you should use much-complex queries for this to work with 1ms;
I figure out a work around to test this out. Since I use postgres I could leverage select pg_sleep(time-in-seconds)
And my test looks like
(is (thrown-with-msg? PSQLException #"ERROR: canceling statement due to user request"
(fetch-or-save "select pg_sleep(3)")))

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

Date Format for API 4.0

When attempting to send an event via post to your api in version 4, I am sending
"data"=>
{"id"=>"bfc50100-02eb-11e9-b178-db8890d0b369",
"name"=>"Name of Event",
"type"=>nil,
"description"=>nil,
"start_epoch"=>1343815200,
"end_epoch"=>1343869200,
"archived"=>0,
"deleted"=>0,
"is_public"=>0,
"status"=>"ACTIVE",
"has_time"=>1,
"timezone"=>nil,
"legacy_id"=>nil,
"created_at"=>"2018-12-18T17:38:36.000Z",
"updated_at"=>"2018-12-18T17:38:36.000Z",
"industry"=>nil}}
And receiving success from your API, but when going to the url for this event, I am seeing the date formatted as 1/18/70, though in Unix time this should be showing as 8/1/2012.
This occurs with all dates. Am I missing something? Is there another date format you would like? The term epoch led me to believe that you wanted a standard unix timestamp.
you need to send unix time stamp, e.g., 1545326867 - which is in milliseconds

Parsing structured syslog with syslog-ng

I am trying to leverage the parsing of structured data feature in syslog-ng. From my firewall, I am forwarding the following message:
<14>1 2012-10-06T11:03:56.493 SRX100 RT_FLOW - RT_FLOW_SESSION_CLOSE [junos#2636.1.1.1.2.36 reason="TCP FIN" source-address="192.168.199.207" source-port="59292" destination-address="184.73.190.157" destination-port="80" service-name="junos-http" nat-source-address="50.193.12.149" nat-source-port="19230" nat-destination-address="184.73.190.157" nat-destination-port="80" src-nat-rule-name="source-nat-rule" dst-nat-rule-name="None" protocol-id="6" policy-name="trust-to-untrust" source-zone-name="trust" destination-zone-name="untrust" session-id-32="9375" packets-from-client="9" bytes-from-client="4342" packets-from-server="7" bytes-from-server="1507" elapsed-time="1" application="UNKNOWN" nested-application="UNKNOWN" username="N/A" roles="N/A" packet-incoming-interface="vlan.0"]
Based on the format of the IETF logs, it appears to be correct, but for some reason the structured data is actually being parsed as the message portion of the log and not being parsed as structured data.
On the syslog-ng side, you need to use either a syslog() source, or a tcp() source with flags(syslog-proto) set, and then the stuff will end up in variables like ${.SDATA.junos#2636.1.1.1.2.36.reason} and so on and so forth, which then you can use as you see fit.