Mongodb Long Query data is not being inserted - ruby-on-rails-3

we are facing a very weird problem while on insertion query, it's working fine on local mongodb but its causing problem on MongoHQ.
When data is a little bit long in in insertion query, Rails, Mongo Id query, MongoHq return true but actually data will not be inserted, following queries are being logged in following cases.
Failure cases:
Dev-Stats['gm_metrics'].insert([
{"_id"=>BSON::ObjectId('50c01eccb4e36b0002000003'), "metric_name"=>2, "count"=>1, "parameters"=>[{"param"=>"user_id", "value"=>"4f0be09cc8c1950001000001"}, {"param"=>"created_at", "value"=>"2012-12-06 04:27:56 +0000"}, {"param"=>"game_id", "value"=>"4e8d712aa510cb0001000002"}], "updated_at"=>2012-12-06 04:27:56 UTC, "created_at"=>2012-12-06 04:27:56 UTC}
])
Our Rails Query from which above mongodb query is generated
GmMetric.with(:safe=>false).create({:metric_name=>2, :count=>1,:parameters => [{"param"=>"user_id", "value"=>"4f0be09cc8c1950001000001"},{"param"=>"created_at", :value=>"#{Time.now}"},{"param"=>"game_id", :value1=>"4e8d712aa510cb0001000002"}]})
Dev-Stats['gm_metrics'].insert([{"_id"=>BSON::ObjectId('50c01eccb4e36b0002000003'), "metric_name"=>2, "count"=>1, "parameters"=>[{"param"=>"user_id", "value"=>"123456"}, {"param"=>"created_at", "value"=>"2012-12-06 04:27:56 +0000"}, {"param"=>"game_id", "value"=>"123456"},{"param"=>"game_id", "value"=>"123456"}], "updated_at"=>2012-12-06 04:28:56 UTC, "created_at"=>2012-12-06 04:28:56 UTC}])
Success Cases
Dev-Stats['gm_metrics'].insert([{"_id"=>BSON::ObjectId('50c01eccb4e36b0002000003'), "metric_name"=>2, "count"=>1, "parameters"=>[{"param"=>"user_id", "value"=>"4f0be09cc8c1950001000001"}, {"param"=>"created_at", "value"=>"2012-12-06 04:27:56 +0000"}, {"param"=>"game_id", "value"=>"123456"}], "updated_at"=>2012-12-06 04:28:56 UTC, "created_at"=>2012-12-06 04:28:56 UTC}])
Dev-Stats['gm_metrics'].insert([{"_id"=>BSON::ObjectId('50c01eccb4e36b0002000003'), "metric_name"=>2, "count"=>1, "parameters"=>[{"param"=>"user_id", "value"=>"4f0be09cc8c1950001000001"}, {"param"=>"created_at", "value"=>"4f0be09cc8c1950001000001"}], "updated_at"=>2012-12-06 04:28:56 UTC, "created_at"=>2012-12-06 04:28:56 UTC}])
For them above cases its confirm that when query lenght is long its not being inserted into database , but when we try to insert with fewer parameters and short values , we always get success. Please guide us how to fix it.

try to insert with safe:true option (m not much into ruby, but it will give you enough insight) on why it is failing.

Related

java/jdbc timeout in clojure

I am trying to add timeout to jdbc/query and jdbc/execute!. Somewhere in the web I found that both functions take :timeout as an option. Documention also says the options are passed to prepare-statment which takes in :timeout as an option.
My function calls look like,
(jdbc/query db-read-spec query {:timeout 2})
(jdbc/execute! db-write-spec query {:timeout 2})
Is this how it is done? If yes, How do I test this?
If there is different way of doing this which is testable, that works too.
The :timeout option causes .setQueryTimeout to be called on the PreparedStatement used under the hood of clojure.java.jdbc. It is in seconds, not milliseconds, so your query would have to be extremely slow for a timeout of 2,000 seconds (just over half an hour) to take effect.
JDBC supports several different timeouts across several of its classes. For example, javax.sql.DataSource supports .setLoginTimeout (also in seconds), as does java.sql.DriverManager.
There are also database-specific options you can add to the connection string (which you can add as additional key/value pairs in your "db-spec") to control lower-level timeouts. For example, MySQL supports connectionTimeout and socketTimeout in the connection string -- and both of those are in milliseconds. clojure.java.jdbc allows for those to be provided in your "db-spec" hash map as :connectTimeout and :socketTimeout keys respectively.
Note that clojure.java.jdbc is considered "Stable" at this point and all current and future development effort is focused on next.jdbc at this point. next.jdbc makes it easier to use the loginTimeout since it operates on JDBC objects directly, so the whole (Java) API is available as well. It also has built-in support for connection pooling and is, overall, simpler and faster than clojure.java.jdbc.
You can leverage query-hint on mysql-select-queries (time in ms)
SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM t1 INNER JOIN t2 WHERE....
then you can just wrap your queries:
(defn timed-query [db query t]
(j/query db [(str (subs query 0 6)
(format " /*+ MAX_EXECUTION_TIME(%s) */ " t)
(subs query 7))]))
and test:
(deftest test-query-timeout
(is (thrown? Exception (timed-query db "select * from Employees where id>5" 1))))
you should use much-complex queries for this to work with 1ms;
I figure out a work around to test this out. Since I use postgres I could leverage select pg_sleep(time-in-seconds)
And my test looks like
(is (thrown-with-msg? PSQLException #"ERROR: canceling statement due to user request"
(fetch-or-save "select pg_sleep(3)")))

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

Bigquery Stream: Missing data after new table created

We recently noticed that within a short period of time after a new table being created, data which were streamed in, without any exceptions or errors, just got missing. Is there any known grace time the streaming should wait?
I finally figured out what happened by printing out trace info step by step. The multi-thread contributed to cover up the issue for a long time.
This the original 'missing data' code to create a table:
insert = sBIGQUERY.tables().insert(mProjectId, mDataset, table);
logger.info("Table " + tid.toString()+" is created at " + new Date(insert
.execute().getCreationTime()));
where insert.execute().getCreationTime() never returned.... (I don't know why) and thus the rest of my process(put data back to the sending queue to wait for next stream) didn't execute.
After I change it to:
sBIGQUERY.tables().insert(mProjectId, mDataset, table).execute();
logger.info("Table " + tid.toString()+" is created");
It runs properly and we get all the data up to BQ.
#Jordan Tigani, do you know the reason for getCreationTime() never get back? (or during quite a long period than I can wait for)
There is a 'warm up' time of a few seconds after streaming first occurs on a table before it is available for querying. There is a similar warm up time if you stop streaming to the table for more than 24 hours and then start again.
See the docs here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability

Apache Pig join error 2087 "Found index:0 in multiple LocalRearrange operators"

So I've got two relations:
pageview counts by a GUID and URL pv_counts
events by the same GUID and url ev_counts
I'm trying to join them with joined_counts = JOIN ev_counts BY ev_site_guid, pv_counts BY pv_site_guid;, but I keep getting this error:
ERROR 2087: Unexpected problem during optimization. Found index:0 in multiple LocalRearrange operators.
I've tried using Pig 10 and Pig 11, but both return the same error.
I've Googled it, but I'm mostly just coming up with the Pig source code, but not an explanation of what it is or means. I've tried making sure I don't have any nulls or empty strings in the keys
Anyone have any idea what I'm doing wrong?
Here's the schema and some sample data:
pv_counts
describe pv_counts;
{group::pv_site_guid:chararray, group::pv_hostname:chararray, pv_count:long}
dump pv_counts;
(bSAw-mF-0r4Q-4acwqm_6r,example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,sports.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,opinion.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,newsinfo.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,lifestyle.example-url.com,10)
.... many more pageviews than events ....
(dZiLDGjsGr3O3zacn9QLBk,example-url2.com.com,10)
(dZiLDGjsGr3O3zacn9QLBk,example-url3.com,10)
ev_counts
describe ev_counts;
{group::ev_site_guid:chararray, group::ee_hostname:chararray, ev1count:long, ev2count:long, ev3count:long, ev4count:long, ev5count:long}
dump ev_counts;
(bSAw-mF-0r4Q-4acwqm_6r,example-url.com,29,0,0,0,0)
(bSAw-mF-0r4Q-4acwqm_6r,sports.example-url.com,7,0,0,0,0)
(bSAw-mF-0r4Q-4acwqm_6r,lifestyle.example-url.com,2,0,0,0,0)
.... not as many events as pageviews ....
(dZiLDGjsGr3O3zacn9QLBk,example-url2.com.com,0,0,37,0,0)
(dZiLDGjsGr3O3zacn9QLBk,example-url3.com,0,0,1,0,0)
I can dump the relations just fine in Pig and Grunt.
When I add the following join statement, it gets to the very end and dies:
joined_counts = JOIN ev_counts BY ev_site_guid, pv_counts BY pv_site_guid;
dump joined_counts;
It'll throw the "ERROR 2087: Unexpected problem during optimization. Found index:0 in multiple LocalRearrange operators." error and an ugly stacktrace. I'm relatively new to pig and so I've never dug into it's internals.
If anyone had any tips or things to try, I'd gladly try them. We're running on Cloudera's CDH3U3 (0.20.2).

Sort by minute doesn't work

I'm trying to sort my photos by hour and minute but it will not catch the minutes - it's just keep saying that nothing exists for that hour and minute. If I'm trying to sort only after the hours, it works perfectly. I have tested WHERE DATEPART(minute, exif_taken) = "'.$_GET['min'].'" but I'm keep getting the following error message:
Fatal error: Uncaught exception 'PDOException' with message 'SQLSTATE[42000]: Syntax error or access violation: 1305 FUNCTION gallery.DATEPART does not exist' in ...
I'm using WAMP Server with default settings except for some modules activated for both Apache and PHP like mod_rewrite and php_exif. Here's how my SQL query looks like:
SELECT *
FROM photos
WHERE HOUR(exif_taken) = "'.$_GET['h'].'"
AND MINUTE(exif_taken) = "'.$_GET['min'].'"
ORDER BY exif_taken DESC
$_GET['h'] is the hour and $_GET['min'] the minute.
How can I solve my problem?
Thanks in advance.
Are you using MySQL? If true, DATEPART() function doesn't exist. You shoul use MINUTE() function..
Here is the complete documentation.
https://dev.mysql.com/doc/refman/5.5/en/date-and-time-functions.html
BTW, for god sake, sanitize your _GET vars before send to the SQL query. You are exposing your system to SQL Injections.
In some odd way it's working perfectly now with the SQL query I posted in my question O.o