Why there are multiple calls to DB - spring-webflux

I am playing with R2DBC using Postgre SQL. The usecase i am trying is to get the Film by ID along with Language, Actors and Category. Below is the schema
this is the corresponding piece of code in ServiceImpl
#Override
public Mono<FilmModel> getById(Long id) {
Mono<Film> filmMono = filmRepository.findById(id).switchIfEmpty(Mono.error(DataFormatException::new)).subscribeOn(Schedulers.boundedElastic());
Flux<Actor> actorFlux = filmMono.flatMapMany(this::getByActorId).subscribeOn(Schedulers.boundedElastic());
Mono<String> language = filmMono.flatMap(film -> languageRepository.findById(film.getLanguageId())).map(Language::getName).subscribeOn(Schedulers.boundedElastic());
Mono<String> category = filmMono.flatMap(film -> filmCategoryRepository
.findFirstByFilmId(film.getFilmId()))
.flatMap(filmCategory -> categoryRepository.findById(filmCategory.getCategoryId()))
.map(Category::getName).subscribeOn(Schedulers.boundedElastic());
return Mono.zip(filmMono, actorFlux.collectList(), language, category)
.map(tuple -> {
FilmModel filmModel = GenericMapper.INSTANCE.filmToFilmModel(tuple.getT1());
List<ActorModel> actors = tuple
.getT2()
.stream()
.map(act -> GenericMapper.INSTANCE.actorToActorModel(act))
.collect(Collectors.toList());
filmModel.setActorModelList(actors);
filmModel.setLanguage(tuple.getT3());
filmModel.setCategory(tuple.getT4());
return filmModel;
});
}
The logs show 4 calls to film
2021-12-16 21:21:20.026 DEBUG 32493 --- [ctor-tcp-nio-10] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT film.* FROM film WHERE film.film_id = $1 LIMIT 2]
2021-12-16 21:21:20.026 DEBUG 32493 --- [actor-tcp-nio-9] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT film.* FROM film WHERE film.film_id = $1 LIMIT 2]
2021-12-16 21:21:20.026 DEBUG 32493 --- [ctor-tcp-nio-12] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT film.* FROM film WHERE film.film_id = $1 LIMIT 2]
2021-12-16 21:21:20.026 DEBUG 32493 --- [actor-tcp-nio-7] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT film.* FROM film WHERE film.film_id = $1 LIMIT 2]
2021-12-16 21:21:20.162 DEBUG 32493 --- [actor-tcp-nio-9] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT language.* FROM language WHERE language.language_id = $1 LIMIT 2]
2021-12-16 21:21:20.188 DEBUG 32493 --- [actor-tcp-nio-7] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT film_actor.actor_id, film_actor.film_id, film_actor.last_update FROM film_actor WHERE film_actor.film_id = $1]
2021-12-16 21:21:20.188 DEBUG 32493 --- [ctor-tcp-nio-10] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT film_category.film_id, film_category.category_id, film_category.last_update FROM film_category WHERE film_category.film_id = $1 LIMIT 1]
2021-12-16 21:21:20.313 DEBUG 32493 --- [ctor-tcp-nio-10] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT category.* FROM category WHERE category.category_id = $1 LIMIT 2]
2021-12-16 21:21:20.563 DEBUG 32493 --- [actor-tcp-nio-7] o.s.r2dbc.core.DefaultDatabaseClient : Executing SQL statement [SELECT actor.* FROM actor WHERE actor.actor_id = $1 LIMIT 2]
I am not trying to look for SQL optimizations(joins etc).I can definitely make it more performant. But the question in point is why i do see 4 SQL queries to Film table. Just to add i have already fixed the code. But not able to understand the core reason.Thanks in advance.

Why I do see 4 SQL queries to Film table.
The reason is quite simple. You are subscribing to the Mono<Film> 4 times:
Mono<Film> filmMono = filmRepository.findById(id);
Flux<Actor> actorFlux = filmMono.flatMapMany(...); (1)
Mono<String> language = filmMono.flatMap(...); (2)
Mono<String> category = filmMono.flatMap(...); (3)
Mono.zip(filmMono, actorFlux.collectList(), language, category) (4)
Each subscription to the filmMono triggers a new query. Note that, you can change that by using Mono#cache operator to turn filmMono into a hot source and cache the result for all four subscribers.

I'm not terribly familiar with your stack, so this is a high-level answer to hit on your "Why". There WILL be a more specific answer for you, somewhere down the pipe (e.g. someone that can confirm whether this thread is relevant).
While I'm no Spring Daisy (or Spring dev), you bind an expression to filmMono that resolves as the query select film.* from film..... You reference that expression four times, and it's resolved four times, in separate contexts. The ordering of the statements is likely a partially-successful attempt by the lib author to lazily evaluate the expression that you bound locally, such that it's able to batch the four accidentally identical queries. You most likely resolved this by collecting into an actual container, and then mapping on that container instead of the expression bound to filmMono.
In general, this situation is because the options available to library authors aren't good when the language doesn't natively support lazy evaluation. Because any operation might alter the dataset, the library author has to choose between:
A, construct just enough scaffolding to fully record all resources needed, copy the dataset for any operations that need to mutate records in some way, and hope that they can detect any edge-cases that might leak the scaffolding when the resolved dataset was expected (getting this right is...hard).
B, resolve each level of mapping as a query, for each context it appears in, lest any operations mutate the dataset in ways that might surprise the integrator (e.g. you).
C, as above, except instead of duplicating the original request, just duplicate the data...at every step. Pass-by-copy gets real painful real fast on the JVM, and languages like Clojure and Scala handle this by just making the dev be very specific about whether they want to mutate in-place, or copy then mutate.
In your case, B made the most sense to the folks that wrote that lib. In fact, they apparently got close enough to A that they were able to batch all the queries that were produced by resolving the expression bound to filmMono (which are only accidentally identical), so color me a bit impressed.
Many access patterns can be rewritten to optimize for the resulting queries instead. Your milage may vary...wildly. Getting familiar with raw SQL, or else a special-purpose language like GraphQL, can give much more consistent results than relational mappers, but I'm ever more appreciative of good IDE support, and mixing domains like that often means giving up auto-complete, context highlighting, lang-server solution-proofs and linting.
Given that the scope of the question was "why did this happen?", even noting my lack of familiarity with your stack, the answer is "lazy evaluation in a language that doesn't natively support it is really hard."

Related

java/jdbc timeout in clojure

I am trying to add timeout to jdbc/query and jdbc/execute!. Somewhere in the web I found that both functions take :timeout as an option. Documention also says the options are passed to prepare-statment which takes in :timeout as an option.
My function calls look like,
(jdbc/query db-read-spec query {:timeout 2})
(jdbc/execute! db-write-spec query {:timeout 2})
Is this how it is done? If yes, How do I test this?
If there is different way of doing this which is testable, that works too.
The :timeout option causes .setQueryTimeout to be called on the PreparedStatement used under the hood of clojure.java.jdbc. It is in seconds, not milliseconds, so your query would have to be extremely slow for a timeout of 2,000 seconds (just over half an hour) to take effect.
JDBC supports several different timeouts across several of its classes. For example, javax.sql.DataSource supports .setLoginTimeout (also in seconds), as does java.sql.DriverManager.
There are also database-specific options you can add to the connection string (which you can add as additional key/value pairs in your "db-spec") to control lower-level timeouts. For example, MySQL supports connectionTimeout and socketTimeout in the connection string -- and both of those are in milliseconds. clojure.java.jdbc allows for those to be provided in your "db-spec" hash map as :connectTimeout and :socketTimeout keys respectively.
Note that clojure.java.jdbc is considered "Stable" at this point and all current and future development effort is focused on next.jdbc at this point. next.jdbc makes it easier to use the loginTimeout since it operates on JDBC objects directly, so the whole (Java) API is available as well. It also has built-in support for connection pooling and is, overall, simpler and faster than clojure.java.jdbc.
You can leverage query-hint on mysql-select-queries (time in ms)
SELECT /*+ MAX_EXECUTION_TIME(1000) */ * FROM t1 INNER JOIN t2 WHERE....
then you can just wrap your queries:
(defn timed-query [db query t]
(j/query db [(str (subs query 0 6)
(format " /*+ MAX_EXECUTION_TIME(%s) */ " t)
(subs query 7))]))
and test:
(deftest test-query-timeout
(is (thrown? Exception (timed-query db "select * from Employees where id>5" 1))))
you should use much-complex queries for this to work with 1ms;
I figure out a work around to test this out. Since I use postgres I could leverage select pg_sleep(time-in-seconds)
And my test looks like
(is (thrown-with-msg? PSQLException #"ERROR: canceling statement due to user request"
(fetch-or-save "select pg_sleep(3)")))

Understanding Erlang ODBC application

I'm connecting to an DB source with Erlang ODBC. My code looks like:
main() ->
Sql = "SELECT 1",
Connection = connect(),
case odbc:sql_query(Connection, Sql) of
{selected, Columns, Results} ->
io:format("Success!~n Columns: ~p~n Results: ~p~n",
[Columns, Results]),
ok;
{error, Reason} ->
{error, Reason}
end.
connect() ->
ConnectionString = "DSN=dsn_name;UID=uid;PWD=pqd",
odbc:start(),
{ok, Conn} = odbc:connect(ConnectionString, []),
Conn.
It's ok now. But how can I handle errors at least in my query? As I understand it contains in {error, Reason}, but how can I output it when something gone wrong? I'm trying to add io:format like at the first clause, but it doesn't work.
At second, unfortunately, I can't find any reference that can explain syntax well, I can't understand what does ok mean in this code (first - line 8, and second - line 16. If I'm right it just means the case that connection is ok and this variable isn't assigned? But what it means at 8 line?)
ok in line 8 is the return value of the case statement when the call to odbc:sql_query(Connection, Sql) returns a result that can match the expression {selected, Columns, Results}. In this case it is useless since the function io:format(...) already returns ok.
the second ok: {ok, Conn} is a very common Erlang usage: the function returns a tuple {ok,Value} in case of success and {error,Reason} in case of failure. So you can match on the success case and extract the returned value with this single line: {ok, Conn} = odbc:connect(ConnectionString, []),
In this case the function connect() doesn't handle the error case, so this code has 4 different possible behaviors:
It can fails to connect to the database: the process will crash with a badmatch error at line 16.
It connects to the database but the query fails: the main function will return the value {error,Reason}.
It connects to the database and the query returns an answer that doesn't match the tuple {selected, Columns, Results}: the process will crash with a badmatch error at line 4.
It connects to the database and the query returns an answer that matches the tuple {selected, Columns, Results}: the function will print
Success!
Columns: Column
Results: Result
and returns ok
So I found something. The {error, Reason} contains the connection errors, means that we specified wrong DSN name etc. Regarding to my offer to catch query error we can read this from Erlang reference:
Gaurds All API-functions are guarded and if you pass an argument of
the wrong type a runtime error will occur. All input parameters to
internal functions are trusted to be correct. It is a good programming
practise to only distrust input from truly external sources. You are
not supposed to catch these errors, it will only make the code very
messy and much more complex, which introduces more bugs and in the
worst case also covers up the actual faults. Put your effort on
testing instead, you should trust your own input.
Means that we should be careful about what we write. That's not bad.

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().

Extensions for Computationally-Intensive Cypher queries

As a follow up to a previous question of mine, I want to find all 30 pathways that exist between two given nodes within a depth of 4. Something to the effect of this:
start startnode = node(1), endnode(1000)
match startnode-[r:rel_Type*1..4]->endnode
return r
limit 30;
My database contains ~50k nodes and 2M relationships.
Expectedly, the computation time for this query is very, very large; I even ended up with the following GC message in the message.log file: GC Monitor: Application threads blocked for an additional 14813ms [total block time: 182.589s]. This error keeps occuring, and blocks all threads for an indefinite period of time. Therefore, I am looking for a way to lower the computational strain of this query on the server by optimizing the query.
Is there any extension I could use to help optimize this query?
Give this one a try:
https://github.com/wfreeman/findpaths
You can query the extension like so:
.../findpathslen/1/1000/4/30
And it will give you a json response with the paths found. Hopefully that helps you.
The meat of it is here, using the built-in graph algorithm to find paths of a certain length:
#GET
#Path("/findpathslen/{id1}/{id2}/{len}/{count}")
#Produces(Array("application/json"))
def fof(#PathParam("id1") id1:Long, #PathParam("id2") id2:Long, #PathParam("len") len:Int, #PathParam("count") count:Int, #Context db:GraphDatabaseService) = {
val node1 = db.getNodeById(id1)
val node2 = db.getNodeById(id2)
val pathFinder = GraphAlgoFactory.pathsWithLength(Traversal.pathExpanderForAllTypes(Direction.OUTGOING), len)
val pathIterator = pathFinder.findAllPaths(node1,node2).asScala
val jsonMap = pathIterator.take(count).map(p => obj(p))
Response.ok(compact(render(decompose(jsonMap))), MediaType.APPLICATION_JSON).build()
}

Apache Pig join error 2087 "Found index:0 in multiple LocalRearrange operators"

So I've got two relations:
pageview counts by a GUID and URL pv_counts
events by the same GUID and url ev_counts
I'm trying to join them with joined_counts = JOIN ev_counts BY ev_site_guid, pv_counts BY pv_site_guid;, but I keep getting this error:
ERROR 2087: Unexpected problem during optimization. Found index:0 in multiple LocalRearrange operators.
I've tried using Pig 10 and Pig 11, but both return the same error.
I've Googled it, but I'm mostly just coming up with the Pig source code, but not an explanation of what it is or means. I've tried making sure I don't have any nulls or empty strings in the keys
Anyone have any idea what I'm doing wrong?
Here's the schema and some sample data:
pv_counts
describe pv_counts;
{group::pv_site_guid:chararray, group::pv_hostname:chararray, pv_count:long}
dump pv_counts;
(bSAw-mF-0r4Q-4acwqm_6r,example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,sports.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,opinion.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,newsinfo.example-url.com,10)
(bSAw-mF-0r4Q-4acwqm_6r,lifestyle.example-url.com,10)
.... many more pageviews than events ....
(dZiLDGjsGr3O3zacn9QLBk,example-url2.com.com,10)
(dZiLDGjsGr3O3zacn9QLBk,example-url3.com,10)
ev_counts
describe ev_counts;
{group::ev_site_guid:chararray, group::ee_hostname:chararray, ev1count:long, ev2count:long, ev3count:long, ev4count:long, ev5count:long}
dump ev_counts;
(bSAw-mF-0r4Q-4acwqm_6r,example-url.com,29,0,0,0,0)
(bSAw-mF-0r4Q-4acwqm_6r,sports.example-url.com,7,0,0,0,0)
(bSAw-mF-0r4Q-4acwqm_6r,lifestyle.example-url.com,2,0,0,0,0)
.... not as many events as pageviews ....
(dZiLDGjsGr3O3zacn9QLBk,example-url2.com.com,0,0,37,0,0)
(dZiLDGjsGr3O3zacn9QLBk,example-url3.com,0,0,1,0,0)
I can dump the relations just fine in Pig and Grunt.
When I add the following join statement, it gets to the very end and dies:
joined_counts = JOIN ev_counts BY ev_site_guid, pv_counts BY pv_site_guid;
dump joined_counts;
It'll throw the "ERROR 2087: Unexpected problem during optimization. Found index:0 in multiple LocalRearrange operators." error and an ugly stacktrace. I'm relatively new to pig and so I've never dug into it's internals.
If anyone had any tips or things to try, I'd gladly try them. We're running on Cloudera's CDH3U3 (0.20.2).