GemFire OQL Query - How do I use the count of a SELECT statement in the WHERE clause? - gemfire

I am trying to query all the IDs of the records from /ExampleRegion. I want to retrieve records if the count of the ID is only 1, so there is only 1 record in the region with that ID.
SELECT COUNT(*), id from /ExampleRegion group by id --> Only if the count for that id is 1.
How can I use COUNT as a Condition in the WHERE Condition?
I have tried the following but it doesn't work:
SELECT * from /ExampleRegion a where (SELECT count(*) as c, b.id from /ExampleRegion b where b.id = a.id and c = 1)
SELECT * from /ExampleRegion a where (SELECT count(*) as c from /ExampleRegion b where b.id = a.id ) = 1
I would think GROUP BY would work, although I still can't seem to find the correct OQL.
Much appreciated.

NOTE: To help address and (try to) resolve this question, I created a test class along with a simple User application domain model class to put this problem into context.
In short...
Regarding...
"How can I use COUNT as a Condition in the WHERE Condition?"
You cannot use an aggregate OQL query function like count in a predicate of the WHERE clause (as I suspect you already found out), such as:
SELECT x.id, count(*) AS cnt FROM /Users x WHERE count(*) = 1 GROUP BY x.id
This results in the following Exception:
Caused by: org.apache.geode.cache.query.QueryInvalidException: Aggregate functions can not be used as part of the WHERE clause.
at org.apache.geode.cache.query.internal.QCompiler.checkWhereClauseForAggregates(QCompiler.java:204)
at org.apache.geode.cache.query.internal.QCompiler.checkWhereClauseForAggregates(QCompiler.java:214)
at org.apache.geode.cache.query.internal.QCompiler.select(QCompiler.java:260)
...
Additionally, and unfortunately, the following OQL query:
SELECT x.id, count(*) AS cnt FROM /Users x WHERE cnt = 1 GROUP BY x.id
Returns no results!
The opposite OQL query used to find duplicates also returns no results:
SELECT x.id, count(*) AS cnt FROM /Users x WHERE cnt = 1 GROUP BY x.id
Although, I am not entirely certain why, I suspect it is due to the same limitation as the first OQL query above where the count aggregate function was used in an OQL query predicate inside the WHERE clause, except this later form is less informative (e.g. like I suspect it might be eating an Exception somewhere since, according to GemFire, the OQL query is syntactically correct).
However, if you only care about the IDs then you can simply run a similar OQL query:
SELECT x.id, count(*) AS cnt FROM /Users x GROUP BY x.id
Of course, this OQL query is returning a projection (or GemFire Struct (Javadoc)) that returns a count of all User IDs (duplicate and unique). Clearly, if the count for a User ID is 1, then it is unique, and if it is greater than 1, a duplicate (i.e. not unique).
In detail...
Typically though, users want to get access to the actual object (e.g. User) when the User instance either has a unique ID (in your case) or a duplicate ID. Users do this to perform some operation on the Region entry value (e.g. User) returned by the OQL query, which is particularly common inside Functions used to process a PARTITION Regions in a parallel and distributed fashion.
But, I have to admit, I am bit dumbfounded by not being able to (completely) solve this problem.
I honestly thought this problem should have been solvable with the following GemFire OQL query:
SELECT u
FROM /Users u, (SELECT DISTINCT x.id AS id, count(*) AS cnt FROM /Users x GROUP BY x.id) v
WHERE v.cnt = 1
AND u.id = v.id
ORDER BY u.name ASC
Essentially, this OQL query selects all Users where their ID is unique because they are 1 of a kind.
Strangely, this results in a GemFire QueryInvalidException:
org.springframework.data.gemfire.GemfireQueryException: ; nested exception is org.apache.geode.cache.query.QueryInvalidException:
at org.springframework.data.gemfire.GemfireCacheUtils.convertGemfireAccessException(GemfireCacheUtils.java:303)
at org.springframework.data.gemfire.GemfireCacheUtils.convertQueryExceptions(GemfireCacheUtils.java:325)
at org.springframework.data.gemfire.GemfireAccessor.convertGemFireQueryException(GemfireAccessor.java:109)
at org.springframework.data.gemfire.GemfireTemplate.find(GemfireTemplate.java:326)
at org.springframework.data.gemfire.repository.query.StringBasedGemfireRepositoryQuery.execute(StringBasedGemfireRepositoryQuery.java:159)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker.doInvoke(RepositoryMethodInvoker.java:135)
at org.springframework.data.repository.core.support.RepositoryMethodInvoker.invoke(RepositoryMethodInvoker.java:119)
at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.doInvoke(QueryExecutorMethodInterceptor.java:151)
at org.springframework.data.repository.core.support.QueryExecutorMethodInterceptor.invoke(QueryExecutorMethodInterceptor.java:130)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
at io.stackoverflow.questions.apache.geode.query.$Proxy45.findUsersWithDuplicateId(Unknown Source)
at io.stackoverflow.questions.apache.geode.query.QueryCountEqualToOneIntegrationTests.duplicateCountQueryIsCorrect(QueryCountEqualToOneIntegrationTests.java:112)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:564)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.springframework.test.context.junit4.statements.RunBeforeTestExecutionCallbacks.evaluate(RunBeforeTestExecutionCallbacks.java:74)
at org.springframework.test.context.junit4.statements.RunAfterTestExecutionCallbacks.evaluate(RunAfterTestExecutionCallbacks.java:84)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.springframework.test.context.junit4.statements.RunBeforeTestMethodCallbacks.evaluate(RunBeforeTestMethodCallbacks.java:75)
at org.springframework.test.context.junit4.statements.RunAfterTestMethodCallbacks.evaluate(RunAfterTestMethodCallbacks.java:86)
at org.springframework.test.context.junit4.statements.SpringRepeat.evaluate(SpringRepeat.java:84)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:251)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:97)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:70)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:190)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
Caused by: org.apache.geode.cache.query.QueryInvalidException:
at org.apache.geode.cache.query.internal.DefaultQuery.<init>(DefaultQuery.java:172)
at org.apache.geode.cache.query.internal.DefaultQueryService.newQuery(DefaultQueryService.java:150)
at org.springframework.data.gemfire.GemfireTemplate.find(GemfireTemplate.java:313)
... 43 more
Caused by: org.apache.geode.cache.query.TypeMismatchException: Exception in evaluating the Collection Expression in getRuntimeIterator() even though the Collection is independent of any RuntimeIterator
at org.apache.geode.cache.query.internal.CompiledIteratorDef.evaluateCollectionForIndependentIterator(CompiledIteratorDef.java:143)
at org.apache.geode.cache.query.internal.CompiledIteratorDef.getRuntimeIterator(CompiledIteratorDef.java:117)
at org.apache.geode.cache.query.internal.CompiledSelect.computeDependencies(CompiledSelect.java:189)
at org.apache.geode.cache.query.internal.DefaultQuery.<init>(DefaultQuery.java:170)
... 45 more
Caused by: java.lang.NullPointerException
at org.apache.geode.cache.query.internal.CompiledSelect.applyProjectionAndAddToResultSet(CompiledSelect.java:1309)
at org.apache.geode.cache.query.internal.CompiledSelect.doNestedIterations(CompiledSelect.java:800)
at org.apache.geode.cache.query.internal.CompiledSelect.doNestedIterations(CompiledSelect.java:844)
at org.apache.geode.cache.query.internal.CompiledSelect.doIterationEvaluate(CompiledSelect.java:703)
at org.apache.geode.cache.query.internal.CompiledSelect.evaluate(CompiledSelect.java:426)
at org.apache.geode.cache.query.internal.CompiledGroupBySelect.evaluate(CompiledGroupBySelect.java:157)
at org.apache.geode.cache.query.internal.CompiledGroupBySelect.evaluate(CompiledGroupBySelect.java:42)
at org.apache.geode.cache.query.internal.CompiledIteratorDef.evaluateCollection(CompiledIteratorDef.java:184)
at org.apache.geode.cache.query.internal.RuntimeIterator.evaluateCollection(RuntimeIterator.java:104)
at org.apache.geode.cache.query.internal.CompiledIteratorDef.evaluateCollectionForIndependentIterator(CompiledIteratorDef.java:128)
... 48 more
There is nothing more blatantly irritating in software to me than NPEs! They are a clear and present programmer error; not a user error!
Seemingly, GemFire is not happy with the nested OQL query declared in the FROM clause, which would in essence create a queryable collection, or intermediate result set used in the outer query (much like a RDBMS temporary table):
TypeMismatchException: Exception in evaluating the Collection Expression in getRuntimeIterator() even though the Collection is independent of any RuntimeIterator
And perhaps, GemFire/Geode is specifically not happy about the "projection" of this nested (temporary) collection, hence the NPE here:
Caused by: java.lang.NullPointerException
at org.apache.geode.cache.query.internal.CompiledSelect.applyProjectionAndAddToResultSet(CompiledSelect.java:1309)
at org.apache.geode.cache.query.internal.CompiledSelect.doNestedIterations(CompiledSelect.java:800)
When I look at the affected GemFire/Geode code, the exact condition really makes no sense to me since I tested with a ClientCache using a LOCAL (only) Region. #sigh
Nevertheless, I even tried to test with a peer Cache instance using a PARTITION Region (with PDX enabled (required for PRs actually)) and that led to the same result! #sigh
Given the GemFire query engine is seemingly having trouble in the projection of the nested OQL query (containing the count and GROUP BY clause) I decided to try to provide more information to the query engine in hopes of better inform the query engine about the projected values. So, I created the UserIdCount projection class type and used it in my OQL query like so:
IMPORT io.stackoverflow.questions.spring.geode.app.model.UserIdCount;
SELECT DISTINCT u
FROM /Users u, (SELECT DISTINCT x.id AS id, count(*) AS cnt FROM /Users x GROUP BY x.id) v TYPE UserIdCount
WHERE v.cnt = 1
AND u.id = v.id
ORDER BY u.name ASC
Of course, and unfortunately, this also did not have the intended effect and only led to the following Exception:
java.lang.IllegalArgumentException: element type must be struct
at org.apache.geode.cache.query.internal.StructSet.setElementType(StructSet.java:365)
at org.apache.geode.cache.query.internal.CompiledIteratorDef.prepareIteratorDef(CompiledIteratorDef.java:275)
at org.apache.geode.cache.query.internal.CompiledIteratorDef.evaluateCollection(CompiledIteratorDef.java:200)
at org.apache.geode.cache.query.internal.RuntimeIterator.evaluateCollection(RuntimeIterator.java:104)
at org.apache.geode.cache.query.internal.CompiledSelect.doNestedIterations(CompiledSelect.java:813)
...
It seems I am stuck with a GemFire Struct, which you'd think GemFire would know how to process in a nested query when accessing the projection values in the outer query. But, whatever!
I really feel like the NPE is an unintended consequence from GemFire and that GemFire really ought to be able to (and, possibly can) handle this type of OQL query.
So, what are you left with.
Well, as I stated above, if all you care about is the IDs, then you can return all IDs with their counts and iterate the List of Struts to find the IDs with a count of 1.
Of course, if you are ultimately interested in the objects with unique (or perhaps, duplicate) IDs to perform additional processing, then you will need to break this into 2 individual and separate OQL queries, first to get the IDs of interest, and then use those IDs to get the objects/values (e.g. Users) in another query.
I have demonstrated this 2-phase query approach for your use case (i.e. unique IDs) in this test case.
Anyway, I hope this gives you a few options or things to think about.
Cheers!

Related

Teiid not performing optimal join

For our Teiid Springboot project we use a row filter in a where clause to determine what results a user gets.
Example:
SELECT * FROM very_large_table WHERE id IN ('01', '03')
We want the context in the IN clause to be dynamic like so:
SELECT * FROM very_large_table WHERE id IN (SELECT other_id from very_small_table)
The problem now is that Teiid gets all the data from very_large_table and only then tries to filter with the where clause, this makes the query 10-20 times slower. The data in this very_small_tableis only about 1-10 records and it is based on the user context we get from Java.
The very_large_table is located on a Oracle database and the very_small_table is on the Teiid Pod/Container. Somehow I can't force Teiid to ship the data to Oracle and perform filtering there.
Things that I have tried:
I have specified the the foreign data wrappers as follows
CREATE FOREING DATA WRAPPER "oracle_override" TYPE "oracle" OPTIONS (EnableDependentsJoins 'true');
CREATE SERVER server_name FOREIGN DATA WRAPPER "oracle_override";
I also tried, exists statement or instead of a where clause use a join clause to see if pushdown happened. Also hints for joins don't seem to matter.
Sadly the performance impact at the moment is that high that we can't reach our performance targets.
Are there any cardinalities on very_small_table and very_large_table? If not the planner will assume a default plan.
You can also use a dependent join hint:
SELECT * FROM very_large_table WHERE id IN /*+ dj */ (SELECT other_id from very_small_table)
Often, exists performs better than in:
SELECT vlt.*
FROM very_large_table vlt
WHERE EXISTS (SELECT 1 FROM very_small_table vst WHERE vst.other_id = vlt.id);
However, this might end up scanning the large table.
If id is unique in vlt and there are no duplicates in vst, then a JOIN might optimize better:
select vlt.*
from very_small_table vst join
very_large_table vlt
on vst.other_id = vlt.id;

SQL Server Execute Order

As I know the order of execute in SQL is
FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY
So I am confused with the correlated query like the below code.
Is FROM WHERE clause in outer query executed first or SELECT in inner query executed first? Can anyone give me idea and explanation? Thanks
SELECT
*, COUNT(1) OVER(PARTITION BY A) pt
FROM
(SELECT
tt.*,
(SELECT COUNT(id) FROM t WHERE data <= 10 AND ID < tt.ID) AS A
FROM
t tt
WHERE
data > 10) t1
As I know the order of execute in SQL is FROM-> WHERE-> GROUP BY-> HAVING -> SELECT ->ORDER BY
False. False. False. Presumably what you are referring to is this part of the documentation:
The following steps show the logical processing order, or binding
order, for a SELECT statement. This order determines when the objects
defined in one step are made available to the clauses in subsequent
steps.
As the documentation explains, this refers to the scoping rules when a query is parsed. It has nothing to do with the execution order. SQL Server -- as with almost any database -- reserves the ability to rearrange the query however it likes for processing.
In fact, the execution plan is really a directed acyclic graph (DAG), whose components generally do not have a 1-1 relationship with the clauses in a query. SQL Server is free to execute your query in whatever way it decides is best, so long as it produces the result set that you have described.

SQL queries with views and subqueries

select nid, avg, std from sView1
where sid = 4891
and nid in (select distinct nid from tblref where rid = 799)
and oidin (select distinct oid from tblref where rid = 799)
and anscount > 3
This is a query I'm currently trying to run. And running it like this takes about 3-4 seconds. However, if I replace the "4891" value with a subquery saying (select distinct sid from tblref where rid = 799) the procedure just hangs, even though the subquery only returns one sid.
The query is supposed to return a dataset with averages (avg) and standard deviations (std) over a resultset which is calculated through nested views in sView1. This dataset is then run through another view to get some top-level averages and stdevs.
The averages may need to include more than 1 sid (sid identifies a dataset).
It's difficult describing it more without revealing codebase and codestructure that shouldn't be revealed ;)
Can anyone suggest why the query hangs when trying to use the subquery? (The code is rebuilt from originally using nested cursors, since I have been told that cursors are the work of the devil, and nested cursors may make me sterile)
Try this. Exists returns as soon as it finds a matching condition, select distinct will require going through the dataset and optionally sorting it to remove the duplicates.
SELECT nid,avg,std from sView1 AS SV
WHERE EXISTS (SELECT * FROM TblRef AS TR WHERE sv.sid = Tr.sid AND Sv.nid = tr.nid AND sv.oid = tr.oid AND tr.rid = 799)
AND ansCount>3
Also, it is pretty difficult to provide a meaningful answer without access to query plans and table structures. So DDL and sample data will definitely help.

OR query performance and strategies with Postgresql

In my application I have a table of application events that are used to generate a user-specific feed of application events. Because it is generated using an OR query, I'm concerned about performance of this heavily used query and am wondering if I'm approaching this wrong.
In the application, users can follow both other users and groups. When an action is performed (eg, a new post is created), a feed_item record is created with the actor_id set to the user's id and the subject_id set to the group id in which the action was performed, and actor_type and subject_type are set to the class names of the models. Since users can follow both groups and users, I need to generate a query that checks both the actor_id and subject_id, and it needs to select distinct records to avoid duplicates. Since it's an OR query, I can't use an normal index. And since a record is created every time an action is performed, I expect this table to have a lot of records rather quickly.
Here's the current query (the following table joins users to feeders, aka, users and groups)
SELECT DISTINCT feed_items.* FROM "feed_items"
INNER JOIN "followings"
ON (
(followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type)
OR
(followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
)
WHERE (followings.follower_id = 42) ORDER BY feed_items.created_at DESC LIMIT 30 OFFSET 0
So my questions:
Since this is a heavily used query, is there a performance problem here?
Is there any obvious way to simplify or optimize this that I'm missing?
What you have is called an exclusive arc and you're seeing exactly why it's a bad idea. The best approach for this kind of problem is to make the feed item type dynamic:
Feed Items: id, type (A or S for Actor or Subject), subtype (replaces actor_type and subject_type)
and then your query becomes
SELECT DISTINCT fi.*
FROM feed_items fi
JOIN followings f ON f.feeder_id = fi.id AND f.feeder_type = fi.type AND f.feeder_subtype = fi.subtype
or similar.
This may not completely or exactly represent what you need to do but the principle is sound: you need to eliminate the reason for the OR condition by changing your data model in such a way to lend itself to having performant queries being written against it.
Explain analyze and time query to see if there is a problem.
Aso you could try expressing the query as a union
SELECT x.* FROM
(
SELECT feed_items.* FROM feed_items
INNER JOIN followings
ON followings.feeder_id = feed_items.subject_id
AND followings.feeder_type = feed_items.subject_type
WHERE (followings.follower_id = 42)
UNION
SELECT feed_items.* FROM feed_items
INNER JOIN followings
followings.feeder_id = feed_items.actor_id
AND followings.feeder_type = feed_items.actor_type)
WHERE (followings.follower_id = 42)
) AS x
ORDER BY x.created_at DESC
LIMIT 30
But again explain analyze and benchmark.
To find out if there is a performance problem measure it. PostgreSQL can explain it for you.
I don't think that the query needs simplifying, if you identify a performance problem then you may need to revise your indexes.

How to avoid nested SQL query in this case?

I have an SQL question, related to this and this question (but different). Basically I want to know how I can avoid a nested query.
Let's say I have a huge table of jobs (jobs) executed by a company in their history. These jobs are characterized by year, month, location and the code belonging to the tool used for the job. Additionally I have a table of tools (tools), translating tool codes to tool descriptions and further data about the tool. Now they want a website where they can select year, month, location and tool using a dropdown box, after which the matching jobs will be displayed. I want to fill the last dropdown with only the relevant tools matching the before selection of year, month and location, so I write the following nested query:
SELECT c.tool_code, t.tool_description
FROM (
SELECT DISTINCT j.tool_code
FROM jobs AS j
WHERE j.year = ....
AND j.month = ....
AND j.location = ....
) AS c
LEFT JOIN tools as t
ON c.tool_code = t.tool_code
ORDER BY c.tool_code ASC
I resorted to this nested query because it was much faster than performing a JOIN on the complete database and selecting from that. It got my query time down a lot. But as I have recently read that MySQL nested queries should be avoided at all cost, I am wondering whether I am wrong in this approach. Should I rewrite my query differently? And how?
No, you shouldn't, your query is fine.
Just create an index on jobs (year, month, location, tool_code) and tools (tool_code) so that the INDEX FOR GROUP-BY can be used.
The article your provided describes the subquery predicates (IN (SELECT ...)), not the nested queries (SELECT FROM (SELECT ...)).
Even with the subqueries, the article is wrong: while MySQL is not able to optimize all subqueries, it deals with IN (SELECT …) predicates just fine.
I don't know why the author chose to put DISTINCT here:
SELECT id, name, price
FROM widgets
WHERE id IN
(
SELECT DISTINCT widgetId
FROM widgetOrders
)
and why do they think this will help to improve performance, but given that widgetID is indexed, MySQL will just transform this query:
SELECT id, name, price
FROM widgets
WHERE id IN
(
SELECT widgetId
FROM widgetOrders
)
into an index_subquery
Essentially, this is just like EXISTS clause: the inner subquery will be executed once per widgets row with the additional predicate added:
SELECT NULL
FROM widgetOrders
WHERE widgetId = widgets.id
and stop on the first match in widgetOrders.
This query:
SELECT DISTINCT w.id,w.name,w.price
FROM widgets w
INNER JOIN
widgetOrders o
ON w.id = o.widgetId
will have to use temporary to get rid of the duplicates and will be much slower.
You could avoid the subquery by using GROUP BY, but if the subquery performs better, keep it.
Why do you use a LEFT JOIN instead of a JOIN to join tools?