neo4j cypher query too slow - cypher

The following query, takes between 1.5sec to 9sec, depends on {keywords}
match (pr:Property)
WHERE (pr.name in {keywords})
with pr
MaTCH (pr) <--(it:Item)
MaTCH (it)-->(pr2)<-[:CAT]-(ca)
return distinct pr2 as prop,count(distinct it) as sum , ca.name as rType
limit 10
Each Item is connected to 100 Properties.
sample profile on the server:
neo4j-sh (?)$ profile match (pr:Property)
WHERE (pr.name in ["GREEN","SHORT","PLAIN","SHORT-SLEEVE"])
with pr
MaTCH (pr) <--(it:Item)
MaTCH (it)-->(pr2)<-[:CAT]-(ca)
return distinct pr2 as prop,count(distinct it) as sum , ca.name as rType
limit 40;
+------------------------------------------------------------------------------------------40 rows
ColumnFilter(symKeys=["prop", "rType", " INTERNAL_AGGREGATE58d28d0e-5727-4850-81ef-7298d63d7be8"], returnItemNames=["prop", "sum", "rType"], _rows=40, _db_hits=0)
Slice(limit="Literal(40)", _rows=40, _db_hits=0)
EagerAggregation(keys=["Cached(prop of type Node)", "Cached(rType of type Any)"], aggregates=["( INTERNAL_AGGREGATE58d28d0e-5727-4850-81ef-7298d63d7be8,Distinct(Count(it),it))"], _rows=40, _db_hits=0)
Extract(symKeys=["it", "ca", " UNNAMED122", "pr", "pr2", " UNNAMED130", " UNNAMED99"], exprKeys=["prop", "rType"], _rows=645685, _db_hits=645685)
SimplePatternMatcher(g="(it)-[' UNNAMED122']-(pr2),(ca)-[' UNNAMED130']-(pr2)", _rows=645685, _db_hits=0)
Filter(pred="hasLabel(it:Item(0))", _rows=6258, _db_hits=0)
SimplePatternMatcher(g="(it)-[' UNNAMED99']-(pr)", _rows=6258, _db_hits=0)
Filter(pred="any(-_-INNER-_- in Collection(List(Literal(GREEN), Literal(SHORT), Literal(PLAIN), Literal(SHORT-SLEEVE))) where Property(pr,name(1)) == -_-INNER-_-)", _rows=4, _db_hits=1210)
NodeByLabel(identifier="pr", _db_hits=0, _rows=304, label="Property", identifiers=["pr"], producer="NodeByLabel")
neo4j version : 2.0.1
Heap size : 3.2 GB max (not even close to get to it..)
DataBase disk usage : 270MB
NumOfNodes : 4368
NumOf Relationships : 395693
Computer : AWS EC2 c3.large .
But, tried to run it on a 4 times faster computer and the results were the same..
When looking at the JConsole I can see that the heap goes from 50mb to 70mb and then cleaned by GC.
Anyway to make it faster? This performance is way too slow for me...
EDIT:
As suggested I tried combining the matches, but it is slower as you can see in the profile:
neo4j-sh (?)$ profile match (pr:Property)
WHERE (pr.name in ["GREEN","SHORT","PLAIN","SHORT-SLEEVE"])
with pr
MaTCH (pr) <--(it:Item)-->(pr2)<-[:CAT]-(ca)
return distinct pr2 as prop,count(distinct it) as sum , ca.name as rType
limit 40;
ColumnFilter(symKeys=["prop", "rType", " INTERNAL_AGGREGATEa6eaa53b-5cf4-4823-9e4d-0d1d66120d51"], returnItemNames=["prop", "sum", "rType"], _rows=40, _db_hits=0)
Slice(limit="Literal(40)", _rows=40, _db_hits=0)
EagerAggregation(keys=["Cached(prop of type Node)", "Cached(rType of type Any)"], aggregates=["( INTERNAL_AGGREGATEa6eaa53b-5cf4-4823-9e4d-0d1d66120d51,Distinct(Count(it),it))"], _rows=40, _db_hits=0)
Extract(symKeys=[" UNNAMED111", "it", "ca", " UNNAMED119", "pr", "pr2", " UNNAMED99"], exprKeys=["prop", "rType"], _rows=639427, _db_hits=639427)
Filter(pred="(hasLabel(it:Item(0)) AND hasLabel(it:Item(0)))", _rows=639427, _db_hits=0)
SimplePatternMatcher(g="(ca)-[' UNNAMED119']-(pr2),(it)-[' UNNAMED99']-(pr),(it)-[' UNNAMED111']-(pr2)", _rows=639427, _db_hits=0)
Filter(pred="any(-_-INNER-_- in Collection(List(Literal(GREEN), Literal(SHORT), Literal(PLAIN), Literal(SHORT-SLEEVE))) where Property(pr,name(1)) == -_-INNER-_-)", _rows=4, _db_hits=1210)
NodeByLabel(identifier="pr", _db_hits=0, _rows=304, label="Property", identifiers=["pr"], producer="NodeByLabel")

First of all, make sure that the name property on the Property label is indexed. As far as I know, indexes aren't used with an IN statement, but this should be resolved in a future version. Performance will be better soon.
CREATE INDEX ON :Property(name)
You can reduce the query as follows:
MATCH (pr:Property)
WHERE (pr.name in {keywords})
MATCH (pr)<--(it:Item)-->(pr2)<-[:CAT]-(ca)
RETURN distinct pr2 as prop,count(distinct it) as sum , ca.name as rType
LIMIT 10

Two you can do as a "workaround", until IN for indexes is fixed:
UNION
split it up in two queries,
first one uses index lookup and a union of all these, like
MATCH (pr:Property {keyword:{keyword1}) return id(pr)
UNION ALL
MATCH (pr:Property {keyword:{keyword2}) return id(pr)
...
etc.
then in the second query do:
MATCH (pr) WHERE ID(pr) IN {ids}
MaTCH (pr) <--(it:Item)
MaTCH (it)-->(pr2)<-[:CAT]-(ca)
return distinct pr2 as prop,count(distinct it) as sum , ca.name as rType
limit 10
Legacy Index
Create a node_auto_index for "keyword" and then use lucene query syntax to do your initial lookup.
START pr=node:node_auto_index('keyword:("GREEN" "SHORT" "PLAIN" "SHORT-SLEEVE")')
MaTCH (pr) <--(it:Item)
MaTCH (it)-->(pr2)<-[:CAT]-(ca)
return distinct pr2 as prop,count(distinct it) as sum , ca.name as rType
limit 10

Related

Using Athena to get terminatingrule from rulegrouplist in AWS WAF logs

I followed these instructions to get my AWS WAF data into an Athena table.
I would like to query the data to find the latest requests with an action of BLOCK. This query works:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;
My issue is cleanly identifying the "terminatingrule" - the reason the request was blocked. As an example, a result has
terminatingrule = AWS-AWSManagedRulesCommonRuleSet
And
rulegrouplist = [
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesAmazonIpReputationList",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesKnownBadInputsRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesLinuxRuleSet",
"terminatingrule": "null",
"excludedrules": "null"
},
{
"nonterminatingmatchingrules": [],
"rulegroupid": "AWS#AWSManagedRulesCommonRuleSet",
"terminatingrule": {
"rulematchdetails": "null",
"action": "BLOCK",
"ruleid": "NoUserAgent_HEADER"
},
"excludedrules":"null"
}
]
The piece of data I would like separated into a column is rulegrouplist[terminatingrule].ruleid which has a value of NoUserAgent_HEADER
AWS provide useful information on querying nested Athena arrays, but I have been unable to get the result I want.
I have framed this as an AWS question but since Athena uses SQL queries, it's likely that anyone with good SQL skills could work this out.
It's not entirely clear to me exactly what you want, but I'm going to assume you are after the array element where terminatingrule is not "null" (I will also assume that if there are multiple you want the first).
The documentation you link to say that the type of the rulegrouplist column is array<string>. The reason why it is string and not a complex type is because there seems to be multiple different schemas for this column, one example being that the terminatingrule property is either the string "null", or a struct/object – something that can't be described using Athena's type system.
This is not a problem, however. When dealing with JSON there's a whole set of JSON functions that can be used. Here's one way to use json_extract combined with filter and element_at to remove array elements where the terminatingrule property is the string "null" and then pick the first of the remaining elements:
SELECT
element_at(
filter(
rulegrouplist,
rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON)
),
1
) AS first_non_null_terminatingrule
FROM waf_logs
WHERE action = 'BLOCK'
ORDER BY date DESC
You say you want the "latest", which to me is ambiguous and could mean both first non-null and last non-null element. The query above will return the first non-null element, and if you want the last you can change the second argument to element_at to -1 (Athena's array indexing starts from 1, and -1 is counting from the end).
To return the individual ruleid element of the json:
SELECT from_unixtime(timestamp / 1000e0) AS date, action, httprequest.clientip AS ip, httprequest.uri AS request, httprequest.country as country, terminatingruleid, json_extract(element_at(filter(rulegrouplist,rulegroup -> json_extract(rulegroup, '$.terminatingrule') <> CAST('null' AS JSON) ),1), '$.terminatingrule.ruleid') AS ruleid
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
I had the same issue but the solution posted by Theo didn't work for me, even though the table was created according to the instructions linked to in the original post.
Here is what worked for me, which is basically the same as Theo's solution, but without the json conversion:
SELECT
from_unixtime(timestamp / 1000e0) AS date,
action,
httprequest.clientip AS ip,
httprequest.uri AS request,
httprequest.country as country,
terminatingruleid,
rulegrouplist,
element_at(filter(ruleGroupList, ruleGroup -> ruleGroup.terminatingRule IS NOT NULL),1).terminatingRule.ruleId AS ruleId
FROM waf_logs
WHERE action='BLOCK'
ORDER BY date DESC
LIMIT 100;

Virtuoso SQL query of a SPARQL query in iSQL

I was going through this link, which uses EXPLAIN() to show us the SQL query that Virtuoso generates(uses internally) for the input SPARQL query. I tried it on my Virtuoso 7.x version and found that I get a different output. I am not able to understand the output fully. Would it be possible to explain what this output from iSQL means and how I would interpret a SQL query from this?
The SPARQL query is
SPARQL SELECT DISTINCT ?s FROM <http://dbpedia.org> WHERE {
?s a <http://dbpedia.org/ontology/Cricketer> .
?s <http://dbpedia.org/property/testdebutyear> ?o .
};
The output I get is
{
Subquery 27
{
RDF_QUAD 3.2e+03 rows(s_1_2_t1.S)
inlined P = #/testdebutyear G = #/dbpedia.org
RDF_QUAD unq 0.8 rows (s_1_2_t0.S)
inlined P = ##type , S = s_1_2_t1.S , O = #/Cricketer , G = #/dbpedia.org
Distinct (s_1_2_t0.S)
After code:
0: s := := artm s_1_2_t0.S
4: BReturn 0
Subquery Select(s)
}
After code:
0: s := Call __ro2sq (s)
5: BReturn 0
Select (s)
}
20 Rows. -- 2 msec.
How do I find the SQL query in this case? Is there a command or link that I am missing?
To answer your specific question, you might read further down the same page you pointed to above, to where it talks specifically about how to "Translate a SPARQL query into the correspondent SQL."
This is also discussed on another feature-specific page.
(Also note, the sample output on both these pages came from Virtuoso 6.x, while you're running 7.x, so your output will likely still differ.)
Here's what I got from my local Virtuoso 7.2.4 (Commercial Edition) --
SQL> SET SPARQL_TRANSLATE ON ;
SQL> SELECT DISTINCT ?s FROM <http://dbpedia.org> WHERE { ?s a <http://dbpedia.org/ontology/Cricketer> . ?s <http://dbpedia.org/property/testdebutyear> ?o . } ;
SPARQL_TO_SQL_TEXT
LONG VARCHAR
_______________________________________________________________________________
SELECT __ro2sq ("s_1_2_rbc"."s") AS "s" FROM (SELECT DISTINCT "s_1_2_t0"."S" AS "s"
FROM DB.DBA.RDF_QUAD AS "s_1_2_t0"
INNER JOIN DB.DBA.RDF_QUAD AS "s_1_2_t1"
ON (
"s_1_2_t0"."S" = "s_1_2_t1"."S")
WHERE
"s_1_2_t0"."G" = __i2idn ( __bft( 'http://dbpedia.org' , 1))
AND
"s_1_2_t0"."P" = __i2idn ( __bft( 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' , 1))
AND
"s_1_2_t0"."O" = __i2idn ( __bft( 'http://dbpedia.org/ontology/Cricketer' , 1))
AND
"s_1_2_t1"."G" = __i2idn ( __bft( 'http://dbpedia.org' , 1))
AND
"s_1_2_t1"."P" = __i2idn ( __bft( 'http://dbpedia.org/property/testdebutyear' , 1))
OPTION (QUIETCAST)) AS "s_1_2_rbc"
1 Rows. -- 3 msec.
SQL> SET SPARQL_TRANSLATE OFF ;
NOTE -- this feature is only available in the command-line iSQL; it is not found in the browser-based iSQL interface.
All that said... I strongly suggest you ask us your initial questions going forward, instead of likely getting distracted by XY Problems. There's fairly little to understand about the single table of quad data (DB.DBA.RDF_QUAD), which has 2 full and 3 partial indexes by default, which are sufficient for most typical uses, all as discussed in the documentation.
The Virtuoso Users mailing list is often a better resource for Virtuoso-specific questions than here, where you tend to get a fair amount of guessing responses.
(ObDisclaimer: I work for OpenLink Software, producer of Virtuoso.)

Discrepancy in String matching between Teradata and HIVE

I am getting into Hive and learning hive. I have customer table in teradata , used sqoop to extract complete table in hive which worked fine.
See below customer table both in Teradata and HIVE.
In Teradata :
select TOP 4 id,name,'"'||status||'"' from customer;
3172460 Customer#003172460 "BUILDING "
3017726 Customer#003017726 "BUILDING "
2817987 Customer#002817987 "COMPLETE "
2817984 Customer#002817984 "BUILDING "
In HIVE :
select id,name,CONCAT ('"' , status , '"') from customer LIMIT 4;
3172460 Customer#003172460 "BUILDING "
3017726 Customer#003017726 "BUILDING "
2817987 Customer#002817987 "COMPLETE "
2817984 Customer#002817984 "BUILDING "
When I tried to fetch records from table customer with column matching which is of String type. I am getting different result for same query in different environment.
See below query results..
In Teradata :
select TOP 2 id,name,'"'||status||'"' from customer WHERE status = 'BUILDING';
3172460 Customer#003172460 "BUILDING "
3017726 Customer#003017726 "BUILDING "
In HIVE :
select id,name,CONCAT ('"' , status , '"') from customer WHERE status = 'BUILDING' LIMIT 2;
**<<No Result>>**
It seems that teradata is doing trimming short of thing before actually comparing stating values. But Hive is matching strings as it is.
Not sure, It is expected behaviour or bug or can be raised as enhancement.
I see below possible solution:
* Convert into like operator expression with wildcard charater before and after
Looking forward for your response on this. How can it be handled/achieved in hive.
You could use rtrim function, i.e:
select id,name,CONCAT ('"' , status , '"') from customer WHERE rtrim(status) = 'BUILDING' LIMIT 2;
But question here arise what standard in string comparision Hive uses? According to ANSI/ISO SQL-92 'BUILDING' == 'BUILDING ', Here is a link for an article about it.
Answering to my own question ...got it from hive-user mailing list
Hive is only PRO SQL compliance,
In hive the string comparisons work just like they would work in java
so in hive
"BUILDING" = "BUILDING"
"BUILDING " != "BUILDING" (extra space added)

Repository - order by in native query not working

I have a spring data JPA repository (on a postgres db) and from time to time I need to use native queries using the nativeQuery = true option.
However in my current situation I need to pass in an order field and am doing so like this:
the call..
targetStatusHistoryRepository.findSirenAlarmTimeActivation([uuid,uuid2],"activation_name DESC", 0, 10)
.. the repo method
#Query(
nativeQuery = true,
value = """select
a.name as activation_name,
min(transition_from_active_in_millis),
max(transition_from_active_in_millis),
avg(transition_from_active_in_millis) from target_status_history t, activation_scenario a
where t.activation_uuid=a.activation_scenario_id and t.transition_from_active_in_millis > 0 and t.activation_uuid in (:activationUUIDs) group by a.name,t.activation_uuid
order by :orderClause offset :offset limit :limit """
)
List<Object[]> findSirenAlarmTimeActivation(#Param("activationUUIDs") List<UUID> activationUUIDs,
#Param("orderClause") String orderClause, #Param("offset") int offset, #Param("limit") int limit )
I wrote a unit test with a DESC and then a ASC call and vice versa, and it seems what ever the first call is, the second gives the same result.
If that's a prepared statement, and that's a bind value being supplied in the ORDER BY clause, that is valid, BUT...
The bind value supplied won't be interpreted as SQL text. That is, the value will be seen as just a value (like a literal string). It won't be seen as a column name, or an ASC or DESC keyword.
In the context of your statement, supplying a value for the :orderClause bind placeholder, that's going to have the same effect as if you had written ORDER BY 'some literal'.
And that's not really doing any ordering of the rows at all.
(This is true at least in every SQL client library I've used with DB2, Teradata, Oracle, SQL Server, MySQL, and MariaDB (JDBC, Perl DBI, ODBC, Pro/C, et al.)
(MyBatis does provide a convenient mechanism for doing variable substitution within the SQL text, dynamically changing the SQL text before it's prepared, but those substitutions are handled BEFORE the statement is prepared, and don't turn into bind placeholders in the statement.)
It is possible to get some modicum of "dynamic" ordering with some carefully crafted expressions in the ORDER BY clause. For example, we can have our static SQL text be something like this:
ORDER BY CASE WHEN :sort_param = 'name ASC' THEN activation_name END ASC
, CASE WHEN :sort_param = 'name DESC' THEN activation_name END DESC
(The SQL text here isn't dynamic, it's actually static, it's as if we had written.
ORDER BY expr1 ASC
, expr1 DESC
The "trick" is that the expressions in the ORDER BY clause are conditionally returning either the value of some column from each row, or they are returning a literal (in the example above, the literal NULL), depending on the value of a bind value, evaluated at execution time.
The net effect is that we can "dynamically" get the effect of either:
ORDER BY activation_name ASC, NULL DESC
or
ORDER BY NULL ASC, activation_name DESC
or
ORDER BY NULL ASC, NULL DESC
depending on what value we supply for the :sort_param placeholder.
You can use pageable with the SpEL langage. The Sort object in Pageable will be used to append " order by " in the end of the request.
Here is an example.
Use createNativeQuery and directly append the order by value as string into query rather than using setParameter(). It worked fine for me.
I had the same problem using native query in Spring Boot and the way that i found was:
1- Create a pageable:
Pageable pageable = PageRequest.of(numPage, sizePage, Sort.by(direction , nameField));
2- Add "ORDER BY true" into the query, for example:
#Query(value = " SELECT * " +
"FROM articulos a " +
"WHERE " +
"AND a.id = :id" +
"ORDER BY TRUE",
countQuery = " SELECT * " +
"FROM articulos a " +
"WHERE " +
"AND a.id = :id" +
"ORDER BY TRUE"
, nativeQuery = true)
Page<PrecioArticuloVO> obtenerPreciosPorCategoriaProveedor(#Param("id")Long id,Pageable pagina);

num_rows in postgres always return 1

I'm trying to do a SELECT COUNT(*) with Postgres.
What I need: Catch the rows affected by the query. It's a school system. If the student is not registered, do something (if).
What I tried:
$query = pg_query("SELECT COUNT(*) FROM inscritossimulado
WHERE codigo_da_escola = '".$CodEscola."'
AND codigo_do_simulado = '".$simulado."'
AND codigo_do_aluno = '".$aluno."'");
if(pg_num_rows($query) == 0)
{
echo "Error you're not registered!";
}
else
{
echo "Hello!";
}
Note: The student in question IS NOT REGISTERED, but the result is always 1 and not 0.
For some reason, when I "show" the query, the result is: "Resource id #21". But, I look many times in the table, and the user is not there.
You are counting the number of rows in the answer, and your query always returns a single line.
Your query says: return one row giving the number of students matching my criteria. If no one matches, you will get back one row with the value 0. If you have 7 people matching, you will get back one row with the value 7.
If you change your query to select * from ... you will get the right answer from pg_num_rows().
Actually, don't count at all. You don't need the count. Just check for existence, which is proven if a single row qualifies:
$query = pg_query(
'SELECT 1
FROM inscritossimulado
WHERE codigo_da_escola = $$' . $CodEscola . '$$
AND codigo_do_simulado = $$' . $simulado. '$$
AND codigo_do_aluno = $$' . $aluno . '$$
LIMIT 1');
Returns 1 row if found, else no row.
Using dollar-quoting in the SQL code, so we can use the safer and faster single quotes in PHP (I presume).
The problem with the aggregate function count() (besides being more expensive) is that it always returns a row - with the value 0 if no rows qualify.
But this still stinks. Don't use string concatenation, which is an open invitation for SQL injection. Rather use prepared statements ... Check out PDO ...