DSE: Query Timeout/Slow - datastax

I am currently running a cluster of 3 nodes with 200 mill of data and the specific vertex I'm querying a total of 25 mill vertex and 30 Mill edges. I am running the following query
g.V().hasLabel('people_node').has("age", inside(0,25)).filter(outE('posted_question').count().is(gt(1))).profile()
I have tried this query on a smaller set of ~100 vertex and edges and the profiler showed that indexes have been used for all parts of the query. However, I think the problem might be in my schema which is shown below.
Schema
schema.propertyKey('id').Text().ifNotExists().create()
schema.propertyKey('name').Text().ifNotExists().create()
schema.propertyKey('age').Int().ifNotExists().create()
schema.propertyKey('location').Point().withGeoBounds().ifNotExists().create()
schema.propertyKey('gender').Text().ifNotExists().create()
schema.propertyKey('dob').Timestamp().ifNotExists().create()
schema.propertyKey('tags').Text().ifNotExists().create()
schema.propertyKey('date_posted').Timestamp().ifNotExists().create()
schema.vertexLabel('people_node').properties('id','name','location','gender','dob').create()
schema.vertexLabel('questions_node').properties('id','tags','date_posted').create()
schema.edgeLabel('posted_question').single().connection('people_node','questions_node').create()
Indexes Used
schema.vertexLabel("people_node").index("search").search().by("name").by("age").by("gender").by("location").by("dob").ifNotExists().add()
schema.vertexLabel("people_node").index("people_node_index").materialized().by("id").ifNotExists().add()
schema.vertexLabel("questions_node").index("search").search().by("date_posted").by("tags").ifNotExists().add()
schema.vertexLabel("questions_node").index("questions_node_index").materialized().by("id").ifNotExists().add()
I have also read about "OLAP" queries I believe I have activated it but the query is still way too slow. Any advise or insight on what is slowing it down will be greatly appreciated.
Profile Statement (OLTP)
gremlin> g1.V().has("people_node","age", inside(0,25)).filter(outE('posted_question').count().is(gt(1))).profile()
==>Traversal Metrics
Step Count Traversers
Time (ms) % Dur
=============================================================================================================
DsegGraphStep(vertex,[],(age < 25 & age > 0 & l... 1 1
38.310 25.54
query-optimizer
0.219
\_condition=((age < 25 & age > 0 & label = people_node) & (true))
query-setup
0.001
\_isFitted=true
\_isSorted=false
\_isScan=false
index-query
26.581
\_indexType=Search
\_usesCache=false
\_statement=SELECT "community_id", "member_id" FROM "MiniGraph"."people_node_p" WHERE "solr_query" = '{"q
":"*:*", "fq":["age:{0 TO 25}"]}' LIMIT ?; with params (java.lang.Integer) 50000
\_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Option
al.empty, pagingState=null, pageSize=-1, user=Optional[cassandra], waitForSchemaAgreement=true,
async=true}
TraversalFilterStep([DsegVertexStep(OUT,[posted...
111.471 74.32
DsegVertexStep(OUT,[posted_question],edge,(di... 1 1
42.814
query-optimizer
0.227
\_condition=((direction = OUT & label = posted_question) & (true))
query-setup
0.036
\_isFitted=true
\_isSorted=false
\_isScan=false
vertex-query
29.908
\_usesCache=false
\_statement=SELECT * FROM "MiniGraph"."people_node_e" WHERE "community_id" = ? AND "member_id" = ? AND "
~~edge_label_id" = ? LIMIT ? ALLOW FILTERING; with params (java.lang.Integer) 1300987392, (j
ava.lang.Long) 1026, (java.lang.Integer) 65584, (java.lang.Integer) 2
\_options=Options{consistency=Optional[ONE], serialConsistency=Optional.empty, fallbackConsistency=Optio
nal.empty, pagingState=null, pageSize=-1, user=Optional[cassandra], waitForSchemaAgreement=tru
e, async=true}
\_usesIndex=false
RangeGlobalStep(0,2) 1 1
0.097
CountGlobalStep 1 1
0.050
IsStep(gt(1))
68.209
DsegPropertyLoadStep
0.205 0.14
>TOTAL - -
149.986 -
Next, due to the partial query being much faster I assume the long time consumption is due to the necessary graph traversals. Hence, is it possible to cache or activate the indexes (_usesIndex=false) so that OLAP queries to be much faster?

Will you please post the output of the .profile statement?
Semanticaly, it looks like you're trying to find all "people" under the age of 25 that have more than 1 posted question. Is that accurate?

Related

How to use `sum` within `summarize` in a KQL query?

I'm working at logging an Azure Storage Account. Have a Diagnostic Setting applied and am using Log Analytics to write KQL queries.
My goal is to determine the number of GetBlob requests (OperationName) for a given fileSize (RequestBodySize).
The challenge is that I need to sum the RequestBodySize for all GetBlob operations on each file. I'm not sure how to nest sum in summarize.
Tried so far:
StorageBlobLogs
| where TimeGenerated >= ago(5h)
and AccountName == 'storageAccount'
and OperationName == 'GetBlob'
| summarize count() by Uri, fileSize = format_bytes(RequestBodySize)
| render scatterchart
Results in:
Also tried: fileSize = format_bytes(sum(RequestBodySize)) but this errored out.
Any ideas?
EDIT 1: Testing out #Yoni's solution.
Here is an example of RequestBodySize with no summarization:
When implementing the summarize query (| summarize count() by Uri, fileSize = format_bytes(RequestBodySize)), the results are 0 bytes.
Though its clear there are multiple calls for a given Uri, the sum doesn't seem to be working.
EDIT 2:
And yeah... pays to verify the field names! There is no RequestBodySize field available, only ResponseBodySize. Using the correct value worked (imagine that!).
I need to sum the RequestBodySize for all GetBlob operations on each file
If I understood your question correctly, you could try this:
StorageBlobLogs
| where TimeGenerated >= ago(5h)
and AccountName == 'storageAccount'
and OperationName == 'GetBlob'
| summarize count(), total_size = format_bytes(sum(RequestBodySize)) by Uri
Here's an example using a dummy data set:
datatable(Url:string, ResponseBodySize:long)
[
"https://something1", 33554432,
"https://something3", 12341234,
"https://something1", 33554432,
"https://something2", 12345678,
"https://something2", 98765432,
]
| summarize count(), total_size = format_bytes(sum(ResponseBodySize)) by Url
Url
count_
total_size
https://something1
2
64 MB
https://something3
1
12 MB
https://something2
2
106 MB

How Can I Optimize an SQL Query With Multiple Conditions and Inner Join

How can I optimize the SQL query below? It takes too much time. This is used to select from a very large astronomical table (a few hundred million objects), and this query runs over 12 hours which is more than the maximum allow query time. It appears that adding more restrictions via additional conditions doesn't help at all.
Your help is much appreciated!
select AVG(o.zMeanPSFMagStd) into mydb.z1819
from MeanObject as o
join ObjectThin ot on ot.ObjID=o.ObjID
inner join StackObjectAttributes soa on soa.objID=o.objid
where o.zMeanPSFMag>=18
and o.zMeanPSFMag<19
and o.zQfPerfect > 0.85
and (ot.qualityFlag & 1) = 0
and (ot.qualityFlag & 2) = 0
and o.zMeanPSFMagErr <> -999
and o.zMeanPSFMagStd <> -999
and ot.nz > 10 and soa.zpsfMajorFWHM < 6
and soa.zpsfMinorFWHM/nullif(soa.zpsfMajorFWHM,0) > 0.65
and soa.zpsfFlux/nullif(soa.zpsfFluxErr,0)>5
and (ot.b>10 or ot.b<-10)
and (ot.raMean>0 and ot.raMean<180)

Round in SQL Server

I'm trying to figure out away to do this in SQL:
100000 % 1000 = 0
1150000%1000000 = 150000
I don't know if there is anything like that in SQL. Or even if there is, I don't know what it is called.
Any idea?
SELECT 1150000 / 1000000 AS result, 1150000 % 1000000 AS reminder;
Wich gives : result = 1 & reminder = 150000

DB2 COALESCE - notable impact on query time execution

I've noted that using COALESCE (in my case) to avoid possible NULL value in prepared statement causes a decrease in performance of DB query time execution. Can someone explain me what is the root cause and how can I overcome that issue? Queries samples below:
QUERY 1 (execution time 3 s):
SELECT TABLE_A.Y, TABLE_B.X
FROM ...
WHERE Z = ? AND TABLE_A.ABC = ? AND
TABLE_A.QWERTY = ? AND TABLE_A.Q = TABLE_B.Q;
QUERY 2 (execution time 210 s):
SELECT TABLE_A.Y, TABLE_B.X
FROM ...
WHERE Z = ? AND (
(COALESCE(?,'')='') OR
(TABLE_A.ABC = ? AND TABLE_A.QWERTY = ? AND TABLE_A.Q = TABLE_B.Q)
);
The only difference is using (COALESCE(?,'')='').
The bigger problem I see is that QUERY 1 has 3 placeholders whereas QUERY 2 has 4 placeholders.
I think what you're trying to do is that you want to make your placeholders optional.
A simple way to do this is to fix QUERY 1 as follows
SELECT TABLE_A.Y, TABLE_B.X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_A.Q = TABLE_B.Q;
WHERE Z = ?
AND TABLE_A.ABC = COALESCE(?,TABLE_A.ABC)
AND TABLE_A.QWERTY = COALESCE(?,TABLE_A.QWERTY)

VBA with sql query - optimization

All,
I have two tabels: Cases(IdCase, Name) and Alerts(IdAlert, refIdCase, Profile).
One case can have multiple alerts connected by refIdCase.
I'm dipslaying the list in VBA listbox which shows case name and profiles assigned to this case.
Firstly I download 50 cases. Then for each recordset I'm finding profile names. Unfortunately it takes some time :( Is there any faster way to achieve that?
set rsCases = cn.Execute("SELECT TOP 50 * FROM Cases")
Do Until rsCases.EOF
Set rsProfiles = cn.Execute("SELECT DISTINCT TOP 50 Profile FROM Alert WHERE refIdCase = " & rsCases.Fields("IdCase").value & ";")
rsCases.movenext
UPDATE: I believe the problem is with our connection to sql server. We are located in Poland and the server is in North America. I performed the same action from computer located in NA and it took only 4 sec, but here from Poland it takes around 45 sec.
Thank you,
TJ
The problem is that you are sending 51 requests to the database. Send 1:
set rstCases = cn.Execute("SELECT c.IdCase, c.Name, a.IdAlert, a.Profile
FROM Cases c
INNER JOIN
(SELECT TOP 50 IdAlert, Profile
FROM Alerts
ORDER BY ???) a
ON c.IdCase=a.refIdCase
ORDER BY ???")
(Linebreaks are for clarity - don't put then in your code)