Fast JPA Query but slow mapping to entity - sql

Because of performance issues with fetching about 30k results from DB as entities when using Hibernate JPA, i instead tried to write a namedQuery to have more control over the query and its runtime. What i end up with is almost 20 seconds just for those few entities, and those 20 seconds are necessary for the "old" query and my own namedQuery (which doesn't take a second to get the result when executed in a sql client), so basically it doesn't make any difference whether i use a namedQuery or the hibernate-generated query.
Is it safe to assume that 98% of the time is used for mapping those results to their corresponding entities? And if so, how should i speed this up? Below is the query that i wrote myself (note that i explicitly have to state all the columns in the SELECT)
SELECT exp.ID
,exp.CREATEDBY
,exp.CREATEDTIME
,exp.DELETED
,exp.LASTCHANGE
,exp.LASTCHANGEBY
,exp.STATUS
,exp.BRIXFIGURE
,exp.GRAMMAGE
,exp.INDIVIDUALPACKAGING
,exp.MINORDERQUANTITY
,exp.PACKAGINGHEIGHT
,exp.PACKAGINGLENGTH
,exp.PACKAGINGWIDTH
,exp.PALETTESIZE
,exp.QUANTITY
,exp.UNIT
,exp.VALIDUNTIL
,exp.EXPORTELEMENT_START
,exp.EXPORTSTATUS
,exp.webServiceResponse
,exp.CATEGORYID
,exp.COMMENTID
,exp.SUPPLIERID
,exp.TRANSPORTPACKAGINGID
,exp.LocationId
,exp.PriceRowId
,exp.EXPORTELEMENT_ENDDATE
,exp.BASEPRICE
,exp.BASEUNIT
,exp.BARCODES
,exp.EXPIRYDATE
,exp.PREORDERPERIOD
,exp.EXPORTWEEKID
,exp.EXPORT_TENDER_UID
,exp.EXPORT_UID
,exp.CURRENCY_ID
,exp.WEIGHT_PER_BOX
FROM EXPORTELEMENT AS exp
JOIN EXPORTELEMENT_LOCATION as exlo ON exlo.EXPORTELEMENTID = exp.ID
WHERE exlo.LOCATIONID = :locationId
AND exp.EXPORTELEMENT_ENDDATE <= :endDate
AND exp.EXPORTELEMENT_START >= :startDate
AND exp.DELETED = :deleted

Writing raw sql vs. letting hibernate/jpa do it for you doesn't improve the performance. The reason might be that your object is mapped to other objects (Fetch eager as opposed to lazy) that map to other objects etc...So you could potentially be pulling your whole db. You might think your query is the only one being executed, but reality is the other mappings might be creating/executing more sql queries...In my case for 10,000 rows doing the mapping myself took 100 milliseconds, but letting hibernate/jpa do the mapping took 10s, a whole 100x.
What improves the performance is doing the mapping yourself. Something like this:
#Query(nativeQuery = true, value = "your_raw_sql_here")
List<Object[]> yourNativeQueryMethod();
Then you can map the object yourself:
for( Object[] objectArray: results) {
BigInteger id = (BigInteger) objectArray[0];
//etc...
}

Related

Building relationships in Neo4j via Py2neo is very slow

We have 5 different types of nodes in database. Largest one has ~290k, the smallest is only ~3k. Each node type has an id field and they are all indexed. I am using py2neo to build relationship, but it is very slow (~ 2 relationships inserted per second)
I used pandas read from a relationship csv, iterate each row to create a relationship wrapped in transaction. I tried batch out 10k creation statements in one transaction, but it does not seem to improve the speed a lot.
Below is the code:
df = pd.read_csv(r"C:\relationship.csv",dtype = datatype, skipinitialspace=True, usecols=fields)
df.fillna('',inplace=True)
def f(node_1 ,rel_type, node_2):
try:
tx = graph.begin()
tx.evaluate('MATCH (a {node_id:$label1}),(b {node_id:$label2}) MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'label1': node_1, 'label2': node_2})
tx.commit()
except Exception as e:
print(str(e))
for index, row in df.iterrows():
if(index%1000000 == 0):
print(index)
try:
f(row["node_1"],row["rel_type"],row["node_2"])
except:
print("error index: " + index)
Can someone help me what I did wrong here. Thanks!
You state that there are "5 different types of nodes" (which I interpret to mean 5 node labels, in neo4j terminology). And, furthermore, you state that their id properties are already indexed.
But your f() function is not generating a Cypher query that uses the labels at all, and neither does it use the id property. In order to take advantage of your indexes, your Cypher query has to specify the node label and the id value.
Since there is currently no efficient way to parameterize the label when performing a MATCH, the following version of the f() function generates a Cypher query that has hardcoded labels (as well as a hardcoded relationship type):
def f(label_1, id_1, rel_type, label_2, id_2):
try:
tx = graph.begin()
tx.evaluate(
'MATCH' +
'(a:' + label_1 + '{id:$id1}),' +
'(b:' + label_2 + '{id:$id2}) ' +
'MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'id1': id_1, 'id2': id_2})
tx.commit()
except Exception as e:
print(str(e))
The code that calls f() will also have to be changed to pass in both the label names and the id values for a and b. Hopefully, your df rows will contain that data (or enough info for you to derive that data).
If your aim is for better performance then you will need to consider a different pattern for loading these, i.e. batching. You're currently running one Cypher MERGE statement for each relationship and wrapping that in its own transaction in a separate function call.
Batching these by looking at multiple statements per transaction or per function call will reduce the number of network hops and should improve performance.

Getting Term Frequencies For Query

In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.

Implementing Sticky Threads In Django

Note: I'm new to both django and databases, so please excuse my ignorance.
I'm trying to implement a forum in django and wish to have sticky threads. The naive way that I was thinking of do this was to define the Thread model like this:
class Thread(models.Model):
title = models.CharField(max_length=max_title_length)
author = models.ForeignKey(Player, related_name="nonsticky_threads")
post_date = models.DateField()
parent = models.ForeignKey(Subsection, related_name="nonsticky_threads")
closed = models.BooleanField()
sticky = models.BooleanField()
and then to get the sticky threads, do something like this:
sticky_threads = Thread.objects.all().filter(sticky=True)
The problem is that at least theoretically this has O(n) complexity, which sounds bad. (Since sticky threads are always displayed on the first page, this query will be run fairly frequently) However, I don't know how database/django cleverness will affect the final performance or if it will still be bad.
My current alternative is to also create distinct Thread and Sticky_Thread classes:
class Thread(models.Model):
title = models.CharField(max_length=max_title_length)
author = models.ForeignKey(Player, related_name="nonsticky_threads")
post_date = models.DateField()
parent = models.ForeignKey(Subsection, related_name="nonsticky_threads")
closed = models.BooleanField()
class Sticky_Thread(models.Model):
title = models.CharField(max_length=max_title_length)
author = models.ForeignKey(Player, related_name="sticky_threads")
post_date = models.DateField()
parent = models.ForeignKey(Subsection, related_name="sticky_threads")
closed = models.BooleanField()
letting me grab the sticky threads in O(1) time no matter what. What I don't like about this approach is that now if I want to just get all of a player's threads, I have to implement a special threads property like this:
class Player(models.Model):
[snip]
#property
def threads(self):
return self.sticky_threads | self.nonsticky_threads
and this approach feels ugly.
Is there an obviously best way to imeplement something like this? Do I just need to do timings to see if the naive way is acceptable? (I'm implementing this as a learning exercise, so I don't really have hard limits, making this check a little difficult) (If so, how would you recommend I do that? (IS something like timeit the bast way?) Is there a better alternative?
Thanks!
Your analysis of the complexity of those two operations is way off. It's simply not true to classify the filter operation as O(n) and the two separate classes as O(1) - I don't know what you're using to make that distinction. Databases are highly optimized for selecting on individual criteria: an index on the sticky column will make the filter query almost exactly the same as querying for everything from a separate table.
The first way is without question the right way to go about this, as long as you ensure that your sticky column is indexed.

Groovy Sql Paging Behavior

I am using the groovy.sql.Sql class to query a database and process the results. My problem is that the ResultSet can be very large; so large that I risk running out of memory if I try to process the whole ResultSet at once. I know the Sql.rows() method supports paging using offset and max results parameters but I haven't been able to find a good example of how to use it (and I'm not certain that paging is what I'm looking for).
Basically, here's what I'm trying to do:
def endOfResultSet = false
for(int x = 1; !endOfResultSet; x+=1000){
def result = sql.rows("Select * from table", x, 1000)
processResult(result)
endOfResultSet = result.size()!=1000
}
My question is if Groovy is smart enough to reuse the same result set for the sql.rows("Select * from table", x, 1000) call or if it will be repeatedly be running the same statement on the database and then paging to where the offset starts.
Your help is appreciated, Thanks!
Edit: What I'm trying to avoid is running the same query on the database multiple times. I'd like to run the query once, get the first 1,000 rows, process them, get the next 1,000 rows, etc... until all the rows are processed.
I assume you've seen this blog post about paging?
To answer your question, if we look at the code for the Sql class in Groovy, we can see that the code for rows(String,int,int) calls rows(String,int,int,null)
And the code for that is:
AbstractQueryCommand command = createQueryCommand(sql);
ResultSet rs = null;
try {
rs = command.execute();
List<GroovyRowResult> result = asList(sql, rs, offset, maxRows, metaClosure);
rs = null;
return result;
} finally {
command.closeResources(rs);
}
So as you can see, it gets the full ResultSet, then steps through this inside the asList method, filling a List<GroovyRowResult> object with just the results you requested.
Edit (after the question was edited)
As I said in my comment below, I think you're going to need to write your own paging query for the specific database you are using... For example, with MySQL, your above query can be changed to:
def result = sql.rows( "SELECT * FROM table LIMIT ${Sql.expand x}, 1000" )
Other databases will have different methods for this sort of thing...I don't believe there is a standard implementation
Answer from above is not correct. If you dig deeper, you'll find that if the ResultSet is not TYPE_FORWARD_ONLY, then the "absolute" method of the ResultSet is invoked to position a server side cursor. Then maxRows are returned. If the ResultSet is TYPE_FORWARD_ONLY, then ResultSet.next() is invoked offset number of times, then maxRows are returned. The exact performance characteristics will depend on the underlying jdbc driver implementation, but usually you want a scrollable result set when using the paging feature.
The resultset is not reused between invocations. Sounds like you want something like streaming, not paging.
Also, I wrote the patch, btw.
http://jira.codehaus.org/browse/GROOVY-4622

NDepend Count, Average, etc.. reporting... aggregates. Possible? clean work arounds?

We have a huge code base, where methods with too many local variables alone returns 226 methods. I don't want this huge table being dumped into the xml output to clutter it up, and I'd like the top 10 if possible, but what I really want is the count so we can do trending and executive summaries. Is there a clean/efficient/scalable non-hacky way to do this?
I imagine I could use an executable task, instead of the ndepend task (so that the merge is not automatic) and the clutter doesn't get merged. Then manually operate on those files to get a summary, but I'd like to know if there is a shorter path?
What about defining a base line to only take account of new flaws?
what I really want is the count so we can do trending and executive summaries
Trending can be easily achieved with code queries and rules over LINQ (CQLinq) like: Avoid making complex methods even more complex (Source CC)
// <Name>Avoid making complex methods even more complex (Source CC)</Name>
// To visualize changes in code, right-click a matched method and select:
// - Compare older and newer versions of source file
// - Compare older and newer versions disassembled with Reflector
warnif count > 0
from m in JustMyCode.Methods where
!m.IsAbstract &&
m.IsPresentInBothBuilds() &&
m.CodeWasChanged()
let oldCC = m.OlderVersion().CyclomaticComplexity
where oldCC > 6 && m.CyclomaticComplexity > oldCC
select new { m,
oldCC ,
newCC = m.CyclomaticComplexity ,
oldLoc = m.OlderVersion().NbLinesOfCode,
newLoc = m.NbLinesOfCode,
}
or Avoid transforming an immutable type into a mutable one.