Read very slow from a database - sql

I use spring boot with spring data jpa, hibernate and oracle.
Actually, I my table I have arount 10 millions of record, I need to do some operation, write info to a file and after delete the record.
It's a basic sql query
select * from zzz where status = 2;
I done a test without doing operation and delete record
long start = System.nanoTime();
int page = 0;
Pageable pageable = PageRequest.of(page, LIMIT);
Page<Billing> pageBilling = billingRepository.findAllByStatus(pageable);
while (true) {
for (Billing: pageBilling .getContent()) {
//process
//write to file
//delete element
}
if (!pageBilling .hasNext()) {
break;
}
pageable = pageBilling .nextPageable();
pageBilling = billingRepository.findAllByStatus(pageable);
}
long end = System.nanoTime();
long microseconds = (end - start) / 1000;
System.out.println(microseconds + " to write");
Result it's bad, with a limit of 10 000, that took 157 minutes, with 100 000 28 minutes, with millions 19 minutes.
It's there a better solution to increase performance?

The following are likely to improve the performance significantly:
You should not iterate past the first page. Instead, delete the processed data and select the first page again. Actually you don't need a page for that you can encode the limit in the method name. Selecting late pages is rather inefficient.
The process of loading, processing and deleting one batch of items should be in a separate transaction. Otherwise the EntityManager will hold all the entities ever loaded which will make things really slow.
If that still isn't sufficient yet you may look into the following:
Inspect the SQL executed. Does it look sensible? If not consider switching to JdbcTemplate or NamedParameterJdbcTemplate with a query method that takes a RowCallbackHandler you should be able to load and process all rows with a single select statement and at the end to process one delete statement to remove all rows. This requires that the status that you use for filtering does not change in the mean time.
How do the execution plans look like? If they seem of inspect your indices.

Related

How to handle caching counters by redis?

I am using Postgres as main DB and REDIS for caching. I am working on caching mechanism for one db query which takes to much time (It's about 5-6 JOINS + nested SELECTS). For now I am caching results of this query using SET 'some key' JSON.stringify(query.result). This works fine, however I have one column that cannot be cached - it is called commentsCount. It has to be always up to date. As a temporary solution, I am querying db just for this one particular field like this:
app.get('/post/getBySlug/:slug',function(req,res,next){
var cacheKey = req.params.slug+'|'+req.params.language; // "my-post-slug|en-us" for example
cache.get(cacheKey, function(err, post){
throw err if err;
if(post) {
db.getPostCommentsCount({ where: { id: post.id }}).done(function(err,commentsCount){
throw err if err;
post.commentsCount = commentsCount;
res.json(post);
next()
})
} else {
db.getFullPostBySlug(req.params.slug, req.params.language).done(function(err, post){
throw err if err;
cache.set(cacheKey, post);
res.json(post);
next();
})
}
})
})
But it is still now what I want, because main DB is still queried. Is there any standard/good practise on storing counters in REDIS? My comment insert function looks like this:
START TRANSACTION
INSERT INTO "Comments" VALUES (...) // insert comments
UPDATE "Posts" SET "commentsCount" = "commentsCount" + 1 WHERE "Posts"."id" = 123456 // update counter on post
COMMIT TRANSACTION
I am using transaction because I dont want comment to be inserted without incrementing comments count. As a "side" question - is it better to make 2 sql queries in transaction or write a trigger to handle incrementing counter??
According to my query (I posted link to gist in comments):
We dont plan more than 2 languages (though it is possible)
I made those counters because I have to keep counters separate per language, be able to order by those separate counters and also be able to order by sum of the counters (total for all languages) - I found it hard to make query that would order by sum of columns from separate rows while still returning those rows... (At the begining counters were stored in language translations).
Generally this query looks for post where exists translation with specific 'slug' and 'language' (slug+language on post translation is unique index). Morover post has to be published (isPublished = boolean) and post.status has to be 'published' (status = enum) or post.iscomingSoon has to be true (isComingSoon = boolean). Do you have idea what index/ordering I could add to this query? Or should I just remove limit?
In every translation table I keep language as TEXT. It can be for example en-us or zh-cn etc. Do you think I should make it enum or maybe I should make another table to store languages and just keep language_id in translations?
Author actually can be null :)

twiiter4j when to STOP when no more tweets available?

So, I've figured out how to be able to get more than 100 tweets, thanks to How to retrieve more than 100 results using Twitter4j
However, when do I make the script stop and print stop when maximum results have been reached? For example, I set
int numberOfTweets = 512;
And, it finds just 82 tweets matching my query.
However, because of:
while (tweets.size () < numberOfTweets)
it still continues to keep on querying over and over until I max out my rate limit of 180 requests per 15 seconds.
I'm really a novice at java, so I would really appreciate if you could show me how to resolve this by modifying the first answer script at How to retrieve more than 100 results using Twitter4j
Thanks in advance!
You only need to modify things in the try{} block. One solution is to check whether the ID of the last tweet you found on the previous loop(previousLastID) in the while is the same as the ID of the last tweet (lastID) in the new batch collected (newTweets). If it is, it means the new batch's elements already exist in the previous array, and that that we have reached the end of possible tweets for this hastag.
try {
QueryResult result = twitter.search(query);
List<Status> newTweets = result.getTweets();
long previousLastID = lastID;
for (Status t: newTweets)
if (t.getId() < lastID) lastID = t.getId();
if (previousLastID == lastID) {
println("Last batch (" + tweets.size() + " tweets) was the same as first. Stopping the Gathering process");
break;
}

Optimization of SQL Select for enumerators

How can this query be optimized for enumerators:
SELECT * FROM Customers
Table Customers
customerId int - has index on it
customerName, etc
SqlReader that returns a set customers will be read on-demand in an enumerator fashion. While it can return huge datasets that can be read/consumed slowly in a foreach loop, every other query on the same table will encounter a lot of contentions. How can this be optimized/avoided? Cursors or selecting into temp tables?
Here is a code example that will cause a lot of contentions (I profiled it and the numbers look bad indeed):
public void DumpCustomers()
{
Thread thread = new Thread(AccessCustomers);
thread.Start();
// GetCustomers returns enumerator with yield return; big # of customers
foreach (Customer customer in GetCustomers())
{
Console.WriteLine(customer.CustomerName);
System.Threading.Thread.Sleep(200);
}
thread.Abort();
}
public void AccessCustomers()
{
while (true)
{
Console.WriteLine(GetCustomer("Zoidberg").CustomerName);
Thread.Sleep(100);
}
}
P.S. I will also need to optimize this in MySQL.
1) Do you need the '*' cant you specify the columns.
2) Use multi-part names dbo.tablename.fieldname - this speeds it up
3) try a locking hint with (nolock) or (readpast)
4) Whats the IO profile? Does SQL have to pull the data from disk every time it runs?
5) Do you find one of the cores on your server max out while the other one is idle?
6) Cache it! Until you know there has been a change- then reload it.
I've run out of ideas..

Are transactions possible with HTML5 Storage in Safari

Instead of doing an each loop on a JSON file containing a list of SQL statments and passing them one at a time, is it possible with Safari client side storage to simply wrap the data in "BEGIN TRANSACTION" / "COMMIT TRANSACTION" and pass that to the database system in a single call? Looping 1,000+ statements takes too much time.
Currently iterating one transaction at a time:
$j.getJSON("update1.json",
function(data){
$j.each(data, function(i,item){
testDB.transaction(
function (transaction) {
transaction.executeSql(data[i], [], nullDataHandler, errorHandler);
}
);
});
});
Trying to figure out how to make just one call:
$j.getJSON("update1.json",
function(data){
testDB.transaction(
function (transaction) {
transaction.executeSql(data, [], nullDataHandler, errorHandler);
}
);
});
Has anybody tried this yet and succeeded?
Every example I could find in the documentation seems to show only one SQL statement per executeSql command. I would just suggest showing an "ajax spinner" loading graphic and execute your SQL in a loop. You can keep it all within the transaction, but the loop would still need to be there:
$j.getJSON("update1.json",
function(data){
testDB.transaction(
function (transaction) {
for(var i = 0; i < data.length; i++){
transaction.executeSql(data[i], [], nullDataHandler, errorHandler);
}
}
);
}
);
Moving the loop inside the transaction and using the for i = should help get a little more speed out of your loop. $.each is good for less than a 1000 iterations, after that the native for(var = i... will probably be faster.
Note Using my code, if any of your SQL statements throw errors, the entire transaction will fail. If that is not your intention, you will need to keep the loop outside the transaction.
I haven't ever messed with HTML5 database storage (have with local/sessionStorage though) and I would assume that it's possible to run one huge string of statements. Use data.join(separator here) to get the string representation of the data array.
Yes, it is possible to process a whole group of statements within a single transaction with webSQL. You actually don't even need to use BEGIN or COMMIT, this is taken care of for you automatically as long as you make all your executeSql calls from the same transaction. As long as you do this every statement gets included within the transaction.
This makes the process much faster and also makes it so that when one of your statements has an error it rolls back the entire transaction.

Paging Lucene's search results

I am using Lucene to show search results in a web application.I am also custom paging for showing the same.
Search results could vary from 5000 to 10000 or more.
Can someone please tell me the best strategy for paging and caching the search results?
I would recommend you don't cache the results, at least not at the application level. Running Lucene on a box with lots of memory that the operating system can use for its file cache will help though.
Just repeat the search with a different offset for each page. Caching introduces statefulness that, in the end, undermines performance. We have hundreds of concurrent users searching an index of over 40 million documents. Searches complete in much less than one second without using explicit caching.
Using the Hits object returned from search, you can access the documents for a page like this:
Hits hits = searcher.search(query);
int offset = page * recordsPerPage;
int count = Math.min(hits.length() - offset, recordsPerPage);
for (int i = 0; i < count; ++i) {
Document doc = hits.doc(offset + i);
...
}