Optimization of SQL Select for enumerators - sql

How can this query be optimized for enumerators:
SELECT * FROM Customers
Table Customers
customerId int - has index on it
customerName, etc
SqlReader that returns a set customers will be read on-demand in an enumerator fashion. While it can return huge datasets that can be read/consumed slowly in a foreach loop, every other query on the same table will encounter a lot of contentions. How can this be optimized/avoided? Cursors or selecting into temp tables?
Here is a code example that will cause a lot of contentions (I profiled it and the numbers look bad indeed):
public void DumpCustomers()
{
Thread thread = new Thread(AccessCustomers);
thread.Start();
// GetCustomers returns enumerator with yield return; big # of customers
foreach (Customer customer in GetCustomers())
{
Console.WriteLine(customer.CustomerName);
System.Threading.Thread.Sleep(200);
}
thread.Abort();
}
public void AccessCustomers()
{
while (true)
{
Console.WriteLine(GetCustomer("Zoidberg").CustomerName);
Thread.Sleep(100);
}
}
P.S. I will also need to optimize this in MySQL.

1) Do you need the '*' cant you specify the columns.
2) Use multi-part names dbo.tablename.fieldname - this speeds it up
3) try a locking hint with (nolock) or (readpast)
4) Whats the IO profile? Does SQL have to pull the data from disk every time it runs?
5) Do you find one of the cores on your server max out while the other one is idle?
6) Cache it! Until you know there has been a change- then reload it.
I've run out of ideas..

Related

Read very slow from a database

I use spring boot with spring data jpa, hibernate and oracle.
Actually, I my table I have arount 10 millions of record, I need to do some operation, write info to a file and after delete the record.
It's a basic sql query
select * from zzz where status = 2;
I done a test without doing operation and delete record
long start = System.nanoTime();
int page = 0;
Pageable pageable = PageRequest.of(page, LIMIT);
Page<Billing> pageBilling = billingRepository.findAllByStatus(pageable);
while (true) {
for (Billing: pageBilling .getContent()) {
//process
//write to file
//delete element
}
if (!pageBilling .hasNext()) {
break;
}
pageable = pageBilling .nextPageable();
pageBilling = billingRepository.findAllByStatus(pageable);
}
long end = System.nanoTime();
long microseconds = (end - start) / 1000;
System.out.println(microseconds + " to write");
Result it's bad, with a limit of 10 000, that took 157 minutes, with 100 000 28 minutes, with millions 19 minutes.
It's there a better solution to increase performance?
The following are likely to improve the performance significantly:
You should not iterate past the first page. Instead, delete the processed data and select the first page again. Actually you don't need a page for that you can encode the limit in the method name. Selecting late pages is rather inefficient.
The process of loading, processing and deleting one batch of items should be in a separate transaction. Otherwise the EntityManager will hold all the entities ever loaded which will make things really slow.
If that still isn't sufficient yet you may look into the following:
Inspect the SQL executed. Does it look sensible? If not consider switching to JdbcTemplate or NamedParameterJdbcTemplate with a query method that takes a RowCallbackHandler you should be able to load and process all rows with a single select statement and at the end to process one delete statement to remove all rows. This requires that the status that you use for filtering does not change in the mean time.
How do the execution plans look like? If they seem of inspect your indices.

Using deepstream List for tens of thousands unique values

I wonder if it's a good/bad idea to use deepstream record.getList for storing a lot of unique values, for example, emails or any other unique identifiers. The main purpose is to be able to answer a question quickly whether we already have, say, a user with such email (email in use) or another record by specific unique field.
I made few experiments today and got two problems:
1) when I tried to populate the list with few thousands values I got
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory
and my deepstream server went off. I was able to fix it by adding more memory to the server node process with this flag
--max-old-space-size=5120
it doesn't look fine but allowed me to make a list with more than 5000 items.
2) It wasn't enough for my tests so I precreated the list with 50000 items and put the data directly to rethinkdb table and got another issue on getting the list or modifing it:
RangeError: Maximum call stack size exceeded
I was able to fix it with another flag:
--stack-size=20000
It helps but I believe it's only matter of time when one of those errors appear in production when the list size reaches proper value. I don't know really whether it's nodejs, javascript, deepstream or rethinkdb issue. That's all in general made me think that I try to use deepstream List wrong way. Please, let me know. Thank you in advance!
Whilst you can use lists to store arrays of strings, they are actually intended as collections of recordnames - the actual data would be stored in the record itself, the list would only manage the order of the records.
Having said that, there are two open Github issues to improve performance for very long lists by sending more efficient deltas and by introducing a pagination option
Interesting results in regards to memory though, definitely something that needs to be handled more gracefully. In the meantime you could drastically improve performance by combining updates into one:
var myList = ds.record.getList( 'super-long-list' );
// Sends 10.000 messages
for( var i = 0; i < 10000; i++ ) {
myList.addEntry( 'something-' + i );
}
// Sends 1 message
var entries = [];
for( var i = 0; i < 10000; i++ ) {
entries.push( 'something-' + i );
}
myList.setEntries( entries );

How to handle caching counters by redis?

I am using Postgres as main DB and REDIS for caching. I am working on caching mechanism for one db query which takes to much time (It's about 5-6 JOINS + nested SELECTS). For now I am caching results of this query using SET 'some key' JSON.stringify(query.result). This works fine, however I have one column that cannot be cached - it is called commentsCount. It has to be always up to date. As a temporary solution, I am querying db just for this one particular field like this:
app.get('/post/getBySlug/:slug',function(req,res,next){
var cacheKey = req.params.slug+'|'+req.params.language; // "my-post-slug|en-us" for example
cache.get(cacheKey, function(err, post){
throw err if err;
if(post) {
db.getPostCommentsCount({ where: { id: post.id }}).done(function(err,commentsCount){
throw err if err;
post.commentsCount = commentsCount;
res.json(post);
next()
})
} else {
db.getFullPostBySlug(req.params.slug, req.params.language).done(function(err, post){
throw err if err;
cache.set(cacheKey, post);
res.json(post);
next();
})
}
})
})
But it is still now what I want, because main DB is still queried. Is there any standard/good practise on storing counters in REDIS? My comment insert function looks like this:
START TRANSACTION
INSERT INTO "Comments" VALUES (...) // insert comments
UPDATE "Posts" SET "commentsCount" = "commentsCount" + 1 WHERE "Posts"."id" = 123456 // update counter on post
COMMIT TRANSACTION
I am using transaction because I dont want comment to be inserted without incrementing comments count. As a "side" question - is it better to make 2 sql queries in transaction or write a trigger to handle incrementing counter??
According to my query (I posted link to gist in comments):
We dont plan more than 2 languages (though it is possible)
I made those counters because I have to keep counters separate per language, be able to order by those separate counters and also be able to order by sum of the counters (total for all languages) - I found it hard to make query that would order by sum of columns from separate rows while still returning those rows... (At the begining counters were stored in language translations).
Generally this query looks for post where exists translation with specific 'slug' and 'language' (slug+language on post translation is unique index). Morover post has to be published (isPublished = boolean) and post.status has to be 'published' (status = enum) or post.iscomingSoon has to be true (isComingSoon = boolean). Do you have idea what index/ordering I could add to this query? Or should I just remove limit?
In every translation table I keep language as TEXT. It can be for example en-us or zh-cn etc. Do you think I should make it enum or maybe I should make another table to store languages and just keep language_id in translations?
Author actually can be null :)

Save huge array to database

First the introduction, in case there's is a better approach: I have a product table with *product_id* and stock, where stock can be as big as 5000 or 10000, I need to create a list (in another table) where I have a row for each item, this is, if a *propduct_id* has stock 1000 I'll have 1000 rows with this *product_id*, and plus, this list needs to be random.
I chose a PHP (symfony2) solution, as I found how to get a random single product_id based on stock or even how to random order the product list, but I didn't find how to "multiply" this rows by stock.
Now, the main problem:
So, in PHP it's no so difficult, get product_id list, "multiply" by stock and shuffle, the problem comes when I want to save:
If I use $em->flush every 100 records or more I get a memory overflow after a while
If I use $em->flush in every record it takes ages to save
This is my code to save which maybe you can improve:
foreach ($huge_random_list as $indice => $id_product)
{
$preasignacion = new ListaPreasignacion();
$preasignacion->setProductId($id_product);
$preasignacion->setOrden($indice+1);
$em->persist($preasignacion);
if ($indice % 100 == 0) $em->flush();
}
$em->flush();
Edit with final solution based on #Pazi suggestion:
$conn = $em->getConnection();
foreach ($huge_random_list as $indice => $id_product)
{
$conn->executeUpdate("insert into product_list(product_id, order) "
." values({$id_product}, {$indice})");
}
I would suggest to abstain from doctrine ORM and use the DBAL connection an pure sql queries for this purpose. I do this always in my applications, where I have to store much data in short time. Doctrine adds too much overhead with objects, checks and dehydrating. You can retrieve the DBAL connection via the DI container. For example in a contoller:
conn = $this->get('database_connection');
Read more about DBAL

Perl DBI - transfer data between two sql servers - fetchall_arrayref

I have 2 servers: dbh1 and dbh2 where I query dbh1 and pull data via fetchall_arrayref method. Once I execute the query, I want to insert the output from dbh1 into a temp table on server dbh2.
I am able to establish access to both servers at the same time and am able to pull data from both.
1. I pull data from dbh1:
while($row = shift(#$rowcache) || shift(#{$rowcache=$sth1->fetchall_arrayref(undef, $max_rows)})) {
#call to sub insert2tempData
&insert2tempData(values #{$row});
}
2. Then on dbh2 I have an insert query:
INSERT INTO ##population (someid, Type, anotherid)
VALUES ('123123', 'blah', '634234');
Question:
How can I insert the bulk result of the fetchall_arrayref from dbh1 into the temp table on server dbh2 (without looping through individual records)?
Ok - so i was able to resolve this issue and was able to implement the following code:
my $max_rows = 38;
my $rowcache = [];
my $sum = 0;
if($fldnames eq "ALL"){ $fldnames = join(',', #{ $sth1->{NAME} });}
my $ins = $dbh2->prepare("insert into $database2.dbo.$tblname2 ($fldnames) values $fldvalues");
my $fetch_tuple_sub = sub { shift(#$rowcache) || shift(#{$rowcache=$sth1->fetchall_arrayref(undef, $max_rows)}) };
my #tuple_status;
my $rc;
$rc = $ins->execute_for_fetch($fetch_tuple_sub, \#tuple_status);
my #errors = grep { ref $_ } #tuple_status;
The transfer works but it is still slower than if I were to transfer data manually through SQL Server export/import wizard . The issue that i notice is that the data flows row by row into the destination and I was wondering if it is possible to increase the bulk transfer size. It downloads the data extremely fast, but when i combine download and upload then the speeds decreases dramatically and it takes up to 10 minutes to transfer a 5000 row table between servers.
It would be better if you said what your goal was (speed?) rather than asking a specific question on avoiding looping.
For a Perl/DBI way:
Look at DBI's execute_array and execute_for_fetch however as you've not told us which DBD you are using it is impossible to say more. Not all DBDs support bulk insert and when they don't DBI emulates it. DBD::Oracle does and DBD::ODBC does (in recent versions see odbc_array_operations) but in the latter it is off by default.
You didn't mention which version of SQL Server you are using. First, I would look into the "BULK INSERT" support of that version.
You also didn't mention how many rows are involved. I'll assume that they fit into memory, otherwise a bulk insert won't work.
From there it's up to you to translate the output of fetchall_arrayref into the syntax needed for the "BULK INSERT" operation.