Magento - fetching products and looping through is getting slower and slower - sql

I'm fetching aroung 6k articles from the Magento database. Traversing through them in beginning is very fast (0 seconds, just some ms) and gets slower and slower. The loop takes about 8 hours to run and in the end each loop in the foreach takes about 16-20 seconds ! It seems like mysql is getting slower and slower in the end, but I cannot explain why.
$product = Mage::getModel('catalog/product');
$data = $product->getCollection()->addAttributeToSelect('*')->addAttributeToFilter('type_id', 'simple');
$num_products = $product->getCollection()->count();
echo 'exporting '.$num_products."\n";
print "starting export\n";
$start_time = time();
foreach ($data as $tProduct) {
// doing some stuff, no sql !
}
Does anyone know why it is so slow ? Would it be faster, just to fetch the ids and selecting each product one by one ?
The memory usage of the script running this code has a constant memory usage of:
VIRT RES SHR S CPU% MEM%
680M 504M 8832 R 90.0 6.3
Regards, Alex

Oh, well, Shot in the dark time. If you are running Magento 1.4.x.x, previous to 1.4.2.0, you have a memory leak that displays exactly this symptom as it eats up more and more memory, leading eventually to memory exhaustion. Profile exports that took 3-8 minutes under 1.3.x.x will now take 2-5 hours if it doesn't throw an error before completion. Another symptom is exports that fail without finishing and without giving any indication of why the window freezes or gives some sort of funky completion message with no output.
The Array Of Death(tm) has been noted and here's the official repair in the new version. Maybe Data Will Flow again!
Excerpt from 1.4.2.0rc1 /lib/Varien/Db/Select.php that has been patched for memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
if (!in_array(self::STRAIGHT_JOIN_ON, self::$_joinTypes)) {
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}
}
Excerpt from 1.4.1.1 /lib/Varien/Db/Select.php with memory leak
public function __construct(Zend_Db_Adapter_Abstract $adapter)
{
parent::__construct($adapter);
self::$_joinTypes[] = self::STRAIGHT_JOIN_ON;
self::$_partsInit = array(self::STRAIGHT_JOIN => false) + self::$_partsInit;
}

Related

Why don't all the shell processes in my promises (start blocks) run? (Is this a bug?)

I want to run multiple shell processes, but when I try to run more than 63, they hang. When I reduce max_threads in the thread pool to n, it hangs after running the nth shell command.
As you can see in the code below, the problem is not in start blocks per se, but in start blocks that contain the shell command:
#!/bin/env perl6
my $*SCHEDULER = ThreadPoolScheduler.new( max_threads => 2 );
my #processes;
# The Promises generated by this loop work as expected when awaited
for #*ARGS -> $item {
#processes.append(
start { say "Planning on processing $item" }
);
}
# The nth Promise generated by the following loop hangs when awaited (where n = max_thread)
for #*ARGS -> $item {
#processes.append(
start { shell "echo 'processing $item'" }
);
}
await(#processes);
Running ./process_items foo bar baz gives the following output, hanging after processing bar, which is just after the nth (here 2nd) thread has run using shell:
Planning on processing foo
Planning on processing bar
Planning on processing baz
processing foo
processing bar
What am I doing wrong? Or is this a bug?
Perl 6 distributions tested on CentOS 7:
Rakudo Star 2018.06
Rakudo Star 2018.10
Rakudo Star 2019.03-RC2
Rakudo Star 2019.03
With Rakudo Star 2019.03-RC2, use v6.c versus use v6.d did not make any difference.
The shell and run subs use Proc, which is implemented in terms of Proc::Async. This uses the thread pool internally. By filling up the pool with blocking calls to shell, the thread pool becomes exhausted, and so cannot process events, resulting in the hang.
It would be far better to use Proc::Async directly for this task. The approach with using shell and a load of real threads won't scale well; every OS thread has memory overhead, GC overhead, and so forth. Since spawning a bunch of child processes is not CPU-bound, this is rather wasteful; in reality, just one or two real threads are needed. So, in this case, perhaps the implementation pushing back on you when doing something inefficient isn't the worst thing.
I notice that one of the reasons for using shell and the thread pool is to try and limit the number of concurrent processes. But this isn't a very reliable way to do it; just because the current thread pool implementation sets a default maximum of 64 threads does not mean it always will do so.
Here's an example of a parallel test runner that runs up to 4 processes at once, collects their output, and envelopes it. It's a little more than you perhaps need, but it nicely illustrates the shape of the overall solution:
my $degree = 4;
my #tests = dir('t').grep(/\.t$/);
react {
sub run-one {
my $test = #tests.shift // return;
my $proc = Proc::Async.new('perl6', '-Ilib', $test);
my #output = "FILE: $test";
whenever $proc.stdout.lines {
push #output, "OUT: $_";
}
whenever $proc.stderr.lines {
push #output, "ERR: $_";
}
my $finished = $proc.start;
whenever $finished {
push #output, "EXIT: {.exitcode}";
say #output.join("\n");
run-one();
}
}
run-one for 1..$degree;
}
The key thing here is the call to run-one when a process ends, which means that you always replace an exited process with a new one, maintaining - so long as there are things to do - up to 4 processes running at a time. The react block naturally ends when all processes have completed, due to the fact that the number of events subscribed to drops to zero.

NAudio WaveOut.Init takes a very long time (sometimes)

I've been using NAudio recently and for the most part I'm very happy with the library. However, I've been experiencing a very annoying intermittent issue that causes the Init method to take a very long time to execute (over 30 seconds).
Here's the code I'm using:
var waveFormat = WaveFormat.CreateIeeeFloatWaveFormat(44100, 2);
_wavePlayer = new WaveOutEvent();
_mixingSampleProvider = new MixingSampleProvider(waveFormat)
{
ReadFully = true
};
_wavePlayer.Init(_mixingSampleProvider); // program halts here
_wavePlayer.Play();
I've also tried using WaveOut instead of WaveOutEvent and I get the same problem.
I can reproduce this issue about 1 in every 3 times or so. So it's not completely easy to reproduce but it's often enough to be very, very annoying.

Why is the first SaveChanges slower than following calls?

I'm investigating some performance problems in an experimental scheduling application I'm working on. I found that calls to session.SaveChanges() were pretty slow, so I wrote a simple test.
Can you explain why the first iteration of the loop takes 200ms and subsequent loop 1-2 ms? How I can I leverage this in my application (I don't mind the first call to be this slow if all subsequent calls are quick)?
private void StoreDtos()
{
for (int i = 0; i < 3; i++)
{
StoreNewSchedule();
}
}
private void StoreNewSchedule()
{
var sw = Stopwatch.StartNew();
using (var session = DocumentStore.OpenSession())
{
session.Store(NewSchedule());
session.SaveChanges();
}
Console.WriteLine("Persisting schedule took {0} ms.",
sw.ElapsedMilliseconds);
}
Output is:
Persisting schedule took 189 ms. // first time
Persisting schedule took 2 ms. // second time
Persisting schedule took 1 ms. // ... etc
Above is for an in-memory database. Using a http connection to a Raven DB instance (on the same machine), I get similar results. The first call takes noticeably more time:
Persisting schedule took 1116 ms.
Persisting schedule took 37 ms.
Persisting schedule took 14 ms.
On Github: RavenDB 2.0 testcode and RavenDB 2.5 testcode.
The very first time that you call RavenDB, there are several things that have to happen.
We need to prepare the serializers for your entities, which takes time.
We need to create the TCP connection to the server.
On the next calls, we can reuse the connection that is already open and the created serializers.

Pushing a chart to the the client using Wt

I am using server push in Wt and I am trying to push a new chart with the following code:
Wt::WApplication::UpdateLock uiLock(app);
if (uiLock){
chart_ste = new ScatterPlotExample(this,10*asf.get_outputSamplingRate());
app->triggerUpdate();
}
but it waits for the program to end and then it prints it whereas the following code in the same program pushes the word "Demokritus every 0.5 secs as it should do:
for (int i=0; i<10; i++)
{
boost::this_thread::sleep(boost::posix_time::milliseconds(500));
Wt::WApplication::UpdateLock uiLock(app);
if (uiLock) {
showFileName = new WText(this);
showFileName->setText(boost::lexical_cast<std::string>("Demokritus"));
app->triggerUpdate();
}
}
What might be my mistake?
The documentation for triggerUpdate mentions that "The update is not immediate, and thus changes that happen after this call will equally be pushed to the client." If the changes are not immediate, it could be that the first piece of code continuously tries to push updates as fast as your CPU will allow it, so it never gets to the server because a new update overwrites the last and it begins waiting again. Try adding boost::this_thread::sleep(boost::posix_time::milliseconds(500)); to the first piece of code to see if that helps.
I've done a project once where I needed to update a chart every second with new data and had a very similar setup to yours. I put in the sleep from the start because I did not want my boost thread to use too much CPU.
Also, it is unclear if the first piece of code is in a bigger loop, if it is, you probably shouldn't make a new chart every time, but create it before hand and then update it with data. I hope some of this helps.

Extensions for Computationally-Intensive Cypher queries

As a follow up to a previous question of mine, I want to find all 30 pathways that exist between two given nodes within a depth of 4. Something to the effect of this:
start startnode = node(1), endnode(1000)
match startnode-[r:rel_Type*1..4]->endnode
return r
limit 30;
My database contains ~50k nodes and 2M relationships.
Expectedly, the computation time for this query is very, very large; I even ended up with the following GC message in the message.log file: GC Monitor: Application threads blocked for an additional 14813ms [total block time: 182.589s]. This error keeps occuring, and blocks all threads for an indefinite period of time. Therefore, I am looking for a way to lower the computational strain of this query on the server by optimizing the query.
Is there any extension I could use to help optimize this query?
Give this one a try:
https://github.com/wfreeman/findpaths
You can query the extension like so:
.../findpathslen/1/1000/4/30
And it will give you a json response with the paths found. Hopefully that helps you.
The meat of it is here, using the built-in graph algorithm to find paths of a certain length:
#GET
#Path("/findpathslen/{id1}/{id2}/{len}/{count}")
#Produces(Array("application/json"))
def fof(#PathParam("id1") id1:Long, #PathParam("id2") id2:Long, #PathParam("len") len:Int, #PathParam("count") count:Int, #Context db:GraphDatabaseService) = {
val node1 = db.getNodeById(id1)
val node2 = db.getNodeById(id2)
val pathFinder = GraphAlgoFactory.pathsWithLength(Traversal.pathExpanderForAllTypes(Direction.OUTGOING), len)
val pathIterator = pathFinder.findAllPaths(node1,node2).asScala
val jsonMap = pathIterator.take(count).map(p => obj(p))
Response.ok(compact(render(decompose(jsonMap))), MediaType.APPLICATION_JSON).build()
}