Optaplanner; What is the allowable size limit? - optaplanner

My problem has a problem size of 80000 but I stuck when I exceeds this limit,
Is there a limit for the problem size used in Optaplanner?
What is this limit?
I get a java heap exception when I exceed this limit (80000)

Some idea's to look into:
1) Give the JVM more memory: -Xmx=2G
2) Use a more efficient data structure. 80k instances will easy fit into a small memory. My bet is you have some sort of cross matrix between 2 collections. For example, a distance matrix for 20k VRP locations needs (20k)² = 400m integers (each of which at least 4 bytes), so it requires almost 2GB of RAM to keep in memory in its most efficient form (an array). Use a profiler such as JProfiler or VisualVM to find out which datastructures are taken such much memory.
3) Read the chapter about "planning clone". Sometimes splitting a Job up in a Job and JobAssignment can save memory because only the JobAssignment needs to be cloned, while in the other case everything that references Job needs to be planning cloned too.

Related

Unable to upload data even after partitioning in VoltDB

We are trying to upload 80 GB of data in 2 host servers each with 48 GB RAM(in total 96GB). We have partitioned table too. But even after partitioning, we are able to upload data only upto 10 GB. In VMC interface, we checked the size worksheet. The no of rows in the table is 40,00,00,000 and table maximum size is 1053,200,000k and minimum size is 98,000,000K. So, what is issue in uploading 80GB even after partitioning and what is this table size?
The size worksheet provides minimum and maximum size in memory that the number of rows would take, based on the schema of the table. If you have VARCHAR or VARBINARY columns, then the difference between min and max can be quite substantial, and your actual memory use is usually somewhere in between, but can be difficult to predict because it depends on the actual size of the strings that you load.
But I think the issue is that the minimum size is 98GB according to the worksheet, meaning if any nullable strings are null, or any not-null strings would be an empty string. Even without taking into account the heap size and any overhead, this is higher than your 96GB capacity.
What is your kfactor setting? If it is 0, there will be only one copy of each record. If it is 1, there will be two copies of each record, so you would really need 196GB minimum in that configuration.
The size per record in RAM depends on the datatypes chosen and if there are any indexes. Also, VARCHAR values longer than 15 characters or 63 bytes are stored in pooled memory which carries more overhead than fixed-width storage, although it can reduce the wasted space if the values are smaller than the maximum size.
If you want some advice on how to minimize the per-record size in memory, please share the definition of your table and any indexes, and I might be able to suggest adjustments that could reduce the size.
You can add more nodes to the cluster, or use servers with more RAM to add capacity.
Disclaimer: I work for VoltDB.

Postgres performance improvement and checklist

I'm studing a series of issues related to performance of my application written in Java, which has about 100,000 hits per day and each visit on average from 5 to 10 readings/writings on the 2 principale database tables (divided equally) whose cardinality is for both between 1 and 3 million records (i access to DB via hibernate).
My two main tables store user information (about 60 columns of type varchar, integer and timestamptz) and another linked to the data to be displayed (with about 30 columns here mainly varchar, integer, timestamptz).
The main problem I encountered may have had a drop in performance of my site (let's talk about time loads over 5 seconds which obviously does not depend only on the database performance), is the use of FillFactor which is currently the default value of 100 (that it's used always when data not changing..).
Obviously fill factor it's same on index (there are 10 for each 2 tables of type btree)
Currently on my main tables I make
40% select operations
30% update operations
20% operations insert
10% delete operations.
My database is also made ​​up of 40 other tables of minor importance (there is just others 3 with same cardinality of user).
My questions are:
How do you find the right value of the fill factor to be set ?
Which can be a checklist of tasks to be checked to improve the performance
of a database of this kind?
Database is on server dedicated (16GB Ram, 8 Core) and storage it's on SSD disk (data are backupped all days and moved on another storage)
You have likely hit the "knee" of your memory usage where the entire index of the heavily used tables no longer fits in shared memory, so disk I/O is slowing it down. Confirm by checking if disk I/O is higher than normal. If so, try increasing shared memory (shared_buffers), or if that's already maxed, adjust the system shared memory size or add more system memory so you can bump it higher. You'll also probably have to start adjusting temp buffers, work memory and maintenance memory, and WAL parameters like checkpoint_segments, etc.
There are some perf tuning hints on PostgreSQL.org, and Google is your friend.
Edit: (to address the first comment) The first symptom of not-enough-memory is a big drop in performance, everything else being the same. Changing the table fill factor is not going to make a difference if you hit a knee in memory usage, if anything it will make it worse w.r.t. load times (which I assume means "db reads") because row information will be expanded across more pages on disk with blank space in each page thus more disk I/O is needed for table scans. But fill factor less than 100% can help with UPDATE operations, but I've found adjusting WAL parameters can compensate most of the time when using indexes (unless you've already optimized those). Bottom line, you need to profile all the heavy queries using EXPLAIN to see what will help. But at first glance, I'm pretty certain this is a memory issue even with the database on an SSD. We're talking a lot of random reads and random writes and a lot of SSDs actually get worse than HDDs after a lot of random small writes.

“Programming Pearls”: Searching

We can avoid many calls to a storage allocator by keeping a collection
of available nodes in his own structure.
This idea can be applied to Binary search tree data structure.
The author said that :"Allocating the nodes all at once can greatly reduces the
tree's space requirements, which reduces the run time by about a third."
I'm curious how this trick can reduce space requirements. I mean If we
want to build a binary search tree with four nodes, we need to allocate
memory for these four nodes, no matter we allocate the nodes one by one or all at
once.
Memory allocators are notoriously bad at allocating very small objects. The situation has somewhat improved in the last decade, but the trick from the book is still relevant.
Most allocators keep additional information with the block that they allocate to you, so that they could free the memory properly. For example, the malloc/free pair of C or new[]/delete[] pair of C++ needs to save the information about the length of the actual memory chunk somewhere; usually, this data ends up in the four bytes just prior to the address returned to you.
This means that at least four additional bytes will be wasted for each allocation. If your tree node takes twelve bytes (four bytes for each of the two pointers plus four bytes for the number), sixteen bytes would be allocated for each node - a 33.3% increase.
Memory allocator needs to perform additional bookkeeping as well. Every time a chunk is taken from the heap, the allocator must account for it.
Finally, the more memory your tree uses, the less is the chance that the adjacent node would be fetched in the cache when the current node is processed, because of the distance in memory to the next node.
This sort of relates to how Strings are handled by Java. Whereas when you concat to a string, you are actually using 3 string objects : the old string, the new segment and the new result. Eventually the garbage collector tidies up but in this situation (my string example and your procedural binary search) - you are growing memory space in a wasteful mannor. At least thats how I understand it.

SSIS crash after few records

I have an SSIS package which suppose to take 100,000 records loop on them and for each one save the details to few tables.
It's working fine, until it reaches somewhere near the 3000 records, then the visual studio crashes. At this point devenv.exe used about 500MB and only 3000 rows were processed.
I'm sure the problem is not with a specific record because it always happens on different 3K of records.
I have a good computer with 2 GIG of ram available.
I'm using SSIS 2008.
Any idea what might be the issue?
Thanks.
Try increasing the default buffer size on your data flow tasks.
Example given here: http://www.mssqltips.com/sqlservertip/1867/sql-server-integration-services-ssis-performance-best-practices/
Best Practice #7 - DefaultBufferMaxSize and DefaultBufferMaxRows
As I said in the "Best Practices #6", the execution tree creates
buffers for storing incoming rows and performing transformations. So
how many buffers does it create? How many rows fit into a single
buffer? How does it impact performance?
The number of buffer created is dependent on how many rows fit into a
buffer and how many rows fit into a buffer dependent on few other
factors. The first consideration is the estimated row size, which is
the sum of the maximum sizes of all the columns from the incoming
records. The second consideration is the DefaultBufferMaxSize property
of the data flow task. This property specifies the default maximum
size of a buffer. The default value is 10 MB and its upper and lower
boundaries are constrained by two internal properties of SSIS which
are MaxBufferSize (100MB) and MinBufferSize (64 KB). It means the size
of a buffer can be as small as 64 KB and as large as 100 MB. The third
factor is, DefaultBufferMaxRows which is again a property of data flow
task which specifies the default number of rows in a buffer. Its
default value is 10000.
Although SSIS does a good job in tuning for these properties in order
to create a optimum number of buffers, if the size exceeds the
DefaultBufferMaxSize then it reduces the rows in the buffer. For
better buffer performance you can do two things. First you can remove
unwanted columns from the source and set data type in each column
appropriately, especially if your source is flat file. This will
enable you to accommodate as many rows as possible in the buffer.
Second, if your system has sufficient memory available, you can tune
these properties to have a small number of large buffers, which could
improve performance. Beware if you change the values of these
properties to a point where page spooling (see Best Practices #8)
begins, it adversely impacts performance. So before you set a value
for these properties, first thoroughly testing in your environment and
set the values appropriately.
You can enable logging of the BufferSizeTuning event to learn how many
rows a buffer contains and you can monitor "Buffers spooled"
performance counter to see if the SSIS has began page spooling. I
will talk more about event logging and performance counters in my next
tips of this series.

Is there a practical limit to the number of elements in a sorted set in redis?

I'm currently migrating some data to Redis and I'm considering using a sorted set to store approximately 1.4e6 items (with associated scores/counts). Is this number of items in a set likely to exceed a practical limit, making it too painful to use the set? I plan on running 64 bit redis, so available memory for the data should not be a problem. Does anyone have experience with a sorted set this size? If so, how are your insertion and query times for the set?
It depends what you want to do with the set. The simple operations are mostly O(log n) which means that they take only twice as long for a million item set as they do for a thousand item set. Unless you have something seriously broken in your config like a memory limit smaller than the set, performance shouldn't be a problem.
Where you need to be careful is with operations on multiple sets, particularly union - that will take a thousand times longer for the million item set. In practical terms this isn't necessarily a problem though - either it will be fast enough for your purposes anyway (redis has commands documented as too slow for production use that are still best measured in milliseconds) or you can adjust the order of operations to avoid running union on really large sets.
Our site has a sorted set with about 2 milions items (email addresses) with integer scores and it took up about 320MB in memory size.