I'm using Yii2 and my database server is MySQL. I need to scan every row of a whole db table searching for occurrences of some text.
This is what I want but because of the large number of records, I'm not sure if I do it like this the server won't run out of memory or the MySQL server wont go away:
$rows = Posts::find()->select('content')->all();
foreach($rows as $post) {
// do some regex on $post['content']. no need to save it back to the table.
}
It's a live server with a large database. I must do it on the fly, can't take down the server for back up and restore!
Would this work? Is there any better way to do this?
The following two subsections of Accessing Data in Yii2 Guide have talked about your issue:
Retrieving Data in Arrays
use yii\helpers\ArrayHelper;
$contents = ArrayHelper::getColumn(
Post::find()->asArray()->all(),
'content'
);
foreach ($contents as $content) {
}
Retrieving Data in Batches
// fetch 10 rows at a time
foreach(Posts::find()->select('content')->each(10) as $post) {
// ...
}
Both methods could reduce memory cost.
Related
I am fetching paginated data from bq since data is huge it takes a lot of time to process them.
while (results.hasNextPage()) {
results = results.getNextPage();
count += results.getValues().spliterator().getExactSizeIfKnown();
results
.getValues()
.forEach(row ->
{
//Some operations.
}
);
logger.info("Grouping completed in iteration {}. Progress: {} / {}", i, count, results.getTotalRows());
i++;
}
I examine my program with visualVm and I realize that majority of the time is spent on results.getNextPage line which is getting next page data. Is there any way to make it parallel? I mean fetching every batch of data(which is 20K in my case) in different thread. I am using java client com.google.cloud.bigquery
Each query writes to a destination table. If no destination table is provided, the BigQuery API automatically populates the destination table property with a reference to a temporary anonymous table.
Having that table you can use the tabledata.list API call to get the data from it. Under the optional params, you will see a startIndex parameter that you can set to whatever you want, and you can use in your pagination script.
You can run parallel API calls using different offsets that will speed your request.
You can refer to this document
to Page through results using the API.
I am querying mssql database table with above 15 million rows and processing all of the returned data using node. Well with a basic SQL query, right away that seems to be a problem since the amount of data to write in memory will crash nodejs.
Alternatively, I gave node more memory with --max-old-space-size=8000 but that seems to not help too. I have also tried using node js streams, even though I am not sure I am wrong streams will be useful if I am processing the data by chunk as opposed to the entire data.
Considering I am sorting, grouping, and mapping through the entire data, What would be the best way to both make querying 15 million rows faster and efficiently use the memory to process the data?
Currently, all the examples I have seen suggest streaming, but then all streaming examples process data by chunk. It would be very helpful if you could couple a simple example with your suggestions.
Possible Ideas
If we say streaming is the way to go to, then is it ok to first store the data on memory while streaming bit by bit then process the data entirely once?
Thanks
In documentation npm-mssql
There is a really nice explanation about streaming without crashing. You should intermittently pause your request.
When streaming large sets of data you want to back-off or chunk the amount of data you're processing to prevent memory exhaustion issues; you can use the Request.pause() function to do this
Look for example of managing rows in batches of 15
let rowsToProcess = [];
request.on('row', row => {
rowsToProcess.push(row);
if (rowsToProcess.length >= 15) {
request.pause();
processRows();
}
});
request.on('done', () => {
processRows();
});
function processRows() {
// process rows
rowsToProcess = [];
request.resume();
}
I'm having a hard problem solving an issue with RavenDB.
At my work we have a process to trying to identify potential duplicates in our database on a specified collection (let's call it users collection).
That means, I'm iterating through the collection and for each document there is a query that is trying to find similar entities. So just imagine, it's quite a long task to run.
My problem is, when the task starts running, the memory consumption for RavenDB is going higher and higher, it's literally just growing and growing, and it seems to continue until it reaches the maximum memory of the system.
But it doesn't really makes sense, since I'm only doing query, I'm using one single index and take a default page size when querying (128).
Anybody meet a similar problem like this? I really have no idea what is going on in ravendb. but it seems like a memory leak.
RavenDB version: 3.0.179
When i need to do massive operations on large collections i work following this steps to prevent problems on memory usage:
I use Query Streaming to extract all the ids of the documents that i want to process (with a dedicated session)
I open a new session for each id, i load the document and then i do what i need
First, a recommendation: if you don't want duplicates, store them with a well-known ID. For example, suppose you don't want duplicate User objects. You'd store them with an ID that makes them unique:
var user = new User() { Email = "foo#bar.com" };
var id = "Users/" + user.Email; // A well-known ID
dbSession.Store(user, id);
Then, when you want to check for duplicates, just check against the well known name:
public string RegisterNewUser(string email)
{
// Unlike .Query, the .Load call is ACID and never stale.
var existingUser = dbSession.Load<User>("Users/" + email);
if (existingUser != null)
{
return "Sorry, that email is already taken.";
}
}
If you follow this pattern, you won't have to worry about running complex queries nor worry about stale indexes.
If this scenario can't work for you for some reason, then we can help diagnose your memory issues. But to diagnose that, we'll need to see your code.
i need to get a large amount of data from a remote database. the idea is do a sort of pagination, like this
1 Select a first block of datas
SELECT * FROM TABLE LIMIT 1,10000
2 Process that block
while(mysql_fetch_array()...){
//do something
}
3 Get next block
and so on.
Assuming 10000 is an allowable dimension for my system, let us suppose i have 30000 records to get: i perform 3 call to remote system.
But my question is: when executing a select, the resultset is transmitted and than stored in some local part with the result that fetch is local, or result set is stored in remote system and records coming one by one at any fetch? Because if the real scenario is the second i don't perform 3 call, but 30000 call, and is not what i want.
I hope I explained, thanks for help
bye
First, it's highly recommended to utilize MySQLi or PDO instead of the deprecated mysql_* functions
http://php.net/manual/en/mysqlinfo.api.choosing.php
By default with the mysql and mysqli extensions, the entire result set is loaded into PHP's memory when executing the query, but this can be changed to load results on demand as rows are retrieved if needed or desired.
mysql
mysql_query() buffers the entire result set in PHP's memory
mysql_unbuffered_query() only retrieves data from the database as rows are requested
mysqli
mysqli::query()
The $resultmode parameter determines behaviour.
The default value of MYSQLI_STORE_RESULT causes the entire result set to be transfered to PHP's memory, but using MYSQLI_USE_RESULT will cause the rows to be retrieved as requested.
PDO by default will load data as needed when using PDO::query() or PDO::prepare() to execute the query and retrieving results with PDO::fetch().
To retrieve all data from the result set into a PHP array, you can use PDO::fetchAll()
Prepared statements can also use the PDO::MYSQL_ATTR_USE_BUFFERED_QUERY constant, though PDO::fetchALL() is recommended.
It's probably best to stick with the default behaviour and benchmark any changes to determine if they actually have any positive results; the overhead of transferring results individually may be minor, and other factors may be more important in determining the optimal method.
You would be performing 3 calls, not 30.000. That's for sure.
Each 10.000 results batch is rendered on the server (by performing each of the 3 queries). Your while iterates through a set of data that has already been returned by MySQL (that's why you don't have 30.000 queries).
That is assuming you would have something like this:
$res = mysql_query(...);
while ($row = mysql_fetch_array($res)) {
//do something with $row
}
Anything you do inside the while loop by making use of $row has to do with already-fetched data from your initial query.
Hope this answers your question.
according to the documentation here all the data is fetched to the server, then you go through it.
from the page:
Returns an array of strings that corresponds to the fetched row, or FALSE if there are no more rows.
In addition it seams this is deprecated so you might want to use something else that is suggested there.
I'm doing some tests with nhibernate and I'm modifying batch_size to get bulk inserts.
I'm using mssql2005 and using the northwind db.
I created 1000 object and insert them to the database. I've changed the values of batch_size from 5 to 100 but found no change in the performance. I'm getting value of around 300ms. Using the sql profiler, I see that 1000 sql insert statements at the sever side. Please help.
app.config
<property name="adonet.batch_size">10</property>
Code
public bool MyTestAddition(IList<Supplier> SupplierList)
{
var SupplierList_ = SupplierList;
var stopwatch = new Stopwatch();
stopwatch.Start();
using (ISession session = dataManager.OpenSession())
{
int counter = 0;
using (ITransaction transaction = session.BeginTransaction())
{
foreach (var supplier in SupplierList_)
{
session.Save(supplier);
}
transaction.Commit();
}
}
stopwatch.Stop();
Console.WriteLine(string.Format("{0} milliseconds. {1} items added",
stopwatch.ElapsedMilliseconds,
SupplierList_.Count));
return true;
}
The following is a great post on batch processing in Hibernate, which is what NHibernate is based upon and closely follows:
http://relation.to/Bloggers/BatchProcessingInHibernate
As you can see, the suggested actions are to set a reasonable batch size in the config, which you have done, but to also call session.flush()and session.clear() every 20 or so records.
We have employed this method ourselves and can now create and save 1000+ objects in seconds.
You could load the target type to a List and then call System.Data.SqlClient.BulkCopy to bcp the data into the target table.
This would allow processing of greater volumes.
According to this nhusers post, you seeing 1000 inserts in SQL server should not really matter, because the optimization is done on a different level. If you really have no gain in performance, trying the most recent version of NHibernate might help pointing to the resolution.
i have tried similar stuff with nh, never really got great performance. i remember settling with doing a flush every 10 entries and a commit every 50 entries to get a performance boost as with each insertion the process got steadily slower. It really depends on the size of the object so you could play arround with those numbers, maybe you can squeeze some performance out of it.
A call to ITransaction.Commit will Flush your Session, effectively writing your changes to the database. You are calling Commit after every Save, so there will be an INSERT for each Supplier.
I'd try to call Commit after every 10 Suppliers or so, or maybe even at the end of your 1000 Suppliers!