I am querying mssql database table with above 15 million rows and processing all of the returned data using node. Well with a basic SQL query, right away that seems to be a problem since the amount of data to write in memory will crash nodejs.
Alternatively, I gave node more memory with --max-old-space-size=8000 but that seems to not help too. I have also tried using node js streams, even though I am not sure I am wrong streams will be useful if I am processing the data by chunk as opposed to the entire data.
Considering I am sorting, grouping, and mapping through the entire data, What would be the best way to both make querying 15 million rows faster and efficiently use the memory to process the data?
Currently, all the examples I have seen suggest streaming, but then all streaming examples process data by chunk. It would be very helpful if you could couple a simple example with your suggestions.
Possible Ideas
If we say streaming is the way to go to, then is it ok to first store the data on memory while streaming bit by bit then process the data entirely once?
Thanks
In documentation npm-mssql
There is a really nice explanation about streaming without crashing. You should intermittently pause your request.
When streaming large sets of data you want to back-off or chunk the amount of data you're processing to prevent memory exhaustion issues; you can use the Request.pause() function to do this
Look for example of managing rows in batches of 15
let rowsToProcess = [];
request.on('row', row => {
rowsToProcess.push(row);
if (rowsToProcess.length >= 15) {
request.pause();
processRows();
}
});
request.on('done', () => {
processRows();
});
function processRows() {
// process rows
rowsToProcess = [];
request.resume();
}
Related
I'm having a hard problem solving an issue with RavenDB.
At my work we have a process to trying to identify potential duplicates in our database on a specified collection (let's call it users collection).
That means, I'm iterating through the collection and for each document there is a query that is trying to find similar entities. So just imagine, it's quite a long task to run.
My problem is, when the task starts running, the memory consumption for RavenDB is going higher and higher, it's literally just growing and growing, and it seems to continue until it reaches the maximum memory of the system.
But it doesn't really makes sense, since I'm only doing query, I'm using one single index and take a default page size when querying (128).
Anybody meet a similar problem like this? I really have no idea what is going on in ravendb. but it seems like a memory leak.
RavenDB version: 3.0.179
When i need to do massive operations on large collections i work following this steps to prevent problems on memory usage:
I use Query Streaming to extract all the ids of the documents that i want to process (with a dedicated session)
I open a new session for each id, i load the document and then i do what i need
First, a recommendation: if you don't want duplicates, store them with a well-known ID. For example, suppose you don't want duplicate User objects. You'd store them with an ID that makes them unique:
var user = new User() { Email = "foo#bar.com" };
var id = "Users/" + user.Email; // A well-known ID
dbSession.Store(user, id);
Then, when you want to check for duplicates, just check against the well known name:
public string RegisterNewUser(string email)
{
// Unlike .Query, the .Load call is ACID and never stale.
var existingUser = dbSession.Load<User>("Users/" + email);
if (existingUser != null)
{
return "Sorry, that email is already taken.";
}
}
If you follow this pattern, you won't have to worry about running complex queries nor worry about stale indexes.
If this scenario can't work for you for some reason, then we can help diagnose your memory issues. But to diagnose that, we'll need to see your code.
Hi i am new to the BigQuery, if i need to fetch a very large set of data, say more than 1 GB, how can i break it into smaller pieces for quicker processing? i will need to process the result and dump it into a file or elasticsearch. i need to find a efficient way to handle it. i tried with the QueryRequest.setPageSize option, but that does't seem to work. I set 100 and it doesn't seem to break on every 100 record i put this line to see how many record i get back before i turn to a new page
result = result.getNextPage();
it displays at random number of records. sometimes at 1000, sometimes at 400, etc.
thanks
Not sure if this helps you but in our project we have something that seems to be similar: we process lots of data in BigQuery and need to use the final result for later usage (it contains roughly 15 Gbs for us when compressed).
What we did was to first save the results to a table with AllowLargeResults set to True and then export the result by compressing it into cloud storage using the Python API.
It automatically breaks the results into several files.
After that we have a Python script that downloads concurrently all files, reads through the whole thing and builds some matrices for us.
I don't quite remember how long it takes to download all the files, I think it's around 10 minutes. I'll try to confirm this one.
Because of memory limitation i need to split a result from sql-component (List<Map<column,value>>) into smaller chunks (some thousand).
I know about
from(sql:...).split(body()).streaming().to(...)
and i also know
.split().tokenize("\n", 1000).streaming()
but the latter is not working with List<Map<>> and is also returning a String.
Is there a out of the Box way to create those chunks? Or do i need to add a custom aggregator just behind the split? Or is there another way?
Edit
Additional info as requested by soilworker:
At the moment the sql endpoint is configured this way:
SqlEndpoint endpoint = context.getEndpoint("sql:select * from " + lookupTableName + "?dataSource=" + LOOK_UP_DS,
SqlEndpoint.class);
// returns complete result in one list instead of one exchange per line.
endpoint.getConsumerProperties().put("useIterator", false);
// poll interval
endpoint.getConsumerProperties().put("delay", LOOKUP_POLL_INTERVAL);
The route using this should poll once a day (we will add CronScheduledRoutePolicy soon) and fetch a complete table (view). All the data is converted to csv with a custom processor and sent via a custom component to proprietary software. The table has 5 columns (small strings) and around 20M entries.
I don't know if there is a memory issue. But i know on my local machine 3GB isn't enough. Is there a way to approximate the memory footprint to know if a certain amount of Ram would be enough?
thanks in advance
maxMessagesPerPoll will help you get the result in batches
I'm working on a prototype, using RavenDB, for my company to evaluate. We will have many threads inserting thousands of rows every few seconds, and many threads reading at the same time. I've done my first simple insert test, and before going much further, I want to make sure I'm using the recommended way of getting the best performance for RavenDB inserts.
I believe there is a bulk insert option. I haven't investigated that yet, as I'm not sure if that's necessary. I'm using the .NET API, and my code looks like this at the moment:
Debug.WriteLine("Number of Marker objects: {0}", markerList.Count);
StopwatchLogger.ExecuteAndLogPerformance(() =>
{
IDocumentSession ravenSession = GetRavenSession();
markerList.ForEach(marker => ravenSession.Store(marker));
ravenSession.SaveChanges();
}, "Save Marker data in RavenDB");
The StopwatchLogger simply invokes the action while putting a stopwatch around it:
internal static void ExecuteAndLogPerformance(Action action, string descriptionOfAction)
{
Stopwatch stopwatch = new Stopwatch();
stopwatch.Start();
action();
stopwatch.Stop();
Debug.WriteLine("{0} -- Processing time: {1} ms", descriptionOfAction, stopwatch.ElapsedMilliseconds);
}
Here is the output from a few runs. Note, I'm writing to a local instance of RavenDB (build 701). I know performance will be worse over the network, but I'm testing locally first.
One run:
Number of Marker objects: 671
Save Marker data in RavenDB -- Processing time: 1308 ms
Another run:
Number of Marker objects: 670
Save Marker data in RavenDB -- Processing time: 1266 ms
Another run:
Number of Marker objects: 667
Save Marker data in RavenDB -- Processing time: 625 ms
Another run:
Number of Marker objects: 639
Save Marker data in RavenDB -- Processing time: 639 ms
Ha. 639 objects in 639 ms. What are the odds of that? Anyway, that's one insert per millisecond, which would be 1000 every second.
The Marker object/document doesn't have much to it. Here is an example of one that has already been saved:
{
"ID": 14740009,
"SubID": "120403041588",
"ReadTime": "2012-04-03T13:51:45.0000000",
"CdsLotOpside": "163325",
"CdsLotBackside": "163325",
"CdteLotOpside": "167762",
"CdteLotBackside": "167762",
"EquipmentID": "VA_B"
}
Is this expected performance?
Is there a better way (best practice) to insert to gain speed?
Are there insert benchmarks available somewhere that I can target?
First, I would rather make sure that the number of items you save in a single batch doesn't get too big. There is no hard limit, however it hurts performance and eventually will crash if the transaction size gets too big. Using a value like 1024 items is safe, but it really depends on the size of your documents.
1000 documents per seconds is much lower than the number that you can actually reach with a single instance of RavenDB. You should do inserts in parallel and you can do some sort of tweaking with config option. For instance, you could increase the values defined by the settings beginning with Raven/Esent/. It is also a good idea (like in sql server) to put the logs and indexes to different hard drives. Depending on your concrete scenario you may also want to temporarily disable indexing while you're doing the inserts.
However, in most cases you don't want to care about that. If you need really high insert performance you can use multiple sharded instances and theoretically get an unlimited number of inserts/per second (just add more instances).
Good afternoon,
After computing a rather large vector (a bit shorter than 2^20 elements), I have to store the result in a database.
The script takes about 4 hours to execute with a simple code such as :
#Do the processing
myVector<-processData(myData)
#Sends every thing to the database
lapply(myVector,sendToDB)
What do you think is the most efficient way to do this?
I thought about using the same query to insert multiple records (multiple inserts) but it simply comes back to "chucking" the data.
Is there any vectorized function do send that into a database?
Interestingly, the code takes a huge amount of time before starting to process the first element of the vector. That is, if I place a browser() call inside sendToDB, it takes 20 minutes before it is reached for the first time (and I mean 20 minutes without taking into account the previous line processing the data). So I was wondering what R was doing during this time?
Is there another way to do such operation in R that I might have missed (parallel processing maybe?)
Thanks!
PS: here is a skelleton of the sendToDB function:
sendToDB<-function(id,data) {
channel<-odbcChannel(...)
query<-paste("INSERT INTO history VALUE(",id,",\"",data,"\")",sep="")
sqlQuery(channel,query)
odbcClose(channel)
}
That's the idea.
UPDATE
I am at the moment trying out the LOAD DATA INFILE command.
I still have no idea why it takes so long to reach the internal function of the lapply for the first time.
SOLUTION
LOAD DATA INFILE is indeed much quicker. Writing into a file line by line using write is affordable and write.table is even quicker.
The overhead I was experiencing for lapply was coming from the fact that I was looping over POSIXct objects. It is much quicker to use seq(along.with=myVector) and then process the data from within the loop.
What about writing it to some file and call LOAD DATA INFILE? This should at least give a benchmark. BTW: What kind of DBMS do you use?
Instead of your sendToDB-function, you could use sqlSave. Internally it uses a prepared insert-statement, which should be faster than individual inserts.
However, on a windows-platform using MS SQL, I use a separate function which first writes my dataframe to a csv-file and next calls the bcp bulk loader. In my case this is a lot faster than sqlSave.
There's a HUGE, relatively speaking, overhead in your sendToDB() function. That function has to negotiate an ODBC connection, send a single row of data, and then close the connection for each and every item in your list. If you are using rodbc it's more efficient to use sqlSave() to copy an entire data frame over as a table. In my experience I've found some databases (SQL Server, for example) to still be pretty slow with sqlSave() over latent networks. In those cases I export from R into a CSV and use a bulk loader to load the files into the DB. I have an external script set up that I call with a system() call to run the bulk loader. That way the load is happening outside of R but my R script is running the show.