Background: I'm trying to load a 3GB csv text file (20M rows x 46 col) into a SQLite table. The data import fails with error: “record 3,493,675 had only 2 fields when table expecting 46 values”. i want to find out if that record has only has 2 fields (corrupted) or if problem elsewhere (my suspicion).
So I tried to look at the ‘offending' record using gawk:
gawk -F, "NF<46 {print FNR,$1,NF}"
but got an error message (on server running Windows 2008 with 8 procs, 16GB RAM | i do not have admin privileges):
“grow_top_buffer: iop->buf: can’t allocate 1,073,741,826 bytes of memory (not enough space)”.
i googled this error, saw some posts from 2003 about a bug, but did not get solution.
So, here’s the problem: I have the same data file on my 4GB RAM Win 7 laptop and same version of gawk works fine – it reads entire 20M records … and all 20M record in the file have the 46 fields required for the table.
I tried several different gawk statements but all failed on server - all work on my PC.
Question: why the buffer memory error with gawk on the Win server?
thanks, peter
If the server machine is running a 32-bit version of Windows and your PC runs a 64-bit version of Windows, gawk might be able to allocate more (virtual) memory on your PC than on the server. This is because it might not be able to address that amount of memory on the server.
Regarding your problem, awk should not need a lot of memory to process the file in the way you want. This seems to me like a gawk bug. Try another version of awk, like Kernighan's One True Awk
Related
Initially, I have 2 sets(tables) each contains 45gb of data which is total 90gb of data in 1 namespace(database), So I decided to remove 1 set to free up the ram size, after deletion of 1 set, again it shows 90gb, ram size changed nothing. Without a restart of aerospike server, Is there a way to flush the deleted data to free up my ram ??
Thanks in advance !!
From Aerospike CE 3.12 on up you should be using the truncate command to truncate the data in a namespace, or a set of a namespace.
The aerospike/delete-set repo is an ancient workaround (hasn't been updated in 2 years). In the Java client simply use the AerospikeClient.truncate() command.
I use the sqlcmd utility to import a 7 GB large SQL dump file into a remote SQL Server. The command I use is this:
sqlcmd -S IP address -U user -P password -t 0 -d database -i file.sql
After about 20-30 min the server regularly responds with:
Sqlcmd: Error: Scripting error.
Any pointers or advice?
I assume file.sql is just a bunch of INSERT statements. For a large amount of rows, I suggest using the BCP command-line utility. This will perform orders of magnitude faster than individual INSERT statements.
You could also bulk insert data using the T-SQL BULK INSERT command. In that case, the file path needs to be accessible by the database server (i.e. UNC path or copied to a drive on the server) and along with needed permissions. See http://msdn.microsoft.com/en-us/library/ms188365.aspx.
Why not use SSIS. While I have a certificate as a DBA, I always try to use the right tool for the job.
Here are some reasons to use SSIS.
1 - Use can still use fast-load, bulk copy. Make sure you set the batch size.
2 - Error handling is much better.
However, if you are using fast-load, either the batch commits or it gets tossed.
If you are using single record, you can direct each error row to a separate destination.
3 - You can perform transformations on the source data before loading it into the destination.
In short, Extract Translate Load.
4 - SSIS loves memory and buffers. If you want to get really in depth, read some articles from Matt Mason or Brian Knight.
Last but not least, the LAN/WAN always plays a factor if the job is not running on the target server with the input file on a local disk.
If you are on the same backbone with a good pipe, things go fast.
In summary, yeah you can use BCP. It is great for little quick jobs. Anything complicated with robust error handling should be done with SSIS.
Good luck,
I am currently trying to import a text file with 180+ million records with about 300+ columns into my sql server database. Needless to say the file is roughly 70 GBs large. I have been at it for days and when i get close something happens and it craps out on me. I need the quickest and most efficient way to do this import. I have tried the wizard which should have been the easiest, then i tried just saving as an ssis package. I havent been able to figure out how to do a bulk import with the settings i think would work great. The error i keep on getting is 'not enough virtual memory'. I changed my virtual memory to 36 gigs . My system has 24 gigs of physical memory. Please help me.
If you are using BCP (and you should be for files this large), use a batch size. Otherwise, BCP will attempt to load all records in one transaction.
By command line: bcp -b 1000
By C#:
using (System.Data.SqlClient.SqlBulkCopy bulkCopy =
new System.Data.SqlClient.SqlBulkCopy(sqlConnection))
{
bulkCopy.DestinationTableName = destinationTableName;
bulkCopy.BatchSize = 1000; // 1000 rows
bulkCopy.WriteToServer(dataTable); // May also pass in DataRow[]
}
Here are the highlights from this MSDN article:
Importing a large data file as a single batch can be problematic, so
bcp and BULK INSERT let you import data in a series of batches, each
of which is smaller than the data file. Each batch is imported and
logged in a separate transaction...
Try reducing the maximum server memory for SQL Server to as low as you can get away with. (Right click the SQL instance in Mgmt Studio -> properties -> memory).
This may free up enough memory for the OS & SSIS to process such a big text file.
I'm assuming the whole process is happening locally on the server.
I had a similar problem with SQL 2012 and trying to import (as a test) around 7 million records into a database. I too ran out of memory and had to cut the bulk import into smaller pieces. The one thing to note is that all the memory that the import process uses (no matter what manner you leverage) up a ton of memory and won't release said system memory until the server was rebooted. I'm not sure if this is intended behavior by SQL Server but it's something to note for your project.
Because I was using the SEQUENCE command with this process I had to leverage T-sql code saved as sql scripts and then use SQLCMD in small pieces to lessen the memory overhead.
You'll have to play around with what works for you and highly recommend to not run the script all at once.
It's going to be a pain in the ass to break it down in smaller pieces and import it in but in the long run you'll be happier.
I am backuping our servers with rsnapshot on daily bases.
Everything works fine and I have my data in daily.0, daily.1 ... daily.6
Now, I am using rsync to backup the backups from one to another NAS server.
The problem is that with rsync, on my 2nd backup server (NAS) I have the same data structure with all the daily from 0 to 6.
The big proble is that the NAS is recognizing the data from each daily as sing le physical files, so I have my data multiplied by 7 at the end.
My question is: Is there any possibility to use rsync and have on my 2nd server single files only, without all the daily.0 to daily.6, so I can avoid that the linux system thinks that I have 6 times more data that I have realy.
Rsync should only take files that have been modified, so you only have to backup the old data once.
But... (I'm not sure your OS or environment), you can pass only the most recent file to rsync like this
latest=`ls -t|head -1` ; rsync $latest backup_location
(my source)
i've 8GB of ram, when i tried it on a 16GB machine the script runs fine, all it does is it creats 2 tables and fills them with data, about 12000 records for each table and 38 columns for the first table and 18 columns for the second table.
the script was generated by sql manager studio 2012 it self and the size of the script is about 78MB.
how can i run it on 8GB machine without getting an out of memory exception?
the script has an Insert command for each record.
its seems to be a know issue and yet microsoft doesn't have fix http://connect.microsoft.com/SQLServer/feedback/details/269566/sql-server-management-studio-cant-handle-large-files
all i did split the script to smaller pieces and it worked fine on the 8GB machine.
If the script is running fine on a 16 GB machine, you can try increasing the virtual memory of your own machine. But that could result in longer execution time of your script.
Instead, you should consider increasing RAM of your own machine.
To increase VM of your machine, go to
System properties->Advanced->Performance->Settings->Advanced->Virtual
Memory->Change