Connection reset by peer error in MongoDb on bulk insert - bulkinsert

I am trying to insert 500 documents by doing a bulk insert in pymongo and i get this error
File "/usr/lib64/python2.6/site-packages/pymongo/collection.py", line 306, in insert
continue_on_error, self.__uuid_subtype), safe)
File "/usr/lib64/python2.6/site-packages/pymongo/connection.py", line 748, in _send_message
raise AutoReconnect(str(e))
pymongo.errors.AutoReconnect: [Errno 104] Connection reset by peer
i looked around and found here that this happens because the size of inserted documents exceeds 16 MB so according to that the size of 500 documents should be over 16 MB. So i checked the size of the size of the 500 documents(python dictionaries) like this
size=0
for dict in dicts:
size+=dict.__sizeof__()
print size
this gives me 502920. This is like 500 KB. way less than 16 MB. Then why do i get this error.
I know i am calculating the size of python dictionaries not BSON documents and MongoDB takes in BSON documents but that cant turn 500KB into 16+ MB. Moreover i dont know how to convert a python dict into A BSON document.
My MongoDB version is 2.0.6 and pymongo version is 2.2.1
EDIT
I can do a bulk insert with 150 documents and thats fine but over 150 documents this error appears

This Bulk Inserts bug has been resolved, but you may need to update your pymongo version:
pip install --upgrade pymongo

The error occurs due to the fact that the bulk inserted documents have
an overall size of greater than 16 MB
My method of calculating the size of dictionaries was wrong.
When i manually inspected each key of the dictionary and found that 1 key was having a value of size 300 KB. So that did make the overall size of documents in the bulk insert more than 16 MB. (500*(300+)KB) > 16 MB. But i still dont know how to calculate size of a dictionary without manually inspecting it. Can someone please suggest?

Just had the same error and got around it by creating my own small bulks like this:
region_list = []
region_counter = 0
write_buffer = 1000
# loop through regions
for region in source_db.region.find({}, region_column):
region_counter += 1 # up _counter
region_list.append(region)
# save bulk if we're at the write buffer
if region_counter == write_buffer:
result = user_db.region.insert(region_list)
region_list = []
region_counter = 0
# if there is a rest, also save that
if region_counter > 0:
result = user_db.region.insert(region_list)
Hope this helps
NB: small update, from pymongo 2.6 on, PyMongo will auto-split lists based on the max transfer size: "The insert() method automatically splits large batches of documents into multiple insert messages based on max_message_size"

Related

Log file size calculated using len(_raw) in Splunk does not match even close to the actual file size on the host?

I am using a Splunk query to calculate the size of logs files sent to Splunk. This is the Splunk query I have used:
index="<my_index>" path="/<my_path>/<my_log_file>"
| eval raw_len=len(_raw)
| eval raw_len_kb = raw_len/1024
| eval raw_len_mb = raw_len/1024/1024
| eval raw_len_gb = raw_len/1024/1024/1024
| stats sum(raw_len) as Bytes sum(raw_len_kb) as KB sum(raw_len_mb) as MB sum(raw_len_gb) as GB by source
| addcoltotals
Splunk reports the size as 17 GB. On the other hand, when I do this on the Unix host:
ls -l /<my_path>/<my_log_file>
the value is just a few MB.
Any idea why there is so much difference?
One should not expect the size of data indexed in Splunk to exactly match the size reported by an OS. This is because Splunk by default removes line ends and because the len function counts characters rather than bytes.
Also, the query shown does not account for multiple hosts sending data to Splunk. There's no time window indicated so we don't know if the file may have been truncated at time point while Splunk still retains all of the data the file ever held.

Redis database gives "Connection refused" error after finishing almost all of a task

I'm trying to parse some data that I have stored in a redis database (on my local machine, accessing via the default port 6739). Essentially, the task is to iterate over about 10K hash structures in the database, calculate a new field from the fields currently in the hash, then write that new field back to the database so that I don't need to do the calculation again.
My script starts up fine, connects to the database, and makes it through about 9500 of the hashes before crashing with a "redis.exceptions.ConnectionError: Error 111 connecting localhost:6379. Connection refused." error. I've rebooted the EC2 instance I'm running it on several times and every time it crashes in the same place.
Any idea what might be going on? Why would redis work for some of the data set but then crash?
EDIT: Here's the output of the execution. It runs for about 3 and a half minutes before dying.
$ sudo python parser.py
Added 0 out of 10378 to dictionary: 22:48:53
Added 100 out of 10378 to dictionary: 22:48:54
Added 200 out of 10378 to dictionary: 22:48:55
Added 300 out of 10378 to dictionary: 22:48:57
Added 400 out of 10378 to dictionary: 22:48:58
Added 500 out of 10378 to dictionary: 22:49:00
...
Added 9000 out of 10378 to dictionary: 22:51:16
Added 9100 out of 10378 to dictionary: 22:51:30
Added 9200 out of 10378 to dictionary: 22:51:44
Added 9300 out of 10378 to dictionary: 22:52:00
Added 9400 out of 10378 to dictionary: 22:52:15
Added 9500 out of 10378 to dictionary: 22:52:17
Traceback (most recent call last):
File "parser.py", line 180, in <module>
buildDictionary(force=True)
File "parser.py", line 123, in buildDictionary
addPostToDict(postid)
File "parser.py", line 92, in addPostToDict
comments = [contentFromId(commentid) for commentid in commentids]
File "parser.py", line 72, in contentFromId
content = db.hget(contentid, keyword)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 1539, in hget
return self.execute_command('HGET', name, key)
File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 464, in execute_command
connection.send_command(*args)
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 334, in send_command
self.send_packed_command(self.pack_command(*args))
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 316, in send_packed_command
self.connect()
File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 253, in connect
raise ConnectionError(self._error_message(e))
redis.exceptions.ConnectionError: Error 111 connecting localhost:6379. Connection refused.
You can set timeout and memory limit in Redis in order to ensure it handles long conenctions and time outs
So the answer turned out to be that my Redis database got too big for my tiny little EC2 instance's memory. As Itamar points out in the comments, Redis stores everything in memory. Once my job added too many items to the database and filled up all available memory, Redis refused all further requests to add to the database.
Things I observed that might help you diagnose this:
Later jobs started taking longer and longer (you can see in the EDIT to the question that it takes 1-2 seconds per 100 jobs at the beginning, but ~15 seconds per 100 jobs at the end)
I lowered the maximum amount of memory that my Redis store could use and it started failing earlier

sybase update getting slower and slower

I have a big text file about 4GB, and more than 8 million lines, I'm writing a perl script to read this file line by line, do some processing and update the info to sybase, i did this in a batch way, 1000 lines per batch for update commit, but here comes the problem, at first, a batch only costs 10 to 20 seconds, but with the processing goes, updating a batch becomes slower and slower, a batch costs 3 to 4 min, I definitely have no idea why this is happening! Any body can help me analys this what may be the cause? Thanks in advance, on my knee...
==>I'm writing a perl script to read this file line by line, do some processing and update the info to sybase
Please do entire processing at one go mean process your source file at one go; Prepare data structure using hash, array as per requirement and then start inserting data into database.
Please keep below points in mind while inserting large data into database.
1- If each column data is not too large then you can insert entire data at one go also.( you may require good RAM not sure about size because it depend on dataset u need to process).
2- You should use execute_array of perl DBI so that you can insert data at one go.
3- If you have not sufficient RAM to insert data at one go then please devide your data ( may be in 8 part, 1 million lines each time).
4- Also Make it sure that you are preparing the statement once. In every run you are just executing with new data set.
5- Set your auto_commit off.
A sample code to use execute_array of perl DBI. I have used this to insert around 10 million of data into mysql.
Please keep ur data in arrays like below in the form of array.
#column1_data, #column2_data, #column3_data
print $logfile_handle, "Total records to insert--".scalar(#column1_data);
print $logfile_handle, "Inserting data into database";
my $sth = $$dbh_ref->prepare("INSERT INTO $tablename (column1,column2,column3) VALUES (?,?,?)")
or print ($logfile_handle, "ERROR- Couldn't prepare statement: " . $$dbh_ref->errsr) && exit;
my $tuples = $sth->execute_array(
{ ArrayTupleStatus => \my #tuple_status },
\#column1_data,
\#column2_data,
\#column3_data
);
$$dbh_ref->do("commit");
print ($logfile_handle,"Data Insertion Completed.");
if ($tuples) {
print ($logfile_handle,"Successfully inserted $tuples records\n");
} else {
##print Error log or those linese which are not inserted
my $status = $tuple_status[$tuple];
$status = [0, "Skipped"] unless defined $status;
next unless ref $status;
print ($logfile_handle, "ERROR- Failed to insert (%s,%s,%s): %s\n",
$column1_data[$tuple], $column2_data[$tuple],$column3_data[$tuple], $status->[1]);
}
}

[Q]uestion about reading and saving a large txt-file via {RSQLite} line by line into a DB

Since my hardware is very limited (a dual core with 32bit Win7 and 4GB of ram - I need to make the best of it.....) I try to save a large text-file (about 1.2GB) into a DB, which I can then trigger by SQL-like queries to do some analytics on particular subgroups.
To be honest I'm not familiar with this area and since I could not find help regarding my issues via "googling", I just quickly show what I came up with and how I thought things would look like:
First I check how many columns my txt-file has:
k <- length(scan("data.txt", nlines=1, sep="\t", what="character"))
Then I open a connection to the text file so that it does not need to be opened
again for every single line:
filecon<-file("data.txt", open="r")
Then I initialize a connection (dbcon) to an SQLite database
dbcon<- dbConnect(dbDriver("SQLite"), dbname="mydb.dbms")
I find out where the position of the first line is
pos<-seek(filecon, rw="r")
Since the first line contains the column-names I save them for later use
col_names <- unlist(strsplit(readLines(filecon, n=1), "\t"))
Next, I test to read the first 10 lines, line by line,
and save them into a DB, which themself (should) contain k - columns with columns-names = col_names.
for(i in 1:10) {
# prints the iteration number in hundreds
if(i %% 100 == 0) {
print(i)
}
# read one line into a variable tt
tt<-readLines(filecon, n=1)
# parse tt into a variable tt2, since tt is a string
tt2<-unlist(strsplit(tt, "\t"))
# Every line, read and parsed from the text file, is immediately saved
# in the SQLite database table "results" using the command dbWriteTable()
dbWriteTable(conn=dbcon, name="results", value=as.data.frame(t(tt2[1:k]),stringsAsFactors=T), col.names=col_names, append=T)
pos<-c(pos, seek(filecon, rw="r"))
}
If I run this I get the following error
Warning messages:
1: In value[[3L]](cond) :
RS-DBI driver: (error in statement: table results has 738 columns but 13 values were supplied)
Why should I supply 738 columns? If I change k (which is 12) to 738, the code works but then I need to trigger the columns by V1, V2, V3,.... and not by the column-names I intended to supply
res <- dbGetQuery(dbcon, "select V1, V2, V3, V4, V5, V6 from results")
Any help or even a small hint is very much appreciated!

Read text file into array (Storing only numerical data)

I'm trying to read specific values from the text file (below):
Current Online Users: 0
Total User Logins: 0
Server Uptime: 0 day, 0 hour, 0 minute
Downloaded Amount: 0.000 KB
Uploaded Amount: 0.000 MB
Downloaded Files: 0
Uploaded Files: 0
Download Bandwidth Utilization: 0.00 KB/s
Upload Bandwidth Utilization: 000.00 KB/s
I can read the file to an array:
Dim path As String = "C:\Stats.txt"
Dim StringArrayOfTextLines() As String = System.IO.File.ReadAllLines(path)
How do I store only the data I require in the array? I've tried split and substring but cannot work out a usable method - I need on the text after the colon for each line.
I was thinking, I only require the numerical data, can this be extracted from each line rather than just splitting into an array?
Thanks.
To capture everything after the colon you just need to split on it and take the second element of each result:
For Each s In StringArrayOfTextLines
Console.WriteLine(s.Split(":")(1).Trim())
Next
If you want to do that as you read the data you'll need to use a StreamReader like Joel suggested.
ReadAllLines does just what it says it does. You have to iterate over the results. To read the data you want directly, you have to write the code to use a System.IO.StreamReader (and it's ReadLine() function) or even just a base System.IO.FileStream.