I have a big text file about 4GB, and more than 8 million lines, I'm writing a perl script to read this file line by line, do some processing and update the info to sybase, i did this in a batch way, 1000 lines per batch for update commit, but here comes the problem, at first, a batch only costs 10 to 20 seconds, but with the processing goes, updating a batch becomes slower and slower, a batch costs 3 to 4 min, I definitely have no idea why this is happening! Any body can help me analys this what may be the cause? Thanks in advance, on my knee...
==>I'm writing a perl script to read this file line by line, do some processing and update the info to sybase
Please do entire processing at one go mean process your source file at one go; Prepare data structure using hash, array as per requirement and then start inserting data into database.
Please keep below points in mind while inserting large data into database.
1- If each column data is not too large then you can insert entire data at one go also.( you may require good RAM not sure about size because it depend on dataset u need to process).
2- You should use execute_array of perl DBI so that you can insert data at one go.
3- If you have not sufficient RAM to insert data at one go then please devide your data ( may be in 8 part, 1 million lines each time).
4- Also Make it sure that you are preparing the statement once. In every run you are just executing with new data set.
5- Set your auto_commit off.
A sample code to use execute_array of perl DBI. I have used this to insert around 10 million of data into mysql.
Please keep ur data in arrays like below in the form of array.
#column1_data, #column2_data, #column3_data
print $logfile_handle, "Total records to insert--".scalar(#column1_data);
print $logfile_handle, "Inserting data into database";
my $sth = $$dbh_ref->prepare("INSERT INTO $tablename (column1,column2,column3) VALUES (?,?,?)")
or print ($logfile_handle, "ERROR- Couldn't prepare statement: " . $$dbh_ref->errsr) && exit;
my $tuples = $sth->execute_array(
{ ArrayTupleStatus => \my #tuple_status },
\#column1_data,
\#column2_data,
\#column3_data
);
$$dbh_ref->do("commit");
print ($logfile_handle,"Data Insertion Completed.");
if ($tuples) {
print ($logfile_handle,"Successfully inserted $tuples records\n");
} else {
##print Error log or those linese which are not inserted
my $status = $tuple_status[$tuple];
$status = [0, "Skipped"] unless defined $status;
next unless ref $status;
print ($logfile_handle, "ERROR- Failed to insert (%s,%s,%s): %s\n",
$column1_data[$tuple], $column2_data[$tuple],$column3_data[$tuple], $status->[1]);
}
}
Related
Can anyone please help me in writing a script in AHK based on below requirement.
Requirement:
I have a CSV/TXT file in my windows environment which contains 20,000+ records in below format.
So, when I run the script it should prompt a InputBox to enter an instance name.
Example : If i enter Instance4 , it should display result in MsgBox as ServerName4
Sample Format:
ServerName1,ServerIP,Instance1,Type
ServerName2,ServerIP,Instance2,Type
ServerName3,ServerIP,Instance3,Type
ServerName4,ServerIP,Instance4,Type
ServerName5,ServerIP,Instance5,Type
.
.
.
Also as the CSV/TXT file contains large no of records , pls also consider the best way to avoid delay in fetching the results.
Please post your code, or at least show what you've already done.
You can use a Parsing Loop with CSV as the delimiter, and make a variable for each 'Instance' who's value is that of the current row's 'ServerName'.
The steps are to first FileRead the data from the file, then Loop, Parse like so:
Loop, Parse, data, CSV
{
; Parses row by row, then column by column in each row.
; A_LoopField // Current value
; A_Index // Current loop's index
; Write a script that makes a variable named with the current value of column 3, and give it the value of column 1
}
After that, you can make a Goto loop that spams InputBox and following a command that prints out the needed variable using the MsgBox command, like so:
MsgBox % %input%
I am pretty new to Tcl and have been writing snippets to improve the automation of the process flow in our Work. I want to compare the value of a variable to its previous value so that the code knows its a new flow. The problem is: How to store the old value of a variable? or more precisely, how can we store the value of a variable that is assigned during previous flow?(Is it even possible?)
The following is how our workflow looks like
Start compilation
A) Start phase1 and run flow.tcl script twice
B) Start phase2 and run flow.tcl script twice
...
End compilation
Here in this example, the variable is assigned a new value every time it is run in a different phase. But since I am unable to store the value of the variable to compare, am stuck at trying different options but in vain. This might be totally impossible but as far as I know Tcl can handle almost everything.
Any help is greatly appreciated
Thanks in advance
Hemanth
Edit: simple solution found. Have the data written to text files and read in back again. Thanks
You can save variables in an array and load the variables back into Tcl. The command "array get" serializes the data and "array set" puts it back into an array.
flow.tcl
#!/usr/bin/tclsh
proc load_data {data_file array_name} {
upvar $array_name data
if {[file exists $data_file]} {
set fp [open $data_file r]
array set data [read $fp]
close $fp
}
}
proc save_data {data_file array_name} {
upvar $array_name data
set fp [open $data_file w]
puts $fp [array get data]
close $fp
}
set now [clock seconds]
# Set defaults. If you need new keys in your data file you can add them here.
set data(count) 0
set data(last_timestamp) $now
# Load existing data over default values. If the key doesn't exist the default will be used.
load_data "flow.dat" data
# Use the saved data to find elapsed time.
set elapsed [expr $now - $data(last_timestamp)]
set count $data(count)
# Save new data.
set data(last_timestamp) $now
set data(count) [incr count]
save_data "flow.dat" data
puts "It's been $elapsed seconds since last run. You have run this $count times."
output
% ./flow.tcl
It's been 0 seconds since last run. You have run this 1 times.
flow.dat
% cat flow.dat
count 1 last_timestamp 1427142892
Since my hardware is very limited (a dual core with 32bit Win7 and 4GB of ram - I need to make the best of it.....) I try to save a large text-file (about 1.2GB) into a DB, which I can then trigger by SQL-like queries to do some analytics on particular subgroups.
To be honest I'm not familiar with this area and since I could not find help regarding my issues via "googling", I just quickly show what I came up with and how I thought things would look like:
First I check how many columns my txt-file has:
k <- length(scan("data.txt", nlines=1, sep="\t", what="character"))
Then I open a connection to the text file so that it does not need to be opened
again for every single line:
filecon<-file("data.txt", open="r")
Then I initialize a connection (dbcon) to an SQLite database
dbcon<- dbConnect(dbDriver("SQLite"), dbname="mydb.dbms")
I find out where the position of the first line is
pos<-seek(filecon, rw="r")
Since the first line contains the column-names I save them for later use
col_names <- unlist(strsplit(readLines(filecon, n=1), "\t"))
Next, I test to read the first 10 lines, line by line,
and save them into a DB, which themself (should) contain k - columns with columns-names = col_names.
for(i in 1:10) {
# prints the iteration number in hundreds
if(i %% 100 == 0) {
print(i)
}
# read one line into a variable tt
tt<-readLines(filecon, n=1)
# parse tt into a variable tt2, since tt is a string
tt2<-unlist(strsplit(tt, "\t"))
# Every line, read and parsed from the text file, is immediately saved
# in the SQLite database table "results" using the command dbWriteTable()
dbWriteTable(conn=dbcon, name="results", value=as.data.frame(t(tt2[1:k]),stringsAsFactors=T), col.names=col_names, append=T)
pos<-c(pos, seek(filecon, rw="r"))
}
If I run this I get the following error
Warning messages:
1: In value[[3L]](cond) :
RS-DBI driver: (error in statement: table results has 738 columns but 13 values were supplied)
Why should I supply 738 columns? If I change k (which is 12) to 738, the code works but then I need to trigger the columns by V1, V2, V3,.... and not by the column-names I intended to supply
res <- dbGetQuery(dbcon, "select V1, V2, V3, V4, V5, V6 from results")
Any help or even a small hint is very much appreciated!
I'm not sure the best way to handle this, I'm guessing it's using a while loop. I have a .txt file with a set of numbers ( these numbers can change based on another script that runs )
ex:
0
36
41
53
60
Each number is on it's own line. For each number I want to get that number and execute a script using it. So in this example I would call a script to stop database 0, after that completes call a script to stop database 36 and so on until it's complete with all numbers in the list.
1) Is a while loop the best way to handle this?
2) I'm having trouble trying to determine what the [[condition]] needs to be to get each number 1 at a time, where can i find some additional help on this?
while [[ condition ]] ; do
command1
done
For testing purposes the file that contains all the numbers is test.txt. The script that will execute is a python script - "amgr.py stop (number from test.txt)"
Here's a simpler way:
cat test.txt | xargs amgr.py stop
this will get each line of your file, and then put it as an extra parameter for your amgr.py:
amgr.py stop 0
amgr.py stop 36
and so on..
I ended up using this method to give me the results i was looking for.
while read -r line ; do
amgr.py stop $line
done<test.txt
Here is the simple perl script fetching data from SQL.
Read data and write on a file OUTFILE, and print the data on screen for every 10000th line.
One thing I am curious is that the printing the data on screen terminates very quickly(in 30 seconds), however, data fetching and writing on a file ends very slowly(30 minutes later).
The amount of data is not large. The output files size is less than 100Mbyte.
while ( my ($a,$b) = $curSqlEid->fetchrow_array() )
{
printf OUTFILE ("%s,%d\n", $a,$b);
$counter ++;
if($counter % 10000 == 0){
printf ("%s,%d\n", $a,$b);
}
}
$curSqlEid->finish();
$dbh->disconnect();
close(OUTFILE);
You are suffering from buffering.
Handles other than STDERR are buffered by default, and most handles use a block buffering. That means Perl will wait until there is 8KB* of data to write before sending anything to the system.
STDOUT is special. When is attached to a terminal (and only then), it uses a different kind of buffering: line buffering. When using line buffering, the data is flushed every time a newline is encountered in the data to write.
You can see this by running
$ perl -e'print "abc"; print "def"; sleep 5; print "\n"; sleep 5;'
[ 5 seconds pass ]
abcdef
[ 5 seconds pass ]
$ perl -e'print "abc"; print "def"; sleep 5; print "\n"; sleep 5;' | cat
[ 10 seconds pass ]
abcdef
The solution is to turn off buffering.
use IO::Handle qw( ); # Not needed on Perl 5.14 or later
OUTFILE->autoflush(1);
* — 8KB is the default. It can be configured when Perl is compiled. It used to be a non-configurable 4KB until 5.14.
I think you are seeing the output file size as 0 while the script is running and displaying on the console. Do not go by that. The file size will show up only once the script has finished. This is due to output buffering.
Anyways, the delay cannot be as large as 30 min. Once the script is done, you should see the output file data.
I tried various things, but the final conclusion is that python and perl has basically different handling data flow from DB. It looks like in perl, it is possible to handle data line by line while the data is transferred from DB. However, in Python it needs to wait until the entire data download from the server to process it.