I've got a mapping file of about 13491 key/value pairs which I need to use to replace the key with the value in a data set of about 500000 lines divided over 25 different files.
Example mapping:
value1,value2
Example input: field1,field2,**value1**,field4
Example output: field1,field2,**value2**,field4
Please note that the value could be in different places on the line with more than 1 occurrence.
My current approach is with AWK:
awk -F, 'NR==FNR { a[$1]=$2 ; next } { for (i in a) gsub(i, a[i]); print }' mapping.txt file1.txt > file1_mapped.txt
However, this is taking a very long time.
Is there any other way to make this faster? Could use a variety of tools (Unix, AWK, Sed, Perl, Python etc.)
Note See the second part for a version that uses Text::CSV module to parse files
Load mappings into a hash (dictionary), then go through your files and test each field for whether there is such a key in the hash, replace with value if there is. Write each line out to a temporary file, and when done move it into a new file (or overwrite the processed file). Any tool has to do that, more or less.
With Perl, tested with a few small made-up files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not #ARGV;
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/; # see Notes
$map{$key} = $val;
}
my $outfile = "tmp.outfile.txt.$$"; # but better use File::Temp
foreach my $file (#ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
s/^\s+|\s+$//g; # remove leading/trailing whitespace
my #fields = split /\s*,\s*/;
exists($map{$_}) && ($_=$map{$_}) for #fields; # see Notes
say $fh_out join ',', #fields;
}
close $fh_out;
# Change to commented out line once thoroughly tested
#move($outfile, $file) or die "can't move $outfile to $file: $!";
move($outfile, 'new_'.$file) or die "can't move $outfile: $!";
}
Notes.
The check of data against mappings is written for efficiency: We must look at each field, there's no escaping that, but then we only check for the field as a key (no regex). For this all leading/trailing spaces need be stripped. Thus this code may change whitespace in output data files; in case this is important for some reason it can of course be modified to preserve original spaces.
It came up in comments that a field in data can differ in fact, by having extra quotes. Then extract the would-be key first
for (#fields) {
$_ = $map{$1} if /"?([^"]*)/ and exists $map{$1};
}
This starts the regex engine on every check, what affects efficiency. It would help to clean up that input CSV data of quotes instead, and run with the code as it is above, with no regex. This can be done by reading files using a CSV-parsing module; see comment at the end.
For Perls earlier than 5.14 replace
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/;
with
my ($key, $val) = map { s/^\s+|\s+$//g; $_ } split /\s*,\s*/;
since the "non-destructive" /r modifier was introduced only in v5.14
If you'd rather that your whole operation doesn't die for one bad file, replace or die ... with
or do {
# print warning for whatever failed (warn "Can't open $file: $!";)
# take care of filehandles and such if/as needed
next;
};
and make sure to (perhaps log and) review output.
This leaves room for some efficiency improvements, but nothing dramatic.
The data, with commas separating fields, may (or may not) be valid CSV. Since the question doesn't at all address this, and doesn't report problems, it is unlikely that any properties of the CSV data format are used in data files (delimiters embedded in data, protected quotes).
However, it's still a good idea to read these files using a module that honors full CSV, like Text::CSV. That also makes things easier, by taking care of extra spaces and quotes and handing us cleaned-up fields. So here's that -- the same as above, but using the module to parse files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
use Text::CSV;
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not #ARGV;
my $csv = Text::CSV->new ( { binary => 1, allow_whitespace => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
$map{ $line->[0] } = $line->[1]
}
my $outfile = "tmp.outfile.txt.$$"; # use File::Temp
foreach my $file (#ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
exists($map{$_}) && ($_=$map{$_}) for #$line;
say $fh_out join ',', #$line;
}
close $fh_out;
move($outfile, 'new_'.$file) or die "Can't move $outfile: $!";
}
Now we don't have to worry about spaces or overall quotes at all, what simplifies things a bit.
While it is difficult to reliably compare these two approaches without realistic data files, I benchmarked them for (made-up) large data files that involve "similar" processing. The code using Text::CSV for parsing runs either around the same, or (up to) 50% faster.
The constructor option allow_whitespace makes it remove extra spaces, perhaps contrary to what the name may imply, as I do by hand above. (Also see allow_loose_quotes and related options.) There is far more, see docs. The Text::CSV defaults to Text::CSV_XS, if installed.
You're doing 13,491 gsub()s on every one of your 500,000 input lines - that's almost 7 billion full-line regexp search/replaces total. So yes, that would take some time and it's almost certainly corrupting your data in ways you just haven't noticed as the result of one gsub() gets changed by the next gsub() and/or you get partial replacements!
I saw in a comment that some of your fields can be surrounded by double quotes. If those fields can't contain commas or newlines and assuming you want full string matches then this is how to write it:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
map[$1] = $2
map["\""$1"\""] = "\""$2"\""
next
}
{
for (i=1; i<=NF; i++) {
if ($i in map) {
$i = map[$i]
}
}
print
}
I tested the above on a mapping file with 13,500 entries and an input file of 500,000 lines with multiple matches on most lines in cygwin on my underpowered laptop and it completed in about 1 second:
$ wc -l mapping.txt
13500 mapping.txt
$ wc -l file500k
500000 file500k
$ time awk -f tst.awk mapping.txt file500k > /dev/null
real 0m1.138s
user 0m1.109s
sys 0m0.015s
If that doesn't do exactly what you want efficiently then please edit your question to provide a MCVE and clearer requirements, see my comment under your question.
There is some commentary below suggesting that the OP needs to handle real CSV data, whereas the question says:
Please note that the value could be in different places on the line with more than 1 occurrence.
I have taken this to mean that these are lines, not CSV data, and that a regex-based solution is required. The OP also confirmed that interpretation in a comment above.
As noted in other answers, however, it is faster to break the data into fields and simply lookup the replacement in the map.
#!/usr/bin/env perl
use strict;
use warnings;
# Load mappings.txt into a Perl
# Hash %m.
#
open my $mh, '<', './mappings.txt'
or die "open: $!";
my %m = ();
while ($mh) {
chomp;
my #f = split ',';
$m{$f[0]} = $f[1];
}
# Load files.txt into a Perl
# Array #files.
#
open my $fh, '<', './files.txt';
chomp(my #files = $fh);
# Update each file line by line,
# using a temporary file similar
# to sed -i.
#
foreach my $file (#files) {
open my $fh, '<', $file
or die "open: $!";
open my $th, '>', "$file.bak"
or die "open: $!";
while ($fh) {
foreach my $k (keys %m) {
my $v = $m[$k];
s/\Q$k/$v/g;
}
print $th;
}
rename "$file.bak", $file
or die "rename: $!";
}
I assume of course that you have your mappings in mappings.txt and file list in files.txt.
According to your comments, you have proper CSV. The following properly handles quoting and escapes when reading from the map file, when reading from a data file, and when writing to a data file.
It seems you want match entire fields. The following does this. It even supports fields that contains commas (,) and/or quotes ("). It does the comparisons using a hash lookup, which is much faster than a regex match.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw( say );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
sub process {
my ($map, $in_fh, $out_fh) = #_;
while ( my $row = $csv->getline($in_fh) ) {
$csv->say($out_fh, [ map { $map->{$_} // $_ } #$row ]);
}
}
die "usage: $0 {map} [{file} [...]]\n"
if #ARGV < 1;
my $map_qfn = shift;
my %map;
{
open(my $fh, '<', $map_qfn)
or die("Can't open \"$map_qfn\": $!\n");
while ( my $row = $csv->getline($fh) ) {
$map{$row->[0]} = $row->[1];
}
}
if (#ARGV) {
for my $qfn (#ARGV) {
open(my $in_fh, '<', $qfn)
or warn("Can't open \"$qfn\": $!\n"), next;
rename($qfn, $qfn."~")
or warn("Can't rename \"$qfn\": $!\n"), next;
open(my $out_fh, '>', $qfn)
or warn("Can't create \"$qfn\": $!\n"), next;
eval { process(\%map, $in_fh, $out_fh); 1 }
or warn("Error processing \"$qfn\": $#"), next;
close($out_fh)
or warn("Error writing to \"$qfn\": $!\n"), next;
}
} else {
eval { process(\%map, \*STDIN, \*STDOUT); 1 }
or warn("Error processing: $#");
close(\*STDOUT)
or warn("Error writing to STDOUT: $!\n");
}
If you provide no files names beyond the map file, it reads from STDIN and outputs to STDOUT.
If you provide one or more file names beyond the map file, it replaces the files in-place (though it leaves a backup behind).
I have a perl script to move a file from one folder to another folder. When I run this manually it works fine. But when I execute it from the browser I am unable to do. Getting error message.
I knew it is related to CGI environment access right. But how to add that permission to my perl script. I gave already 777 permission to that file and folder, but still cannot. Please advice.
Thanks in advance.
Here is my code.
#!/usr/bin/perl
use strict;
use warnings;
use CGI;
use File::Copy;
print "Content-type: application/json\n\n";
my $query = new CGI;
my $name = $query->param('name');
#$name = "WVID21WAAA110200";
my $sdir = "/disk1/advisories/input/unread";
my $tdir = "/disk1/advisories/input/read";
my $file = $sdir."/".$name;
my $tfile = $tdir."/".$name;
my $f = 0;
print "[";
if (-e $file && -f $file)
{
move($file,$tfile) or $f=1;
if(($f==1)){
print "{\"status\":\"failed\",\"message\":\"Access Denied\"}";
}else{
print "{\"status\":\"success\",\"message\":\"File moved\"}";
}
}else{
print "{\"status\":\"failed\",\"message\":\"Invalid file\"}";
}
print"]";
exit 0;
It's because of the destination path folder doesn't have execute permission. I added nad it works fine now. – Rajesh
i am pretty new to perl, and i was trying to set up a web server which runs perl...
i did make it work with another script, but with this one i got this error:
Server error!
The server encountered an internal error and was unable to complete
your request.
Error message: End of script output before headers: index.pl
If you think this is a server error, please contact the webmaster.
Error 500
localhost Apache/2.4.9 (Win32) OpenSSL/1.0.1g PHP/5.5.11
This is my script:
#!"C:\xampp\perl\bin\perl.exe"
use strict;
use warnings;
#Open file and define $pcontent as content of body.txt
open(FILE,"body.txt");
local $/;
my $pcontent = <FILE>;
close(FILE)
#Open file and define $ptitle as content of title.txt
open(FILE,"title.txt");
local $/;
my $ptitle = <FILE>;
close(FILE)
#open html code
print "Content-type: text/html\n\n";
print "<html>";
#set html page title
print "<head>";
print "<title>$ptitle</title>";
print "</head>";
print "<body>";
#set the <body> of the html page
if ($pcontent = ""){
print "
<H1>ERROR OCCURED!</h1>"
} else{
print $pcontent;
};
#close the html code
print "</body>";
print "</html>";
The reason it isn't working is because your Perl code has syntax errors which prevent it from compiling. You can check your code for syntax errors by running
perl -c yourscript.pl
And if we do that we find:
syntax error at yourscript.pl line 11, near ")
If we look at line 11, we see that the line before is missing a semicolon at the end of the statement.
close(FILE) # <--- need semicolon here.
But there are a few other problems with this script:
You should avoid the use of global filehandles (FILE) and instead use lexical filehandles. One advantage is that since they are automatically destroyed at the end of their scope (assuming no references) they will be automatically closed for you.
You should use the three-argument form of open which will help you catch certain bugs
You should check that your open succeeds and report an error if it doesn't
You should only localize $/ within a small block, otherwise it will affect other things in your program that you may not want it to
If this script grows to be anything other than a trivial example, you should use a templating system rather than printing a bunch of HTML.
Your conditional is wrong; you need to use the eq operator for string equality, or == for numerical equality. The = operator is for assignment.
Putting that all together, here is how I would write it:
use strict;
use warnings;
#Open file and define $pcontent as content of body.txt
my $pcontent = do {
open my $fh, '<', 'body.txt' or die "Can not open body.txt: $!";
local $/;
<$fh>;
};
#Open file and define $ptitle as content of title.txt
my $ptitle = do {
open my $fh, '<', 'title.txt' or die "Can not open title.txt: $!";
local $/;
<$fh>;
};
#open html code
print "Content-type: text/html\n\n";
print "<html>";
#set html page title
print "<head>";
print "<title>$ptitle</title>";
print "</head>";
print "<body>";
#set the <body> of the html page
if ($pcontent eq ""){
print "<H1>ERROR OCCURED!</h1>"
} else{
print $pcontent;
};
#close the html code
print "</body>";
print "</html>";
I have this assignment which requires me to take the source ip and destination port from this log file and add them to a database table i created using Perl dbi sqlite.
i have tried to write a script that does that but it does not seem to work. i would appreciate any help. The log file is available at
http://fleming0.flemingc.on.ca/~chbaker/COMP234-Perl/sample.log
here is the code i have so far.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my %ip2port;
my $IPCount = keys %ip2port;
my $portCount = 0;
my $filename = "./sample.log";
open my $LOG, "<", $filename or die "Can't open $filename: $!";
LINE: while (my $line = <$LOG>) {
my ($src_id) = $line =~ m!SRC=([.\d]+)!gis; my ($dst_port) = $line =~ m!DPT=([.\d]+)!gis;
my $dbh = DBI->connect(
"dbi:SQLite:dbname=test.db",
"",
"",
{ RaiseError => 1 },
) or die $DBI::errstr;
$dbh->do("INSERT INTO probes VALUES($src_id, $dst_port )");
$dbh->do("INSERT INTO probes VALUES(2,'$dst_port',57127)");
my $sth = $dbh->prepare("SELECT SQLITE_VERSION()");
$sth->execute();
my $ver = $sth->fetch();
print #$ver;
print "\n";
$sth->finish();
$dbh->disconnect();
}
1) Change your regular expression:
my ($src_id) = $line =~ m!SRC=([\.\d]+)!g;
my ($dst_port) = $line =~ m!DPT=([\d]+)!g;
2) Change your SQL's
$dbh->do("INSERT INTO probes VALUES('$src_id', $dst_port )");
UPDATE
In any case, it's better to build sql sentences with parameter binding and avoid the SQL-injection problems:
$dbh->do("INSERT INTO probes VALUES(?,?)", undef, $src_id, $dst_port);
Say that I read in the following information stored in three diffrent text files (Can be many more)
File 1
1 2 rt 45
2 3 er 44
File 2
rf r 4 5
3 er 4 t
er t yu 4
File 3
er tyu 3er 3r
der 4r 5e
edr rty tyu 4r
edr 5t yt5 45
When I read in this information I want it to print this information from these two files into separate arrays as for now they are printed out in the same time
Now I Have this script printing out all information at the same time
{
TESTd[NR-1] = $2; g++
}
END {
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" TESTd[i] "\"]"
}
print " _____"
}
But is there a way to read in multiple files and do this for every text file?
Like instead of getting this output when doing awk -f test.awk 1.txt 2.txt 3.txt
["2"]
["3"]
["r"]
["er"]
["t"]
["tyu"]
["4r"]
["rty"]
["5t"]
_____
I get this output
["2"]
["3"]
_____
["r"]
["er"]
["t"]
_____
["tyu"]
["4r"]
["rty"]
["5t"]
_____
And reading in each file at the time is preferably not an option here since I will have like 30 text files.
EDIT________________________________________________________________
I want to do this in awk if possible because I'm going to do something like this
{
PRINTONCE[NR-1] = $2; g++
PRINTONEATTIME[NR-1] = $3
}
END {
#Do this for all arguments once
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" PRINTONCE[i] "\"] \n"
}
print " _____"
#Do this for loop for every .txt file that is read in as an argument
#for(j=0;j<args.length;j++){
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" PRINTONEATTIME[i] "\"] \n"
}
print " _____"
}
From what i understand, you have an awk script that works and you want to run that awk script on many files and want their output to have a new line(or _) in between so you can distinguish which output is from which file.
Try this bash script :-
dir=~/*.txt #all txt files in ~(home) directory
for f in $dir
do
echo "File is $f"
awk 'BEGIN{print "Hello"}' $f #your awk code will take $f file as input.
echo "------------------"; echo;
done
Also, if you do not want to do this to all files you can write the for loop as for f in 1.txt 2.txt 3.txt.
If you don't want to do it in awk directly. You can call it like this in bash or zsh for example:
for fic in test*.txt; awk -f test.awk $fic
It's quite simple to do it directly in awk:
# define a function to print out the array
function dump(array, n) {
for (i = 0 ; i <= n-1; i ++ ) {
print " [\"" array[i] "\"]"
}
print " _____"
}
# dump and reset when starting a new file
FNR==1 && NR!=1 {
dump(TESTd, g)
delete TESTd
g = 0
}
# add data to the array
{
TESTd[FNR-1] = $2; g++
}
# dump at the end
END {
dump(TESTd, g)
}
N.B. using delete TESTd is a non-standard gawk feature, but the question is tagged as gawk so I assumed it's OK to use it.
Alternatively you could use one or more of ARGIND, ARGV, ARGC or FILENAME to distinguish the different files.
Or as suggested by see https://stackoverflow.com/a/10691259/981959, with gawk 4 you can use an ENDFILE group instead of END in your original:
{
TESTd[FNR-1] = $2; g++
}
ENDFILE {
for (i = 0 ; i <= g-1; i ++ ) {
print " [\"" TESTd[i] "\"]"
}
print " _____"
delete TESTd
g = 0
}
Write a bash shell script or a basic shell script. Try to put below into test.sh. Then call /bin/sh test.sh or /bin/bash test.sh, see which one will work
for f in *.txt
do
echo "File is $f"
awk -F '\t' 'blah blah' $f >> output.txt
done
Or write a bash shell script to call your awk script
for f in *.txt
do
echo "File is $f"
/bin/sh yourscript.sh
done