Unix/Perl/Python: substitute list on big data set - awk

I've got a mapping file of about 13491 key/value pairs which I need to use to replace the key with the value in a data set of about 500000 lines divided over 25 different files.
Example mapping:
value1,value2
Example input: field1,field2,**value1**,field4
Example output: field1,field2,**value2**,field4
Please note that the value could be in different places on the line with more than 1 occurrence.
My current approach is with AWK:
awk -F, 'NR==FNR { a[$1]=$2 ; next } { for (i in a) gsub(i, a[i]); print }' mapping.txt file1.txt > file1_mapped.txt
However, this is taking a very long time.
Is there any other way to make this faster? Could use a variety of tools (Unix, AWK, Sed, Perl, Python etc.)

Note   See the second part for a version that uses Text::CSV module to parse files
Load mappings into a hash (dictionary), then go through your files and test each field for whether there is such a key in the hash, replace with value if there is. Write each line out to a temporary file, and when done move it into a new file (or overwrite the processed file). Any tool has to do that, more or less.
With Perl, tested with a few small made-up files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not #ARGV;
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/; # see Notes
$map{$key} = $val;
}
my $outfile = "tmp.outfile.txt.$$"; # but better use File::Temp
foreach my $file (#ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
s/^\s+|\s+$//g; # remove leading/trailing whitespace
my #fields = split /\s*,\s*/;
exists($map{$_}) && ($_=$map{$_}) for #fields; # see Notes
say $fh_out join ',', #fields;
}
close $fh_out;
# Change to commented out line once thoroughly tested
#move($outfile, $file) or die "can't move $outfile to $file: $!";
move($outfile, 'new_'.$file) or die "can't move $outfile: $!";
}
Notes.
The check of data against mappings is written for efficiency: We must look at each field, there's no escaping that, but then we only check for the field as a key (no regex). For this all leading/trailing spaces need be stripped. Thus this code may change whitespace in output data files; in case this is important for some reason it can of course be modified to preserve original spaces.
It came up in comments that a field in data can differ in fact, by having extra quotes. Then extract the would-be key first
for (#fields) {
$_ = $map{$1} if /"?([^"]*)/ and exists $map{$1};
}
This starts the regex engine on every check, what affects efficiency. It would help to clean up that input CSV data of quotes instead, and run with the code as it is above, with no regex. This can be done by reading files using a CSV-parsing module; see comment at the end.
For Perls earlier than 5.14 replace
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/;
with
my ($key, $val) = map { s/^\s+|\s+$//g; $_ } split /\s*,\s*/;
since the "non-destructive" /r modifier was introduced only in v5.14
If you'd rather that your whole operation doesn't die for one bad file, replace or die ... with
or do {
# print warning for whatever failed (warn "Can't open $file: $!";)
# take care of filehandles and such if/as needed
next;
};
and make sure to (perhaps log and) review output.
This leaves room for some efficiency improvements, but nothing dramatic.
The data, with commas separating fields, may (or may not) be valid CSV. Since the question doesn't at all address this, and doesn't report problems, it is unlikely that any properties of the CSV data format are used in data files (delimiters embedded in data, protected quotes).
However, it's still a good idea to read these files using a module that honors full CSV, like Text::CSV. That also makes things easier, by taking care of extra spaces and quotes and handing us cleaned-up fields. So here's that -- the same as above, but using the module to parse files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
use Text::CSV;
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not #ARGV;
my $csv = Text::CSV->new ( { binary => 1, allow_whitespace => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
$map{ $line->[0] } = $line->[1]
}
my $outfile = "tmp.outfile.txt.$$"; # use File::Temp
foreach my $file (#ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
exists($map{$_}) && ($_=$map{$_}) for #$line;
say $fh_out join ',', #$line;
}
close $fh_out;
move($outfile, 'new_'.$file) or die "Can't move $outfile: $!";
}
Now we don't have to worry about spaces or overall quotes at all, what simplifies things a bit.
While it is difficult to reliably compare these two approaches without realistic data files, I benchmarked them for (made-up) large data files that involve "similar" processing. The code using Text::CSV for parsing runs either around the same, or (up to) 50% faster.
The constructor option allow_whitespace makes it remove extra spaces, perhaps contrary to what the name may imply, as I do by hand above. (Also see allow_loose_quotes and related options.) There is far more, see docs. The Text::CSV defaults to Text::CSV_XS, if installed.

You're doing 13,491 gsub()s on every one of your 500,000 input lines - that's almost 7 billion full-line regexp search/replaces total. So yes, that would take some time and it's almost certainly corrupting your data in ways you just haven't noticed as the result of one gsub() gets changed by the next gsub() and/or you get partial replacements!
I saw in a comment that some of your fields can be surrounded by double quotes. If those fields can't contain commas or newlines and assuming you want full string matches then this is how to write it:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
map[$1] = $2
map["\""$1"\""] = "\""$2"\""
next
}
{
for (i=1; i<=NF; i++) {
if ($i in map) {
$i = map[$i]
}
}
print
}
I tested the above on a mapping file with 13,500 entries and an input file of 500,000 lines with multiple matches on most lines in cygwin on my underpowered laptop and it completed in about 1 second:
$ wc -l mapping.txt
13500 mapping.txt
$ wc -l file500k
500000 file500k
$ time awk -f tst.awk mapping.txt file500k > /dev/null
real 0m1.138s
user 0m1.109s
sys 0m0.015s
If that doesn't do exactly what you want efficiently then please edit your question to provide a MCVE and clearer requirements, see my comment under your question.

There is some commentary below suggesting that the OP needs to handle real CSV data, whereas the question says:
Please note that the value could be in different places on the line with more than 1 occurrence.
I have taken this to mean that these are lines, not CSV data, and that a regex-based solution is required. The OP also confirmed that interpretation in a comment above.
As noted in other answers, however, it is faster to break the data into fields and simply lookup the replacement in the map.
#!/usr/bin/env perl
use strict;
use warnings;
# Load mappings.txt into a Perl
# Hash %m.
#
open my $mh, '<', './mappings.txt'
or die "open: $!";
my %m = ();
while ($mh) {
chomp;
my #f = split ',';
$m{$f[0]} = $f[1];
}
# Load files.txt into a Perl
# Array #files.
#
open my $fh, '<', './files.txt';
chomp(my #files = $fh);
# Update each file line by line,
# using a temporary file similar
# to sed -i.
#
foreach my $file (#files) {
open my $fh, '<', $file
or die "open: $!";
open my $th, '>', "$file.bak"
or die "open: $!";
while ($fh) {
foreach my $k (keys %m) {
my $v = $m[$k];
s/\Q$k/$v/g;
}
print $th;
}
rename "$file.bak", $file
or die "rename: $!";
}
I assume of course that you have your mappings in mappings.txt and file list in files.txt.

According to your comments, you have proper CSV. The following properly handles quoting and escapes when reading from the map file, when reading from a data file, and when writing to a data file.
It seems you want match entire fields. The following does this. It even supports fields that contains commas (,) and/or quotes ("). It does the comparisons using a hash lookup, which is much faster than a regex match.
#!/usr/bin/perl
use strict;
use warnings;
use feature qw( say );
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
sub process {
my ($map, $in_fh, $out_fh) = #_;
while ( my $row = $csv->getline($in_fh) ) {
$csv->say($out_fh, [ map { $map->{$_} // $_ } #$row ]);
}
}
die "usage: $0 {map} [{file} [...]]\n"
if #ARGV < 1;
my $map_qfn = shift;
my %map;
{
open(my $fh, '<', $map_qfn)
or die("Can't open \"$map_qfn\": $!\n");
while ( my $row = $csv->getline($fh) ) {
$map{$row->[0]} = $row->[1];
}
}
if (#ARGV) {
for my $qfn (#ARGV) {
open(my $in_fh, '<', $qfn)
or warn("Can't open \"$qfn\": $!\n"), next;
rename($qfn, $qfn."~")
or warn("Can't rename \"$qfn\": $!\n"), next;
open(my $out_fh, '>', $qfn)
or warn("Can't create \"$qfn\": $!\n"), next;
eval { process(\%map, $in_fh, $out_fh); 1 }
or warn("Error processing \"$qfn\": $#"), next;
close($out_fh)
or warn("Error writing to \"$qfn\": $!\n"), next;
}
} else {
eval { process(\%map, \*STDIN, \*STDOUT); 1 }
or warn("Error processing: $#");
close(\*STDOUT)
or warn("Error writing to STDOUT: $!\n");
}
If you provide no files names beyond the map file, it reads from STDIN and outputs to STDOUT.
If you provide one or more file names beyond the map file, it replaces the files in-place (though it leaves a backup behind).

Related

How to read gz file line by line in Perl6

I'm trying to read a huge gz file line by line in Perl6.
I'm trying to do something like this
my $file = 'huge_file.gz';
for $file.IO.lines -> $line {
say $line;
}
But this give error that I have a malformed UTF-8. I can't see how to get this to read gzipped material from the help page https://docs.perl6.org/language/unicode#UTF8-C8 or https://docs.perl6.org/language/io
I want to accomplish the same thing as was done in Perl5: http://blog-en.openalfa.com/how-to-read-and-write-compressed-files-in-perl
How can I read a gz file line by line in Perl6?
thanks
I would recommend using the module Compress::Zlib for this purpose. You can find the readme and code on github and install it with zef install Compress::Zlib.
This example is taken from the test file number 3 titled "wrap":
use Test;
use Compress::Zlib;
gzspurt("t/compressed.gz", "this\nis\na\ntest");
my $wrap = zwrap(open("t/compressed.gz"), :gzip);
is $wrap.get, "this\n", 'first line roundtrips';
is $wrap.get, "is\n", 'second line roundtrips';
is $wrap.get, "a\n", 'third line roundtrips';
is $wrap.get, "test", 'fourth line roundtrips';
This is probably the easiest way to get what you want.
use the read-file-content method in the Archive::Libarchive module, but i don't know if the method read all lines into memory at once:
use Archive::Libarchive;
use Archive::Libarchive::Constants;
my $a = Archive::Libarchive.new: operation => LibarchiveRead, file => 'test.tar.gz';
my Archive::Libarchive::Entry $e .= new;
my $log = '';
while $a.next-header($e) {
$log = get-log($a,$e) if $e.pathname.ends-with('.txt');
}
sub get-log($a, $e) {
return $a.read-file-content($e).decode('UTF8-C8');
}
If you are after a quick solution you can read the lines from the stdout pipe of a gzip process:
my $proc = run :out, "gzip", "--to-stdout", "--decompress", "MyFile.gz"
for $proc.out.lines -> $line {
say $line;
}
$proc.out.close;

Split large CSV file based on column value cardinality

I have a large CSV file with following line format:
c1,c2
I would like to split the original file in two files as follows:
One file will contain the lines where the value of c1 appears exactly once in the file.
Another file will contain the lines where the value of c1 appears twice or more in the file.
Any idea how it can be done?
For example, if the original file is:
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
I would like to produce the following files:
3,foo
4,bar
and
1,foo
2,bar
2,foo
1,bar
this one-liner generates two files o1.csv and o2.csv
awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' file file
test:
kent$ cat f
1,foo
2,bar
3,foo
4,bar
2,foo
1,bar
kent$ awk -F, 'NR==FNR{a[$1]++;next}{print >"o"(a[$1]==1?"1":"2")".csv"}' f f
kent$ head o*
==> o1.csv <==
3,foo
4,bar
==> o2.csv <==
1,foo
2,bar
2,foo
1,bar
Note
awk reads your file twice, instead of saving whole file in memory
the order of the file is retained
Depending what you mean by big, this might work for you.
It has to hold the lines, in associative array done, until it sees
2nd use, or until the end of file. When a 2nd use is seen the remembered
data is changed to "!" to avoid printing it again on a 3rd and later match.
>file2
awk -F, '
{ if(done[$1]!=""){
if(done[$1]!="!"){
print done[$1]
done[$1] = "!"
}
print
}else{
done[$1] = $0
order[++n] = $1
}
}
END{
for(i=1;i<=n;i++){
out = done[order[i]]
if(out!="!")print out >>"file2"
}
}
' <csvfile >file1
I'd break out Perl for this job
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
my #lines;
open ( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file
while ( <$input> ) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
push ( #lines, [ $c1 , $c2 ] );
}
close ( $input );
print "File 1:\n";
#filter any single elements
foreach my $pair ( grep { $count_of{$_ -> [0]} < 2 } #lines ) {
print join (",", #$pair );
}
print "File 2:\n";
#filter any repeats.
foreach my $pair ( grep { $count_of{$_ -> [0]} > 1 } #lines ) {
print join (",", #$pair );
}
This will hold the whole file in memory, but given your data - you don't save much space by double processing it and maintaining a count.
However you could do:
#!/usr/bin/env perl
use strict;
use warnings;
my %count_of;
open( my $input, '<', 'your_file.csv' ) or die $!;
#read the whole file counting "c1"
while (<$input>) {
my ( $c1, $c2 ) = split /,/;
$count_of{$c1}++;
}
open( my $output_single, '>', "output_uniques.csv" ) or die $!;
open( my $output_dupe, '>', "output_dupes.csv" ) or die $!;
seek( $input, 0, 0 );
while ( my $line = <$input> ) {
my ($c1) = split( ",", $line );
if ( $count_of{$c1} > 1 ) {
print {$output_dupe} $line;
}
else {
print {$output_single} $line;
}
}
close($input);
close($output_single);
close($output_dupe);
This will minimise memory occupancy, by only retaining the count - it reads the file first to count the c1 values, and then processes it a second time and prints lines to different outputs.

Insert a line at the end of an ini section only if it doesn't exist

I have an smb.conf ini file which is overwritten whenever edited with a certain GUI tool, wiping out a custom setting. This means I need a cron job to ensure that one particular section in the file contains a certain option=value pair, and insert it at the end of the section if it doesn't exist.
Example
Ensure that hosts deny=192.168.23. exists within the [myshare] section:
[global]
printcap name = cups
winbind enum groups = yes
security = user
[myshare]
path=/mnt/myshare
browseable=yes
enable recycle bin=no
writeable=yes
hosts deny=192.168.23.
[Another Share]
invalid users=nobody,nobody
valid users=nobody,nobody
path=/mnt/share2
browseable=no
Long-winded solution using awk
After a long time struggling with sed, I concluded that it might not be the right tool for the job. So I moved over to awk and came up with this:
#!/bin/sh
file="smb.conf"
tmp="smb.conf.tmp"
section="myshare"
opt="hosts deny=192.168.23."
awk '
BEGIN {
this_section=0;
opt_found=0;
}
# Match the line where our section begins
/^[ \t]*\['"$section"'\][ \t]*$/ {
this_section=1;
print $0;
next;
}
# Match lines containing our option
this_section == 1 && /^[ \t]*'"$opt"'[ \t]*$/ {
opt_found=1;
}
# Match the following section heading
this_section == 1 && /^[ \t]*\[.*$/ {
this_section=0;
if (opt_found != 1) {
print "\t'"$opt"'";
}
}
# Print every line
{ print $0; }
END {
# In case our section is the very last in the file
if (this_section == 1 && opt_found != 1) {
print "\t'"$opt"'";
}
}
' $file > $tmp
# Overwrite $file only if $tmp is different
diff -q $file $tmp > /dev/null 2>&1
if [ $? -ne 0 ]; then
mv $tmp $file
# reload smb.conf here
else
rm $tmp
fi
I can't help feeling that this is a long script to achieve a simple task. Is there a more efficient/elegant way to insert a property in an ini file using basic shell tools like sed and awk?
Consider using Python 3's configparser:
#!/usr/bin/python3
import sys
from configparser import SafeConfigParser
cfg = SafeConfigParser()
cfg.read(sys.argv[1])
cfg['myshare']['hosts deny'] = '192.168.23.';
with open(sys.argv[1], 'w') as f:
cfg.write(f)
To be called as ./filename.py smb.conf (i.e., the first parameter is the file to change).
Note that comments are not preserved by this. However, since a GUI overwrites the config and doesn't preserve custom options, I suspect that comments are already nuked and that this is not a worry in your case.
Untested, should work though
awk -vT="hosts deny=192.168.23" 'x&&$0~T{x=0}x&&/^ *\[[^]]+\]/{print "\t\t"T;x=0}
/^ *\[myshare\]/{x++}1' file
This solution is a bit awkward. It uses the INI section header as the record separator. This means that there is an empty record before the first header, so when we match the header we're interested in, we have to read the next record to handle that INI section. Also, there are some printf commands because the records still contain leading and trailing newlines.
awk -v RS='[[][^]]+[]]' -v str="hosts deny=192.168.23." '
{printf "%s", $0; printf "%s", RT}
RT == "[myshare]" {
getline
printf "%s", $0
if (index($0, str) == 0) print str
printf "%s", RT
}
' smb.conf
RS is the awk variable that contains the regex to split the text into records.
RT is the awk variable that contains the actual text of the current record separator.
With GNU awk for a couple of extensions:
$ cat tst.awk
index($0,str) { found = 1 }
match($0,/^\s*\[([^]]+).*/,a) {
if ( (name == tgt) && !found ) { print indent str }
name = a[1]
found = 0
}
{ print; indent=gensub(/\S.*/,"","") }
.
$ awk -v tgt="myshare" -v str="hosts deny=192.168.23." -f tst.awk file
[global]
printcap name = cups
winbind enum groups = yes
security = user
[myshare]
path=/mnt/myshare
browseable=yes
enable recycle bin=no
writeable=yes
hosts deny=192.168.23.
[Another Share]
invalid users=nobody,nobody
valid users=nobody,nobody
path=/mnt/share2
browseable=no
.
$ awk -v tgt="myshare" -v str="fluffy bunny" -f tst.awk file
[global]
printcap name = cups
winbind enum groups = yes
security = user
[myshare]
path=/mnt/myshare
browseable=yes
enable recycle bin=no
writeable=yes
hosts deny=192.168.23.
fluffy bunny
[Another Share]
invalid users=nobody,nobody
valid users=nobody,nobody
path=/mnt/share2
browseable=no

Break down JSON string in simple perl or simple unix?

ok so i have have this
{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}
and at the moment i'm using this shell command to decode it to get the string i need,
echo $x | grep -Po '"utterance":.*?[^\\]"' | sed -e s/://g -e s/utterance//g -e 's/"//g'
but this only works when you have a grep compiled with perl and plus the script i use to get that JSON string is written in perl, so is there any way i can do this same decoding in a simple perl script or a simpler unix command, or better yet, c or objective-c?
the script i'm using to get the json is here, http://pastebin.com/jBGzJbMk and if you want a file to use then download http://trevorrudolph.com/a.flac
How about:
perl -MJSON -nE 'say decode_json($_)->{hypotheses}[0]{utterance}'
in script form:
use JSON;
while (<>) {
print decode_json($_)->{hypotheses}[0]{utterance}, "\n"
}
Well, I'm not sure if I can deduce what you are after correctly, but this is a way to decode that JSON string in perl.
Of course, you'll need to know the data structure in order to get the data you need. The line that prints the "utterance" string is commented out in the code below.
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json = decode_json
q#{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}#;
#print $json->{'hypotheses'}[0]{'utterance'};
print Dumper $json;
Output:
$VAR1 = {
'status' => 0,
'hypotheses' => [
{
'utterance' => 'hello how are you',
'confidence' => '0.96311796'
}
],
'id' => '7aceb216d02ecdca7ceffadcadea8950-1'
};
Quick hack:
while (<>) {
say for /"utterance":"?(.*?)(?<!\\)"/;
}
Or as a one-liner:
perl -lnwe 'print for /"utterance":"(.+?)(?<!\\)"/g' inputfile.txt
The one-liner is troublesome if you happen to be using Windows, since " is interpreted by the shell.
Quick hack#2:
This will hopefully go through any hash structure and find keys.
my $json = decode_json $str;
say find_key($json, 'utterance');
sub find_key {
my ($ref, $find) = #_;
if (ref $ref) {
if (ref $ref eq 'HASH' and defined $ref->{$find}) {
return $ref->{$find};
} else {
for (values $ref) {
my $found = find_key($_, $find);
if (defined $found) {
return $found;
}
}
}
}
return;
}
Based on the naming, it's possible to have multiple hypotheses. The prints the utterance of each hypothesis:
echo '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}' | \
perl -MJSON::XS -n000E'
say $_->{utterance}
for #{ JSON::XS->new->decode($_)->{hypotheses} }'
Or as a script:
use feature qw( say );
use JSON::XS;
my $json = '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}';
say $_->{utterance}
for #{ JSON::XS->new->decode($json)->{hypotheses} };
If you don't want to use any modules from CPAN and try a regex instead there are multiple variants you can try:
# JSON is on a single line:
$json = '{"other":"stuff","hypo":[{"utterance":"hi, this is \"bob\"","moo":0}]}';
# RegEx with negative look behind:
# Match everything up to a double quote without a Backslash in front of it
print "$1\n" if ($json =~ m/"utterance":"(.*?)(?<!\\)"/)
This regex works if there is only one utterance. It doesn't matter what else is in the string around it, since it only searches for the double quoted string following the utterance key.
For a more robust version you could add whitespace where necessary/possible and make the . in the RegEx match newlines: m/"utterance"\s*:\s*"(.*?)(?<!\\)"/s
If you have multiple entries for the utterance confidence hash/object, changing case and weird formatting of the JSON string try this:
# weird JSON:
$json = <<'EOJSON';
{
"status":0,
"id":"an ID",
"hypotheses":[
{
"UtTeraNcE":"hello my name is \"Bob\".",
"confidence":0.0
},
{
'utterance' : 'how are you?',
"confidence":0.1
},
{
"utterance"
: "
thought
so!
",
"confidence" : 0.9
}
]
}
EOJSON
# RegEx with alternatives:
print "$1\n" while ( $json =~ m/["']utterance["']\s*:\s*["'](([^\\"']|\\.)*)["']/gis);
The main part of this RegEx is "(([^\\"]|\\.)*)". Description in detail as extended regex:
/
["'] # opening quotes
( # start capturing parentheses for $1
( # start of grouping alternatives
[^\\"'] # anything that's not a backslash or a quote
| # or
\\. # a backslash followed by anything
) # end of grouping
* # in any quantity
) # end capturing parentheses
["'] # closing quotes
/xgs
If you have many data sets and speed is a concern you can add the o modifier to the regex and use character classes instead of the i modifier. You can suppress the capturing of the alternatives to $2 with clustering parenthesis (?:pattern). Then you get this final result:
m/["'][uU][tT][tT][eE][rR][aA][nN][cC][eE]["']\s*:\s*["']((?:[^\\"']|\\.)*)["']/gos
Yes, sometimes perl looks like a big explosion in a bracket factory ;-)
Just stubmled upon another nice method of doing this, i finaly found how to acsess the Mac OS X JavaScript engine form commandline, heres the script,
alias jsc='/System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc'
x='{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}'
jsc -e "print(${x}['hypotheses'][0]['utterance'])"
Ugh, yes i came up with another answer, im strudying python and it reads arrays in both its python format and the same format as a json so, i jsut made this one liner when your variable is x
python -c "print ${x}['hypotheses'][0]['utterance']"
figured it out for unix but would love to see your perl and c, objective-c answers...
echo $X | sed -e 's/.*utterance//' -e 's/confidence.*//' -e s/://g -e 's/"//g' -e 's/,//g'
:D
shorter copy of the same sed:
echo $X | sed -e 's/.*utterance//;s/confidence.*//;s/://g;s/"//g;s/,//g'

Using a variable defined inside AWK

I got this piece of script working. This is what i wanted:
input
3.76023 0.783649 0.307724 8766.26
3.76022 0.764265 0.307646 8777.46
3.7602 0.733251 0.30752 8821.29
3.76021 0.752635 0.307598 8783.33
3.76023 0.79528 0.307771 8729.82
3.76024 0.814664 0.307849 8650.2
3.76026 0.845679 0.307978 8802.97
3.76025 0.826293 0.307897 8690.43
with script
!/bin/bash
awk -F ', ' '
{
for (i=3; i<=10; i++) {
if (i==NR) {
npc1[i]=sprintf("%s", $1);
npc2[i]=sprintf("%s", $2);
npc3[i]=sprintf("%s", $3);
npRs[i]=sprintf("%s", $4);
print npc1[i],npc2[i],\
npc3[i], npc4[i];
}
}
} ' p_walls.raw
echo "${npc1[100]}"
But now I can't use those arrays npc1[i], outside awk. That last echo prints nothing. Isnt it possible or am I missing something?
AWK is a separate process, after it finishes all internal data is gone. This is true for all external processes/commands. Bash only sees what bash builtins touch.
i is never 100, so why do you want to access npc1[100]?
What are you really trying to do? If you rewrite the question we might be able to help...
(Cherry on the cake is always good!)
Sorry, but all of #yi_H 's answer and comments above are correct.
But there's really no problem loading 2 sets of data into 2 separate arrays in awk, ie.
awk '{
if (FILENAME == "file1") arr1[i++]=$0 ;
#same for file2; }
END {
f1max=++i; f2max=++j;
for (i=1;i<f1max;i++) {
arr1[i]
# put what you need here for arr1 processing
#
# dont forget that you can do things like
if (arr1[i] in arr2) { print arr1[i]"=arr2[arr1["i"]=" arr2[arr1[i]] }
}
for j=1;j<f2max;j++) {
arr2[j]
# and here for arr2
}
}' file1 file2
You'll have to fill the actual processing for arr1[i] and arr2[j].
Also, get an awk book for the weekend and be up and running by Monday. It's easy. You can probably figure it out from grymoire.com/Unix/awk.html
I hope this helps.