Replace substrings (fields) of lines in file2 by substrings (fields) of corresponding lines of file1 [closed] - awk

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 22 days ago.
Improve this question
I need to modify large files as described below:
Content of file1 is like this:
> /make output fileID 1234 name "key string 1" first 1 begin
> /make output fileID 567 name "other key string" middle 1 continue
> /make output fileID 890 name "last key string" final 1 end
Content of file2 is like this:
dummyline0
somestring1 fileID AAA name "key string 1" first 1 begin
dummyline1
somestring2 fileID BBBBB name "other key string" middle 1 continue
dummyline2
dummyline3
somestring2 fileID CCCCCC name "last key string" final 1 end
For each line of file1 I want to find corresponding line in file2 which has identical part behind 'name' keyword and then I need to make following manipulation in file2:
duplicate line found (there now will be two instances of line)
comment out the first instance with # char at the begining followed by 'commentTExt: '
modify the second instance of line: the fileID in file2 replace with fileID from file1
Result in file2 should looks like this:
dummyline0
#commentText: somestring1 fileID AAA name "key string 1" first 1 begin
somestring1 fileID 1234 name "key string 1" first 1 begin
dummyline1
#commentText: somestring2 fileID BBBBB name "other key string" middle 1 continue
somestring2 fileID 567 name "other key string" middle 1 continue
dummyline2
dummyline3
#commentText: somestring2 fileID CCCCCC name "last key string" final 1 end
somestring2 fileID 890 name "last key string" final 1 end
Note: Lines in file1 and lines to be modified in file2 has the same formatting (fields position is still the same for every line in given file). Only one occurence of string behind 'name' keyword exists in file1 & file2. If it would be too complcated, line duplication & adding comment in file2 may be ommited.
Preferably with AWK or sed ...
Could anybody help please?
Thanks

A Perl idea.
#!/usr/bin/perl
use strict;
use warnings;
die "Usage: $0 file1 file2\n" unless #ARGV==2;
my ($re, $id, %h);
# regex builder to avoid repetition
sub mkre { qr/^(.* fileID )(\S+)(.* name ($_[0]).*)$/ }
# process file1
$re = mkre(qr/"[^"]+"/);
while (<>) {
# look for id/name pairs
# convert name to RE, quoting metacharacters
# store RE=>id pairs in hash for later use
$h{ mkre(qr/\Q$4\E/) } = $2 if m/$re/;
# terminate loop after processing file1
last if eof;
}
# process file2
while (<>) {
while ( ($re,$id) = each %h ) {
# if substitution succeeded, we're done with this line
if ( s/$re/#commentTExt: $&\n$1$id$3/ ) {
# there can be only one match,
# so this regex won't be needed again
delete $h{$re};
last;
}
}
print;
# reset "each" iterator
keys %h;
}
AWK is slightly more long-winded.
awk '
# process file1
NR==FNR {
# extract id
for (i=1;i<=NF;i++)
if ($i=="fileID") { id=$(++i); break; }
# extract name
split($0,a,/"/)
name = "\"" a[2] "\""
if (name in h) printf "warning: duplicate name: %s\n", name
# store for later lookup
h[name] = id
next
}
# process file2
{
# attempt substitution
for (name in h) {
if ($0 ~ name) {
# matched - output comment and prep new version
print "#commentTExt: " $0
t = " fileID " h[name]
sub(/ fileID [A-Z]+/,t)
break
}
}
# output possibly modified line
print
}
' file1 file2

Related

Using grep / sed replace values from list with semi colons, new lines as seperators

I am fairly new grep, sed, and awk. I have used them in them past to extract lines and/or replace things from exact lists.
In this case I am confused on how to go about it. I have a two csv files.
My first csv file is names that are separted by spaces and semi colons.
Name,
Frank ,
Frank; John; Rob; ,
John; Nick; ,
The second csv is with location and names
Location, Name,
France, Frank,
John, New Jersey,
Nick, Germany,
Rob, Japan,
I would like the output to add the location as a column next to the name.
Name, Location,
Frank , France,
Frank; John; Rob; , France; New Jersey; Japan,
John; Nick; , New Jersey; Germany,
How can I search through the 2nd csv file by line and treat each name as unique to extract its respective location? Then output it so it keeps the information per line with semi colons..
What I have done do far is:
cat file1.csv | cut -f1 | tr ';' '\t' > file-test.tsv
Thank you.
Your files are formatted somewhat strangely. Comma delimited overall, and individual fields delimited with semicolons, but sometimes with a trailing semicolon and sometimes not.
Also, at the time this answer is written, your second file still has "Location, Name" for the first data row, and "Name, Location" for all the rest. I'm assuming that the actual file is "Location, Name" on every row.
Here's how I'm approaching it:
Make one pass through the 2nd file and create a mapping from name to location
Make one pass through the 1st file and apply the mapping
Here is my solution, using just awk:
# use delimiter of zero or more spaces on either side of a comma
awk -F ' *, *' '
# First line of first file processed; set flag variable
FNR == 1 && NR == 1 {mapfile = 1;}
# Lines 2+ in the map file: save the mapping
mapfile && FNR > 1 {map[$2] = $1;}
# First line of second file; print header and reset flag
FNR == 1 && NR > 1 {print "Name, Location,"; mapfile = 0;}
# Process lines 2+ in the name file (i.e. not the map file)
!mapfile && FNR > 1 {
data = $0;
sub(/ *, *$/,"",data); # remove trailing comma
sub(/ *; *$/,"",data); # remove trailing semicolon
# create "names" array of length "num"
num = split(data,names,/ *; */);
locs = ""; # init location string to empty
for (i = 1; i <= num; i++)
{
locs = locs map[names[i]] "; ";
}
sub(/; $/,",",locs); # change last semicolon to comma
# print original line from name file, and append locations
print $0 " " locs;
}' file2.csv file1.csv
Some more explanation:
NR = "Number of Row" being processed. This starts at 1 and increments forever, regardless of how many files are processed by awk
FNR = "File Number of Row". This starts over at 1 with every file being processed
So when both are 1, the first line of the map file is being processed.
When FNR is 1 but NR is greater than 1, the 2nd file is being processed.
Also,
awk can use regular expressions as delimiters, so I've told it to use a comma with zero or more spaces on either side as the delimiter ( *, *).
$0 = entire line
$1, $2, etc are the individual fields of each line when split using the specified delimiter.
The rest of the logic should be self-evident from the code and comments within the script.
When processing your files in this order
file2.csv = your second file, but with "location, name" order on all rows
file1.csv = your first file
the output is:
Name, Location,
Frank , France,
Frank; John; Rob; , France; New Jersey; Japan,
John; Nick; , New Jersey; Germany,
Assuming the lines of your 2nd file are actually always in location, name order instead of sometimes one, sometimes the other as in the example in your question here's how to output the data you want:
$ cat tst.awk
BEGIN { FS=" *, *"; OFS=" , " }
NR == FNR {
name2loc[$2] = $1
next
}
{
for (i=1; i<=NF; i++) {
n = split($i,names,/ *; */)
for (j=1; j<=n; j++) {
locs = (j>1 ? locs "; " : "") name2loc[names[j]]
}
}
print $1, locs
}
.
$ awk -f tst.awk file2 file1
Name , Location
Frank , France
Frank; John; Rob; , France; New Jersey; Japan;
John; Nick; , New Jersey; Germany;
Massage the output format to suit whatever you really want your output to look like.

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.
I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}
If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.
With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

awk to store field length in variable then use in print

In the awk below I am trying to store the length of $5 in a variable il if the condition is met (in the two lines it is) and then add that variable to $3 in the print statement. The two sub statements are to remove the matching from both $5 and $6. The script as is executes and produces the current output. However, il does not seem to be populated and added in the print. It seems close but I'm not sure why the variable isn't being stored? Thank you :)
awk
awk 'BEGIN{FS=OFS="\t"} # define fs and output
FNR==NR{ # process each field in each line of file
if(length($5) < length($6)) { # condition
il=$(length($5))
echo $il
sub($5,"",$6) && sub($6,"",$5) # removing matching
print $1,$2,$3+$il,$3+$il,"-",$6 # print desired output
next
}
}' in
in tab-delimited
id1 1 116268178 GAAA GAAAA
id2 2 228197304 A AATCC
current output tab-delimited
id1 1 116268178 116268178 - A
id2 2 228197304 228197304 - ATCC
desired output tab-delimited
since `$5` is 4 in line 1 that is added to `$3`
since `$5` is 1 in line 2 that is added to `$3`
id1 1 116268181 116268181 - A
id2 2 228197305 228197305 - ATCC
Following awk may help you here.
awk '{$3+=length($4);$3=$3 OFS $3;sub($4,"",$5);$4="-"} 1' Input_file
Please add BEGIN{FS=OFS="\t"} in case your Input_file is TAB delimited and you require output in TAB delimited form too.

awk to Count Sum and Unique improve command

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!
BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.

awk: for every record extract specific information

Simplified example of my file looks like this:
# FamilyName_A
Information 1 2 3
Information 4 5 6
# FamilyName_B
Information 7 8 9
# FamilyName_C
Information 10 11 12
Information 13 14 15
Information 16 17 18
Record separator is #. For every record I want to print: record ID (Family Name (first word after record separator) and first to columns of next lines. For the output like this:
FamilyName_A Information 1
FamilyName_A Information 4
FamilyName_B Information 7
FamilyName_C Information 10
FamilyName_C Information 13
FamilyName_C Information 16
I tried doing this by myself:
awk 'BEGIN {RS="#"} {print $1}' -- This prints me Record ID
But I don't know how to do the rest (loop to print for every record specific fields).
Use the following script
$1 == # { current=$2; next; }
{ print current, $1, $2; }
Depending on your input data the expression to catch the record header may slightly change. For the data you provided both $1 == #, /^#/ and /^# FamilyName/ are perfectly suitable, but if your input data differs a bit, you may need to adjust the condition.
On one line:
awk 'BEGIN { family = ""} { if ($1 == "#") family = $2; else print family, $1, $2 }' input.txt
Explanation
BEGIN {
family = "";
}
{
if ($1 == "#")
family = $2
else
print family, $1, $2
}
Set family to empty string.
Check each line: if starts with #, remember family name.
If no #, print last remembered family name and first two fields.