Inserting text at specific position in file names which match a pattern - awk

I am editing a large text file called "test.txt" on a Mac. Most lines start with #, but some lines are a tab separated list of fields:
val1 val2 val3 val4 val5 val6 val7 val8 val9
What I would like to do is find specific lines where val2 = foo and val3 = bar (or just grep for the string foo \t bar, and then on these lines only, replace whatever val9 is with the string val9=val9. So if val9 is 'g1.t1', I would replace it with 'g1.t1=g1.t1'.
I was able to come up with the following command:
fgrep -l -w 'foo bar' test.txt | xargs sed -i "" 's/\([^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\t[^\t]*\)\t\([^\t]*\)/\1\t\2=\2/'
to find these lines, and making these modifications, but this just prints out these modified lines.
I want to write the entire file back out to a new file called "test_edited.txt", with only these changes made. I feel like the solution I've come up, by relying on piping the output of fgrep to sed, doesn't allow for this. But maybe I'm missing something?
Any suggestions welcome!
Thanks!

awk is more suitable for this job than a grep + xargs + sed` with a very clumsy looking regular expression:
awk 'BEGIN{FS=OFS="\t"}
$2 == "foo" && $3 == "bar" {$9 = $9 "=" $9} 1' file
# if you want to save changes back to original file use:
awk 'BEGIN{FS=OFS="\t"}
$2 == "foo" && $3 == "bar" {$9 = $9 "=" $9} 1' file > _tmp &&
mv _tmp file

Related

Filtering using awk returns empty files

I have a similar problem to this question: How to do filtering of multiple files in a directory using awk?
The solution in the answers of the question above does not work for me.
I have tab-delimited txt files (all in folder Observation_by_pracid). For each file, I want to create a new file that only contains rows with a specific value in column $9 (medcodeid). The specific values are to be found in medicalcode_list.txt.
There is no error, however it returns only empty files.
Codelist
medcodeid
2576
3199
Format of input files
patid consid ... medcodeid
500470520002 3062539302 ... 2576
951924020002 3062538414 ... 310803013
503478020002 3061587464 ... 257619018
951924020002 3062537807 ... 55627011
503576720002 3062537720 ... 3199
Desired output
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
My code
mkdir HBA1C_observation_bypracid
awk '
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Solution
mkdir HBA1C_observation_bypracid
awk '
BEGIN{ FS=OFS="\t" }
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Adding "BEGIN..." solved my problem.
You can join two files on a column using join.
Files must be sorted on the joined column. To perform a numerical sort on a column, use sort this way, where N is the column number:
sort -kN -n FILE
You also need to get ride of the first line (column names) of each files. You can use tail command the way below, where N is the number of line from which you want to output the content (so 2nd line):
tail -n +N
... But still need to display the column values:
head -n 1 FILE
To join two files f1 and f2, on the fields c1 of f1 and c2 of f2, and output fields y of files x:
join -1 c1 -2 c2 f1 f2 -o "x.y, x.y"
Working sample:
head -n 1 input_file
for input_file in *.txt ; do
join -1 1 -2 9 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9" \
<(tail -n +2 PATH/medicalcode_list.txt | sort -k1 -n) \
<(tail -n +2 "$input_file" | sort -k3 -n)
done
Result (for the input file you gave):
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
Note: the column names arent aligned with the values. Don't know if it's a prerequisite. You can format the display with printf command.
Personally I think it would be simpler to loop over in the shell (understanding that this will reread the code list more than once), with a simpler awk function that you should be able to test and debug. Something like:
for file in *.txt; do
awk 'FNR == NR { mlist[$1] } FNR != NR && ($9 in mlist) { print }' \
PATH/medicalcode_list.txt "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done
You should be able to start without the redirection to make sure that for a single file, you get the results printed to the terminal that you were expected. If you don't there might be some incorrect assumption about the files.
Another option would be to write a separate awk script that writes the code to hard-code the list in another awk script. Also gives the advantage to check the contents of the variable mlist.
printf 'BEGIN {\n%s\n}\n $9 in mlist { print }' \
"$(awk '{ print "mlist[" $1 "]" }' PATH/medicalcode_list.txt)" > filter.awk
for file in *.txt; do
awk -f filter.awk "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done

While using awk showing fatal : cannot open pipe ( Too many open files) error

I was trying to do masking of file with command 'tr' and 'awk' but failing with error fatal: cannot open pipe ( Too many open pipes) error. FILE has approx 1000000 records quite a huge number.
Below is the code I am trying :-
awk - F "|" - v OFS="|" '{ "echo \""$1"\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\"" | get line $1}1' FILE.CSV > test.CSV
It is showing error :-
awk: (FILENAME=- FNR=1019) fatal: cannot open pipe `echo ""TTP_123"" | tr "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" "QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq"' (Too many open pipes)
Please let me know what I am doing wrong here
Also a Note any number of columns could be used for masking and can be at any positions in this example I have taken 1 and 2 column positions but it could be 3 and 10 or 5,7,25 columns
Thanks
AJ
First things first, you can't have a space between - and F or v.
I was going to suggest sed, but as you only want to translate the first column, that's not as easy.
Unfortunately, awk doesn't have built-in tr functionality, so you'd have to use the shell like you are and just close the pipe:
awk -F "|" -v OFS="|" '{
command="echo \"\\"$1"\\\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\""
command | getline $1
close(command)
}1' FILE.CSV > test.CSV
However, I suggest using perl, which can do field splitting and character translation:
perl -F'\|' -lane '$F[0] =~ tr/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq/; print join("|", #F)' FILE.CSV > test.CSV
Or, for a shorter command line, just put the program into a file, drop the e in -lane and use the file name instead of the '...' command.
you can do the mapping in awk instead of making a system call for each line, or perhaps simply
paste -d'|' <(cut -d'|' -f1 file | tr '0-9' 'a-z') <(cut -d'|' -f2- file)
replace the tr arguments with yours.
This does not answer your question, but you can implement tr as an awk function that would save having to spawn lots of external processes
$ cat tr.awk
function tr(str, from, to, s,i,c,idx) {
s = ""
for (i=1; i<=length($str); i++) {
c = substr(str, i, 1)
idx = index(from, c)
s = s (idx == 0 ? c : substr(to, idx, 1))
}
return s
}
{
print $1, tr($1,
" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq")
}
Example:
$ printf "%s\n" hello wor-ld | awk -f tr.awk
hello KGCCN
wor-ld 3N8-CF

Printing field by column with variable line field

I have a command that returns an output similar to;
*************
* something *
*************
| Header | Title |
Column1|Column2 | Column3 |Column4| Column5 |Column6|Column7| Column8 |
--------------------------------------------------------------------------------
val1 val2 val3 x y i j 1(a) 2 1(a) 2 val4
val5 val6 val7 w x y z i j k 2(b) 2 1(b) 1 val8
..
..
Total lines: xx
I want to just print column6 for example, but because the output is not fixed variable field by space, awk '{print $x}' won't work for me. I need a way to print output by defined column (eg. Column6 or Column8). Maybe printing the column6 field from the right which is field $5 from the right? Is there such a method to print from right rather than the default of all the command which count field from the left?
Any help would be appreciated.
Use NF for this
awk '{print $(NF-5)}'
This will print the 6th to last column for example
I have been solving similar problem. Assuming Column6 is all the time there, you can use following colon to find an index
echo "Column1|Column2|Column6|Column8" | sed 's/Column6.*//;s/[^|]\+//g' | wc -c
Then you can simply construct the awk query
X=$(echo ...)
SCRIPT="{ print \$${X}; }"
echo "Column1 |Column2 |Column6 |Column8" | awk "${SCRIPT}"
|Column6
Rewrite In Gnu awk:
$ cat program.awk
BEGIN { FS="|" }
$0 ~ c { # process record with header
split($0,a,"|") # split to get header indexes
for(i in a) { # loop all column names
gsub(/^ *| *$/,"",a[i]) # trim space off
if(a[i]==c) ci=i # ci is the one
}
while(i=index($0,FS)) { # form FIXEDWIDTHS to separate fields
FIELDWIDTHS = FIELDWIDTHS i " "
$0=substr($0,i+1)
}
}
ci && $ci !~ /^-+$/ && $0=$ci # implicit printing
Run it:
$ awk -v c="Column6" -f program.awk file
1(a) 2
2(b) 2
If you want to edit the outputed column, the last row in program.awk is the place to do it. For example, let's imagine for a second that you'd like to loose the parenthesized part of Column6, you could for example create an action part with {sub(/\(.*\)/,""); print} for it.

Delete text before comma in a delimited field

I have a pipe delimited file where I want to remove all text before a comma in field 9.
Example line:
www.upstate.edu|upadhyap|Prashant K Upadhyaya, MD||General Surgery|http://www.upstate.edu/hospital/providers/doctors/?docID=upadhyap|Patricia J. Numann Center for Breast, Endocrine & Plastic Surgery|Upstate Specialty Services at Harrison Center|Suite D, 550 Harrison Street||Syracuse|NY|13202|
so the targeted field is: |Suite D, 550 Harrison Street|
and I want it to look like: |550 Harrison Street|
So far what I have tried has either deleted information from other fields (usually the name in field 3) or has had no effect.
The .awk script I have been trying to write looks like this:
mv $1 $1.bak4
cat $1.bak4 | awk -F "|" '{
gsub(/*,/,"", $9);
print $0
}' > $1
The pattern argument to gsub is a regex not a glob. Your * isn't matching what you expect it to. You want /.*,/ there. You are also going to need to OFS to | to keep that delimiter.
mv $1 $1.bak4
awk 'BEGIN{ FS = OFS = "|" }{ gsub(/.*,/,"",$9) } 1' $1.bak4 > $1
I also replaced the verbose print line you had with a true pattern (1) that uses the fact that the default action is print.

awk: changing OFS without looping though variables

I'm working on an awk one-liner to substitute commas to tabs in a file ( and swap \\N for missing values in preparation for MySQL select into).
The following link http://www.unix.com/unix-for-dummies-questions-and-answers/211941-awk-output-field-separator.html (at the bottom) suggest the following approach to avoid looping through the variables:
echo a b c d | awk '{gsub(OFS,";")}1'
head -n1 flatfile.tab | awk -F $'\t' '{for(j=1;j<=NF;j++){gsub(" +","\\N",$j)}gsub(OFS,",")}1'
Clearly, the trailing 1 (can be a number, char) triggers the printing of the entire record. Could you please explain why this is working?
SO also has Print all Fields with AWK separated by OFS , but in that post it seems unclear why this is working.
Thanks.
Awk evaluates 1 or any number other than 0 as a true-statement. Since, true statements without the action statements part are equal to { print $0 }. It prints the line.
For example:
$ echo "hello" | awk '1'
hello
$ echo "hello" | awk '0'
$