Truncation of strings after running awk script - awk

I have this code
BEGIN { FS=OFS=";" }
{ key = $(NF-1) }
NR == FNR {
for (i=1; i<(NF-1); i++) {
if ( !seen[key,$i]++ ) {
map[key] = (key in map ? map[key] OFS : "") $i
}
}
next
}
{ print $0 map[key] }
I use code in this way
awk -f tst.awk 2.txt 1.txt
I have two text files
1.txt
AA;BB;
2.txt
CC;DD;BB;AA;
I try to generate this 3.txt output
AA;BB;CC;DD;
but with this script is not possible because this script return only AA;BB;
logic: The above just uses literal strings in a hash lookup of array indices so it doesn't care what characters you have in your input. However about sample output:if in 2.txt there are common fields also in 1.txt.for example BB;AA; then you need concatenate them in a single row, i.e AA;BB;CC;DD; Ordering is not required, for example is not relevant if output is BB;AA;DD;CC; Only condition that is required is avoid duplicates but my script already does this

Could you please try following, as per OP's comment both files have only 1 line. So using paste command to combine both the files and then processing its output by awk command.
paste -d';' 1.txt 2.txt |
awk '
BEGIN{
FS=OFS=";"
}
{
for(i=1;i<=NF;i++){
if(!seen[$i]++){ val=(val?val OFS:"")$i }
}
print val
delete seen
val=""
}'

Related

Modify a Line after Matching Pattern with Another Line, Then Delete the Other Line with Awk

For example, say I had the following lines:
1,r,other,columns,....,
4,w,...,
2,w,etc...
3,r
1,w
2,r
I would want my output written to a file (or overwrite the existing file) as:
1,r/w,other,columns,....,
4,w,...,
2,r/w,etc...
3,r
Where order does not matter in the end.
The first "row" of the line where commas are delimiters are the patterns to match, once matched, one will have 'r' and the other 'w' as their second row, I want to combine them into one line like the example above.
Update
I've managed to get it working with the command:
awk -F, '{a[$1]=a[$1]?a[$1] OFS $2:$2} END{for (i in a) print i FS a[i]}' OFS="/" file
However, this erases all other columns that come after the second, how can I preserve those columns?
$ cat tst.awk
BEGIN { FS=OFS="," }
{
key = $1
perms[key] = (key in perms ? perms[key] "/" : "") $2
}
$3 != "" {
sub(/([^,]*,){2}/,"")
vals[key] = $0
}
END {
for (key in perms) {
print key, perms[key] (key in vals ? OFS vals[key] : "")
}
}
$ awk -f tst.awk file
1,r/w,other,columns,....,
2,w/r,etc...
3,r
4,w,...,

Print columns from two files

How to print columns from various files?
I tried according to Awk: extract different columns from many different files
paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R )
FILE.R == ARGV[1] { one[FNR]=$1 }
FILE.R == ARGV[2] { two[FNR]=$2 }
FILE_R == ARGV[3] { three[FNR]=$3 }
FILE_R == ARGV[4] { four[FNR]=$4 }
END {
for (i=1; i<=length(one); i++) {
print one[i], two[i], three[i], four[i]
}
}
but I don't understand how to use this script.
FILE.R
56604.6017 2.3893 2.2926 2.2033
56605.1562 2.3138 2.2172 2.2033
FILE_R
56604.6017 2.29259 0.006699 42.
56605.1562 2.21716 0.007504 40.
Output desired
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Thank you
This is one way:
$ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2
Output:
56604.6017 2.3893 0.006699 42.
56605.1562 2.3138 0.007504 40.
Explained:
$ awk -v OFS="\t" ' # setting the field separator to a tab
NR==FNR { # process the first file
a[$1]=$2 # hash the second field, use first as key
next
}
{
print $1,a[$1],$3,$4 # output
}' file1 file2
If the field spacing with tabs is not enough, use printf with modifiers like in your sample.

awk: extract data from a column by name rather than position

I have a text file that is comma delimited. The first line is a list of field names, and subsequent lines contain data. I'll get new versions of the file, and I want to extract all the values from a particular column by name rather than by column number. (I.e. the column I want may be in different positions in different versions of the file.)
For example, here are two files:
foo,bar,interesting,junk
1,2,gold,ramjet
2,25,diamonds,superfluous
and
foo,bar,baz,interesting,junk,morejunk
5,3,smurf,platinum,garbage,scrap
6,2.5,mushroom,sodium,liverwurst,eew
I'd like a single script that will go through multiple files, extracting the minerals in the "interesting" column. :-)
What I've got so far is something that works on ONE file, but I know that awk is more elegant than this. How do I clean this up and make it work on multiple files at once?
BEGIN {
FS=",";
}
NR == 1 {
for(i=1; i<=NF; i++) {
if($i=="interesting") {
col=i;
}
}
}
NR > 1 {
print $col;
}
You're pretty darn close already. Just use FNR instead of NR, for "File NR".
#!/usr/bin/awk -f
BEGIN { FS="," }
FNR==1 {
for (col=1;col<=NF;col++)
if ($col=="interesting")
next
}
{ print $col }
Or if you like:
#!/usr/bin/awk -f
BEGIN { FS="," }
FNR==1 { for (col=1;$col!="interesting";col++); next }
{ print $col }
Or if you prefer one-liners:
$ awk -F, -v txt="interesting" 'FNR==1{for(c=1;$c!=txt;c++);next} {print $c}' file1 file2
Of course, be careful that you actually have the specified column, or you may find yourself in an endless loop. You can probably figure out the extra condition that saves you from that risk.
Note that in awk, you only need to terminate commands with semicolons if they are followed by another command. Thus, you would do this:
command1; command2
But you can drop the semicolon if you separate commands with newlines:
command1
command2
Do it this way:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { for (i=1;i<=NF;i++) f[$i]=i; next }
{ print $(f["interesting"]) }
$ awk -f tst.awk file1 file2
gold
diamonds
platinum
sodium
Creating a name->value array is always the best approach when it's applicable. It keeps every part of the code simple and decoupled from the rest of the code, and it sets you up for doing other things like changing the order of the fields when you output the results, e.g.:
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 { for (i=1;i<=NF;i++) f[$i]=i; next }
{ print $(f["junk"]), $(f["interesting"]), $(f["bar"]) }
$ awk -f tst.awk file1 file2
ramjet,gold,2
superfluous,diamonds,25
garbage,platinum,3
liverwurst,sodium,2.5

awk group by multiple columns and print max value with non-primary key

i'm new to this site and trying to learn awk. i'm trying to find the maximum value of field3, grouping by field1 and print all the fields with maximum value. Field 2 contains time, that means for each item1 there is 96 values of field2,field3 and field4
input file: (comma separated)
item1,00:15,10,30
item2,00:45,20,45
item2,12:15,30,45
item1,00:30,20,56
item3,23:00,40,44
item1,12:45,50,55
item3,11:15,30,45
desired output:
item1,12:45,50,55
item2,12:15,30,45
item3,11:15,30,45
what i tried so far:
BEGIN{
FS=OFS=","}
{
if (a[$1]<$3){
a[$1]=$3}
}
END{
for (i in a ){
print i,a[i]
}
but this only prints
item1,50
item2,30
item3,30
but i need to print the corresponding field2 and field4 with the max value as shown in the desired output. please help.
The problem here is that you are not storing the whole line, so when you go through the final data there is no full data to print.
What you need to do is to use another array, say data[index]=full line:
BEGIN{
FS=OFS=","}
{
if (a[$1]<$3){
a[$1]=$3
data[$1]=$0} # store it here!
}
END {
for (i in a )
print data[i] # print it here
}
Or as a one-liner:
$ awk 'BEGIN{FS=OFS=","} {if (a[$1]<$3) {a[$1]=$3; data[$1]=$0}} END{for (i in a) print data[i]}' file
item1,12:45,50,55
item2,12:15,30,45
item3,23:00,40,44
With a little help of the sort command:
sort -t, -k1,1 -k3,3nr file | awk -F, '!seen[$1]++'
To do this job robustly you need:
$ cat tst.awk
BEGIN { FS="," }
!($1 in max) {
max[$1] = $3
data[$1] = $0
keys[++numKeys] = $1
}
$3 > max[$1] {
max[$1] = $3
data[$1] = $0
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
print data[keys[keyNr]]
}
}
$ awk -f tst.awk file
item1,12:45,50,55
item2,12:15,30,45
item3,23:00,40,44
When doing min/max calculations you should always seed your min/max value with the first value read rather than assuming it'll always be less than or greater than some arbitrary value (e.g. zero-or-null if you skip the !($1 in max) block above).
You need the keys array to preserve input order when printing the output. If you use in instead then the output order will be random.
Note that idiomatic awk syntax is simply:
<condition> { <action> }
not C-style:
{ if ( <condition> ) { <action> } }

Convert rows into columns using awk

Not all columns (&data) are present for all records. Hence whenever fields missing are missing, they should be replaced with nulls.
My Input format:
.set 1000
EMP_NAME="Rob"
EMP_DES="Developer"
EMP_DEP="Sales"
EMP_DOJ="20-10-2010"
EMR_MGR="Jack"
.set 1001
EMP_NAME="Koster"
EMP_DEP="Promotions"
EMP_DOJ="20-10-2011"
.set 1002
EMP_NAME="Boua"
EMP_DES="TA"
EMR_MGR="James"
My desired output Format:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
I tried the below:
awk 'NR>1{printf "%s"(/^\.set/?RS:"~"),a} {a=substr($0,index($0,"=")+1)} END {print a}' $line
This is printing:
Rob~Developer~Sales~20-10-2010~Jack
Koster~Promotions~20-10-2011~
Boua~TA~James~
This awk script produces the desired output:
BEGIN { FS = "[=\"]+"; OFS = "~" }
/\.set/ { ++records; next }
NR > 1 { f[records,$1] = $2 }
END {
for (i = 1; i <= records; ++i) {
print f[i,"EMP_NAME"], f[i,"EMP_DES"], f[i,"EMP_DEP"], f[i,"EMP_DOJ"], f[i,"EMR_MGR"]
}
}
A two-dimensional array is used to store all of the values that are defined for each record.
After all the file has been processed, the loop goes through each row of the array and prints all of the values. The elements that are undefined will be evaluated as an empty string.
Specifying the elements explicity allows you to control the order in which they are printed. Using print rather than printf allows you to make correct use of the OFS variable which has been set to ~, as well as the ORS which is a newline character by default.
Thanks to #Ed for his helpful comments that pointed out some flaws in my original script.
Output:
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
$ cat tst.awk
BEGIN{ FS="[=\"]+"; OFS="~" }
/\.set/ { ++numRecs; next }
{ name2val[numRecs,$1] = $2 }
!seen[$1]++ { names[++numNames] = $1 }
END {
for (recNr=1; recNr<=numRecs; recNr++)
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[recNr,names[nameNr]], (nameNr<numNames?OFS:ORS)
}
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
If you want some pre-defined order of fields in your output rather than creating it on the fly from the rows in each record as they're read, just populate the names[] array explicitly in the BEGIN section and if you have that situation AND don't want to save the whole file in memory:
$ cat tst.awk
BEGIN{
FS="[=\"]+"; OFS="~";
numNames=split("EMP_NAME EMP_DES EMP_DEP EMP_DOJ EMR_MGR",names,/ /)
}
function prtName2val( nameNr, i) {
if ( length(name2val) ) {
for (nameNr=1; nameNr<=numNames; nameNr++)
printf "%s%s", name2val[names[nameNr]], (nameNr<numNames?OFS:ORS)
delete name2val
}
}
/\.set/ { prtName2val(); next }
{ name2val[$1] = $2 }
END { prtName2val() }
$ awk -f tst.awk file
Rob~Developer~Sales~20-10-2010~Jack
Koster~~Promotions~20-10-2011~
Boua~TA~~~James
The above uses GNU awk for length(name2val) and delete name2val, if you don't have that then use for (i in name2val) { do stuff; break } and split("",name2val) instead..
This is all I can suggest:
awk '{ t = $0; sub(/^[^"]*"/, "", t); gsub(/"[^"]*"/, "~", t); sub(/".*/, "", t); print t }' file
Or sed:
sed -re 's|^[^"]*"||; s|"[^"]*"|~|g; s|".*||' file
Output:
Rob~Developer~Sales~20-10-2010~Jack~Koster~Promotions~20-10-2011~Boua~TA~James