matching rows and fields from two files - awk

I would like to match the record number in one file will the same field number in another file:
file1:
1
3
5
4
3
1
5
file2:
A B C D E F G
H I J J K L M
N O P Q R S T
I would like to use the record numbers corresponding to 5 in the first file to obtain the corresponding fields in the second file. Desired output:
C G
J M
P T
So far, I've done:
awk '{ if ($1=="5") print NR }' file1 > temp
for i in $(cat temp); do
awk '{ print $"'${i}'" }' file2
done
But get the output:
C
J
P
G
M
T
I would like to have this in the format of the desired output above, but can't get it to work. Perhaps using prinf or awk for-loop might work, but I have had no success.
Thank you all.

awk 'NR==FNR{if($1==5)a[NR];next}{for(i in a){printf $i" "}print ""}' a b
C G
J M
P T

Related

Extract substring from grep/awk results?

I have a grep command that finds rows in a file, passes those to awk and prints out the 1st and 15th columns.
grep String1 /path/to/file.txt | grep string2 | awk -F ' ' '{print $1, $15}'
So far, so good. This results in a list like this:
2023-01-20 [text1]>
2023-01-22 [text2]>
2023-01-23 [text3]>
2023-01-25 [text4]>
Ideally, I'd like to add some regex to the awk command so that I get this:
2023-01-20 text1
2023-01-22 text2
2023-01-23 text3
2023-01-25 text4
My searches have only returned how to use regex with awk to identify fields but not to extract a substring from the results. Is this possible with awk or some other command?
One awk idea that combines the current code with the new requirement:
awk -v s1="String1" -v s2="string2" ' # feed both search strings in as awk variables "s1" and "s2"
$0~s1 && $0~s2 { print $1,substr($15,2,index($15,"]")-2) } # if s1 and s2 are both present in the current line then print 1st field and 15th field (sans the "[" "]" wrappers)
' /path/to/file.txt
A non-sensical demo file:
$ cat file.txt
a b c d e f g h i j k l m n o p q r s t u v w x y z
a string2 c d e f g h i j k l m n [old]> p q r s t u v String1 x y z
a b c d e f g h i j k l m n o p q r s t u v w x y z
a String1 c d e f g h i j k l m n [older]> p q r s t u v string2 x y z
Running the awk script against this file generates:
a old
a older
If you're just basically want to delete the characters [, ] and >, you can simply use tr -d for this, something like:
... | tr -d "[]>"
Linux prompt>echo "2023-01-20 [text1]>" | tr -d "[]>"
2023-01-20 text1
Another option removing the leading [ and trailing ]> with gsub and an alternation:
awk '/String1/ && /string2/ {
gsub(/^\[|\]>$/, "", $15)
{print $1, $15}
}' file.txt
In gnu-awk you could use gensub:
awk '/String1/ && /string2/ {
{print $1, gensub(/^\[|\]>$/, "", "g", $15)}
}' file
Or find the occurrence of the string using index:
awk 'index($0, "String1") && index($0, "string2"){
gsub(/^\[|\]>$/, "", $15)
{print $1, $15}
}' file

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)

How to replace 1 or 2 with M and F in command line

I have a file with four columns( four fields). One column is sex coded as 1 or 2. How could I use awk command to replace 1 by M and 2 by F?
awk '$3=$3==1?"M":"F"' file
for example:
kent$ echo "a b 1 c
c d 2 x"|awk '$3=$3==1?"M":"F"'
a b M c
c d F x
in this example, your 3rd column is 1 or 2, you just change the $3 to the right column index.
It is always good to show an example of your input, also with expected output.

awk: delete first and last entry of comma-separated field

I have a 4 column data that looks something like the following:
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
I am trying to delete first two numbers and the last two numbers on entire fourth column so the output looks like the following
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Can someone give me a help? I would appreciate it.
You can use this:
$ awk '{n=split($4, a, ","); for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":","); print $1, $2, $3, t; t=""}' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Explanation
n=split($4, a, ",") slices the 4th field in pieces, based on comma as delimiter. As split() returns the number of pieces we got, we store it in n to work with it later on.
for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":",") stores in t the last field, looping through all the slices.
print $1, $2, $3, t; t="" prints the new output and blanks the variable t.
This will work for your posted sample input:
$ awk '{gsub(/^([^,]+,){2}|(,[^,]+){2}$/,"",$NF)}1' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
If you have cases where there's less than 4 commas in your 4th field then update your question to show how those should be handled.
This uses bash array manipulation. It may be a little ... gnarly:
while read -a fields; do # read the fields for each line
IFS=, read -a values <<< "${fields[3]}" # split the last field on comma
new=("${values[#]:2:${#values[#]}-4}") # drop the first 2 and last fields
fields[3]=$(IFS=,; echo "${new[*]}") # join the new list on comma
printf "%s\t" "${fields[#]}"; echo # print the new line
done <<END
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
END
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6

awk histogram in buckets

Consider I have a following file..
1 a
1 b
1 a
1 c
1 a
2 a
2 d
2 a
2 d
I want to have a histogram within a bucket... for example if bucket is 1 then the output will be
a 3
b 1
c 1
a 2
d 2
for bucket 2... we have
a 5
b 1
c 1
d 2
I want to do it with awk and I literally stuck...
here is my code:
awk '
{A[$1]} count [$2]++
{for(i in A) {print i,A[i]}
}' test
Any help?
Thanks,
Amir.
Edit Adding a size_of_bucket variable.
awk -v "size_of_bucket=2" '
{
bucket = int(($1-1)/size_of_bucket);
A[bucket","$2]++;
}
END {
for (i in A) {
print i, A[i];
}
}
'