Looping and merging 2 files - awk

First of all, please pardon me, I am a noob.
My problem is as follows:
I have 2 text files - file1 and file2.
Following are the file samples and the desired output:
file1:
A B C
D E F
G H I
file2:
a1 a2 a3
b1 b2 b3
c1 c2 c3
Desired output:
A B C a1 a2 a3
A B C b1 b2 b3
A B C c1 c2 c3
D E F a1 a2 a3
D E F b1 b2 b3
D E F c1 c2 c3
and so on.
Can anybody please help me out with this?

awk 'FNR == NR {file2[FNR] = $0; c++; next} {for (i = 1; i <= c; i++) {print $0, file2[i]}}' file2 file1
Read all the lines of file2 into an array. For each line of file1, loop through the array and print the line from file1 and the line from file2.
In Bash:
while read -r line
do
file2+=("$line")
done < file2
while read -r line
do
for line2 in "${file2[#]}"
do
echo "$line $line2"
done
done < file1

Related

Combine two columns into new and print all columns

I want to combine columns 1 and 2 and add them as a new column in my data frame. Then I want to print all the old columns and the newly created column. I can combine the columns using the script below, but not sure how to print all columns, not only the combined:
awk ' { print $1 $2 "_" $NF } ' input_file
in
c1 c2 c3
12 1 12
4 4 57
out
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
If you want to print the _ between field 1 and 2, then the first output would be c1 c2 c3 c1_c2 instead of c1 c2 c3 c4
You can add a column at the end with the value of $1 and $2 and then print the whole line:
awk ' { $(NF+1) = $1"_"$2 }1' input_file
Output
c1 c2 c3 c1_c2
12 1 12 12_1
4 4 57 4_4
Or you can print the whole line followed by field $1 and $2
awk '{print $0, $1"_"$2}' input_file
Output
c1 c2 c3 c1_c2
12 1 12 12_1
4 4 57 4_4
Here is a Generic solution in awk. Just mention field numbers in awk variable named fields eg: 1,2,3,4,7,8(example) and it will add all fields values to last column. Written and tested in GNU awk should work in any awk.
awk -v fields="1,2" '
BEGIN{
num=split(fields,arr,",")
for(i=1;i<=num;i++){
field[arr[i]]
}
}
FNR==1{
print
next
}
{
val=""
for(i=1;i<=NF;i++){
if(i in field){
val=(val?val "_":"")$i
}
}
print $0,val
}
' Input_file
$ awk '{print $0, (NR>1 ? $1"_"$2 : "c4")}' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
or to get tab-separated output if your input is tab-separated:
$ awk 'BEGIN{FS=OFS="\t"} {print $0, (NR>1 ? $1"_"$2 : "c4")}' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
or if it isn't:
$ awk -v OFS='\t' '{$(NF+1)=(NR>1 ? $1"_"$2 : "c4")} 1' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
Another awk which at FNR==1 uses the field name in $NF to create the field name for the next field (c3 -> c4, c -> c1, etc):
$ awk '{
printf "%s%s%s\n",
$0,
OFS,
(FNR>1?$1 "_" $2:(match($3,/[0-9]+$/)?substr($3,1,RSTART-1) substr($3,RSTART)+1:$3 1))
}' file
Output:
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4
golfed version
$ awk '$++NF=NR>1?$1"_"$2:"c4"' file
c1 c2 c3 c4
12 1 12 12_1
4 4 57 4_4

Count number of occurrences of a number larger than x from every raw

I have a file with multiple rows and 26 columns. I want to count the number of occurrences of values that are higher than 0 (I guess is also valid different from 0) in each row (excluding the first two columns). The file looks like this:
X Y Sample1 Sample2 Sample3 .... Sample24
a a1 0 7 0 0
b a2 2 8 0 0
c a3 0 3 15 3
d d3 0 0 0 0
I would like to have an output file like this:
X Y Result
a a1 1
b b1 2
c c1 3
d d1 0
awk or sed would be good.
I saw a similar question but in that case the columns were summed and the desired output was different.
awk 'NR==1{printf "X\tY\tResult%s",ORS} # Printing the header
NR>1{
count=0; # Initializing count for each row to zero
for(i=3;i<=NF;i++){ #iterating from field 3 to end, NF is #fields
if($i>0){ #$i expands to $3,$4 and so which are the fields
count++; # Incrementing if the condition is true.
}
};
printf "%s\t%s\t%s%s",$1,$2,count,ORS # For each row print o/p
}' file
should do that
another awk
$ awk '{if(NR==1) c="Result";
else for(i=3;i<=NF;i++) c+=($i>0);
print $1,$2,c; c=0}' file | column -t
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0
$ awk '{print $1, $2, (NR>1 ? gsub(/ [1-9]/,"") : "Result")}' file
X Y Result
a a1 1
b a2 2
c a3 3
d d3 0

Compare two files with the third column in each file matching but the second fith columns do not match using awk

Say file1 is:
a b c d f
aa bb cc dd ef
ab bc dg ef ge
ao ob dy ed co
and file2 is:
a b c d e
aa bb cc dd ee
ab bc de ef ge
ao ob dy ed co
the expected output should be:
a b c d f
aa bb cc dd ef
Here is what I tried:
awk 'NR==FNR{c[$3,$5]++;next};($3 in c[$3]) && !($5 in c[$5]) > 0' file1 file2
something like this?
$ awk 'NR==FNR{a[$3]=$0;next}
$3 in a{split(a[$3],r); if($5!=r[5])print}' file2 file1
a b c d f
aa bb cc dd ef
checking 5th field for not matching.
I guess, this can be simplified to,
$ awk 'NR==FNR{a[$3]=$5;next} $3 in a && a[$3]!=$5' file2 file1

Concatenate/merge columns

File( ~50,000 columns)
A1 2 123 f f j j k k
A2 10 789 f o p f m n
Output
A1 2 123 ff jj kk
A2 10 789 fo pf mn
I basically want to concatenate every two columns into one starting from column4. How can we do it in awk or sed?
It is possible in awk. See below
:~/t> more test.txt
A1 2 123 f f j j k k
:~/t> awk '{for(i=j=4; i < NF; i+=2) {$j = $i$(i+1); j++} NF=j-1}1' test.txt
A1 2 123 ff jj kk
Sorry just noticed you gave two lines as example...
:~/t> more test.txt
A1 2 123 f f j j k k
A2 10 789 f o p f m n
:~/t> awk '{for(i=j=4; i < NF; i+=2) {$j = $i$(i+1); j++} NF=j-1}1' test.txt
A1 2 123 ff jj kk
A2 10 789 fo pf mn

Comparing fields of two files in awk

I want to compare two fields of two files, such as follows:
Compare the 2nd filed of file one with the 1st field of file two, print the match (even if the match is repeated) and all the columns of file one and two.
File 1:
G4 b45 3 4
G4 b45 1 3
G3 b23 2 2
G3 b22 2 6
G3 b22 2 4
File 2:
b45 a b c
b64 d e f
b23 g h i
b22 j k l
b20 m n o
Output:
G4 b45 a b c 3 4
G4 b45 a b c 1 3
G3 b23 g h i 2 2
G3 b22 j k l 2 6
G3 b22 j k l 2 4
I have tried this with the following awk command using associative arrays:
awk 'FNR==NR {array1[$2] = $1 ; arrayrest[$2] = substr($0, index($0, $2)); next}($1 in array1) {print array1[$1] "\t" $0 "\t" arrayrest[$1]}' file1 file2
But there are two problems:
It does not print the lines if the match is repeated while I want them to be printed.
It repeats the first field of file two in the output.
How could I make this awk command work nicely? Thanks in advance.
Not quite the exact output formatting you want but the right output contents.
awk 'FNR==NR{seen[$1]=$0; next} ($2 in seen) {$2=seen[$2]}7' file2 file1
Add | column -t to get more consistent column spacing.
This should be simple and clear to u:
awk 'NR==FNR {n[$2]=$0} {if ($1 in n) print n[$1],$2,$3,$4}' file1 file2
small awk
awk '{x[$1]=$0}$2=x[$2]' f2 f1
If $1 and $2 can contain the same value
awk '{x[$1]=$0}FNR!=NR&&$2=x[$2]' f2 f1
output
G4 b45 a b c 3 4
G4 b45 a b c 1 3
G3 b23 g h i 2 2
G3 b22 j k l 2 6
G3 b22 j k l 2 4