read columns from several file and print them in individual columns - awk

I have several text files which each one contains several columns contains numbers e.g:
5 10 6
6 20 1
7 30 4
8 40 3
9 23 1
4 13 6
I want to collect the second column of all files in separate columns. I used this code, it works but print all second columns in a single column.
{awk '{print $3}' > outfile}
How can I print each column in an individual one?

$ awk '{a[FNR]=(FNR in a)?a[FNR] OFS $2:$2}
END {for(i=1;i<=NR;i++) print a[i]}' file1 file2 ... > outfile
assumes all files have the same number of lines, otherwise alignment will be off.

Related

How do I print starting from a certain row of output with awk? [duplicate]

I have millions of records in my file, what i need to do is print columns 1396 to 1400 for specific number of rows, and if i can get this in excel or notepad.
Tried with this command
awk {print $1396,$1397,$1398,$1399,$1400}' file_name
But this is running for each row.
You need a condition to specify which rows to apply the action to:
awk '<<condition goes here>> {print $1396,$1397,$1398,$1399,$1400}' file_name
For example, to do this only for rows 50 to 100:
awk 'NR >= 50 && NR <= 100 {print $1396,$1397,$1398,$1399,$1400}' file_name
(Depending on what you want to do, you can also have much more complicated selection patterns than this.)
Here's a simpler example for testing:
awk 'NR >= 3 && NR <= 5 {print $2, $3}'
If I run this on an input file containing
1 2 3 4
2 3 4 5
3 a b 6
4 c d 7
5 e f 8
6 7 8 9
I get the output
a b
c d
e f

Find the ratio among columns

I have some input files of the following format:
File1.txt File2.txt File3.txt
1 2 1 6 1 20
2 3 2 9 2 21
3 7 3 14 3 28
Now I need to output a new single file using AWK with three columns, the first column remains the same, and it is the same among the three files (just an ordinal number).
However for 2nd and the 3rd column of this newly created file, I need to values of the 2nd column of the second file divided by the values of the 2nd column of the 1st file, also the values of the second column of the third file divided by the value of the 2nd column of the first file. In other words, the 2nd columns for the 2nd and 3rd file divided by the 2nd column of the first file.
e.g.:
Result.txt
1 3 10
2 3 7
3 2 4
Use a multidimensional matrix to store the values:
awk 'FNR==NR {a[$1]=$2; next}
{b[$1,ARGIND]=$2/a[$1]}
END {for (i in a)
print i,b[i,2],b[i,3]
}' f1 f2 f3
Test
$ awk 'FNR==NR {a[$1]=$2; next} {b[$1,ARGIND]=$2/a[$1]} END {for (i in a) print i,b[i,2],b[i,3]}' f1 f2 f3
1 3 10
2 3 7
3 2 4

Concatenate files based off unique titles in their first column

I have many files that are of two column format with a label in the first column and a number in the second column. The number is positive (never zero):
AGS 3
KET 45
WEGWET 12
FEW 56
Within each file, the labels are not repeated.
I would like to concatenate these many files into one file with many+1 columns, such that the first column includes the unique set of all labels across all files, and the last five columns include the number for each label of each file. If the label did not exist in a certain file (and hence there is no number for it), I would like it to default to zero. For instance, if the second file contains this:
AGS 5
KET 14
KJV 2
FEW 3
then the final output would look like:
AGS 3 5
KET 45 14
WEGWET 12 0
KJV 0 2
FEW 56 3
I am new to Linux, and have been playing around with sed and awk, but realize this probably requires multiple steps...
*Edit note: I had to change it from just 2 files to many files. Even though my example only shows 2 files, I would like to do this in case of >2 files as well. Thank you...
Here is one way using awk:
awk '
NR==FNR {a[$1]=$0;next}
{
print (($1 in a)?a[$1] FS $2: $1 FS "0" FS $2)
delete a[$1]
}
END{
for (x in a) print a[x],"0"
}' file1 file2 | column -t
AGS 3 5
KET 45 14
KJV 0 2
FEW 56 3
WEGWET 12 0
You read file1 in to an array indexed at column 1 and assign entire line as it's value
For the file2, check if column 1 is present in our array. If it is print the value from file1 along with value from file2. If it is not present print 0 as value for file1.
Delete the array element as we go along to get only what was unique in file1.
In the END block print what was unique in file1 and print 0 for file2.
Pipe the output to column -t for pretty format.
Assuming that your data are in files named file1 and file2:
$ awk 'FNR==NR {a[$1]=$2; b[$1]=0; next} {a[$1]+=0; b[$1]=$2} END{for (x in b) {printf "%-15s%3s%3s\n",x,a[x],b[x]}}' file1 file2
KJV 0 2
WEGWET 12 0
KET 45 14
AGS 3 5
FEW 56 3
To understand the above, we have to understand an awk trick.
In awk, NR is the number of records (lines) that have been processed and FNR is the number of records that we have processed in the current file. Consequently, the condition FNR==NR is true only when we are processing in the first file. In this case, the associative array a gets all the values from the first file and associative array b gets placeholder, i.e. zero, values. When we process the second file, its values go in array b and we make sure that array a at least has a placeholder value of zero. When we are done with the second file, the data is printed.
More than two files using GNU Awk
I created a file3:
$ cat file3
AGS 3
KET 45
WEGWET 12
FEW 56
AGS 17
ABC 100
The awk program extended to work with any number of files is:
$ awk 'FNR==1 {n+=1} {a[$1][n]=$2} END{for (x in a) {printf "%-15s",x; for (i=1;i<=n;i++) {printf "%5s",a[x][i]};print ""}}' file1 file2 file3
KJV 2
ABC 100
WEGWET 12 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56
This code works creates a file counter. We know that we are in a new file every time that FNR is 1 and a counter, n, is incremented. For every line we encounter, we put the data in a 2-D array. The first dimension of a is the label and the second is the number of the file that we encountered it in. In the end, we just loop over all the labels and all the files, from 1 to n and print the data.
More than 2 files without GNU Awk
Without requiring GNU's awk, we can solve the problem using simulated two-dimensional arrays:
$ awk 'FNR==1 {n+=1} {b[$1]=1; a[$1,":",n]=$2} END{for (x in b) {printf "%-15s",x; for (i=1;i<=n;i++) {q=a[x,":",i]+0; printf "%5s",q};print ""}}' file1 file2 file3
KJV 0 2 0
ABC 0 0 100
WEGWET 12 0 12
KET 45 14 45
AGS 3 5 17
FEW 56 3 56

Lookup and Replace with two files in awk

I am trying to correct one file with another with a single line of AWK code. I am trying to take $1 from FILE2, look it up in FILE1, get the corresponding $3 and $4. After I set them as variables I want the program to stop evaluating FILE1, change $10 and $11 from FILE2 to the values of the variables, and print this out.
I am having trouble getting the awk to switch from FILE1 to FILE2 after I have extracted the variables. I've tried nextfile, but this resets the program and it tires to extract variables from FILE2, I set NR to the last Record, but it did not switch.
I am also doing a loop to get each line out of FILE1, but if that can be part of the script I am sure it would speed things up not having to reopen awk over and over again.
here is the parts I have figured out.
for file in `cut -f 1 FILE2`; do
awk -v a=$file '$1=a{s=$2;q=$4; ---GO TO FILE1---}{if ($1==a) {$10=s; $11=q; print 0;exit}' FILE1 FILE2 >> FILEOUT
done
a quick example set NOTE: Despite how I have this written, the two files are not in the same order and on the order of 8GB in size, so a little unwieldy to sort.
FILE1
A 12345 + AJD$JD
B 12504 + DKFJ#%
C 52042 + DSJTJE
FILE2
A 2 3 4 5 6 7 8 9 345 D$J
B 2 3 4 5 6 7 8 9 250 KFJ
C 2 3 4 5 6 7 8 9 204 SJT
OUTFILE
A 2 3 4 5 6 7 8 9 12345 AJD$JD
B 2 3 4 5 6 7 8 9 12504 DKFJ#%
C 2 3 4 5 6 7 8 9 52042 DSJTJE
This is the code I got to work based on Kent's answer below.
awk 'NR==FNR{a[$1]=$2" "$4;next}$1 in a{$9=$9" "a[$1]}{$10="";$11=""}2' f1 f2
try this one-liner:
kent$ awk 'NR==FNR{a[$1]=$2" "$4;next}$1 in a{NF-=2;$0=$0" "a[$1]}7' f1 f2
A 2 3 4 5 6 7 8 9 12345 AJD$JD
B 2 3 4 5 6 7 8 9 12504 DKFJ#%
C 2 3 4 5 6 7 8 9 52042 DSJTJE
No need to loop over the files repeatedly - just read one file and store the relevant fields in arrays keyed on $1, then go through the other file and use those arrays to look up the values you want to insert.
awk '(FILENAME=="FILE1"){y[$1]=$2;z[$1]=$4}; (FILENAME=="FILE2" && $1 in y){$10=y[$1];$11=z[$1];print $0}' FILE1 FILE2
That said, it sounds like you might have a use for the join command here rather than messing about with awk (the above script assumes all your $1/$2/$4 values will fit in memory).

How to Add Column with Percentage

I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00