Sorting a column based on the value of another - awk

I have a file with the following structure:
Input
1 30923 2 300 G:0.503333 T:0.496667 T
1 51476 2 300 T:0.986667 C:0.0133333 C
1 51479 2 300 T:0.966667 A:0.0333333 T
What I would like to do is to change the position of the fifth and sixth column in a way that one column gets the order identical as of the seventh column. You can see in the example. In the seventh column, we have T, C, T and after the change, the sixth column from T, C, A has changed into T, C, T in the output, that is in the third line, the position of the fifth and sixth columns have switched when compared to the seventh column.
Output
1 30923 2 300 G:0.503333 T:0.496667 T
1 51476 2 300 T:0.986667 C:0.0133333 C
1 51479 2 300 A:0.0333333 T:0.966667 T
I hope I could explain clearly, I have not been able to find a solution, could you please give me a hint as how to do this?
Thank you in advance.

Using output as tab delimiters and all columns justified.
awk -F'[ :]*' '{if($7 == $9 ) print $1,$2,$3,$4,$5,$6,$7,$8,$9; else print $1,$2,$3,$4,$7,$8,$5,$6,$9}' input.txt|column -t
Output:
1 30923 2 300 G 0.503333 T 0.496667 T
1 51476 2 300 T 0.986667 C 0.0133333 C
1 51479 2 300 A 0.0333333 T 0.966667 T

If I understand correctly, maybe this will work for you?
: file a.awk
substr($6,1,1) == $7 { print }
substr($6,1,1) != $7 { print $1, $2, $3, $4, $6, $5, $7 }
: file a.txt
1 30923 2 300 G:0.503333 T:0.496667 T
1 51476 2 300 T:0.986667 C:0.0133333 C
1 51479 2 300 T:0.966667 A:0.0333333 T
bash-3.2$ awk -f a.awk a.txt
1 30923 2 300 G:0.503333 T:0.496667 T
1 51476 2 300 T:0.986667 C:0.0133333 C
1 51479 2 300 A:0.0333333 T:0.966667 T

Related

Add a new column between the first and second columns with values that are numbers from 1 to 5, repeated as needed to reach total line number

I have a txt file with 2 columns, and would want to add a new column between the two that has values ranging from 1 to 5, and repeated as many times as needed to have the same rows as the other columns. I'm trying to use AWK but I'm open to other suggestions
Example Input
A 100
A 200
A 300
A 400
A 500
B 1000
B 2000
B 3000
B 4000
B 5000
Example output
A 1 100
A 2 200
A 3 300
A 4 400
A 5 500
B 1 1000
B 2 2000
B 3 3000
B 4 4000
B 5 5000
Right now I'm trying
awk 'BEGIN{FS=OFS="\t"}{for (i = 1; i <= 5; ++i) $2 =++i OFS $2}1' $my_data
But clearly is not working.
A simpler awk:
awk '{print $1, ++cnt[$1], $2}' file
A 1 100
A 2 200
A 3 300
A 4 400
A 5 500
B 1 1000
B 2 2000
B 3 3000
B 4 4000
B 5 5000
With modulo-operator %:
awk '{print $1, (NR-1)%5+1, $2}' file
Output:
A 1 100
A 2 200
A 3 300
A 4 400
A 5 500
B 1 1000
B 2 2000
B 3 3000
B 4 4000
B 5 5000
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
awk '{print $1,a[$1]+=1,$2}' file

How do I print starting from a certain row of output with awk? [duplicate]

I have millions of records in my file, what i need to do is print columns 1396 to 1400 for specific number of rows, and if i can get this in excel or notepad.
Tried with this command
awk {print $1396,$1397,$1398,$1399,$1400}' file_name
But this is running for each row.
You need a condition to specify which rows to apply the action to:
awk '<<condition goes here>> {print $1396,$1397,$1398,$1399,$1400}' file_name
For example, to do this only for rows 50 to 100:
awk 'NR >= 50 && NR <= 100 {print $1396,$1397,$1398,$1399,$1400}' file_name
(Depending on what you want to do, you can also have much more complicated selection patterns than this.)
Here's a simpler example for testing:
awk 'NR >= 3 && NR <= 5 {print $2, $3}'
If I run this on an input file containing
1 2 3 4
2 3 4 5
3 a b 6
4 c d 7
5 e f 8
6 7 8 9
I get the output
a b
c d
e f

Column manipulating using Bash & Awk

Let's assume have an example1.txt file consisting of few rows.
item item item
A B C
100 20 2
100 22 3
100 23 4
101 26 2
102 28 2
103 29 3
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
104 36 3
There are few commands I would like to perform to filter out the txt files and add a few more columns.
At first, I want to apply a condition when item C is equal to 2. Using awk command I can do that in the following way.
Therefore The return text file would be:
awk '$3 == 2 { print $1 "\t" $2 "\t" $3} ' example1.txt > example2.txt
item item item
A B C
100 20 2
101 26 2
102 28 2
103 30 2
103 32 2
104 33 2
104 34 2
104 35 2
Now I want to count two things:
I want to count the total unique number in column 1.
For example, in the above case example2.txt, it would be:
(100,101,102,103,104) = 5
And I would like to add the repeating column A number and add that to a new column.
I would like to have like this:
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
~
Above Item D column (4th), 1st row is 1, because it did not have any repetitive. but in 4th row, it's 2 because 103 is repetitive twice. Therefore I have added 2 in the 4th and 5th columns. Similarly, the last three columns in Item 4 is 3, because item A is repetitive three times in these three columns.
You may try this awk:
awk -v OFS='\t' 'NR <= 2 {
print $0, (NR == 1 ? "item" : "D")
}
FNR == NR && $3 == 2 {
++freq[$1]
next
}
$3 == 2 {
print $0, freq[$1]
}' file{,}
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Could you please try following. In case you want to save output into same Input_file then append > temp && mv temp Input_file to following code.
awk '
FNR==NR{
if($3==2){
a[$1,$3]++
}
next
}
FNR==1{
$(NF+1)="item"
print
next
}
FNR==2{
$(NF+1)="D"
print
next
}
$3!=2{
next
}
FNR>2{
$(NF+1)=a[$1,$3]
}
1
' Input_file Input_file | column -t
Output will be as follows.
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program fro here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st time Input_file is being read.
if($3==2){ ##Checking condition if 3rd field is 2 then do following.
a[$1,$3]++ ##Creating an array a whose index is $1,$3 and keep adding its index with 1 here.
}
next ##next will skip further statements from here.
}
FNR==1{ ##Checking condition if this is first line.
$(NF+1)="item" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
FNR==2{ ##Checking condition if this is second line.
$(NF+1)="D" ##Adding a new field with string item in it.
print ##Printing 1st line here.
next ##next will skip further statements from here.
}
$3!=2{ ##Checking condition if 3rd field is NOT equal to 2 then do following.
next ##next will skip further statements from here.
}
FNR>2{ ##Checking condition if line is greater than 2 then do following.
$(NF+1)=a[$1,$3] ##Creating new field with value of array a with index of $1,$3 here.
}
1 ##1 will print edited/non-edited lines here.
' Input_file Input_file ##Mentioning Input_file names 2 times here.
Similar to the others, but using awk with a single-pass and storing the information in arrays regarding the records seen and the count for D with the arrays ord and Dcnt used to map the information for each, e.g.
awk '
FNR == 1 { h1=$0"\titem" } # header 1 with extra "\titem"
FNR == 2 { h2=$0"\tD" } # header 2 with exter "\tD"
FNR > 2 && $3 == 2 { # remaining rows with $3 == 2
D[$1]++ # for D colum times A seen
seen[$1,$2] = $0 # save records seen
ord[++n] = $1 SUBSEP $2 # save order all records appear
Dcnt[n] = $1 # save order mapped to $1 for D
}
END {
printf "%s\n%s\n", h1, h2 # output headers
for (i=1; i<=n; i++) # loop outputing info with D column added
print seen[ord[i]]"\t"D[Dcnt[i]]
}
' example.txt
(note: SUBSEP is a built-in variable that corresponds to the substring separator used when using the comma to concatenate fields for an array index, e.g. seen[$1,$2] to allow comparison outside of an array. It is by default "\034")
Example Output
item item item item
A B C D
100 20 2 1
101 26 2 1
102 28 2 1
103 30 2 2
103 32 2 2
104 33 2 3
104 34 2 3
104 35 2 3
Always more than one way to skin-the-cat with awk.
Assuming the file is not a big file;
awk 'NR==FNR && $3 == 2{a[$1]++;next}$3==2{$4=a[$1];print;}' file.txt file.txt
You parse through the file twice. In the first iteration, you calculate the 4th column and have it in an array. In the second parsing, we set the count as 4th column,and get the whole line printed.

Replace nth and (n+1)th values in one file with same values from another file

i have two files:
f1.txt:
header 1
header 2
100
100
100
100
100
100
100
100
100
100
100
100
100
f2.txt:
header 1
header 2
10
1234
5678
10
10
2345
6789
10
10
3456
7890
10
10
desired output
f3.txt:
header 1
header 2
100
1234
5678
100
100
2345
6789
100
100
3456
7890
100
100
the values in f2.txt that occur in lines 4 & 5, then 8 & 9, then 12 & 13 (i.e., they're spaced every 6th row), i want to put them inside f1.txt to replace the corresponding rows in f1.txt. how can i do this?
so far, i have only been able to print these values out of f2.txt as such:
exec<f2.txt
var=$(awk 'NR % 6 == 4')
echo "$var"
this produces
1234
2345
3456
then when i change 4 to 5, it gives me the 2nd set of values. so am trying to learn how to extract the 2 sets of values, and then put them in f1.txt? any help will be greatly appreciated. thanks!
Try:
paste f1.txt f2.txt | awk -F'\t' '
NR < 3 || (NR-2)%4 == 1 || (NR-2)%4 == 0 {print $1; next}
{print $2}
'
Your desired output does not indicate groups of 6 lines, but instead groups of 4 lines. Perhaps the 2 header lines are throwing you off.
I'm assuming your input files do not contain tabs.
More concise awk from Ed Morton:
awk -F'\t' '{print (NR-2)%4 < 2 ? $1 : $2}'

How to do a join using awk

Here is my Input file
Identifier Relation
A 1
A 2
A 3
B 2
B 3
C 1
C 2
C 3
I want to join this file to itself based on the "Relation" field.
Sample Output file
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
I used the following awk script:
awk 'NR==FNR {a[NR]=$0; next} { for (k in a) if (a[k]~$2) print a[k],$0}' input input > output
However, I had to do another awk step to delete lines which did a join with itself i.e, A 1 A 1 ; B 2 B 2 etc.
The second issue with this file is it prints both directions of the join, thus
A 1 C 1 is printed along with C 1 A 1 on another line.
Both these lines display the same relation and I would NOT like to include this.I want to see just one or the other i.e, "A 1 C 1" or "C 1 A 1" not both.
Any suggestions/directions are highly appreciated.
alternative solution using awk with join and sort support
$ join -j 2 <(sort -k2 -k1,1 file){,}
| awk '$2!=$3 && !($3 FS $2 in a){a[$2 FS $3]; print$2,$1,$3,$1}'
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
create the cross product, eliminate the diagonal and one of the symmetrical pairs.
Here is an awk-only solution:
awk 'NR>1{ar[$2]=(ar[$2]$1);}\
END{ for(key in ar){\
for(i=1; i<length(ar[key]); i++) {\
for(j=i+1; j<length(ar[key])+1; j++) {\
print substr(ar[key],i,1), key, substr(ar[key],j,1), key;\
}\
}\
}}' infile
Each number in the second column of the input serves as a key of an awk-array. The value of the corresponding array-element is a sequence of first-column letters (e.g., array[1]=ABC).
Then, we built all two-letter combinations for each sequence (e.g., "ABC" gives "AB", "AC" and "BC")
Output:
A 1 C 1
A 2 B 2
A 2 C 2
B 2 C 2
A 3 B 3
A 3 C 3
B 3 C 3
Note:
If a number occurs only once, no output is generated for this number.
The order of output depends on the order of input. (No sorting of letters!!). That is if the second input line was C 1, then array[1]="CAB" and the first output line would be C 1 A 1
First line of input is ignored due to NR>1
There is surely a solution with awk only, but I'm going to propose a solution using awk and sort because I think it's quite simple and does not require storing the entire file content in awk variables. The idea is as follows:
rewrite the input file so that the "relation" field is first (A 1 -> 1 A)
use sort -n to put together all lines with same "relation"
use awk to combine consecutive lines having the same "relation"
That would translate to something like:
awk '{print $2 " " $1}' input | sort -n |
awk '{if ($1==lastsel)printf " "; else if(lastsel) printf "\n"; lastsel=$1; printf "%s %s", $2, $1;}END{if(lastsel)printf"\n"}'
A 1 C 1
A 2 B 2 C 2
A 3 B 3 C 3
EDIT: If you want only one i-j relation per line:
awk '{print $2 " " $1}' input | sort -n |
awk '$1!=rel{rel=$1;item=$2;next;} {printf "%s %s %s %s\n", item, rel, $2, $1;}'
A 1 C 1
A 2 B 2
A 2 C 2
A 3 B 3
A 3 C 3
Note the following limitations with this solution:
In case a given n has only one entry, nothing will be output (no output such as D 1)
All relations always have the lexicographically first item in the first column (e.g. A 1 C 1 but never B 1 C 1)