How to number the lines according to a field with awk? - awk

I wonder whether there is a way using awk to number the lines according to a field. For example,
Input
2334 332
2334 546
2334 675
7890 222
7890 134
234 45
.
.
.
Based on the 1st field, I would have the following output
Output
1 2334 332
1 2334 546
1 2334 675
2 7890 222
2 7890 134
3 234 45
.
.
.
I would be grateful for your help.
Cheers,
T

here's how,
awk '!a[$1]++{c++}{print c, $0}' file
1 2334 332
1 2334 546
1 2334 675
2 7890 222
2 7890 134
3 234 45

awk 'last != $1 { line = line + 1 } { last = $1; print line, $0 }'

Related

conditional awk statement to create a new field with additive value

Question
How would I use awk to create a new field that has $2+consistent value?
I am planning to cycle through a list of values but I wouldn't mind using a one liner for each command
PseudoCode
awk '$1 == Bob {$4 = $2 + 400}' file
Sample Data
Philip 13 2
Bob 152 8
Bob 4561 2
Bob 234 36
Bob 98 12
Rey 147 152
Rey 15 1547
Expected Output
Philip 13 2
Bob 152 8 408
Bob 4561 2 402
Bob 234 36 436
Bob 98 12 412
Rey 147 152
Rey 15 1547
just quote Bob, also you want to add third field not second
$ awk '$1=="Bob" {$4=$3+400}1' file | column -t
Philip 13 2
Bob 152 8 408
Bob 4561 2 402
Bob 234 36 436
Bob 98 12 412
Rey 147 152
Rey 15 1547
Here , check if $1 is equal to Bob and , reconstruct the record ($0) by appending $2 FS 400 in to $0. Here FS is the field separator used between 3rd and 4th fields. 1 in the end means tell awk to take the default action which is print.
awk '$1=="Bob"{$0=$0 FS $2 + 400}1' file
Philip 13 2
Bob 152 8 552
Bob 4561 2 4961
Bob 234 36 634
Bob 98 12 498
Rey 147 152
Rey 15 1547
Or , if you want to keep name(Bob) as variable
awk -vname="Bob" '$1==name{$0=$0 FS $2 + 400}1' file
1st solutiuon: Could you please try following too once. I am using here NF and NF+1 awk's out of the box variables. Where $NF denotes value of last column of current line and $(NF+1) will create an additional column if condition of st field stringBob` is found is TRUE.
awk '{$(NF+1)=$1=="Bob"?400+$NF:""} 1' OFS="\t" Input_file
2nd solution: In case we don't want to create a new field and simply want to print the values as per condition then try following, this should be more faster I believe.
awk 'BEGIN{OFS="\t"}{$1=$1;print $0,$1=="Bob"?400+$NF:""}' Input_file
Output will be as follows.
Philip 13 2
Bob 152 8 408
Bob 4561 2 402
Bob 234 36 436
Bob 98 12 412
Rey 147 152
Rey 15 1547
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
{
$(NF+1)=$1=="Bob"?400+$NF:"" ##Creating a new last field here whose value will be depending upon condition check.
##its checking condition if 1st field is having Bob string in it then add 400 value to last field value or make it NULL.
}
1 ##awk works on method of condition then action. so by mentioning 1 making condition TRUE here and NO action defined so by default print of current line will happen.
' OFS="\t" Input_file ##Setting OFS as TAB here where OFS ois output field separator and mentioning Input_file name here.

Calculate exponential value of a record

I need to calculate and print the exponential value of records of field $2 after multiplying by a factor of -0.05.
The data looks like this:
101 205 560
101 200 530
107 160 480
110 95 600
I need the output to look like this:
101 205 560 0.000035
101 200 530 0.000045
107 160 480 0.00033
110 95 600 0.00865
This should work:
$ awk '{ print $0, sprintf("%f", exp($2 * -0.05)) }' infile
101 205 560 0.000035
101 200 530 0.000045
107 160 480 0.000335
110 95 600 0.008652
This just prints the whole line $0, followed by the exponential of the second field multiplied by -0.05. The sprintf formatting makes sure that the result is not printed in scientific notation (which would happen otherwise).
If the input data is tab separated and you need tabs in the output as well, you have to set the output field separator first:
$ awk 'BEGIN{OFS="\t"} { print $0, sprintf("%f", exp($2 * -0.05)) }' infile
101 205 560 0.000035
101 200 530 0.000045
107 160 480 0.000335
110 95 600 0.008652

AWK, string match and replace

I have two files; the first looks like this,
SomeFile.CEL SomeOtherFile.CEL
probe1 111 666
probe2 222 777
probe3 333 888
probe4 444 999
probe5 555 100
probe6 101 102
and the second looks like this (note: the duplicate probe4, has two different gene names),
probe1 Gene1
probe2 Gene2
probe3 Gene3
probe4 Gene4A
probe4 Gene4B
probe5 Gene5
probe7 Gene6
What I need is an output file to look like this,
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
This ideal output file would contain all of the gene names which had matched probe names between the two files. Additionally, where multiple names exist for a single probe, I want the expression data (444 999) to be duplicated for all possible gene names (this example shows 2 gene names for a single probe, but it could be as many as 5 or 6.)
By the way all files are tab-separated.
I've searched through this and other forums and while these article came close,
Replace multiple arguments with gsub
awk print column $3 if $2==a specific value?
awk partly string match (if column partly matches)
Sed pattern to match and replace
Regex match and replace
they don't answer my full question.
So far, I have had the most success with this command,
awk -F"\t" 'FILENAME=="input1.file"{a[$1]=$1} FILENAME=="input2.file {if(a[$1]){$1="";print $0}}' input1.file input2.file
but it doesn't account for the necessary duplication. Lastly, there are some files which look like input1, but contain more than just the two samples I described (someFile.CEL and someOtherFile.CEL). There could be as many as 50 samples/CEL files. I figure I might have to build a script, but I thought I'd ask if there was a simpler way first.
$ awk 'NR==FNR{a[$1]=$2 FS $3; next} $1 in a{print $2, a[$1]}' file1 file2
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
join GNU command was made for this exact situation, and it can be combined with awk.
This one liner version is working with any number of columns (FIELDS) in first file:
join SomeFile.CEL SomeOtherFile.CEL | awk '{$1=$NF; $NF=""; print}'
By default the first FIELD of both files is used for the JOIN.
The 2 files must be sorted on the join fields.
The test with 2 additional sort to ensure that the JOIN fields are sorted:
$ join <(sort SomeFile.CEL) <(sort SomeOtherFile.CEL) | awk '{$1=$NF; $NF=""; print}'
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
Second test with another first file with more columns:
$ cat SomeFile_ManyColumns.CEL
probe1 111 666 666 111 777 888 999
probe2 222 777 111 666 999 888 777
probe3 333 888 101 102 999 888 111
probe4 444 999 876 543 321 678 101
probe5 555 100 101 543 321 666 999
probe6 101 102 888 321 543 101 678
$ join <(sort SomeFile_ManyColumns.CEL) <(sort SomeOtherFile.CEL) | awk '{$1=$NF; $NF=""; print}'
Gene1 111 666 666 111 777 888 999
Gene2 222 777 111 666 999 888 777
Gene3 333 888 101 102 999 888 111
Gene4A 444 999 876 543 321 678 101
Gene4B 444 999 876 543 321 678 101
Gene5 555 100 101 543 321 666 999
----
For history a solution with a fix number of columns (FIELDS):
join -o 2.2,1.2,1.3 SomeFile.CEL SomeOtherFile.CEL
-o 2.2,1.2,1.3 specify the output FORMAT: it is one or more comma or blank separated specifications, each being `FILENUM.FIELD'
The test:
$ join -o 2.2,1.2,1.3 SomeFile.CEL SomeOtherFile.CEL
Gene1 111 666
Gene2 222 777
Gene3 333 888
Gene4A 444 999
Gene4B 444 999
Gene5 555 100
There is a not so well known unix tool for joining files on a (sorted) common column, called join. You can use it in your case like this:
join <( sort file2.txt) <(sort file1.txt ) | cut -d\ -f2-
the sorts are required for nonsorted files
the cut is required to strip away the first column with the probe names
due to the sorting and cutting, awk is probably faster

Transpose a column to a line after detecting a pattern

I have this text file format:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 #line 2
37.31 624 #line 3
260 1 #line 4
321 624 #line 5
532 23 #line 6
12 644 #line 7
270 0.0 #line 8
3e-37 1046 #line 9
154 #line 10
I have to detect a line containing 8 columns (line 2), and transpose the second column of the followning seven lines (lines 3 - 9) to the end of the 8-column line. And finally, exclude line 10. This pattern repeats along a large text file, but it is not frequent (30 time, in a file of 2000 lines). Is it possible doing it using awk?
The edited text file must look like the following text:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046 #line 2
Thank you very much in advance.
awk 'NF == 12 { t = $0; for (i = 1; i <= 7; ++i) { r = getline; if (r < 1) break; t = t "\t" $2; } print t; next; } NF > 12' temp.txt
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046
It would automatically print lines having more than 12 fields.
If it detects lines having 12 fields, concatenate second column of other 7 lines with it and print.
Ignore any other line.
edited to only add the second column of the lines with two columns.
I think this does what you want:
awk 'NF >= 8 { a[++i] = $0 } NF == 2 { a[i] = a[i] " " $2 } END { for (j = 1; j <= i; ++j) print a[j] }' file
For lines with more than 8 columns, add a new element to the array a. If the line has 2 columns, append the contents to the current array element. Once the whole file has been processed, go through the array and print all of the lines.
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046

How to append the count of numbers in each line of text using awk?

I have several very large text files and would like to append the count of numbers following by a space in front of each line. Could anyone kindly suggest how to do it efficiently using Awk?
Input:
10 109 120 200 1148 1210 1500 5201
9 139 1239 1439 6551
199 5693 5695
Desired Output:
8 10 109 120 200 1148 1210 1500 5201
5 9 139 1239 1439 6551
3 199 5693 5695
You can use
awk '{print NF,$0}' input.txt
It says print number of field of the current line NF separated by current field separator , which in this case is a space then print the current line itself $0.
this will work for you:
awk '{$0=NF FS $0}7' file