output matching column from multiple input in awk - awk

Assumes there are some data from these two input which I only want, which is "A" from inputA.txt and "B" from inputB.txt
==> inputA.txt <==
A 10214027 6369158
A 10214028 6369263
A 10214029 6369321
A 10214030 6369713
A 10214031 6370146
A 10214032 6370553
A 10214033 6370917
A 10214034 6371322
A 10214035 6371735
A 10214036 6372136
So I only want the data with A's
==> inputB.txt <==
B 50015214 5116941
B 50015215 5116767
B 50015216 5116577
B 50015217 5116409
B 50015218 5116221
B 50015219 5116044
B 50015220 5115845
B 50015221 5115676
B 50015222 5115512
B 50015223 5115326
Same goes here, only want B's
and I've built the script, but it's been doubled due to using multiple inputs.
#!/bin/awk -f
BEGIN{
printf "Column 1\tColumn 2\tColumn 3"
}
/^A/{
c=substr($2,1,4)
d=substr($2,5,3)
e=substr($3,1,4)
f=substr($3,5,3)
}
{
printf "%4.1f %4.1f %4.1f %4.1f\n",c,d,e,f > "outputA.txt"
}
/^B/{
c=substr($2,1,4)
d=substr($2,5,3)
e=substr($3,1,4)
f=substr($3,5,3)
}
{
printf "%4.1f %4.1f %4.1f %4.1f\n",c,d,e,f > "outputB.txt"
}
Let me know your thought on this.
Expected output
==> outputA.txt <==
Column 1 Column 2 Column 3 Column 4
1021 4027 6369 158
1021 4028 6369 263
1021 4029 6369 321
1021 4030 6369 713
1021 4031 6370 146
1021 4032 6370 553
1021 4033 6370 917
1021 4034 6371 322
1021 4035 6371 735
1021 4036 6372 136
==> outputB.txt <==
Column 1 Column 2 Column 3 Column 4
5001 5214 5116 941
5001 5215 5116 767
5001 5216 5116 577
5001 5217 5116 409
5001 5218 5116 221
5001 5219 5116 044
5001 5220 5115 845
5001 5221 5115 676
5001 5222 5115 512
5001 5223 5115 326

With GNU awk and FIELDWIDTHS:
awk 'BEGIN{FIELDWIDTHS="1 1 4 4 1 4 3"}
{out="output" $1 ".txt"}
FNR==1{print "Column 1 Column 2 Column 3 Column 4" >out}
{print $3,$4,$6,$7 >out}' inputA.txt inputB.txt
Use FIELDWIDTHS to split current row to seven columns. out contains name of new file. If first row of current file is reached print header to new file. For every row print four columns to new file.
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Could you please try following.
awk '
FNR==1{
sub(/[a-z]+/,"",FILENAME)
file="output"FILENAME".txt"
print "Column 1 Column 2 Column 3 Column 4" > (file)
}
{
print substr($0,3,4),substr($0,7,4),substr($0,12,4),substr($0,16,3) > (file)
}
' inputA inputB
Explanation:
awk ' ##Starting awk program here.
FNR==1{ ##Checking condition if FNR==1, line number is 1 then do following.
sub(/[a-z]+/,"",FILENAME) ##Substituting all small letters from file name with NULL.
file="output"FILENAME".txt" ##Creating variable file whose value is string output FILENAME and .txt
print "Column 1 Column 2 Column 3 Column 4" > (file) ##Printing headers to output file.
}
{
print substr($0,3,4),substr($0,7,4),substr($0,12,4),substr($0,16,3) > (file) ##Printing substrings values as per OP need to output files.
}
' inputA inputB ##Mentioning multiple Input_file names here.

You don't need substr here. Empty out the first field, insert a space after every four digits, force awk to reparse fields and then print:
awk '$1=="A"{
$1=""
gsub(/[0-9]{4}/,"& ")
$1=$1
print
}' inputA.txt
Its output:
1021 4027 6369 158
1021 4028 6369 263
1021 4029 6369 321
1021 4030 6369 713
1021 4031 6370 146
1021 4032 6370 553
1021 4033 6370 917
1021 4034 6371 322
1021 4035 6371 735
1021 4036 6372 136
Obviously this works with only one input but I believe referring to other answers you can tweak it to work with multiple files

just keep it simple :
${...input_data...} |
{m,g,n}awk 'gsub(" ....", "& ")^_'
A 1021 4027 6369 158
A 1021 4028 6369 263
A 1021 4029 6369 321
A 1021 4030 6369 713
A 1021 4031 6370 146
A 1021 4032 6370 553
A 1021 4033 6370 917
A 1021 4034 6371 322
A 1021 4035 6371 735
A 1021 4036 6372 136
B 5001 5214 5116 941
B 5001 5215 5116 767
B 5001 5216 5116 577
B 5001 5217 5116 409
B 5001 5218 5116 221
B 5001 5219 5116 044
B 5001 5220 5115 845
B 5001 5221 5115 676
B 5001 5222 5115 512
B 5001 5223 5115 326

Related

How to remove duplicates based on condition?

Here is my sample table:
idmain
idtime
idperson1
idperson2
141
20220106
510
384
221
20220107
300
184
221
20220107
301
184
465
20220108
300
184
525
20220109
111
123
525
20220109
112
123
525
20220109
113
123
Duplicated records only differ by idperson1. So I require to remove these records preserving only the record with the max value of idperson1. So my final table should be:
idmain
idtime
idperson1
idperson2
141
20220106
510
384
221
20220107
301
184
465
20220108
300
184
525
20220109
113
123
db<>fiddle
first you can use subquery to obtain max value of idperson1.
then use this condition like this:
select a.* from fact1 a
where idperson1=(select max(b.idperson1) from fact1 b where a.idtime=b.idtime and a.idperson2=b.idperson2);

How to remove unwanted values in data when reading csv file

Reading Pina_Indian_Diabities.csv some of the values are strings, something like this
+AC0-5.4128147485
734 2
735 4
736 0
737 8
738 +AC0-5.4128147485
739 1
740 NaN
741 3
742 1
743 9
744 13
745 12
746 1
747 1
like in row 738, there re such values in other rows and columns as well.
How can I drop them?

Calculate exponential value of a record

I need to calculate and print the exponential value of records of field $2 after multiplying by a factor of -0.05.
The data looks like this:
101 205 560
101 200 530
107 160 480
110 95 600
I need the output to look like this:
101 205 560 0.000035
101 200 530 0.000045
107 160 480 0.00033
110 95 600 0.00865
This should work:
$ awk '{ print $0, sprintf("%f", exp($2 * -0.05)) }' infile
101 205 560 0.000035
101 200 530 0.000045
107 160 480 0.000335
110 95 600 0.008652
This just prints the whole line $0, followed by the exponential of the second field multiplied by -0.05. The sprintf formatting makes sure that the result is not printed in scientific notation (which would happen otherwise).
If the input data is tab separated and you need tabs in the output as well, you have to set the output field separator first:
$ awk 'BEGIN{OFS="\t"} { print $0, sprintf("%f", exp($2 * -0.05)) }' infile
101 205 560 0.000035
101 200 530 0.000045
107 160 480 0.000335
110 95 600 0.008652

Transpose a column to a line after detecting a pattern

I have this text file format:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 #line 2
37.31 624 #line 3
260 1 #line 4
321 624 #line 5
532 23 #line 6
12 644 #line 7
270 0.0 #line 8
3e-37 1046 #line 9
154 #line 10
I have to detect a line containing 8 columns (line 2), and transpose the second column of the followning seven lines (lines 3 - 9) to the end of the 8-column line. And finally, exclude line 10. This pattern repeats along a large text file, but it is not frequent (30 time, in a file of 2000 lines). Is it possible doing it using awk?
The edited text file must look like the following text:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046 #line 2
Thank you very much in advance.
awk 'NF == 12 { t = $0; for (i = 1; i <= 7; ++i) { r = getline; if (r < 1) break; t = t "\t" $2; } print t; next; } NF > 12' temp.txt
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046
It would automatically print lines having more than 12 fields.
If it detects lines having 12 fields, concatenate second column of other 7 lines with it and print.
Ignore any other line.
edited to only add the second column of the lines with two columns.
I think this does what you want:
awk 'NF >= 8 { a[++i] = $0 } NF == 2 { a[i] = a[i] " " $2 } END { for (j = 1; j <= i; ++j) print a[j] }' file
For lines with more than 8 columns, add a new element to the array a. If the line has 2 columns, append the contents to the current array element. Once the whole file has been processed, go through the array and print all of the lines.
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046

How to number the lines according to a field with awk?

I wonder whether there is a way using awk to number the lines according to a field. For example,
Input
2334 332
2334 546
2334 675
7890 222
7890 134
234 45
.
.
.
Based on the 1st field, I would have the following output
Output
1 2334 332
1 2334 546
1 2334 675
2 7890 222
2 7890 134
3 234 45
.
.
.
I would be grateful for your help.
Cheers,
T
here's how,
awk '!a[$1]++{c++}{print c, $0}' file
1 2334 332
1 2334 546
1 2334 675
2 7890 222
2 7890 134
3 234 45
awk 'last != $1 { line = line + 1 } { last = $1; print line, $0 }'