Print three columns with awk, then sets of three columns - awk

I want to create a number of files from a much larger file, dividing by columns. For example, the header of my larger file looks like this:
Name Chr Position SNP1A SNP1B SNP1C SNP2A SNP2B SNP2C SNP3A SNP3B SNP3C
and I want to create these files:
Name Chr Position SNP1A SNP1B SNP1C
Name Chr Position SNP2A SNP2B SNP2C
Name Chr Position SNP3A SNP3B SNP3C
I've been trying to use awk, but I'm a bit of a novice with it, so my command currently reads:
for ((i=1; i<=440;i++)); do awk -f printindivs.awk inputfile done
Where printindivs.awk is:
{print $1 $2 $3 $((3*$i)+1) $((3*$i)+2) $((3*$i)+3))}
The output I'm getting suggests that my way of trying to get the sets of three is wrong: how can I do this?
Thanks

You can do this easily will just a simple awk script:
$ awk '{for(i=4;i<=NF;i+=3)print $1,$2,$3,$i,$(i+1),$(i+2) > ("out"++j)}' file
The output files will be in out[1..n]:
$ cat out1
Name Chr Position SNP1A SNP1B SNP1C
$ cat out2
Name Chr Position SNP2A SNP2B SNP2C
$ cat out3
Name Chr Position SNP3A SNP3B SNP3C

Related

how to ignore action on printing a field value using awk

Hello I have several files whose starting line (or record) follows this format:
cat file_1.txt | grep '>'
> CP022114.1 Kluyvera georgiana strain YDC799 chromosome, complete genome
I want to retrieve the second field on that record which corresponds to the genus taxonomic category, on this example it is "Kluyvera". So I use this:
awk 'NR==1{print $2}' file.txt
and I got
Kluyvera
The issue is that in some files the second field doesnt corresponds to the genus taxonomic category and the genus is preceeded by the string "candidatus":
cat file_2.txt | grep '>'
> NTKC01000006.1 Candidatus Thioglobus sp. MED-G25 SUP05-clade-MED-G25-C6, whole genome shotgun sequence
on the above record , "Thioglobus" is to the genus of the specie. so when I try the above awk command it retrieves me "Candidatus".
I want awk to print "this file has candidatus" instead of retrieving the second field for that record.
Let's say you have input file like this:
cat file
CP022114.1 Kluyvera georgiana strain YDC799 chromosome, complete genome
NTKC01000006.1 Candidatus Thioglobus sp. MED-G25 SUP05-clade-MED-G25-C6, whole genome shotgun sequence
You can use awk like this with a conditional print:
awk '{print ($2 == "Candidatus" ? $3 : $2)}' file
Kluyvera
Thioglobus
Or if you want to print a custom string for the Candidatus record then use:
awk '{print ($2 == "Candidatus" ? "this file has candidatus" : $2)}' file
Kluyvera
this file has candidatus

Replace number in a template file with numbers from a list and output to different files

file1.txt is like:
$view->name = '12483291';
...
$view->human_name = '12483291';
Also I have some numbers in file2:
8789
53416
673425
What I need to do is: for each number in the above column, replace the '12483291' value, and create a new file named after the replacement number.
Desired output:
file: 8789.inc
$view->name = '8789'
...
$view->human_name = '8789'
file: 53416.inc
$view->name = '53416'
...
$view->human_name = '53416'
file: 673425.inc
$view->name = '673425'
...
$view->human_name = '673425'
How would you approach this?
A few of my attempts, but without getting the result I want:
sed "s/12483291/$(cat file2)/" file1 > 8789.inc
The above works if file2 has only one line, and I have to run the command as many times as the values in file2, manually giving the name of the result file.
This might work for you (GNU sed):
sed -n 's/.*/sed "s#12483291#&#g" file1 >&.inc/e' file2
Replace the number 1248329 in the template file1 by each number in file2 and name the file produced by that number with .inc appended.
If your file has dos-style line ending, see Why does my tool output overwrite itself and how do I fix it? first.
With GNU awk
awk -v q="'" -v s="'12483291'" 'NR==FNR{a[$1]; next}
{for(k in a) {print gensub(s, q k q, 1) > k".inc"}}' f2 f1
-v q="'" just a handy variable with single quote character
-v s="'12483291'" the field value to be replaced
NR==FNR{a[$1]; next} here NR has overall record number and FNR has current file record number. So, NR==FNR will be true only for first file. The array a will store the first field as keys.
for(k in a) for the second file, loop over all the keys in array a
gensub(s, q k q, 1) change the field value with value of the key (note that this will replace only first match and assumes s doesn't have any regex metacharacter)
output of gensub is then redirected to a filename based on the key
Add -v RS='\r\n' to handle dos-style input
With other awk you may run into too many files issue if f2 has large number of lines. Change the loop content to {line=$0; sub(s, q k q, line); f=k".inc"; print line >> f; close(f)}. This assumes .inc files don't already exist, otherwise, you'll get content appended.

Problems with awk substr

I am trying to split a file column using the substr awk command. So the input is as follows (it consists of 4 lines, one blank line):
#NS500645:122:HYGVMBGX2:4:21402:2606:16446:ACCTAGAAGG:R1
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I want to split the second line by the pattern "GATC" but keeping it on the right sub-string like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
I want that the last line have the same length as the splitted one and regenerate the file like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTAT
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
GATCC
EEEEE
For split the last colum I am using this awk script:
cat prove | paste - - - - | awk 'BEGIN
{FS="\t"; OFS="\t"}\ {gsub("GATC","/tGATC", $2); {split ($2, a, "\t")};\ for
(i in a) print substr($4, length(a[i-1])+1,
length(a[i-1])+length(a[i]))}'
But the output is as follows:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Being the second and third line longer that expected.
I check the calculated length that are passed to the substr command and are correct:
1 30
31 70
41 45
Using these length the output should be:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEE
But as I showed it is not the case.
Any suggestions?
I guess you're looking something like this, but your question formatting is really confusing
$ awk -v OFS='\t' 'NR==1 {next}
NR==2 {n=index($0,"GATC")}
/^[^+]/ {print substr($0,1,n-1),substr($0,n)}' file
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I assumed your file is in this format
dummy header line to be ignored
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
+
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

How to print first column of row along with specific pattern?

I am trying to extract a pattern along with printing the starting string of the line.
Input
Saureus1000(37 genes,10 taxa): Saureus08BA02176_00020(Saureus08BA02176) Saureus1269_00069(Saureus1269) Saureus170_00062(Saureus170) Saureus71193_00020(Saureus71193) SaureusED133_00019(SaureusED133) SaureusED98_00019(SaureusED98) SaureusLGA251_00019(SaureusLGA251) SaureusN305_00605(SaureusN305) SaureusRF122_00019(SaureusRF122) SaureusST398_00020(SaureusST398) Saureus08BA02176_01763(Saureus08BA02176) Saureus08BA02176_01805(Saureus08BA02176) Saureus08BA02176_01808(Saureus08BA02176) Saureus1269_01194(Saureus1269) Saureus1269_01237(Saureus1269) Saureus1269_01240(Saureus1269) Saureus71193_01635(Saureus71193) Saureus71193_01678(Saureus71193) Saureus71193_01681(Saureus71193) SaureusED133_01798(SaureusED133) SaureusED133_01840(SaureusED133) SaureusED133_01843(SaureusED133) SaureusED98_01777(SaureusED98) SaureusED98_01821(SaureusED98) SaureusED98_01824(SaureusED98) SaureusLGA251_01748(SaureusLGA251) SaureusLGA251_01790(SaureusLGA251) SaureusLGA251_01793(SaureusLGA251) SaureusN305_00013(SaureusN305) SaureusN305_00016(SaureusN305) SaureusN305_00059(SaureusN305) SaureusRF122_01807(SaureusRF122) SaureusRF122_01848(SaureusRF122) SaureusRF122_01851(SaureusRF122) SaureusST398_01884(SaureusST398) SaureusST398_01927(SaureusST398) SaureusST398_01930(SaureusST398)
Saureus1001(35 genes,12 taxa): Saureus08BA02176_01441(Saureus08BA02176) Saureus1269_02301(Saureus1269) Saureus1269_02527(Saureus1269) Saureus71193_01310(Saureus71193) SaureusED98_01421(SaureusED98) SaureusED98_01424(SaureusED98) SaureusN305_02184(SaureusN305) SaureusN305_02188(SaureusN305) SaureusN305_02190(SaureusN305) SaureusRF122_01383(SaureusRF122) SaureusRF122_01386(SaureusRF122) SaureusST398_01476(SaureusST398) Saureus08BA02176_01442(Saureus08BA02176) Saureus08BA02176_01443(Saureus08BA02176) Saureus08BA02176_01445(Saureus08BA02176) Saureus1269_02302(Saureus1269) Saureus1269_02529(Saureus1269) Saureus1364_00430(Saureus1364) Saureus170_00571(Saureus170) Saureus170_00574(Saureus170) Saureus302_00352(Saureus302) Saureus302_00556(Saureus302) Saureus71193_01311(Saureus71193) Saureus71193_01312(Saureus71193) Saureus71193_01314(Saureus71193) SaureusED98_01423(SaureusED98) SaureusED98_01426(SaureusED98) SaureusLGA251_01423(SaureusLGA251) SaureusN305_02185(SaureusN305) SaureusN305_02187(SaureusN305) SaureusST398_01477(SaureusST398) SaureusST398_01478(SaureusST398) SaureusST398_01548(SaureusST398) SaureusED133_01465(SaureusED133) Saureus302_01433(Saureus302)
Req.Output
Saureus1000 Saureus08BA02176_00020
I am using this code to find but not getting the required output in single line
awk '{print $1} {for(i=1;i<=NF;i++){if($i~/^Saureus08BA/){print $i}}}' file > test
Output for this command
Saureus1000(37
Saureus08BA02176_00020(Saureus08BA02176)
Saureus08BA02176_01763(Saureus08BA02176)
Saureus08BA02176_01805(Saureus08BA02176)
Saureus08BA02176_01808(Saureus08BA02176)
Saureus1001(35
Saureus08BA02176_01441(Saureus08BA02176)
Saureus08BA02176_01442(Saureus08BA02176)
Saureus08BA02176_01443(Saureus08BA02176)
Saureus08BA02176_01445(Saureus08BA02176)
GNU awk solution:
awk 'match($0,/^([^(]+)\([^(]+(Saureus08BA[0-9]+_[0-9]+)/,a){ print a[1],a[2] }' file
([^(]+) - capturing the needed part from the 1st field
(Saureus08BA[0-9]+_[0-9]+) - the 2nd captured group containing the next "Saureus" item
The output:
Saureus1000 Saureus08BA02176_00020
Saureus1001 Saureus08BA02176_01441

awk / split to return lines with a certain value in a certain column - create blocks of 100,000

I have a csv file where the third column is a number. Some of the entries don't have a value in this column.
I want to pull 100k blocks from the file, but only entries with a valid value for that column.
I could use split, but how do I make it check that column for a value?
$ cat test.txt
1,2,3,get me
4,5,,skip me
6,7,8,get me
9,10,11,stop before me
$ awk -F, '$3!="" && ++i<=2' test.txt
1,2,3,get me
6,7,8,get me
If your trying to verify whether or not the third field within a record has a value and output its contents if it does, you could try the following:
awk -F , '{ if($3 != ""){print $3} }'
This could also be written as:
awk -F , '$3 != ""{print $3}'