awk: if pattern is matched append some data - awk

I have a data set created by a tool with file name test.deg. The file contents is as follows:
1 I0.XPDIN1 1.581e-01 1.507e-01 3.662e-04 3.891e-02
2 I0.XPXA1 1.577e-01 1.502e-01 3.653e-04 3.859e-02
3 I0.XPXA2 1.538e-01 1.444e-01 3.552e-04 3.471e-02
I have a second file ,test.spf, containing the following information:
XPDIN1 XPDIN1#d XPDIN1#g XPDIN1#s VPP
XPXA1 XPXA1#d XPXA1#g XPXA1#s VPP
XPXA2 XPXA2#d XPXA2#g XPXA2#s VPP
I am trying to write an awk script that matches the Instance name from test.deg to the instance name in test.spf. When the script sees a match I would like the 5th column's contents appended to that matched instance name's line end. Example output for I0.XPDIN1 in test.deg would be XPDIN1 XPDIN1#d XPDIN1#g XPDIN1#s VPP 3.662e-04
The script needs to match the instance name from test.deg after the prefix I0. to the first instance name call in test.spf then add the 5th columns data.
Thanks,
Bad Awk

GNU Awk
$ awk 'FNR==NR{a[$2]=$5; next} ("I0."$1 in a){$6=a["I0."$1]}1' test.deg test.spf
XPDIN1 XPDIN1#d XPDIN1#g XPDIN1#s VPP 3.662e-04
XPXA1 XPXA1#d XPXA1#g XPXA1#s VPP 3.653e-04
XPXA2 XPXA2#d XPXA2#g XPXA2#s VPP 3.552e-04

Related

How to replace strings in text with id from second text?

I've got two CSV files. The first file contains organism family names and connection weight information but I need to change the format of the file to load it into different programs like Gephi. I have created a second file where each family has an ID value. I haven't found a good example on this site on how to change the family names in the first file to the ids from the second file. Example of my files:
$ cat edge_file.csv
Source,Target,Weight,Type,From,To
Argasidae,Alcaligenaceae,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
Argasidae,Burkholderiaceae,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
Argasidae,Methylophilaceae,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
Argasidae,Oxalobacteraceae,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
Argasidae,Rhodocyclaceae,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
Argasidae,Sphingomonadaceae,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
Argasidae,Zoogloeaceae,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
Argasidae,Agaricaceae,0.190482976,undirected,A_Argasidae,F_Agaricaceae
Argasidae,Bulleribasidiaceae,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
Argasidae,Camptobasidiaceae,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
Argasidae,Chrysozymaceae,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
Argasidae,Cryptococcaceae,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
$ cat id_file.csv
Id,Family
1,Argasidae
2,Buthidae
3,Alcaligenaceae
4,Burkholderiaceae
5,Methylophilaceae
6,Oxalobacteraceae
7,Rhodocyclaceae
8,Oppiidae
9,Sphingomonadaceae
10,Zoogloeaceae
11,Agaricaceae
12,Bulleribasidiaceae
13,Camptobasidiaceae
14,Chrysozymaceae
15,Cryptococcaceae
I basically want the edge_file.csv output to turn into the output below, where Source and Target have changed from family names to ids instead.
Source,Target,Weight,Type,From,To
1,3,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
1,4,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
1,5,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
1,6,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
1,7,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
1,9,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
1,10,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
1,11,0.190482976,undirected,A_Argasidae,F_Agaricaceae
1,12,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
1,13,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
1,14,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
1,15,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
I haven't been able to figure it out with awk since I'm new to it, but I tried some variations from other examples here such as (just testing it out for the "Source" column):
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' edge_file.csv id_file.csv
Everything just prints out blank. My understanding is that I should create an array for the Source and Target columns in the edge_file.csv, and then replace it with the first column from the id_file.csv, which is the Id column. Can't get the syntax to work even for just one column.
You're close. This oneliner should help:
awk -F, -v OFS=',' 'NR==FNR{a[$2]=$1;next}{$1=a[$1];$2=a[$2]}1' id_file.csv edge_file.csv

Using awk to replace and add text

I have the following .txt file:
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
I would like to make two edits to this text. First, in the line:
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
I would like to replace "Number=." with "Number=G"
And immediately after the after the line:
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
I would like to add a new line of text (& and line break):
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
I was wondering if this could be done with one or two awk commands.
Thanks for any suggestions!
My solution is similar to #Daweo. Consider this script, replace.awk:
/^##FORMAT/ { sub(/Number=\./, "Number=G") }
/##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">/ {
print
print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\"Quality score\">"
next
}
1
Run it:
awk -f replace.awk file.txt
Notes
The first line is easy to understand. It is a straight replace
The next group of lines deals with your second requirements. First, the print statement prints out the current line
The next print statement prints out your data
The next command skips to the next line
Finally, the pattern 1 tells awk to print every lines
I would GNU AWK following way, let file.txt content be
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
then
awk '/##FORMAT=<ID=PL/{gsub("Number=\\.","Number=G")}/##INFO=<ID=AF/{print;print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22>";next}{print}' file.txt
output
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
Explanation: If current line contains ##FORMAT=<ID=PL change Number=\\. to Number=G (note \ are required to get literal . rather than . meaning any character). If current line contains ##INFO=<ID=AF print it and then print ##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22> (\x22 is hex escape code for ", " could not be used inside " delimited string) and go to next line. Final print-ing is for all lines but those containing ##INFO=<ID=AF as these have own print-ing.
(tested in gawk 4.2.1)

Awk array, replace with full length matches of keys

I want to replace strings in a target file (target.txt) by strings in a lookup table (lookup.tab), which looks as follows.
Seq_1 Name_one
Seq_2 Name_two
Seq_3 Name_three
...
Seq_10 Name_ten
Seq_11 Name_eleven
Seq_12 Name_twelve
The target.txt file is a large file with a tree structure (Nexus format). It is not arranged in columns.
Therefore I use the following command:
awk 'FNR==NR { array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' "lookup.tab" "target.txt"
Unfortunately, this command does not take the full length of the elements from the first column, so that Seq_1, Seq_10, Seq_11, Seq_12 end up as Name_one, Name_one0, Name_one1, Name_one2 etc...
How can the awk command be made more specific to correctly substitute the strings?
Try this please, see if it meets your need:
awk 'FNR==NR { le=length($1); a[le][$1]=$2; if (maxL<le) maxL=le; next } { for(le=maxL;le>0;le--) if(length(a[le])) for (i in a[le]) gsub(i, a[le][i]) }1' "lookup.tab" "target.txt"
It's based on your own trying, but instead of randomly replace using the hashes in the array, replace using those longer keys first.
By this way, and based on your examples, I think it's enough to avoid wrongly substitudes.

Creating scripts for obtaining data from a text file

I have a text file named stat.txt which contains lines each in the format
<User Name>-<IP>-<File Name>-<Size>. Each line contains a user name,an IP address,a file name and a download file size.I need to create a script userstat.awk which allows the following data to be obtained when the specific command is written:
userstat.awk u -will list all files
userstat.awk total -will list total size of all files
So far,I have tried to list all the files for a user using default commands but I can't do it using these commands.
Given stat.txt:
user-1.1.1.1-file.jpg-20
root-1.1.1.1-file.jpg-20
user-1.1.1.1-img.jpg-20
root-1.1.1.1-thing.jpg-20
You could use the command (improved by #ClasesWikner):
awk -F- '{print $3; s+=$4}END {print "total: " s}' stat.txt
To output:
file.jpg
file.jpg
img.jpg
thing.jpg
total: 80
As mentioned by #Scheff this will not work when usernames or file names contain a -.

awk match pattern from file

I have very large data sets in which I need to find specific patterns located in a specific column index and need the entire line output. I've gotten [successfully] as far as a single cmd line pattern match:
awk -F'|' -v OFS='|' '$1=="100002"{print $1,$22,$11,$12,$13,$28,$25,$27}' searchfile > outfile
100022 - being the search pattern, is a an exact match and located in column 1
searchfile - is the data file with 3.8 million lines and 60 columns all | delimited
Now I want to modify this search by specifying an input patternfile, because I have a little over 800 patterns that need to be matched and outputted. I've done my best to search the site and did find the use of the -f flag however I don't know how to integrate that with my search criteria per above. I need to be able to specify: exact match, specific column index search, specify specific columns to output, and specific in/out delimiter.
sample data set (note this has been modified to protect data owner):
100001|0|60|100001|AAR Corp| | |Industrial|Aerospace/Defense|Aerospace/Defense-Equip|US|US|US|IL|DE|;2;6;1;1;1100 North Wood Dale Road;1; ;1;Wood Dale;1;IL;1;60191;1;United States;|
15460796|0|60|15460796|PayPal Data Services Inc|348546|eBay Inc|Consumer, Non-cyclical|Commercial Services|Inactive/Unknown|US|US|US|CA|DE|;2;6;1;1;2211 North 1st Street;1; ;1;San Jose;1;CA;1;95125;1;United States;|
100003|0|60|100003|Abex Inc|170435|Mafco Consolidated Group Inc|Industrial|Aerospace/Defense|Aerospace/Defense-Equip|US|US|US|NH|DE|;2;6;1;1;Liberty Lane;1; ;1;Hampton;1;NH;1;03842;1;United States;|
100004|0|60|100004|Abitibi-Consolidated Inc|23165941|Resolute Forest Products Inc|Basic Materials|Forest Products&Paper|Paper&Related Products|CA|CA|CA|QC|QC|;2;6;1;1;1155 Metcalfe Street;1;Suite 800;1;Montreal;1;QC;1;M5J 2P5;1;Canada;|
100005|0|60|100005|Acme Electric Corp|100763|Hubbell Inc|Industrial|Electrical Compo&Equip|Power Conv/Supply Equip|US|US|US|NC|NY|;2;6;1;1;400 Quaker Road;1; ;1;East Aurora;1;NY;1;14052;1;United States;|
100006|0|60|100006|ACME-Cleveland Corp|100430|Danaher Corp|Industrial|Hand/Machine Tools|Mach Tools&Rel Products|US|US|US|OH|OH|;2;6;1;1;30100 Chagrin Boulevard;1;Suite 100;1;Pepper Pike;1;OH;1;44124-5705;1;United States;|
100007|0|60|100007|Acuson Corp|196005|Siemens Corp|Consumer, Non-cyclical|Healthcare-Products|Ultra Sound Imaging Sys|US|US|US|CA|DE|;2;6;1;1;1220 Charleston Road;1; ;1;Mountain View;1;CA;1;94039;1;United States;|
100009|0|60|100009|ADT Ltd|101520|Tyco International Plc|Consumer, Non-cyclical|Commercial Services|Protection-Safety|BM|BM|BM| | |;2;6;1;1;Cedar House;1;41 Cedar Avenue;1;Hamilton;1; ;1;HM 12;1;Bermuda;|
100010|0|60|100010|Advanced Micro Devices Inc| | |Technology|Semiconductors|Electronic Compo-Semicon|US|US|US|CA|DE|;2;6;1;1;One AMD Place;1;PO Box 3453;1;Sunnyvale;1;CA;1;94088-3453;1;United States;|
input pattern search:
100006
100052
You can externalize all the variables from the script
$ awk -v sep='|' -v matchindex='1' -v matchvalue='100002' -v columns='1,22,11,12,13,28,25,27'
'BEGIN{FS=OFS=sep; n=split(columns,c,",")}
$matchindex==matchvalue{for(i=1;i<n;i++)
printf "%s",$c[i] OFS; printf "%s\n", $c[n]}'
and perhaps write another script to generate the first line from a config file.