How to replace strings in text with id from second text? - awk

I've got two CSV files. The first file contains organism family names and connection weight information but I need to change the format of the file to load it into different programs like Gephi. I have created a second file where each family has an ID value. I haven't found a good example on this site on how to change the family names in the first file to the ids from the second file. Example of my files:
$ cat edge_file.csv
Source,Target,Weight,Type,From,To
Argasidae,Alcaligenaceae,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
Argasidae,Burkholderiaceae,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
Argasidae,Methylophilaceae,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
Argasidae,Oxalobacteraceae,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
Argasidae,Rhodocyclaceae,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
Argasidae,Sphingomonadaceae,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
Argasidae,Zoogloeaceae,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
Argasidae,Agaricaceae,0.190482976,undirected,A_Argasidae,F_Agaricaceae
Argasidae,Bulleribasidiaceae,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
Argasidae,Camptobasidiaceae,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
Argasidae,Chrysozymaceae,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
Argasidae,Cryptococcaceae,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
$ cat id_file.csv
Id,Family
1,Argasidae
2,Buthidae
3,Alcaligenaceae
4,Burkholderiaceae
5,Methylophilaceae
6,Oxalobacteraceae
7,Rhodocyclaceae
8,Oppiidae
9,Sphingomonadaceae
10,Zoogloeaceae
11,Agaricaceae
12,Bulleribasidiaceae
13,Camptobasidiaceae
14,Chrysozymaceae
15,Cryptococcaceae
I basically want the edge_file.csv output to turn into the output below, where Source and Target have changed from family names to ids instead.
Source,Target,Weight,Type,From,To
1,3,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
1,4,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
1,5,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
1,6,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
1,7,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
1,9,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
1,10,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
1,11,0.190482976,undirected,A_Argasidae,F_Agaricaceae
1,12,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
1,13,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
1,14,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
1,15,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
I haven't been able to figure it out with awk since I'm new to it, but I tried some variations from other examples here such as (just testing it out for the "Source" column):
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' edge_file.csv id_file.csv
Everything just prints out blank. My understanding is that I should create an array for the Source and Target columns in the edge_file.csv, and then replace it with the first column from the id_file.csv, which is the Id column. Can't get the syntax to work even for just one column.

You're close. This oneliner should help:
awk -F, -v OFS=',' 'NR==FNR{a[$2]=$1;next}{$1=a[$1];$2=a[$2]}1' id_file.csv edge_file.csv

Related

awk: if pattern is matched append some data

I have a data set created by a tool with file name test.deg. The file contents is as follows:
1 I0.XPDIN1 1.581e-01 1.507e-01 3.662e-04 3.891e-02
2 I0.XPXA1 1.577e-01 1.502e-01 3.653e-04 3.859e-02
3 I0.XPXA2 1.538e-01 1.444e-01 3.552e-04 3.471e-02
I have a second file ,test.spf, containing the following information:
XPDIN1 XPDIN1#d XPDIN1#g XPDIN1#s VPP
XPXA1 XPXA1#d XPXA1#g XPXA1#s VPP
XPXA2 XPXA2#d XPXA2#g XPXA2#s VPP
I am trying to write an awk script that matches the Instance name from test.deg to the instance name in test.spf. When the script sees a match I would like the 5th column's contents appended to that matched instance name's line end. Example output for I0.XPDIN1 in test.deg would be XPDIN1 XPDIN1#d XPDIN1#g XPDIN1#s VPP 3.662e-04
The script needs to match the instance name from test.deg after the prefix I0. to the first instance name call in test.spf then add the 5th columns data.
Thanks,
Bad Awk
GNU Awk
$ awk 'FNR==NR{a[$2]=$5; next} ("I0."$1 in a){$6=a["I0."$1]}1' test.deg test.spf
XPDIN1 XPDIN1#d XPDIN1#g XPDIN1#s VPP 3.662e-04
XPXA1 XPXA1#d XPXA1#g XPXA1#s VPP 3.653e-04
XPXA2 XPXA2#d XPXA2#g XPXA2#s VPP 3.552e-04

Awk array, replace with full length matches of keys

I want to replace strings in a target file (target.txt) by strings in a lookup table (lookup.tab), which looks as follows.
Seq_1 Name_one
Seq_2 Name_two
Seq_3 Name_three
...
Seq_10 Name_ten
Seq_11 Name_eleven
Seq_12 Name_twelve
The target.txt file is a large file with a tree structure (Nexus format). It is not arranged in columns.
Therefore I use the following command:
awk 'FNR==NR { array[$1]=$2; next } { for (i in array) gsub(i, array[i]) }1' "lookup.tab" "target.txt"
Unfortunately, this command does not take the full length of the elements from the first column, so that Seq_1, Seq_10, Seq_11, Seq_12 end up as Name_one, Name_one0, Name_one1, Name_one2 etc...
How can the awk command be made more specific to correctly substitute the strings?
Try this please, see if it meets your need:
awk 'FNR==NR { le=length($1); a[le][$1]=$2; if (maxL<le) maxL=le; next } { for(le=maxL;le>0;le--) if(length(a[le])) for (i in a[le]) gsub(i, a[le][i]) }1' "lookup.tab" "target.txt"
It's based on your own trying, but instead of randomly replace using the hashes in the array, replace using those longer keys first.
By this way, and based on your examples, I think it's enough to avoid wrongly substitudes.

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

Need to join lines in data file. Data lines are inconsistent

I need to join lines in an input data file using keys. Specifically data line 1 will always start with a 'C1', data line 2 will always start with a 'J1'.
There will not be a 'C1' in every case. Some of the 'J1' lines exist without the preceding 'C1' lines. Best case is I can retrieve the 'C1 || J1' lines into an output file and the 'J1' singularities into a separate output file. I have searched this site for the answer for most of the day and it is not apparent to me.
Using an odd / even approach is out due to the single 'J1' lines.
I know nothing about Perl and we don't use it at work, so Perl is out.
I am sort of restricted to awk, sed.
OMG Forgot the Sample input and Sample outputs. My bad.
Sample Input:
C111416655020090209IP
J1114166550SA0165235Z00000X295053911A
C112411158820060930OP
J1124111588DE2095332B00000X29650
J11241115887145143336C0003X296501145D
J11241115887814653336C0003X296501145D
C104327839320060503OP
J1043278393548223332B00000X295053424A
Sample Output file 1:
C111416655020090209IP J1114166550SA0165235Z00000X295053911A
C112411158820060930OP J1124111588DE2095332B00000X29650
C104327839320060503OP J1043278393548223332B00000X295053424A
Sample Output file 2:
J11241115887145143336C0003X296501145D
J11241115887814653336C0003X296501145D
State machine not likely here at work.
Thanks in advance.
Robert Hohlt, Jr.
You probably want something like this:
awk '/^J1/{print p, $0; p=""; next} {p=$0}' file
but until you post some sample input and expected output, it's just a rough guess.

Updating Values on json file A using reference on file B - The return

Ok, i should feel ashamed for that, but i'm unable to understand how awk works...
A few days ago i posted this question which questions about how to replace fields on file A using the file B as a reference ( both files have matching ID's for reference ).
But after accepting the answer as correct ( Thanks Ed !) i'm struggling about how to do it using this following pattern:
File A
{"test_ref":32132112321,"test_id":12345,"test_name":"","test_comm":"test", "null_test": "true"}
{"test_ref":32133321321,"test_id":12346,"test_name":"","test_comm":"test", "test_type": "alfa"}
{"test_ref":32132331321,"test_id":12347,"test_name":"","test_comm":"test", "test_val": 1923}
File B
{"test_id": 12345, "test_name": "Test values for null"}
{"test_id": 12346, "test_name": "alfa tests initiated"}
{"test_id": 12347, "test_name": "discard values"}
Expected result:
{"test_ref":32132112321,"test_id":12345,"test_name":"Test values for null","test_comm":"test", "null_test": "true"}
{"test_ref":32133321321,"test_id":12346,"test_name":"alfa tests initiated","test_comm":"test", "test_type": "alfa"}
{"test_ref":32132331321,"test_id":12347,"test_name":"discard values","test_comm":"test", "test_val": 1923}
I tried some alterations with the original solution but without success. So, Based on the Question posted before, how could i achieve the same results with this new pattern?
PS: One important note, the lines on file A not always have the same length
Big Thanks in advance.
EDIT:
After trying the solution posted by Wintermute, it seens it doens't work with lines having:
{"test_ref":32132112321,"test_id":12345,"test_name":"","test_comm":"test", "null_test": "true","modifiers":[{"type":3,"value":31}{"type":4,"value":33}]}
Error received.
error: parse error: Expected separator between values at line xxx, column xxx
Parsing JSON with awk or sed is not a good idea for the same reasons that it's not a good idea to parse XML with them: sed works based on lines, and JSON is not line-based. awk works on vaguely tabular data, and JSON is not vaguely tabular. People don't expect their JSON tools to break when they insert newlines in benign places.
Instead, consider using a tool geared towards JSON processing, such as jq. In this particular case, you could use
jq -c -s 'group_by(.test_id) | map(.[0] + .[1]) | .[]' a.json b.json > c.json
Here jq slurps (-s) the input files into an array of JSON objects, groups these by test_id, merges them and unpacks the array. -c means compact output format, so each JSON object in the result ends up on a single line in the output.