Updating Values on json file A using reference on file B - The return - awk

Ok, i should feel ashamed for that, but i'm unable to understand how awk works...
A few days ago i posted this question which questions about how to replace fields on file A using the file B as a reference ( both files have matching ID's for reference ).
But after accepting the answer as correct ( Thanks Ed !) i'm struggling about how to do it using this following pattern:
File A
{"test_ref":32132112321,"test_id":12345,"test_name":"","test_comm":"test", "null_test": "true"}
{"test_ref":32133321321,"test_id":12346,"test_name":"","test_comm":"test", "test_type": "alfa"}
{"test_ref":32132331321,"test_id":12347,"test_name":"","test_comm":"test", "test_val": 1923}
File B
{"test_id": 12345, "test_name": "Test values for null"}
{"test_id": 12346, "test_name": "alfa tests initiated"}
{"test_id": 12347, "test_name": "discard values"}
Expected result:
{"test_ref":32132112321,"test_id":12345,"test_name":"Test values for null","test_comm":"test", "null_test": "true"}
{"test_ref":32133321321,"test_id":12346,"test_name":"alfa tests initiated","test_comm":"test", "test_type": "alfa"}
{"test_ref":32132331321,"test_id":12347,"test_name":"discard values","test_comm":"test", "test_val": 1923}
I tried some alterations with the original solution but without success. So, Based on the Question posted before, how could i achieve the same results with this new pattern?
PS: One important note, the lines on file A not always have the same length
Big Thanks in advance.
EDIT:
After trying the solution posted by Wintermute, it seens it doens't work with lines having:
{"test_ref":32132112321,"test_id":12345,"test_name":"","test_comm":"test", "null_test": "true","modifiers":[{"type":3,"value":31}{"type":4,"value":33}]}
Error received.
error: parse error: Expected separator between values at line xxx, column xxx

Parsing JSON with awk or sed is not a good idea for the same reasons that it's not a good idea to parse XML with them: sed works based on lines, and JSON is not line-based. awk works on vaguely tabular data, and JSON is not vaguely tabular. People don't expect their JSON tools to break when they insert newlines in benign places.
Instead, consider using a tool geared towards JSON processing, such as jq. In this particular case, you could use
jq -c -s 'group_by(.test_id) | map(.[0] + .[1]) | .[]' a.json b.json > c.json
Here jq slurps (-s) the input files into an array of JSON objects, groups these by test_id, merges them and unpacks the array. -c means compact output format, so each JSON object in the result ends up on a single line in the output.

Related

How to format xmllint xpath result output?

I have the problem that I get an XML from somewhere and am just interested in a subset of the information it contains and want to have the output readable, but I cannot get xmllint to format the result of an XPATH evaluation with more than one element coming out of the latter. See this minimal example:
<currentMeasurements>
<datetime>2022-08-18 00:00:00</datetime>
<sensor>
<type>temperature</type>
<name>Behind the garage</name>
<value>10.5</value>
</sensor>
<sensor>
<type>noise</type>
<name>In the classroom</name>
<value>POSITIVE_INFINITY</value>
</sensor>
<sensor>
<type>temperature</type>
<name>In the garage</name>
<value>11.0</value>
</sensor>
</currentMeasurements>
Of course, when I fetch that data from some server, it just gives me a single line:
<currentMeasurements><datetime>2022-08-18 00:00:00</datetime><sensor><type>temperature</type><name>Behind the garage</name><value>10.5</value></sensor><sensor><type>noise</type><name>In the classroom</name><value>POSITIVE_INFINITY</value></sensor><sensor><type>temperature</type><name>In the garage</name><value>11.0</value></sensor></currentMeasurements>
Prettyprinting it on the commandline is easy (the following commands assume that the long line has been copied to clipboard and is accessible as /dev/clipboard):
cat /dev/clipboard | xmllint --format -
That gives me a formatted string (like the minimal example above). But I want to only have a subset of all data, which I can do with an XPATH expression. For example, if I am not interested in any noise but only temperatures, I can do this:
cat /dev/clipboard | xmllint --xpath "//type[text() = 'temperature']/.." -
This works, however, it doesn't format the output, which makes the result unreadable (especially when the data gets un-minimal of course):
<sensor><type>temperature</type><name>Behind the garage</name><value>10.5</value></sensor>
<sensor><type>temperature</type><name>In the garage</name><value>11.0</value></sensor>
Even when providing --format as well, the output is not formatted. And when I pipe the result into another xmllint --format -, it complains that there is Extra content at the end of the document – of course, the XPATH result does not have one single root.
So my question could be translated into any question like
How can I format the result of an XPATH evaluation with xmllint?
How can I format XML input with "more than one root" with xmllint?
How can I wrap a root node around an XPATH evaluation result?
My only solution up to now is to wrap the xmllint call into a subshell and add printf statements around, but I think it can be done more elegantly:
cat /dev/clipboard | (printf "<myNewRoot>"; xmllint --format --xpath "//type[text() = 'temperature']/.." - ; printf "</myNewRoot>") | xmllint --format -

How to replace strings in text with id from second text?

I've got two CSV files. The first file contains organism family names and connection weight information but I need to change the format of the file to load it into different programs like Gephi. I have created a second file where each family has an ID value. I haven't found a good example on this site on how to change the family names in the first file to the ids from the second file. Example of my files:
$ cat edge_file.csv
Source,Target,Weight,Type,From,To
Argasidae,Alcaligenaceae,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
Argasidae,Burkholderiaceae,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
Argasidae,Methylophilaceae,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
Argasidae,Oxalobacteraceae,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
Argasidae,Rhodocyclaceae,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
Argasidae,Sphingomonadaceae,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
Argasidae,Zoogloeaceae,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
Argasidae,Agaricaceae,0.190482976,undirected,A_Argasidae,F_Agaricaceae
Argasidae,Bulleribasidiaceae,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
Argasidae,Camptobasidiaceae,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
Argasidae,Chrysozymaceae,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
Argasidae,Cryptococcaceae,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
$ cat id_file.csv
Id,Family
1,Argasidae
2,Buthidae
3,Alcaligenaceae
4,Burkholderiaceae
5,Methylophilaceae
6,Oxalobacteraceae
7,Rhodocyclaceae
8,Oppiidae
9,Sphingomonadaceae
10,Zoogloeaceae
11,Agaricaceae
12,Bulleribasidiaceae
13,Camptobasidiaceae
14,Chrysozymaceae
15,Cryptococcaceae
I basically want the edge_file.csv output to turn into the output below, where Source and Target have changed from family names to ids instead.
Source,Target,Weight,Type,From,To
1,3,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
1,4,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
1,5,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
1,6,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
1,7,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
1,9,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
1,10,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
1,11,0.190482976,undirected,A_Argasidae,F_Agaricaceae
1,12,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
1,13,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
1,14,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
1,15,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
I haven't been able to figure it out with awk since I'm new to it, but I tried some variations from other examples here such as (just testing it out for the "Source" column):
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' edge_file.csv id_file.csv
Everything just prints out blank. My understanding is that I should create an array for the Source and Target columns in the edge_file.csv, and then replace it with the first column from the id_file.csv, which is the Id column. Can't get the syntax to work even for just one column.
You're close. This oneliner should help:
awk -F, -v OFS=',' 'NR==FNR{a[$2]=$1;next}{$1=a[$1];$2=a[$2]}1' id_file.csv edge_file.csv

Plink. Error: No people remaining after --keep

I'm trying to subset individuals from the ACB population from the file named allconcat39.vcf , using Plink 1.9. For that, I created a text file (tab delimited) in R called indACB, which looks like this:
head indACB.txt
684_HG01879 684_HG01879
685_HG01880 685_HG01880
686_HG01882 686_HG01882
687_HG01883 687_HG01883
688_HG01885 688_HG01885
689_HG01886 689_HG01886
690_HG01889 690_HG01889
691_HG01890 691_HG01890
694_HG01894 694_HG01894
695_HG01896 695_HG01896
when I run the following code:
./plink --vcf allconcat39.vcf --keep indACB.txt --recode --out allconcat39ACB
the following error occurs:
Error: No people remaining after --keep.
I made sure than the vcf and the indACB.txt file had compatible individual IDs and sample IDs. I don't know where else the problem can be. Any thoughts? Thank you in advance !
It was solved in another forum by Christopher Chang: Add --double-id to your command line; otherwise plink treats '_' as a delimiter between the FID and IID.

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

Need to join lines in data file. Data lines are inconsistent

I need to join lines in an input data file using keys. Specifically data line 1 will always start with a 'C1', data line 2 will always start with a 'J1'.
There will not be a 'C1' in every case. Some of the 'J1' lines exist without the preceding 'C1' lines. Best case is I can retrieve the 'C1 || J1' lines into an output file and the 'J1' singularities into a separate output file. I have searched this site for the answer for most of the day and it is not apparent to me.
Using an odd / even approach is out due to the single 'J1' lines.
I know nothing about Perl and we don't use it at work, so Perl is out.
I am sort of restricted to awk, sed.
OMG Forgot the Sample input and Sample outputs. My bad.
Sample Input:
C111416655020090209IP
J1114166550SA0165235Z00000X295053911A
C112411158820060930OP
J1124111588DE2095332B00000X29650
J11241115887145143336C0003X296501145D
J11241115887814653336C0003X296501145D
C104327839320060503OP
J1043278393548223332B00000X295053424A
Sample Output file 1:
C111416655020090209IP J1114166550SA0165235Z00000X295053911A
C112411158820060930OP J1124111588DE2095332B00000X29650
C104327839320060503OP J1043278393548223332B00000X295053424A
Sample Output file 2:
J11241115887145143336C0003X296501145D
J11241115887814653336C0003X296501145D
State machine not likely here at work.
Thanks in advance.
Robert Hohlt, Jr.
You probably want something like this:
awk '/^J1/{print p, $0; p=""; next} {p=$0}' file
but until you post some sample input and expected output, it's just a rough guess.