Need to join lines in data file. Data lines are inconsistent - awk

I need to join lines in an input data file using keys. Specifically data line 1 will always start with a 'C1', data line 2 will always start with a 'J1'.
There will not be a 'C1' in every case. Some of the 'J1' lines exist without the preceding 'C1' lines. Best case is I can retrieve the 'C1 || J1' lines into an output file and the 'J1' singularities into a separate output file. I have searched this site for the answer for most of the day and it is not apparent to me.
Using an odd / even approach is out due to the single 'J1' lines.
I know nothing about Perl and we don't use it at work, so Perl is out.
I am sort of restricted to awk, sed.
OMG Forgot the Sample input and Sample outputs. My bad.
Sample Input:
C111416655020090209IP
J1114166550SA0165235Z00000X295053911A
C112411158820060930OP
J1124111588DE2095332B00000X29650
J11241115887145143336C0003X296501145D
J11241115887814653336C0003X296501145D
C104327839320060503OP
J1043278393548223332B00000X295053424A
Sample Output file 1:
C111416655020090209IP J1114166550SA0165235Z00000X295053911A
C112411158820060930OP J1124111588DE2095332B00000X29650
C104327839320060503OP J1043278393548223332B00000X295053424A
Sample Output file 2:
J11241115887145143336C0003X296501145D
J11241115887814653336C0003X296501145D
State machine not likely here at work.
Thanks in advance.
Robert Hohlt, Jr.

You probably want something like this:
awk '/^J1/{print p, $0; p=""; next} {p=$0}' file
but until you post some sample input and expected output, it's just a rough guess.

Related

How to replace strings in text with id from second text?

I've got two CSV files. The first file contains organism family names and connection weight information but I need to change the format of the file to load it into different programs like Gephi. I have created a second file where each family has an ID value. I haven't found a good example on this site on how to change the family names in the first file to the ids from the second file. Example of my files:
$ cat edge_file.csv
Source,Target,Weight,Type,From,To
Argasidae,Alcaligenaceae,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
Argasidae,Burkholderiaceae,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
Argasidae,Methylophilaceae,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
Argasidae,Oxalobacteraceae,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
Argasidae,Rhodocyclaceae,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
Argasidae,Sphingomonadaceae,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
Argasidae,Zoogloeaceae,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
Argasidae,Agaricaceae,0.190482976,undirected,A_Argasidae,F_Agaricaceae
Argasidae,Bulleribasidiaceae,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
Argasidae,Camptobasidiaceae,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
Argasidae,Chrysozymaceae,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
Argasidae,Cryptococcaceae,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
$ cat id_file.csv
Id,Family
1,Argasidae
2,Buthidae
3,Alcaligenaceae
4,Burkholderiaceae
5,Methylophilaceae
6,Oxalobacteraceae
7,Rhodocyclaceae
8,Oppiidae
9,Sphingomonadaceae
10,Zoogloeaceae
11,Agaricaceae
12,Bulleribasidiaceae
13,Camptobasidiaceae
14,Chrysozymaceae
15,Cryptococcaceae
I basically want the edge_file.csv output to turn into the output below, where Source and Target have changed from family names to ids instead.
Source,Target,Weight,Type,From,To
1,3,0.040968439,undirected,A_Argasidae,B_Alcaligenaceae
1,4,0.796351574,undirected,A_Argasidae,B_Burkholderiaceae
1,5,0.276912259,undirected,A_Argasidae,B_Methylophilaceae
1,6,0.460508445,undirected,A_Argasidae,B_Oxalobacteraceae
1,7,0.764558003,undirected,A_Argasidae,B_Rhodocyclaceae
1,9,0.70198002,undirected,A_Argasidae,B_Sphingomonadaceae
1,10,0.034648156,undirected,A_Argasidae,B_Zoogloeaceae
1,11,0.190482976,undirected,A_Argasidae,F_Agaricaceae
1,12,0.841600859,undirected,A_Argasidae,F_Bulleribasidiaceae
1,13,0.841600859,undirected,A_Argasidae,F_Camptobasidiaceae
1,14,0.190482976,undirected,A_Argasidae,F_Chrysozymaceae
1,15,0.055650172,undirected,A_Argasidae,F_Cryptococcaceae
I haven't been able to figure it out with awk since I'm new to it, but I tried some variations from other examples here such as (just testing it out for the "Source" column):
awk 'NR==FNR{a[$1]=$1;next}{$1=a[$1];}1' edge_file.csv id_file.csv
Everything just prints out blank. My understanding is that I should create an array for the Source and Target columns in the edge_file.csv, and then replace it with the first column from the id_file.csv, which is the Id column. Can't get the syntax to work even for just one column.
You're close. This oneliner should help:
awk -F, -v OFS=',' 'NR==FNR{a[$2]=$1;next}{$1=a[$1];$2=a[$2]}1' id_file.csv edge_file.csv

Keeping line in file that do not match with list of words in another file

I want to keep lines in a file that do not match with a list of words kept in another file (not the whole line match). For a toy dataset, I have created a list_file.txt which contains:
BGC0001184
BGC0000853
And a large_file.txt that contains:
contig com1_25_species_1.25M_idxstats.txt
BGC0000853 0
BGC0000853 14
BGC0000853 2
BGC0000854 6
BGC0001185 7
BGC0001185 13
BGC0001184 31
BGC0001186 11
BGC0001184 31
BGC0001184 31
And I use grep as follows:
grep -vf list_file.txt large_file.txt
All good. I get the desired output:
contig com1_25_species_1.25M_idxstats.txt
BGC0000854 6
BGC0001185 7
BGC0001185 13
BGC0001186 11
Now, when I try to apply the same on real large dataset (same format, large files), it’s not working.
What am I missing here? Please let me know if you have any awk/sed suggestions.
Thanks.
Link for my large dataset files are below:
List File:
https://drive.google.com/file/d/14wa6iopzgZUz56C8a3eWRvLPyU_PkCMK/view?usp=sharing
Large File:
https://drive.google.com/file/d/1O3LYE15o9wJmMmsdxcb4xzjaIw1E9VYa/view?usp=sharing
For your shown samples, could you please try following, written and tested in GNU awk.
awk '{sub(/\r$/,"")} FNR==NR{arr[$0];next} !($1 in arr)' list_file.txt contig
2nd solution: In case your values are NOT exact same(1st fields of list and contig files) and you want to do partial matching then try following.
awk '{sub(/\r$/,"")} FNR==NR{arr[$0];next} {for(i in arr){if(index($0,i)){next}}} 1' list_file.txt contig

print from match & process several input files

when you scrutiny my questions from the past weeks you find I asked questions similar to this one. I had problems to ask in a demanded format since I did not really know where my problems came from. E. Morton tells me not to use range expression. Well, I do not know what they are excactly. I found in this forum many questions alike mine with working answers.
Like: "How to print following line from a match" (e.g.)
But all solutions I found stop working when I process more than one input file. I need to process many.
I use this command:
gawk -f 1.awk print*.csv > new.txt
while 1.awk contains:
BEGIN { OFS=FS=";"
pattern="row4"
}
go {print} $0 ~ pattern {go = 1}
input file 1 print1.csv contains:
row1;something;in;this;row;;;;;;;
row2;something;in;this;row;;;;;;;
row3;something;in;this;row;;;;;;;
row4;don't;need;to;match;the;whole;line,;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
Input file 2 print2.csv contains the same just for illustration purpose.
The 1.awk (and several others ways I found in this forum to print from match) works for one file. Output:
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
BUT not when I process more input files.
Each time I process this way more than one input file awk commands 'to print from match' seem to be ignored.
As said I was told not to use range expression. I do not know how and maybe the problem is linked to the way I input several files?
just reset your match indicator at the beginning of each file
$ awk 'FNR==1{p=0} p; /row4/{p=1} ' file1 file2
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
row5;something;in;this;row;;;;;;;
row6;something;in;this;row;;;;;;;
row7;something;in;this;row;;;;;;;
row8;something;in;this;row;;;;;;;
row9;something;in;this;row;;;;;;;
row10;something;in;this;row;;;;;;;
UPDATE
From the comments
is it possible to combine your awk with: "If $1="row5" then write in
$6="row5" and delete the value "row5" in $5? In other words, to move
content "row5" in column1, if found there, to new column 6? I could to
this with another awk but a combination into one would be nicer
... $1=="row5"{$6=$5; $5=""} ...
or, if you want to use another field instead of $5 replace $5 with the corresponding field number.

Updating Values on json file A using reference on file B - The return

Ok, i should feel ashamed for that, but i'm unable to understand how awk works...
A few days ago i posted this question which questions about how to replace fields on file A using the file B as a reference ( both files have matching ID's for reference ).
But after accepting the answer as correct ( Thanks Ed !) i'm struggling about how to do it using this following pattern:
File A
{"test_ref":32132112321,"test_id":12345,"test_name":"","test_comm":"test", "null_test": "true"}
{"test_ref":32133321321,"test_id":12346,"test_name":"","test_comm":"test", "test_type": "alfa"}
{"test_ref":32132331321,"test_id":12347,"test_name":"","test_comm":"test", "test_val": 1923}
File B
{"test_id": 12345, "test_name": "Test values for null"}
{"test_id": 12346, "test_name": "alfa tests initiated"}
{"test_id": 12347, "test_name": "discard values"}
Expected result:
{"test_ref":32132112321,"test_id":12345,"test_name":"Test values for null","test_comm":"test", "null_test": "true"}
{"test_ref":32133321321,"test_id":12346,"test_name":"alfa tests initiated","test_comm":"test", "test_type": "alfa"}
{"test_ref":32132331321,"test_id":12347,"test_name":"discard values","test_comm":"test", "test_val": 1923}
I tried some alterations with the original solution but without success. So, Based on the Question posted before, how could i achieve the same results with this new pattern?
PS: One important note, the lines on file A not always have the same length
Big Thanks in advance.
EDIT:
After trying the solution posted by Wintermute, it seens it doens't work with lines having:
{"test_ref":32132112321,"test_id":12345,"test_name":"","test_comm":"test", "null_test": "true","modifiers":[{"type":3,"value":31}{"type":4,"value":33}]}
Error received.
error: parse error: Expected separator between values at line xxx, column xxx
Parsing JSON with awk or sed is not a good idea for the same reasons that it's not a good idea to parse XML with them: sed works based on lines, and JSON is not line-based. awk works on vaguely tabular data, and JSON is not vaguely tabular. People don't expect their JSON tools to break when they insert newlines in benign places.
Instead, consider using a tool geared towards JSON processing, such as jq. In this particular case, you could use
jq -c -s 'group_by(.test_id) | map(.[0] + .[1]) | .[]' a.json b.json > c.json
Here jq slurps (-s) the input files into an array of JSON objects, groups these by test_id, merges them and unpacks the array. -c means compact output format, so each JSON object in the result ends up on a single line in the output.

How can I remove lines from a file with more than a certain number of entries

I've looked at the similar question about removing lines with more than a certain number of characters and my problem is similar but a bit trickier. I have a file that is generated after analyzing some data and each line is supposed to contain 29 numbers. For example:
53.0399 0.203827 7.28285 0.0139936 129.537 0.313907 11.3814 0.0137903 355.008 \
0.160464 12.2717 0.120802 55.7404 0.0875189 11.3311 0.0841887 536.66 0.256761 \
19.4495 0.197625 46.4401 2.38957 15.8914 17.1149 240.192 0.270649 19.348 0.230\
402 23001028 23800855
53.4843 0.198886 7.31329 0.0135975 129.215 0.335697 11.3673 0.014766 355.091 0\
.155786 11.9938 0.118147 55.567 0.368255 11.449 0.0842612 536.91 0.251735 18.9\
639 0.184361 47.2451 0.119655 18.6589 0.592563 240.477 0.298805 20.7409 0.2548\
56 23001585
50.7302 0.226066 7.12251 0.0158698 237.335 1.83226 15.4057 0.059467 -164.075 5\
.14639 146.619 1.37761 55.6474 0.289037 11.4864 0.0857042 536.34 0.252356 19.3\
91 0.198221 46.7011 0.139855 20.1464 0.668163 240.664 0.284125 20.3799 0.24696\
23002153
But every once in a while, a line like the first one appears that has an extra 8 digit number at the end from analyzing an empty file (so it just returns the file ID number but not on a new line like it should). So I just want to find lines that have this extra 30th number and remove just that 30th entry. I figure I could do this with awk but since I have little experience with it I'm not sure how. So if anyone can help I'd appreciate it.
Thanks
Summary: Want to find lines in a text file with an extra entry in a row and remove the last extra entry so all rows have same number of entries.
With awk, you tell it how many fields there are per record. The extras are ignored
awk '{NF = 29; print}' filename
If you want to save that back to the file, you have to do a little extra work
awk '{NF = 29; print}' filename > filename.new && mv filename.new filename