concatenating string with multiple array - awk

I'm trying to rearrange from a specific string into the respective column.
Here is the input
String 1: 47/13528
String 2: 55(s)
String 3:
String 4: 114(n)
String 5: 225(s), 26/10533-10541
String 6: 103/13519
String 7: 10(s), 162(n)
String 8: 152/12345,12346
(d=dead, n=null, s=strike)
The alphabet in each value is the flag (d=dead, n=null, s=strike).
The String with value (digit) which is "String 1" will be the 47c1
etc:
String 1: 47/13528
value without any flag will be sorted into the null column along with null tag (n)
String 1 (the integer will be concatenated with 47/13528)
Sorted :
null
47c1#SP13528;114c4;103c6#SP13519;162c7
Str#2: 55(s)
flagged with (s) will be sorted into strike column
Sorted :
strike
55c2;225c5;26c5#SP10533-10541;162c7
I'm trying to parse it by modifying previous code, seems no luck
{
for (i=1; i<=NF; i++) {
num = $i+0
abbr = $i
gsub(/[^[:alpha:]]/,"",abbr)
list[abbr] = list[abbr] num " c " val ORS
}
}
END {
n = split("dead null strike",types)
for (i=1; i<=n; i++) {
name = types[i]
abbr = substr(name,1,1)
printf "name,list[abbr]\n"
}
}
Expected Output (sorted into csv) :
dead,null,strike
,47c1#SP13528;114c4; 26c5#SP10533-10541;103c6#SP13519;162c7, 152c8#SP12345;152c8#SP12346,55c2;225c5;162c7;10c7
Breakdown for crosscheck purpose:
dead
none
null
47c1#SP13528;114c4;103c6#SP13519;162c7;152c8#SP12345;152c8#SP12346;26c5#SP10533-10541;;162c7
strike
55c2;225c5;10c7

Here is an awk script for parsing your file.
BEGIN {
types["d"]; types["n"]; types["s"]
deft = "n"; OFS = ","; sep = ";"
}
$1=="String" {
gsub(/[)(]/,""); gsub(",", " ") # general line subs
for (i=3;i<=NF;i++) {
if (!gsub("/","c"$2+0"#SP", $i)) $i = $i"c"$2+0 # make all subs on items
for (t in types) { if (gsub(t, "", $i)) { x=t; break }; x=deft } #find type
items[x] = items[x]? items[x] sep $i: $i # append for type found
}
}
END {
print "dead" OFS "null" OFS "strike"
print items["d"] OFS items["n"] OFS items["s"]
}
Input:
String 1: 47/13528
String 2: 55(s)
String 3:
String 4: 114(n)
String 5: 225(s), 26/10533-10541
String 6: 103/13519
String 7: 10(s), 162(n)
String 8: 152/12345,12346
(d=dead, n=null, s=strike)
Output:
> awk -f tst.awk file
dead,null,strike
,47c1#SP13528;114c4;26c5#SP10533-10541;103c6#SP13519;162c7;152c8#SP12345;12346c8,55c2;225c5;10c7
Your description was changing on important details, like how we decide the type of an item or how they are separated, and untill now your input and outputs are not consistent to it, but in general I think you can easily get what is done into this script. Have in mind that gsub() returns the number of the substitutions made, while doing them also, so many times it is convenient to use it as a condition.

My usuall approuch is:
First preprocess the data to have one information on one line.
Then preprocess the data to have one information in one column row wise.
Then it's easy - just accumulate columns in some array in awk and print them.
The following code:
cat <<EOF |
String 1: 47/13528
String 2: 55(s)
String 3:
String 4: 114(n)
String 5: 225(s), 26/10533-10541
String 6: 103/13519
String 7: 10(s), 162(n)
String 8: 152/12345,12346
(d=dead, n=null, s=strike)
EOF
sed '
# filter only lines with String
/^String \([0-9]*\): */!d;
# Remove the String
# Remove the : and spaces
s//\1 /
# remove trailing spaces
s/ *$//
# Remove lines with nothing
/^[0-9]* *$/d
# remove the commas and split lines on comma
# by moving them to separate lines
# repeat that until a comma is found
: a
/\([0-9]*\) \(.*\), *\(.*\)/{
s//\1 \2\n\1 \3/
ba
}
' | sed '
# we should be having two fields here
# separated by a single space
/^[^ ]* [^ ]*$/!{
s/.*/ERROR: "&"/
q1
}
# Move the name in braces to separate column
/(\(.\))$/{
s// \1/
b not
} ; {
# default is n
s/$/ n/
} ; : not
# shuffle first and second field
# to that <num>c<num>(#SP<something>)? format
# if second field has a "/"
\~^\([0-9]*\) \([0-9]*\)/\([^ ]*\)~{
# then add a SP
s//\2c\1#SP\3/
b not2
} ; {
# otherwise just do a "c" between
s/\([0-9]*\) \([0-9]*\)/\2c\1/
} ; : not2
' |
sort -n -k1 |
# now it's trivial
awk '
{
out[$2] = out[$2] (!length(out[$2])?"":";") $1
}
function outputit(name, idx) {
print name
if (length(out[idx]) == 0) {
print "none"
} else {
print out[idx]
}
printf "\n"
}
END{
outputit("dead", "d")
outputit("null", "n")
outputit("strike", "s")
}
'
outputs on repl:
dead
none
null
26c5#SP10533-10541;47c1#SP13528;103c6#SP13519;114c4;152c8#SP12345;162c7;12346c8
strike
10c7;55c2;225c5
The output I believe matches yours up to the sorting order with the ; separated list, which you seem to sort first column then second column, I just sorted with sort.

Related

Swapping / rearranging of columns and its values based on inputs using Unix scripts

Team,
I have an requirement of changing /ordering the column of csv files based on inputs .
example :
Datafile (source File) will be always with standard column and its values example :
PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION
MK3,Biberach,15200100_3,Biologics Downstream
MK3,Biberach,15200100_4,Sciona Upstream
MK3,Biberach,15200100_5,Drag envois
MK3,Biberach,15200100_8,flatsylio
MK3,Biberach,15200100_1,bioCovis
these columns (PRODUCTCODE,SITE,BATCHID,LV1P_DESCRIPTION) will be standard for source files and what i am looking for solution to format this and generate new file with the columns which we preferred .
Note : Source / Data file will be always comma delimited
Example : if I pass PRODUCTCODE,BATCHID as input then i would like to have only those column and its data extracted from source file and generate new file .
Something like script_name <output_column> <Source_File_name> <target_file_name>
target file example :
PRODUCTCODE,BATCHID
MK3,15200100_3
MK3,15200100_4
MK3,15200100_5
MK3,15200100_8
MK3,15200100_1
if i pass output_column as "LV1P_DESCRIPTION,PRODUCTCODE" then out file should be like below
LV1P_DESCRIPTION,PRODUCTCODE
Biologics Downstream,MK3
Sciona Upstream,MK3
Drag envios,MK3
flatsylio,MK3
bioCovis,MK3
It would be great if any one can help on this.
I have tried using some awk scripts (got it from some site) but it was not working as expected , since i don't have unix knowledge finding difficulties to modify this .
awk code:
BEGIN {
FS = ","
}
NR==1 {
split(c, ca, ",")
for (i = 1 ; i <= length(ca) ; i++) {
gsub(/ /, "", ca[i])
cm[ca[i]] = 1
}
for (i = 1 ; i <= NF ; i++) {
if (cm[$i] == 1) {
cc[i] = 1
}
}
if (length(cc) == 0) {
exit 1
}
}
{
ci = ""
for (i = 1 ; i <= NF ; i++) {
if (cc[i] == 1) {
if (ci == "") {
ci = $i
} else {
ci = ci "," $i
}
}
}
print ci
}
the above code is saves as Remove.awk and this will be called by another scripts as below
var1="BATCHID,LV2P_DESCRIPTION"
## this is input fields values used for testing
awk -f Remove.awk -v c="${var1}" RESULT.csv > test.csv
The following GNU awk solution should meet your objectives:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" 'BEGIN { split(flds,map,",") } NR==1 { for (i=1;i<=NF;i++) { map1[$i]=i } } { printf "%s",$map1[map[1]];for(i=2;i<=length(map);i++) { printf ",%s",$map1[map[i]] } printf "\n" }' file
Explanation:
awk -F, -v flds="LV1P_DESCRIPTION,PRODUCTCODE" ' # Pass the fields to print as a variable field
BEGIN {
split(flds,map,",") # Split fld into an array map using , as the delimiter
}
NR==1 { for (i=1;i<=NF;i++) {
map1[$i]=i # Loop through the header and create and array map1 with the column header as the index and the column number the value
}
}
{ printf "%s",$map1[map[1]]; # Print the first field specified (index of map)
for(i=2;i<=length(map);i++) {
printf ",%s",$map1[map[i]] # Loop through the other field numbers specified, printing the contents
}
printf "\n"
}' file

Match specific pattern and print just the matched string in the previous line

I update the question with additional information
I have a .fastq file formatted in the following way
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
For each sequence the format is the same (repetition of 4 lines)
What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
Starting from a file like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
I would like to obtain an output like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
wrt the code you posted:
substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)
Update - per your comment below that pattern is actually a regexp rather than a string:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
I warn you, I wanted to have some fun and it is twisted.
awk -v pattern=pattern -v window=15 '
BEGIN{RS="#";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
Input :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Output :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Second block is not updated because window is 15 and it cannot find the pattern within this window.
I used variable RS to deal with entire 4 lines block with $0, $1, $2, $3 and $4. Because input file starts with RS and does not end with RS, I prefered to not set ORS and use printf instead of print.

gsub for substituting translations not working

I have a dictionary dict with records separated by ":" and data fields by new lines, for example:
:one
1
:two
2
:three
3
:four
4
Now I want awk to substitute all occurrences of each record in the input
file, eg
onetwotwotwoone
two
threetwoone
four
My first awk script looked like this and works just fine:
BEGIN { RS = ":" ; FS = "\n"}
NR == FNR {
rep[$1] = $2
next
}
{
for (key in rep)
grub(key,rep[key])
print
}
giving me:
12221
2
321
4
Unfortunately another dict file contains some character used by regular expressions, so I have to substitute escape characters in my script. By moving key and rep[key] into a string (which can then be parsed for escape characters), the script will only substitute the second record in the dict. Why? And how to solve?
Here's the current second part of the script:
{
for (key in rep)
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
print
}
All scripts are run by awk -f translate.awk dict input
Thanks in advance!
Your fundamental problem is using strings in regexp and backreference contexts when you don't want them and then trying to escape the metacharacters in your strings to disable the characters that you're enabling by using them in those contexts. If you want strings, use them in string contexts, that's all.
You won't want this:
gsub(regexp,backreference-enabled-string)
You want something more like this:
index(...,string) substr(string)
I think this is what you're trying to do:
$ cat tst.awk
BEGIN { FS = ":" }
NR == FNR {
if ( NR%2 ) {
key = $2
}
else {
rep[key] = $0
}
next
}
{
for ( key in rep ) {
head = ""
tail = $0
while ( start = index(tail,key) ) {
head = head substr(tail,1,start-1) rep[key]
tail = substr(tail,start+length(key))
}
$0 = head tail
}
print
}
$ awk -f tst.awk dict file
12221
2
321
4
Never mind for asking....
Just some missing parentheses...?!
{
for (key in rep)
{
orig=key
trans=rep[key]
gsub(/[\]\[^$.*?+{}\\()|]/, "\\\\&", orig)
gsub(orig,trans)
}
print
}
works like a charm.

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
a:5,b:1,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
Output File
session:4,a=5,b=1,c=2
session:3,a=11,b=3,c=5|3
Notes:
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
a:5,b:1,c:2,session:4,e:8
a:5,c:2,session:4,e:8
b:3,a:11,c:5,e:9,session:3,c:3
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
{
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
}
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
}
print ""
}
$ awk -f tst.awk file
session:4,a=5,b=1,c=2
session:4,a=5,b=,c=2
session:3,a=11,b=3,c=5|3
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
BEGIN {
FS = "[,:]"
OFS = ","
}
{
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
}
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
}
printf "\n"
delete hash
delete hash_orig
}
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:
session:4,a:5,b:1,c:2,e:8
session:3,a:11,b:3,c:5|3,e:9

can awk replace fields based on separate specification file?

I have an input file like this:
SomeSection.Foo
OtherSection.Foo
OtherSection.Goo
...and there is another file describing which object(s) belong to each section:
[SomeSection]
Blah
Foo
[OtherSection]
Foo
Goo
The desired output would be:
SomeSection.2 // that's because Foo appears 2nd in SomeSection
OtherSection.1 // that's because Foo appears 1st in OtherSection
OtherSection.2 // that's because Goo appears 2nd in OtherSection
(The numbers and names of sections and objects are variable)
How would you do such a thing in awk?
Thanks in advance,
Adrian.
One possibility:
Content of script.awk (with comments):
## When 'FNR == NR', the first input file is in process.
## If line begins with '[', get the section string and reset the position
## of its objects.
FNR == NR && $0 ~ /^\[/ {
object = substr( $0, 2, length($0) - 2 )
pos = 0
next
}
## This section process the objects of each section. It saves them in
## an array. Variable 'pos' increments with each object processed.
FNR == NR {
arr_obj[object, $0] = ++pos
next
}
## This section process second file. It splits line in '.' to find second
## part in the array and prints all.
FNR < NR {
ret = split( $0, obj, /\./ )
if ( ret != 2 ) {
next
}
printf "%s.%d\n", obj[1], arr_obj[ obj[1] SUBSEP obj[2] ]
}
Run the script (important the order of input files, object.txt has sections with objects and input.txt the calls):
awk -f script.awk object.txt input.txt
Result:
SomeSection.2
OtherSection.1
OtherSection.2
EDIT to a question in comments:
I'm not an expert but I will try to explain how I understand it:
SUBSEP is a character to separate indexes in an array when you want to use different values as key. By default is \034, although you can modify it like RS or FS.
In instruction arr_obj[object, $0] = ++pos the comma joins all values with the value of SUBSEP, so in this case would result in:
arr_obj[SomeSection\034Blah] = 1
At the end of the script I access to the index using explicity that variable arr_obj[ obj[1] SUBSEP obj[2], but with same meaning that arr_obj[object, $0] in previous section.
You can also access to each part of this index splitting it with SUBSEP variable, like this:
for (key in arr_obj) { ## Assign 'string\034string' to 'key' variable
split( key, key_parts, SUBSEP ) ## Split 'key' with the content of SUBSEP variable.
...
}
with a result of:
key_parts[1] -> SomeSection
key_parts[2] -> Blah
this awk line should do the job:
awk 'BEGIN{FS="[\\.\\]\\[]"}
NR==FNR{ if(NF>1){ i=1; idx=$2; }else{ s[idx"."$1]=i; i++; } next; }
{ if($0 in s) print $1"."s[$0] } ' f2 input
see test below:
kent$ head input f2
==> input <==
SomeSection.Foo
OtherSection.Foo
OtherSection.Goo
==> f2 <==
[SomeSection]
Blah
Foo
[OtherSection]
Foo
Goo
kent$ awk 'BEGIN{FS="[\\.\\]\\[]"}
NR==FNR{ if(NF>1){ i=1; idx=$2; }else{ s[idx"."$1]=i; i++; } next; }
{ if($0 in s) print $1"."s[$0] } ' f2 input
SomeSection.2
OtherSection.1
OtherSection.2