print lines from one match to another unless third match is between - awk

is there a way in bash to print lines from one match to another, unless third match is between those lines? Let's say file is:
A
B
C
D
E
A
B
Z
C
D
And I want to print all the lines between "A" and "C", but not those containing "Z", so output should be:
A
B
C
I'm using this part of code to match lines between "A" and "C":
awk '/C/{p=0} /A/{p=1} p'

With your shown samples, please try following awk code.
awk '
/A/ { found=1 }
/C/ && !noVal && found{
print value ORS $0
found=noVal=value=""
}
found && /Z/{ noVal=1 }
found{
value=(value?value ORS:"")$0
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/A/ { found=1 } ##Checking condition if line has A then set found to 1.
/C/ && !noVal && found{ ##Checking if C is found and noVal is NULL and found is set then do following.
print value ORS $0 ##printing value ORS and current line here.
found=noVal=value="" ##Nullifying found,noVal and value here.
}
found && /Z/{ noVal=1 } ##Checking if found is SET and Z is found then set noVal here.
found{ ##Checking if found is set here.
value=(value?value ORS:"")$0 ##Creating value which has current line in it and keep concatenating values to it.
}
' Input_file ##Mentioning Input_file name here.

This uses full-line string matching instead of the partial-line regexp matching used in your question and the other answers posted so far (see how-do-i-find-the-text-that-matches-a-pattern for the difference) as I expect it's what you should really be using for a robust solution:
$ cat tst.awk
$0 == "A" { inBlock=1 }
inBlock {
rec = rec $0 ORS
if ( $0 == "C" ) {
if ( !index(ORS rec ORS, ORS "Z" ORS) ) {
printf "%s", rec
}
rec = ""
inBlock = 0
}
}
$ awk -f tst.awk file
A
B
C
If you REALLY wanted to continue to use partial-line regexp matching that'd be this:
$ cat tst.awk
/A/ { inBlock=1 }
inBlock {
rec = rec $0 ORS
if ( /C/ ) {
if ( rec !~ /Z/ ) {
printf "%s", rec
}
rec = ""
inBlock = 0
}
}
but that's fragile if your real data isn't just single letters on their own lines.

I would use A as record separator and C as field separator. So, the A to C range would be in $1 (except the A itself at the beginning and the C at the end) and the rest up to the next A in $2.
The trick is to print only the first field $1 if it doesn't contain any Z. Skip the first record that will be empty.
So try:
awk 'BEGIN{RS="A";FS="C"}(NR > 1) && !/Z/{print "A" $1 "C"}' inputfile
Or even better, according to a comment of Ed Morton below:
awk 'BEGIN{RS="A";FS="C"}(NR > 1) && !/Z/{print RS $1 FS}' inputfile
If the Z can occur after C, we will have to correct the code.

You can use sed to do what you described:
sed '/^A$/,/^C$/!d; /^Z$/d' example-data
# gives
A
B
C
A
B
C
!d means delete lines which don't match the address.
Your expected result was three lines though. So you could use sed '/^A$/,/^C$/!d; /^Z$/d; /^C$/q'. Or, sed '/^A$/,/^C$/!d; /^Z$/d' | sort -u?

Related

How to sort inside a cell captured by awk

I have a file with rows like following, where 3rd column has multiple numeric values which I need to sort:
file: h1.csv
Class S101-T1;3343-1-25310;3344-1-25446 3345-1-25691 3348-1-27681 3347-1-28453
Class S101-T2;3343-2-25310;3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310;3345-3-25691 3343-3-25314
Class S101-T2;3343-4-25310;3345-4-25691 3343-4-25314 3344-4-25314
Class S102-T1;3343-5-25310;3344-5-25446 3345-5-25691
So, expected output is:
Class S101-T1;3343-1-25310;3344-1-25446 3345-1-25691 3347-1-28453 3348-1-27681
Class S101-T2;3343-2-25310;3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310;3343-3-25314 3345-3-25691
Class S101-T2;3343-4-25310;3343-4-25314 3344-4-25314 3345-4-25691
Class S102-T1;3343-5-25310;3344-5-25446 3345-5-25691
My idea was to capture 3rd column with awk and then sort it, and finally print output, but I have arrived only to capture the column. I have not succeeded in sorting it, nor printing disired output.
Here's the code I've got so far...
cat h1.csv | awk -F';' '{ gsub(" ","\n",$3); print $0 }'
I have tried (and some others giving error):
cat h1.csv | awk -F';' '{ gsub(" ","\n",$3); print $3 | "sort -u" }'
cat h1.csv | awk -F';' '{ gsub(" ","\n",$3); sort -u; print $3 }'
So, is it possible to do so, how?, any help! Thanks...
One option could be to split the 3rd column on a space, and then using asort() for the values using gnu-awk.
Then concatenate the first 2 fields and the splitted and sorted fields again.
awk '
BEGIN{FS=OFS=";"}
{
n=split($3, a, " ")
asort(a)
res = $1 OFS $2 OFS
for (i = 1; i <= n; i++) {
res = res " " a[i]
}
print res
}' file
Output
Class S101-T1;3343-1-25310; 3344-1-25446 3345-1-25691 3347-1-28453 3348-1-27681
Class S101-T2;3343-2-25310; 3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310; 3343-3-25314 3345-3-25691
Class S101-T2;3343-4-25310; 3343-4-25314 3344-4-25314 3345-4-25691
Class S102-T1;3343-5-25310; 3344-5-25446 3345-5-25691
In GNU awk, with your shown samples, please try following awk code.
awk '
BEGIN{
FS=OFS=";"
PROCINFO["sorted_in"] = "#val_num_asc"
}
{
nf=val=""
delete value
num=split($NF,arr," ")
for(i=1;i<=num;i++){
split(arr[i],arr2,"-")
value[arr2[1]]=arr[i]
}
for(i in value){
nf=(nf?nf " ":"")value[i]
}
$NF=nf
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=";" ##Setting FS, OFS as ; here.
PROCINFO["sorted_in"] = "#val_num_asc" ##Setting PROCINFO using sorted_in to make sure array values are sorted by values in ascending order only.
}
{
nf=val="" ##Nullifying variables here.
delete value ##Deleting value array here.
num=split($NF,arr," ") ##Splitting last field into arr with separator as space here.
for(i=1;i<=num;i++){ ##Traversing through all elements of array arr.
split(arr[i],arr2,"-") ##Splitting first value of arr into arr2 by delimiter of - to make sure to get only first value eg: 3344, 3345 etc.
value[arr2[1]]=arr[i] ##Assigning value array value to arr value with index of arr2 value whose index of 1st.
}
for(i in value){ ##Traversing through array value here.
nf=(nf?nf " ":"")value[i] ##Concatenating all values to nf here.
}
$NF=nf ##Assigning last field value to nf here.
}
1 ##printing edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Using GNU awk for sorted_in:
$ cat tst.awk
BEGIN {
FS = OFS = ";"
PROCINFO["sorted_in"] = "#val_str_asc"
}
{
split($3,a," ")
sorted = ""
for (i in a) {
sorted = (sorted=="" ? "" : sorted " ") a[i]
}
$3 = sorted
print
}
$ awk -f tst.awk file
Class S101-T1;3343-1-25310;3344-1-25446 3345-1-25691 3347-1-28453 3348-1-27681
Class S101-T2;3343-2-25310;3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310;3343-3-25314 3345-3-25691
Class S101-T2;3343-4-25310;3343-4-25314 3344-4-25314 3345-4-25691
Class S102-T1;3343-5-25310;3344-5-25446 3345-5-25691
Note that this assumes alphabetic sort so it'd sort 1000-1-1 before 200-1-1. That works as long as the strings you want sorted are always made up of the same length parts, i.e. 4digits-1digit-5digits.

Multiple options of nf for identify duplicate in different positions awk?

I hope you find yourself well, I am writing to know if it is possible to do something like this in awk
I NEED SOMETHING LIKE MANY CASE OF NF...
FOR NF = 7 PK IS $1,$5, BUT FOR NF=8 $1,$6
INPUT
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|B10|CCC|DDD|000|20200127|JONH3
AAA|BBB|MMM|DDD|444|20200131|JONH4
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
DESIRE OUTPUTS
file .PK_OK_1
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3
file DUPLICATE_PK_1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4
file PK_OK_2
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
file DUPLICATE_PK_2
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
file INVALID_LENGHT
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
MY CODE IS something like this (NOM_ARCH IS A VARIABLE)
BEGIN { FS="|";
OFS="|"
}
NF == 7 {
if (!seen[$1,$5]) {
print > NOM_ARCH".PK_OK_1"; seen[$1,$5]=1
}else{
print > NOM_ARCH".DUPLICATE_PK_1"
}
next
}
NF == 8 {
if (!seen[$1,$6]) {
print > NOM_ARCH".PK_OK_2"; seen[$1,$6]=1
}else{
print > NOM_ARCH".DUPLICATE_PK_2"
}
next
}
{ print > NOM_ARCH".INVALID_LENGHT" }
With your shown samples, please try following awk code.
awk '
BEGIN{ FS=OFS="|" }
{
if(NF==7){ key=($1 FS $5) }
if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
arr1[key]++
next
}
NF==7{
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
NF==8{
outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
NF>8{
outputFile="file_INVALID_LENGHTH"
}
{
print > (outputFile)
}
' Input_file Input_file
OR use following code without ternary operators as per OP's request:
awk '
BEGIN{ FS=OFS="|" }
{
if(NF==7){ key=($1 FS $5) }
if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
arr1[key]++
next
}
NF==7{
if(arr1[key]==1){ outputFile="file.PK_OK_1" }
else { outputFile="file_DUPLICATE_PK_1"}
}
NF==8{
if(arr1[key]==1){ outputFile="file.PK_OK_2" }
else { outputFile="file_DUPLICATE_PK_2"}
}
NF>8{
outputFile="file_INVALID_LENGHTH"
}
{
print > (outputFile)
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
## Starting awk program from here.
awk '
## Starting BEGIN section of this program from here, setting FS and OFS to | here.
BEGIN{ FS=OFS="|" }
##Starting main program from here.
{
##Checking condition if NF is 7 then set key to $1 FS $5.
if(NF==7){ key=($1 FS $5) }
##Checking condition if NF is 8 then set key to $1 FS $6.
if(NF==8){ key=($1 FS $6) }
}
##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
FNR==NR{
##Creating array arr1 with index of key and keep increasing same key value with 1 here.
arr1[key]++
##next will skip all further statements from here.
next
}
##Checking condition if NF==7 then do following.
NF==7{
##Setting outputFile(where contents will be written to), either file.PK_OK_1 OR file_DUPLICATE_PK_1 depending upon value of arr1.
##Basically it uses ternary operators ? and :
##Statements after ? will executed if condition arr1[key]==1 is TRUE.
##Statements after : will be executed if condition ar1[key]==1 is FALSE.
outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
##Checking condition if NF==8 then do following.
NF==8{
##Setting outputFile(where contents will be written to), either file.PK_OK_2 OR file_DUPLICATE_PK_2 depending upon value of arr1.
outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
##Checking condition if NF>8 then do following.
NF>8{
##Setting outputFile(where contents will be written to) to file_INVALID_LENGHTH here.
outputFile="file_INVALID_LENGHTH"
}
{
##Printing current line to outputFile(already set its value above)
print > (outputFile)
}
##Mentioning Input_file names here.
' Input_file Input_file
Normally I'd recommend a first pass with sort and uniq -c for efficiency but I started out assuming the wrong requirements and so wrote most of this under that assumption and so I've just tweaked it now for the real requirements and so here's how to do it all in one awk script:
$ cat tst.awk
BEGIN {
FS=OFS="|"
map[7] = 1
map[8] = 2
}
{ key = $1 FS $(NF-2) FS NF }
NR==FNR {
cnt[key]++
next
}
{
if ( NF in map ) {
sfx = ( cnt[key]>1 ? "DUPLICATE_PK" : "PK_OK" ) "_" map[NF]
}
else {
sfx = "INVALID_LENGTH"
}
print > (nom_arch "." sfx)
}
$ awk -v nom_arch='foo' -f tst.awk file file
$ head foo.*
==> foo.DUPLICATE_PK_1 <==
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4
==> foo.DUPLICATE_PK_2 <==
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
==> foo.INVALID_LENGTH <==
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
==> foo.PK_OK_1 <==
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3
==> foo.PK_OK_2 <==
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
I corrected the spelling of LENGTH above.
Note that NF is included in key = $1 FS $(NF-2) FS NF so we avoid a potential case pointed out by #rowboat where a line with 7 fields has the same $1 and $(NF-2) as a line with 8 fields and so we would otherwise end up counting that twice when it should be 2 separate counts of 1.
We could have used NF-6 instead of map[NF] when setting the sfx but the map[] is useful for identifying valid NF values too and there may be other values of NF in future for which the sfx can't be determined by just subtracting 6.
This uses GNU awk for multidimensional arrays:
# classify.awk
BEGIN {
FS = "|"
ok[7] = ".PK_OK_1"; dup[7] = ".DUPLICATE_PK_1"
ok[8] = ".PK_OK_2"; dup[8] = ".DUPLICATE_PK_2"
}
NF < 7 || NF > 8 {
print > nom_arch".INVALID_LENGTH"
next
}
{
pk = $1 SUBSEP (NF == 7 ? $5 : $6)
count[NF][pk]++
lines[NF][pk] = lines[NF][pk] $0 ORS
}
END {
for (nf in count)
for (pk in count[nf]) {
outfile = nom_arch (count[nf][pk] == 1 ? ok[nf] : dup[nf])
sub(ORS"$", "", lines[nf][pk])
print lines[nf][pk] > outfile
}
}
Then this will produce the desired output files
gawk -f classify.awk -v nom_arch="foo" file
The awk SUBSEP variable is used in array keys when you do something like
var[x,y] = 10
awk uses the value of SUBSEP to join the values of x and y.
The default SUBSEP value is octal value 034, an ASCII character unlikely to appear in text data.
This version is more portable, does not require GNU awk
BEGIN {
FS = "|"
ok[7] = ".PK_OK_1"; dup[7] = ".DUPLICATE_PK_1"
ok[8] = ".PK_OK_2"; dup[8] = ".DUPLICATE_PK_2"
}
NF < 7 || NF > 8 {
print > (nom_arch".INVALID_LENGTH")
next
}
{
pk = NF SUBSEP $1 SUBSEP (NF == 7 ? $5 : $6)
count[pk]++
lines[pk] = lines[pk] $0 ORS
}
END {
for (pk in count) {
sub(ORS"$", "", lines[pk])
nf = pk; sub(SUBSEP".*", "", nf)
outfile = nom_arch (count[pk] == 1 ? ok[nf] : dup[nf])
print lines[pk] > outfile
}
}
If it's ok to put the first occurrence of a dup in with the OK's, then one pass is easy.
NOM_ARCH=/tmp/mytest
awk -v nom_arch="$NOM_ARCH" ' BEGIN { FS=OFS="|" }
{ if (NF ~ /^[78]$/) { key=($1 FS NF-2) } else { print > (nom_arch ".INVALID_LENGTH"); next; }
print > ( nom_arch "." ( seen[key]++ ? "DUPLICATE_PK" : "PK_OK" ) "_" NF-6 ) } ' file
c.f. AAA|B10|CCC|DDD|000|20200127|JONH3 and AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY which land in the OK files as the first hit, but subsequent dups get seen and directed elsewhere. Note that it might still be faster to shift those records between smaller files on a second pass after the fact.
Personally, I'd probably just split the records to key-sorted files by NF first. Then the second pass each is easy.
NOM_ARCH=/tmp/mytest
# this pre-sort is likely the slow part, though smaller files and in parallel
awk 'BEGIN { FS=OFS="|" } { k2=NF-2; print | "sort -t\\| -k1,1 -k"k2","k2">NF"NF; }' file
shopt -s extglob; cat NF!([78]) > $NOM_ARCH.INVALID_LENGTH &
​for f in NF[78]; do
awk -v nom_arch="$NOM_ARCH" '
BEGIN { FS=OFS="|"; lastkey=""; lastrec=""; }
END { if(""!=lastrec){print lastrec>f} }
{ key=($1 FS $(NF-2));
if ( key==lastkey ) {
f=(nom_arch".DUPLICATE_PK_"NF-6);
if(""!=lastrec){print lastrec>f}
print $0>f;
lastrec="";
} else {
if(""!=lastrec){print lastrec>f}
f=(nom_arch".PK_OK_"NF-6);
lastkey=($1 FS $(NF-2));
lastrec=$0;
}
}' "$f" &
​done
​wait
Now your data should be sorted to files. This likely reorders the records in those files (see below), so if that matters you should add sorts to those outputs as well.
mytest.PK_OK_1:
​AAA|B10|CCC|DDD|000|20200127|JONH3
​AAA|BBB|CCC|DDD|111|20220129|JONH1
mytest.PK_OK_2:
​AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
​AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
mytest.DUPLICATE_PK_1:
​AAA|BBB|MMM|DDD|444|20200131|JONH4
​AAA|XXX|YYY|DDD|444|20210115|JONH2
mytest.DUPLICATE_PK_2:
​AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
​AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
mytest.INVALID_LENGTH:
​ AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL
This uses more disk space but less memory than an internal lookup table, and is likely a lot slower.
YMMV.

How to merge duplicate lines into same row with primary key and more than one column of information

Here is my data:
NAME1,NAME1_001,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
NAME1,NAME1_003,LIC000_1,NULL,NULL,NULL,NULL
NAME2,NAME2_001,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_001,NULL,LIC400_2,NULL,NULL,LIC500_1
NAME3,NAME3_005,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_006,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME4,NAME4_002,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
Expected result:
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6
I tried below command, but have no idea how to add the details ($3 to $7)
awk '
BEGIN{FS=","; OFS="|"};
{ arr[$1] = arr[$1] == ""? $2 : arr[$1] "|" $2 }
END {for (i in arr) print i, arr[i] }' file.csv
Any suggestion? thanks!!
Could you please try following. Written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==NR{
first=$1
$1=""
sub(/^,/,"")
arr[first]=(first in arr?arr[first] OFS:"")$0
next
}
($1 in arr){
print $1 arr[$1]
delete arr[$1]
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="," ##Setting FS as comma here.
OFS="|" ##Setting OFS as | here.
}
FNR==NR{ ##Checking FNR==NR which will be TRUE when first time Input_file is being read.
first=$1 ##Setting first as 1st field here.
$1="" ##Nullifying first field here.
sub(/^,/,"") ##Substituting starting comma with NULL in current line.
arr[first]=(first in arr?arr[first] OFS:"")$0 ##Creating arr with index of first and keep adding same index value to it.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field is present in arr then do following.
print $1 arr[$1] ##Printing 1st field with arr value here.
delete arr[$1] ##Deleting arr item here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Another awk:
$ awk '
BEGIN { # set them field separators
FS=","
OFS="|"
}
{
if($1 in a) { # if $1 already has an entry in a hash
t=$1 # store key temporarily
$1=a[$1] # set the a hash entry to $1
a[t]=$0 # and hash the record
} else { # if $1 seen for the first time
$1=$1 # rebuild record to change the separators
a[$1]=$0 # and hash the record
}
}
END { # afterwards
for(i in a) # iterate a
print a[i] # and output
}' file
Assuming your input is grouped by the key field as shown in your example (if it isn't then sort it first) you don't need to store the whole file in memory or read it twice and this will output the lines in the same order they appear in the input:
$ cat tst.awk
BEGIN { FS=","; OFS="|" }
$1 != prev {
if (NR>1) {
print rec
}
prev = rec = $1
}
{
$1 = ""
rec = rec $0
}
END { print rec }
$ awk -f tst.awk file
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6

Extract sequence from list of data into separate line

sample.txt does have "tab-separated column", and there's semi-colon seperated that needed to be splitted accordingly from sequence of number into repeated value.
cat sample.txt
2 2627 588;577
2 2629 566
2 2685 568-564
2 2771 573
2 2773 597
2 2779 533
2 2799 558
2 6919 726;740-742;777
2 7295 761;771-772
Please be noted that, some of line may have inverted sequence 568-564
By using previous script, I manage to split it, but failed to extract from sequence (splitted by dash)
#!/bin/sh
awk -F"\t" '{print $1}' $1 >> $2 &&
awk -F"\t" '{print $2}' $1 >> $2 &&
awk -F"\t" '{print $3}' $1 >> $2 &&
sed -i "s/^M//;s/;\r//g" $2
#!/bin/awk -f
BEGIN { FS=";"; recNr=1}
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print a[lineNr] "," b[lineNr] "," $i
}
}
Expected
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Could you please try following(will add explanation in few mins).
awk '
BEGIN{
OFS=","
}
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(array[i]~/-/){
split(array[i],array2,"-")
to=array2[1]>array2[2]?array2[1]:array2[2]
from=array2[1]<array2[2]?array2[1]:array2[2]
while(from<=to){
print $1,$2,from++
}
}
else{
print $1,$2,array[i]
}
from=to=""
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
OFS="," ##Setting OFS as comma here.
}
{
num=split($NF,array,";") ##Splitting last field of line into an array named array with delimiter semi-colon here.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num which is actually length of array created in previous step.
if(array[i]~/-/){ ##Checking condition if array value with index i is having dash then do followong.
split(array[i],array2,"-") ##Split value of array with index i to array2 here with delimiter -(dash) here.
to=array2[1]>array2[2]?array2[1]:array2[2] ##Creating to variable which will compare 2 elements of array2 and have maximum value out of them here.
from=array2[1]<array2[2]?array2[1]:array2[2] ##Creating from variable which will compare 2 elements of array2 and will have minimum out of them.
while(from<=to){ ##Running while loop from variable from to till value of variable to here.
print $1,$2,from++ ##Printing 1st, 2nd fields with value of from variable and increasing from value with 1 each time it comes here.
}
}
else{ ##Mention else part of if condition here.
print $1,$2,array[i] ##Printing only 1st, 2nd fields along with value of array with index i here.
}
from=to="" ##Nullifying variables from and to here.
}
}
' Input_file ##Mentioning Input_file name here.
Adding link for conditional statements ? and : explanation as per James sir's comments:
https://www.gnu.org/software/gawk/manual/html_node/Conditional-Exp.html
For shown sample output will be as follows.
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
$ awk '
BEGIN {
FS="( +|;)" # input field separator is space or ;
OFS="," # output fs is comma
}
{
for(i=3;i<=NF;i++) { # from the 3rd field to the end
n=split($i,t,"-") # split on - if any. below loop from smaller to greater
if(n) # in case of empty fields
for(j=(t[1]<t[n]?t[1]:t[n]); j<=(t[1]<t[n]?t[n]:t[1]);j++)
print $1,$2,j # output
}
}' file
Output
2,2627,588
2,2627,577
2,2629,566
2,2685,564 <─┐
2,2685,565 │
2,2685,566 ├─ wrong order, from smaller to greater
2,2685,567 │
2,2685,568 <─┘
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.

awk totally separate duplicate and non-duplicates

If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file