Conditional transposition in awk based on column values - awk

I'm trying to make the below transformation using awk.
Input:
status,parent,child,date
first,foo,bar,2019-01-01
NULL,foo,bar,2019-01-02
NULL,foo,bar,2019-01-03
last,foo,bar,2019-01-04
NULL,foo,bar,2019-01-05
blah,foo,bar,2019-01-06
NULL,foo,bar,2019-01-07
first,bif,baz,2019-01-02
NULL,bif,baz,2019-01-03
last,bif,baz,2019-01-04
Expected output:
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
I'm pretty stumped by this problem, and haven't got anything to show yet - any pointers would be very helpful.

Could you please try following.
awk '
BEGIN{
FS=OFS=SUBSEP=","
print "parent,child,first,last"
}
$1=="first" || $1=="last"{
a[$1,$2,$3]=$NF
b[$2,$3]
}
END{
for(i in b){
print i,a["first",i],a["last",i]
}
}
' Input_file
Output will be as follows.
parent,child,first,last
bif,baz,2019-01-02,2019-01-04
foo,bar,2019-01-01,2019-01-04
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=SUBSEP="," ##Setting Fs, OFS and SUBSEP as comma here.
print "parent,child,first,last" ##Printing header values as per OP request here.
} ##Closing BEGIN BLOCK for this progam here.
$1=="first" || $1=="last"{ ##Checking condition if $1 is either string first or last then do following.
a[$1,$2,$3]=$NF ##Creating an array named a whose index is $1,$2,$3 and its value is $NF(last column of current line).
b[$2,$3] ##Creating an array named b whose index is $2,$3 from current line.
} ##Closing main BLOCK for main program here.
END{ ##Starting END BLOCK for this awk program.
for(i in b){ ##Starting a for loop to traverse through array here.
print i,a["first",i],a["last",i] ##Printing variable it, array a with index of "first",i and value of array b with index of "last",i.
} ##Closing BLOCK for, for loop here.
} ##Closing BLOCK for END block for this awk program here.
' Input_file ##Mentioning Input_file name here.

$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $2 OFS $3 }
FNR==1 { print key, "first", "last" }
$1=="first" { first[key] = $4 }
$1=="last" { print key, first[key], $4 }
$ awk -f tst.awk file
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
If you can have a first without a last or vice-versa or they can occur out of order then include those cases in the example in your question.

Not awk, you already have that, but here's an option in bash alone, just for kicks.
#!/usr/bin/env bash
declare -A first=()
printf 'parent,child,first,last\n'
while IFS=, read pos a b date; do
case "$pos" in
first) first[$a,$b]=$date ;;
last) printf "%s,%s,%s,%s\n" "$a" "$b" "${first[$a,$b]}" "$date" ;;
esac
done < input.csv
Requires bash 4+ for the associative array.

Related

How to cut field in awk

I want to write script file to cut field and then use it to compare the field then print
ex: My input file
ADC1/asf/sd/df_adc1/125125
AED1/asf/sd/df_aed1/asfk
ASQ2/asf/df_asq2/aks
ABX5/df_abx5/asf/sd/sdgqw
Output file: The last field will print from beginning to the field have same the first but not capitalized. I will compare the first field and then find the word and print from begin to the word I had compare.
ADC1/asf/sd/df_adc1/125125 ADC1 ADC1/asf/sd/df_adc1
AED1/asf/sd/sf_aed1/asfk AED1 AED1/asf/sd/sf_aed1
ASQ2/asf/dg_asq2/aks ASQ2 ASQ2/asf/dg_asq2
ABX5/da_abx5/asf/sd/sdgqw ABX5 ABX5/da_abx5
I had use awk to split :
awk '{split($1,a,"/"); {print $1, a[1]}}' input > output
and the output like that
ADC1/asf/sd/df_adc1/125125 ADC1
AED1/asf/sd/df_aed1/asfk AED1
ASQ2/asf/df_asq2/aks ASQ2
ABX5/df_abx5/asf/sd/sdgqw ABX5
but i dont know how to compare to make last field
With your shown samples, please try following awk program.
awk '
BEGIN{ FS=OFS="/" }
{
val=""
for(i=1;i<=NF;i++){
if(i>1 && index(tolower($i),tolower($1))){
print $0" "$1" "val OFS $i
}
val=(val?val OFS:"")$i
}
}
' Input_file
Explanation: Adding detailed explanation for above solution.
awk ' ##Starting awk program from here.
BEGIN{ FS=OFS="/" } ##Setting FS and OFS as / here.
{
val="" ##Nullifying val here.
for(i=1;i<=NF;i++){ ##Traversing through all fields here.
if(i>1 && index(tolower($i),tolower($1))){ ##Checking if field is greater than 1 and $1 is found in $i(current field) in case-insensitive mode then do following.
print $0" "$1" "val OFS $i ##Printing current line first field val OFS and current field.
}
val=(val?val OFS:"")$i ##Creating val which has val and current field value in it.
}
}
' Input_file ##Mentioning Input_file name here.

How to merge duplicate lines into same row with primary key and more than one column of information

Here is my data:
NAME1,NAME1_001,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
NAME1,NAME1_003,LIC000_1,NULL,NULL,NULL,NULL
NAME2,NAME2_001,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_001,NULL,LIC400_2,NULL,NULL,LIC500_1
NAME3,NAME3_005,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_006,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME4,NAME4_002,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
Expected result:
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6
I tried below command, but have no idea how to add the details ($3 to $7)
awk '
BEGIN{FS=","; OFS="|"};
{ arr[$1] = arr[$1] == ""? $2 : arr[$1] "|" $2 }
END {for (i in arr) print i, arr[i] }' file.csv
Any suggestion? thanks!!
Could you please try following. Written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==NR{
first=$1
$1=""
sub(/^,/,"")
arr[first]=(first in arr?arr[first] OFS:"")$0
next
}
($1 in arr){
print $1 arr[$1]
delete arr[$1]
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="," ##Setting FS as comma here.
OFS="|" ##Setting OFS as | here.
}
FNR==NR{ ##Checking FNR==NR which will be TRUE when first time Input_file is being read.
first=$1 ##Setting first as 1st field here.
$1="" ##Nullifying first field here.
sub(/^,/,"") ##Substituting starting comma with NULL in current line.
arr[first]=(first in arr?arr[first] OFS:"")$0 ##Creating arr with index of first and keep adding same index value to it.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field is present in arr then do following.
print $1 arr[$1] ##Printing 1st field with arr value here.
delete arr[$1] ##Deleting arr item here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Another awk:
$ awk '
BEGIN { # set them field separators
FS=","
OFS="|"
}
{
if($1 in a) { # if $1 already has an entry in a hash
t=$1 # store key temporarily
$1=a[$1] # set the a hash entry to $1
a[t]=$0 # and hash the record
} else { # if $1 seen for the first time
$1=$1 # rebuild record to change the separators
a[$1]=$0 # and hash the record
}
}
END { # afterwards
for(i in a) # iterate a
print a[i] # and output
}' file
Assuming your input is grouped by the key field as shown in your example (if it isn't then sort it first) you don't need to store the whole file in memory or read it twice and this will output the lines in the same order they appear in the input:
$ cat tst.awk
BEGIN { FS=","; OFS="|" }
$1 != prev {
if (NR>1) {
print rec
}
prev = rec = $1
}
{
$1 = ""
rec = rec $0
}
END { print rec }
$ awk -f tst.awk file
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

Filter out FASTA files by specified sequence length in bash

There's a FASTA file assembly.fasta containing contig names and corresponding sequences:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
>contig_3
CGTGATCTCGCCATTCGTGCCG
I want to get only contigs longer than 30 letters and get a new FASTA file assembly.filtered.fasta containing only those long sequences with contig names, in this format:
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
Using gnu-awk, you may use this simpler version:
awk -v RS='>[^\n]+\n' 'length() >= 30 {printf "%s", prt $0} {prt = RT}' file
>contig_1
CCAATACGGGCGCGCAGGCTTTCTATCGCGCGGCCGGCTTCGTCGAGGACGGGCGGCGCA
AGGATTACTACCGCAGCGGC
>contig_2
ATATAAACCTTATTCATCGTTTTCAGCCTAATTTTCCATTTAACAGGGATGATTTTCGTC
AAAATGCTGAGGCTTTACCAAGATTTTCTACCTTGCACCTTCAGAAAAAAATCATGGCAT
TTATAGACGAAATTCTCGAGAAA
A very quick way to achieve what you are after is:
awk -v n=30 '/^>/{ if(l>n) print b; b=$0;l=0;next }
{l+=length;b=b ORS $0}END{if(l>n) print b }' file
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx -v '(length($seq)>30){print ">" $name ORS $seq}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
Could you please try following, tested and written with shown samples.
awk '
/^>/{
if(sign_val && strLen>=30){
print sign_val ORS line
}
strLen=line=""
sign_val=$0
next
}
{
strLen+=length($0)
line=(line?line ORS:"")$0
}
END{
if(sign_val && strLen>=30){
print sign_val ORS line
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^>/{ ##Checking condition if line starts from > then do following.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
strLen=line="" ##Nullify variables steLen and line here.
sign_val=$0 ##Setting sign_val to current line here.
next ##next will skip all further statements from here.
}
{
strLen+=length($0) ##Checking length of line and keep adding it here.
line=(line?line ORS:"")$0 ##Creating line variable and keep appending it to it with new line.
}
END{ ##Starting END block of this program from here.
if(sign_val && strLen>=30){ ##Checking if sign_val is SET and steLen is SET then do following.
print sign_val ORS line ##Printing sign_val ORS and line here.
}
}
' Input_file ##mentioning Input_file name here.

Extract sequence from list of data into separate line

sample.txt does have "tab-separated column", and there's semi-colon seperated that needed to be splitted accordingly from sequence of number into repeated value.
cat sample.txt
2 2627 588;577
2 2629 566
2 2685 568-564
2 2771 573
2 2773 597
2 2779 533
2 2799 558
2 6919 726;740-742;777
2 7295 761;771-772
Please be noted that, some of line may have inverted sequence 568-564
By using previous script, I manage to split it, but failed to extract from sequence (splitted by dash)
#!/bin/sh
awk -F"\t" '{print $1}' $1 >> $2 &&
awk -F"\t" '{print $2}' $1 >> $2 &&
awk -F"\t" '{print $3}' $1 >> $2 &&
sed -i "s/^M//;s/;\r//g" $2
#!/bin/awk -f
BEGIN { FS=";"; recNr=1}
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print a[lineNr] "," b[lineNr] "," $i
}
}
Expected
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Could you please try following(will add explanation in few mins).
awk '
BEGIN{
OFS=","
}
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(array[i]~/-/){
split(array[i],array2,"-")
to=array2[1]>array2[2]?array2[1]:array2[2]
from=array2[1]<array2[2]?array2[1]:array2[2]
while(from<=to){
print $1,$2,from++
}
}
else{
print $1,$2,array[i]
}
from=to=""
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
OFS="," ##Setting OFS as comma here.
}
{
num=split($NF,array,";") ##Splitting last field of line into an array named array with delimiter semi-colon here.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num which is actually length of array created in previous step.
if(array[i]~/-/){ ##Checking condition if array value with index i is having dash then do followong.
split(array[i],array2,"-") ##Split value of array with index i to array2 here with delimiter -(dash) here.
to=array2[1]>array2[2]?array2[1]:array2[2] ##Creating to variable which will compare 2 elements of array2 and have maximum value out of them here.
from=array2[1]<array2[2]?array2[1]:array2[2] ##Creating from variable which will compare 2 elements of array2 and will have minimum out of them.
while(from<=to){ ##Running while loop from variable from to till value of variable to here.
print $1,$2,from++ ##Printing 1st, 2nd fields with value of from variable and increasing from value with 1 each time it comes here.
}
}
else{ ##Mention else part of if condition here.
print $1,$2,array[i] ##Printing only 1st, 2nd fields along with value of array with index i here.
}
from=to="" ##Nullifying variables from and to here.
}
}
' Input_file ##Mentioning Input_file name here.
Adding link for conditional statements ? and : explanation as per James sir's comments:
https://www.gnu.org/software/gawk/manual/html_node/Conditional-Exp.html
For shown sample output will be as follows.
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
$ awk '
BEGIN {
FS="( +|;)" # input field separator is space or ;
OFS="," # output fs is comma
}
{
for(i=3;i<=NF;i++) { # from the 3rd field to the end
n=split($i,t,"-") # split on - if any. below loop from smaller to greater
if(n) # in case of empty fields
for(j=(t[1]<t[n]?t[1]:t[n]); j<=(t[1]<t[n]?t[n]:t[1]);j++)
print $1,$2,j # output
}
}' file
Output
2,2627,588
2,2627,577
2,2629,566
2,2685,564 <─┐
2,2685,565 │
2,2685,566 ├─ wrong order, from smaller to greater
2,2685,567 │
2,2685,568 <─┘
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.