Add index when same column value - awk

I have this two columns file :
ctg0F chr_1
ctg1F chr_2
ctg2F chr_3
ctg3F chr_4
ctg4F chr_5
ctg5F chr_6
ctg6F chr_4
ctg7F chr_7
ctg8F chr_8
The first column has different values. I'd like to add an index only for the repeated values in the second column. Here chr4 appears twice, and so :
ctg0F chr_1
ctg1F chr_2
ctg2F chr_3
ctg3F chr_4_1
ctg4F chr_5
ctg5F chr_6
ctg6F chr_4_2
ctg7F chr_7
ctg8F chr_8
I do this :
awk '{ if (++count[$2]>1) print $1,$2"_"count[$2]; else print $1,$2"_"count[$2]}'
But this even adds index "_1" for the unique values.
Any help?

With your shown samples in single awk using its END block capabilities please try following awk code.
awk '
{
lineValue[FNR]=$0
secField[$2]++
lineSecondValue[FNR]=$2
}
END{
for(i=1;i<=FNR;i++){
if(secField[lineSecondValue[i]]>1){
print lineValue[i]"_"++currentSecVal[lineSecondValue[i]]
}
}
}
' Input_file

$ awk '
NR==FNR { tot[$2]++; next }
{ print $0 (tot[$2]>1 ? "_" (++cnt[$2]) : "") }
' file file
ctg0F chr_1
ctg1F chr_2
ctg2F chr_3
ctg3F chr_4_1
ctg4F chr_5
ctg5F chr_6
ctg6F chr_4_2
ctg7F chr_7
ctg8F chr_8

This 2 phase awk approach should work for you that computes and stores frequency of 2nd column in 1st phase and then adds a suffix when frequency is greater than one.
awk 'NR == FNR {++fq[$2]; next}
fq[$2] > 1 {$0 = $0 "_" ++ind[$2]} 1' file file
ctg0F chr_1
ctg1F chr_2
ctg2F chr_3
ctg3F chr_4_1
ctg4F chr_5
ctg5F chr_6
ctg6F chr_4_2
ctg7F chr_7
ctg8F chr_8

Use this Perl in-line script:
perl -lane '
$total{ $F[-1] }++;
last LINE if eof;
END {
while ( <> ) {
chomp;
#fields = split;
$_ = join "_", $_, ++$cnt{ $fields[-1] } if $total{ $fields[-1] } > 1;
print;
}
}' input.txt input.txt
$total{ $F[-1] }++; : Count the total number of occurrences of the last column strings (chr...).
last LINE if eof; : Stop processing the file in first command line argument (input.txt) and skip to processing file in the second argument (which is also input.txt).
while ( <> ) { ... } : Read the current input file (input.txt) line by line, storing the current line in $_ variable.
chomp : Strip the terminal newline from $_.
$_ = join "_", $_, ++$cnt{ $fields[-1] } if $total{ $fields[-1] } > 1; : If the total number of occurrences of the last field is greater than 1, then increment the current number of occurrences by 1, and prepend it to the current line.
print; : Print the resulting current line.
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlvar: Perl predefined variables

{m,g,n}awk '(_ = NF)*(FNR == NR) ? !++__[$_] :
__[$_]<_ || $!_=$!_ "_"++___[$_] ' file file
ctg0F chr_1
ctg1F chr_2
ctg2F chr_3
ctg3F chr_4_1
ctg4F chr_5
ctg5F chr_6
ctg6F chr_4_2
ctg7F chr_7
ctg8F chr_8

Related

Edit only specific lines when I find special character with awk

I have this kind of file :
>AX-89948491-minus
CTAACACATTTAGTAGATT
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
When the lines start by ">" and include "minus" , I need to reverse (rev) and translate (tr) the next following lines. I should get :
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
I would like to go with awk. I tried that but it does not work..
awk '{if(NR%2==1~/"plus"/){print;getline;print} else if (NR%2==1~/"minus"/){system("echo "$0" | rev | tr ATCGatcg TAGCtagc")} else {print;getline;print}}' file
Any help?
This gnu-awk should work for you:
awk '
p {
cmd = "rev <<< \047" $0 "\047 | tr ATCGatcg TAGCtagc"
if ((cmd |& getline var) > 0)
$0 = var
}
{
p = /^>/ && /-minus/
} 1' file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
Awk is a tool to manipulate text, not a tool to sequence calls to other tools. The latter is what a shell is for. There are times when you need to call other tools from awk but not when it's simple text manipulation like reversing and translating characters in a string as you want to do.
Using any awk in any shell on every Unix box without spawning a subshell once per target input line to call other Unix tools (including the non-POSIX-defined rev which won't exist on some Unix boxes):
$ cat tst.awk
BEGIN {
split("ATCGatcg TAGCtagc",tmp)
for (i=1; i<=length(tmp[1]); i++) {
tr[substr(tmp[1],i,1)] = substr(tmp[2],i,1)
}
}
f {
out = ""
for (i=1; i<=length($0); i++) {
char = substr($0,i,1)
out = (char in tr ? tr[char] : char) out
}
$0 = out
f = 0
}
/^>.*minus/ { f=1 }
{ print }
$ awk -f tst.awk file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT
I'd use perl, as it has builtin reverse and tr functions:
perl -lpe '
if (/^>/) {$rev = /minus/; next}
if ($rev) {$_ = reverse; tr/ATCGatcg/TAGCtagc/}
' file
>AX-89948491-minus
AATCTACTAAATGTGTTAG
>AX-89940152-plus
cgtcattcagggcaggtggggcaaaA
>AX-89922107-plus
TTATAACTTGTGTATGCTCTCAGGCT

How to remove white spaces between word and spefic token?

I want to clean my code which looks sort like this.
for(i=0; i < max; i++){
test = 5 + test ;
if (test == 10)
printf("HELLO WORLD\n") ;
}
How do you remove the spaces between "test" and ";" without de-formatting the other lines?
Edit: I also want to remove every other spaces before a semicolon.
I tried something with this:
sed 's/;.*$/;/p' $FILE
but it also removes the spaces from the beginning of the line until the first word.
I prefer the answer to use something like awk or sed.
Thanks.
This would remove all blank characters (spaces or tabs) before a ; at the end of the line:
sed 's/[[:blank:]]*;$/;/' file.c
I assume that you also want the spaces before the semicolon on the printf line to be removed.
To remove the spaces between "test" and ";" without de-formatting the other lines using awk:
$ awk '{sub(/test +;/,"test;")}1' file
for(i=0; i < max; i++){
test = 5 + test;
if (test == 10)
printf("HELLO WORLD\n") ;
}
If the test +; appears elsewhere than in the end of the line you could throw into the regex: /test +; *$/
Edit To remove all the space before ;s (and after) using awk:
$ awk '{sub(/ +; *$/,";")}1' file
for(i=0; i < max; i++){
test = 5 + test;
if (test == 10)
printf("HELLO WORLD\n");
}

AWK: add a sequential number out of 4 digits

How do I achieve from following string.ext
>Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
to change the sequential number after string to a 4 digit number (starting with 0001) and separate that number with | from string, so that output is returned like:
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
the commands I came up until here are ($faa is referring to the filename string.ext)
faa=$1
var=$(basename "$faa" .ext)
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' $faa >$faa.tmp
sed 's/ /_/g' $faa.tmp >$faa.tmp2
awk -v var="$var" '/>/{sub(">","&"var"|");sub(/\.ext/,x)}1' $faa.tmp2 >$faa.tmp3
awk '/>/{sub(/\|/,++i"|")}1' $faa.tmp3 >$faa.tmp4
tr '\.' '_' <$faa.tmp4 | tr '\:' '_' | sed 's/__/_/g' >$faa.tmp5
Edit: I also want to change following characters to 1 underscore: / . :
I'd use perl here:
perl -pe '
next unless /^>/; # only transform the "header" lines
s/[\h.]/_/g; # change dots and horizontal whitespace
substr($_,1,0) = sprintf("string|%04d|", ++$n) # insert the counter
' file
$ awk '
FNR==1 {base=FILENAME; sub(/\.[^.]+$/,"",base) }
sub(/^>/,"") { gsub(/[\/ .:]+/,"_"); $0=sprintf(">%s|%04d|%s",base,++c,$0) }
1' string.ext
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
I'm assuming from your posted sample and code that you actually want every contiguous sequence of any combination of spaces, periods, forward slashes and/or colons converted to a single underscore.
In awk.
$ awk '/^>/{n=sprintf("%04d",++i);sub(/^>/,">string|" n "|")}1' file
>string|0001|Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
Explained:
$ awk '
/^>/ { # if string starts with >
n=sprintf("%04d",++i) # iterate i from 1 and zeropad
sub(/^>/,">string|" n "|") # replace the > with stuff
}1' file # implicit output
Don't include & in string (see comments).
awk -F'[ \.]' 'BEGIN{a=1;OFS="_"}/^>/{$1=sprintf(">String|%04d",a);++a;print $0; next;}{print $0}' filename

How to remove space and the specific character in string - awk

Below is a input.
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
I am trying to make a final result like as following. That is, wanted to remove space,'{' and '}' character and check if the next line is '>' or '<'.
In fact, the input above is repeated. I also need to parse '>' and '<' character so I will put the parsed string(YES or NO) into database.
ID=34,ID=35#YES#NO
ID=99,ID=23#YES#NO
ID=18,ID=87#NO#YES
So, with 'sub' function I thought I can replace the space with blank but the result shows:
1#YES#NO
Can you let me know what is wrong?
If possible, teach me how to remove '{' and '}' as well.
Appreciated if you could show me the awk file version instead of one-liner.
BEGIN {
VALUES = ""
L_EXIST = "NO"
R_EXIST = "NO"
}
/!/ { VALUES = gsub(" ", "", $0);
getline;
if ($1 == ">") L_EXIST = "YES";
else if ($1 == "<") R_EXIST = "YES";
print VALUES"#"L_EXIST"#"R_EXIST
}
END {
}
Given your sample input:
$ cat file
!{ID=34, ID2=35}
>
!{ID=99, ID2=23}
>
!{ID=18, ID2=87}
<
This script produces the desired output:
BEGIN { FS="[}{=, ]+"; RS="!" }
NR > 1 { printf "ID=%d,ID=%d#%s\n", $3, $5, ($6==">"?"YES#NO":"NO#YES") }
The Field Separator is set to consume the spaces and other characters between the parts of the line that you're interested in. The Record Separator is set to !, so that each pair of lines is treated as a single record.
The first record is empty (the start of the first line, up to the first !), so we only process the ones after that. The output is constructed using printf, with a ternary to determine the last part (I assume that there are only two options, > or <).
Let's say you have this input:
input.txt
!{ID=34, ID2=35}
!{ID=36, ID2=37}
>
You can use the following awk command
awk -F'[!{}, ]' 'NR>1{yn="NO";if($1==">")yn="YES";print l"#"yn}{l=$3","$5}' input.txt
to produce this output:
ID=34,ID2=35#NO
ID=36,ID2=37#YES

Fields contain field separator as string: How to apply awk correctly in this case?

I have a CSV-file similar to this test.csv file:
Header 1; Header 2; Header 3
A;B;US
C;D;US
E;F;US
G;H;FR
I;J;FR
K;L;FR
M;"String with ; semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
Now, I want to split this file based on header 3. So I want to end up with four separate CSV files, one for "US", "FR", "UK", and "".
With my very limited Linux command line skills (sadly :-( I used until now this line:
awk -F\; 'NR>1{ fname="country_yearly_"$3".csv"; print >>(fname); close(fname);}' test.csv
Of course, the experienced command line users of you will notice my problem: One field in my test.csv contains rows in which the semicolon which I use as a separator is also present in fields that are marked with quotation marks (I can't guarantee that for sure because of millions of rows, but I'm happy with an answer that assumes this). So sadly, I get an additional file named country_yearly_ semicolon".csv, which contains this row in my example.
In my venture to solve this issue, I came across this question on SO. In particular, Thor's answer seems to contain the solution of my problem by replacing all semicolons in strings. I adjusted his code accordingly as follows:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
print
}' test.csv > test1.csv
Now, I get the following test1.csv file:
M;"String with | semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";
As you can see, all rows that have quotation marks are shown and my problem line is fixed as well, but a) I actually want all rows, not only those in quotation marks and I can't figure out which part in his code does limit the rows to ones with quotation marks and b) I think it would be more efficient if test.csv is just changed instead of sending the output to a new file, but I don't know how to do that either.
EDIT in response to Birei's answer:
Unfortunately, my minimal example was too simple. Here is an updated version:
Header 1; Header 2; Header 3; Header 4
A;B;US;
C;D;US;
E;F;US;
G;H;FR;
I;J;FR;
K;L;FR;
M;"String with ; semicolon";UK;"Yet another ; string"
N;"String without semicolon";UK; "No problem here"
O;"String OK";;"Fine"
P;"String OK";;"Not ; fine"
Note that my real data has roughly 100 columns and millions of rows and the country column, ignoring semicolons in strings, is column 13. However, as far as I see it I can't use the fact that it's column 13 if I don't get rid of the semicolons in strings first.
To split the file, you might just do:
awk -v FS=";" '{ CSV_FILE = "country_yearly_" $NF ".csv" ; print > CSV_FILE }'
Which always take the last field to construct the file name.
In your example, only lines with quotation marks are printed due to the NF > 1 pattern. The following script will print all lines:
awk -F'"' -v OFS='' '
NF > 1 {
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i);
$i = FS $i FS; # reinsert the quotes
}
}
{
# print all lines
print
}' test.csv > test1.csv
To do what you want, you could change the line in the script and reprocess it:
awk -F'"' -v OFS='' '
# Save the original line
{ ORIGINAL_LINE = LINE = $0 }
# Replace the semicolon inside quotes by a dummy character
# and put the resulting line in the LINE variable
NF > 1 {
LINE = ""
for(i=2; i<=NF; i+=2) {
gsub(";", "|", $i)
LINE = LINE $(i-1) FS $i FS # reinsert the quotes
}
# Add the end of the line after the last quote
if ( $(i+1) ) { LINE = LINE $(i+1) }
}
{
# Put the semicolon-separated fields in a table
# (the semicolon inside quotes have been removed from LINE)
split( LINE, TABLE, /;/ )
# Build the file name -- TABLE[ 3 ] is the 3rd field
CSV_FILE = "country_yearly_" TABLE[ 3 ] ".csv"
# Save the line
print ORIGINAL_LINE > CSV_FILE
}'
You were near of a solution. I would use the last field to avoid the problem of fields with double quotes. Also, there is no need to close each file. They will automatically be closed by the shell at the end of the awk script.
awk '
BEGIN {
FS = OFS = ";";
}
FNR > 1 {
fname = "country_yearly_" $NF ".csv";
print >>fname;
}
' infile
Check output:
head country_yearly_*
That yields:
==> country_yearly_.csv <==
O;"String OK";
P;"String OK";
==> country_yearly_FR.csv <==
G;H;FR
I;J;FR
K;L;FR
==> country_yearly_UK.csv <==
M;"String with ; semicolon";UK
N;"String without semicolon";UK
==> country_yearly_US.csv <==
A;B;US
C;D;US
E;F;US