How to print the initials with awk - awk

I have this input text file:
Pedro Paulo da Silva
22 years old
Brazil
Bruce Mackenzie
30 years old
United States of America
Lee Dong In
26 years old
South Korea
The name of the person is always the first line of the file (and the first line after the empty line /n).
I have to do this output (ignoring everything except the names in the first lines):
PedroPdS
BruceM
LeeDI
Don't know how to do that with awk. I just know that awk 'print {$number}' will grab the column $number and that's how I'm supposed to grab their names.
I've searched here and found this: sed -e 's/$/ /' -e 's/\([^ ]\)[^ ]* /\1/g' -e 's/^ *//'
But I have to use awk.

Would you please try the following:
awk -v RS="" -F '\n' ' # records are separated on blank lines by setting RS to null
{
n = split($1, b, " ") # split the name on spaces
init = b[1] # the first name
for (i = 2; i <= n; i++) # loop over the remaining
init = init substr(b[i], 1, 1) # append the initial
print init
}' input.txt
Output:
PedroPdS
BruceM
LeeDI

With your shown samples, please try following once.
awk '
!NF{
count=0
next
}
++count==1{
printf("%s%s",$1,NF==1?ORS:"")
for(i=2;i<=NF;i++){
printf("%s%s",substr($i,1,1),i==NF?ORS:"")
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
!NF{ ##Checking if line is empty then do following.
count=0 ##Setting count to 0 here.
next ##next will skip all further statements from here.
}
++count==1{ ##Checking condition if count is 1 then do following.
printf("%s%s",$1,NF==1?ORS:"") ##Using printf to print $1 followed by new line OR nothing.
for(i=2;i<=NF;i++){ ##Starting a for loop here.
printf("%s%s",substr($i,1,1),i==NF?ORS:"") ##Using printf to print sub string of current line from field 2 to last field of line and printing only 1st character of line.
}
}' Input_file ##Mentioning Input_file name here.

I would use GNU AWK for this task following way, let file.txt content be
Pedro Paulo da Silva
22 years old
Brazil
Bruce Mackenzie
30 years old
United States of America
Lee Dong In
26 years old
South Korea
then
awk '{if(prevline==""){print gensub(/ ([[:alpha:]])[[:alpha:]]+/, "\\1", "g")};prevline=$0}' file.txt
output
PedroPdS
BruceM
LeeDI
Explanation: there are two things to do, first select which lines to print, then change their content into initials. For first I check if previous line (prevline) is empty string, GNU AWK if variable was not set earlier treat it as empty string for comparison with another string, so condition is meet for first line, then after processing each line I set prevline to line content ($0) so in next turn it does hold previous line. For conversion into initials I harness gensub function - I instruct AWK to replace space-letter-letters using letter and print such changed line.
(tested in gawk 4.2.1)

$ cat input
Pedro Paulo da Silva
22 years old
Brazil
Bruce Mackenzie
30 years old
United States of America
Lee Dong In
26 years old
South Korea
$ awk '!a{ printf "%s", $1;
for( i = 2; i <= NF; i++ ) printf("%c", $i);
printf "\n"; a=1}
/^$/{a=0}' input
PedroPdS
BruceM
LeeDI

You can try this:
awk -F ' ' 'BEGIN {X = 1} NR == X{print $1 substr($2, 1, 1) substr($3, 1, 1) substr($4, 1, 1); X += 4}'

Another potential option is:
awk '/[0-9]/{print p} {p=$1 substr($2, 1, 1) substr($3, 1, 1) substr($4, 1, 1) substr($5, 1, 1)}' file

Another variation with a mix from the existing answers.
awk '{
if (!x) { # Variable x is empty at the start or set to empty line
res=$1 # Set res to field 1
for(i=2; i<=NF;i++) { # Loop the rest of the fields starting at field 2
res = res substr($i, 1, 1) # Concat the first char from each field with res
}
print res
}
x=$0 # Set x variable to the value of the current line
}
' file
Output
PedroPdS
BruceM
LeeDI

With GNU awk in paragraph mode and using gensub() function you can get it:
awk 'BEGIN {RS = ""; FS = "\n"} {print gensub(/([[:space:]])([[:alpha:]]{1})([^[:space:]+])+/,"\\2","g",$1)}' file
PedroPdS
BruceM
LeeDI

Yet another. It turned out a bit like #WilliamPursell's, though (++):
$ awk '!p{for(i=1;i<=NF;i++)printf (i==1?"%s%s":"%c%s"),$i,(i==NF?ORS:"")}{p=NF}' file
Output:
PedroPdS
BruceM
LeeDI
"Explained":
$ awk '
!p { # if previous record empty
for(i=1;i<=NF;i++) # process record for ...
printf (i==1?"%s%s":"%c%s"),$i,(i==NF?ORS:"") # ... output
}
{ p=NF }' file # store field count

Related

compare and print 2 columns from 2 files in awk ou perl

I have 2 files with 2 million lines.
I need to compare 2 columns in 2 different files and I want to print the lines of the 2 files where there are equal items.
this awk code works, but it does not print lines from the 2 files:
awk 'NR == FNR {a[$3]; next}$3 in a' file1.txt file2.txt
file1.txt
0001 00000001 084010800001080
0001 00000010 041140000100004
file2.txt
2451 00000009 401208008004000
2451 00000010 084010800001080
desired output:
file1[$1]-file2[$1] file1[$2]-file2[$2] $3 ( same on both files )
0001-2451 00000001-00000010 084010800001080
how to do this in awk or perl?
Assuming your $3 values are unique within each input file as shown in your sample input/output:
$ cat tst.awk
NR==FNR {
foos[$3] = $1
bars[$3] = $2
next
}
$3 in foos {
print foos[$3] "-" $1, bars[$3] "-" $2, $3
}
$ awk -f tst.awk file1.txt file2.txt
0001-2451 00000001-00000010 084010800001080
I named the arrays foos[] and bars[] as I don't know what the first 2 columns of your input actually represent - choose a more meaningful name.
With your shown samples, please try following awk code. Fair warning
I haven't tested it yet with millions of lines.
awk '
FNR == NR{
arr1[$3]=$0
next
}
($3 in arr1){
split(arr1[$3],arr2)
print (arr2[1]"-"$1,arr2[2]"-"$2,$3)
delete arr2
}
' file1.txt file2.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR == NR{ ##checking condition which will be TRUE when first Input_file is being read.
arr1[$3]=$0 ##Creating arr1 array with value of $1 OFS $2 and $3
next ##next will skip all further statements from here.
}
($3 in arr1){ ##checking if $3 is present in arr1 then do following.
split(arr1[$3],arr2) ##Splitting value of arr1 into arr2.
print (arr2[1]"-"$1,arr2[2]"-"$2,$3) ##printing values as per requirement of OP.
delete arr2 ##Deleting arr2 array here.
}
' file1.txt file2.txt ##Mentioning Input_file names here.
If you have two massive files, you may want to use sort, join and awk to produce your output without having to have the first file mostly in memory.
Based on your example, this pipe would do that:
join -1 3 -2 3 <(sort -k3 -n file1) <(sort -k3 -n file2) | awk '{printf("%s-%s %s-%s %s\n",$2,$4,$3,$5,$1)}'
Prints:
0001-2451 00000001-00000010 084010800001080
If your files are that big, you might want to avoid storing the data in memory. It's a whole lot of comparisons, 2 million lines times 2 million lines = 4 * 1012 comparisons.
use strict;
use warnings;
use feature 'say';
my $file1 = shift;
my $file2 = shift;
open my $fh1, "<", $file1 or die "Cannot open '$file1': $!";
while (<$fh1>) {
my #F = split;
open my $fh2, "<", $file2 or die "Cannot open '$file2': $!";
# for each line of file1 file2 is reopened and read again
while (my $cmp = <$fh2>) {
my #C = split ' ', $cmp;
if ($F[2] eq $C[2]) { # check string equality
say "$F[0]-$C[0] $F[1]-$C[1] $F[2]";
}
}
}
With your rather limited test set, I get the following output:
0001-2451 00000001-00000010 084010800001080
Python: tested with 2.000.000 rows each file
d = {}
with open('1.txt', 'r') as f1, open('2.txt', 'r') as f2:
for line in f1:
if not line: break
c0,c1,c2 = line.split()
d[(c2)] = (c0,c1)
for line in f2:
if not line: break
c0,c1,c2 = line.split()
if (c2) in d: print("{}-{} {}-{} {}".format(d[(c2)][0], c0, d[(c2)][1], c1, c2))
$ time python3 comapre.py
1001-2001 10000001-20000001 224010800001084
1042-2013 10000042-20000013 224010800001096
real 0m3.555s
user 0m3.234s
sys 0m0.321s

Move new line character 5 positions downstream in a text (fasta) file

I am trying to transform a text file like this (fasta format):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
The objective is to displace newline character 5 positions downstream, except for those lines starting with >
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
I would like to use AWK, but I am not sure how to proceed. I am thinking about something similar to this:
awk '{for(i=1;i<=NR;i++){ if($1 ~ /^>/){¿?¿?¿?}}}'
Do you know how can I solve this?
Assumptions:
all data lines are to be expanded to a max of 24 characters
One awk idea:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence $1 } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out
This generates:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
I would do it following way, let file.txt content be
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
then
awk 'BEGIN{width=24}/>/&&x{print x;x=""}/>/{print;next}{x = x $0}length(x)>=width{print substr(x,1,width);x=substr(x,width+1)}END{print x}' file.txt
gives output
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
Explanation: I set width to 24 which is number of desired character, if > is found and there is something stored in x do print that and set x value to empty string, if line with > is encountered do print it and go to next line. For every line do append current line content to x, if length of x is equal to or greater than width do print width first characters of x and remove these characters from x. After processing all lines do print x. Disclaimer solution: this solution assumes that ratio between current width and desired with is lesser than 0.5
(GNU Awk 5.0.1)
Yet another approach you could try, using awk's field and record separators:
awk -v width=24 '
BEGIN {
FS="\n" # Set the Field separator to newline
RS=">" # Set the Record separator to ">"
ORS=OFS="" # Set the Output Record and Field separator to an empty string
}
NR>1 { # Using ">" as a record separator the first record is empty, so skip
header=$1 # Using "\n" as the Field separator, $1 contains the header, save it in a variable
$1=OFS # Assign an empty string to $1 so the record gets recalculated and the body becomes $0 i
# with all newlines are removed, since OFS == ""
gsub(".{" width "}", "&" FS) # Append every "width" characters with a newline (FS)
print RS header FS $0 FS # Print a ">", the header, a newline, the body and a newline
}
' fasta_in > fasta_out
Assuming the line that starts with > is never more than 24 chars long:
$ awk '{printf "%s", (/^>/ ? sep $0 ORS : $0); sep=ORS} END{print ""}' file | fold -w24
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

Extract first position of a regex match grep

Good morning everyone,
I have a text file containing multiple lines. I want to find a regular pattern inside it and print its position using grep.
For example:
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
I want to find L[any_letter]T in the file and print the position of L and the three letter code. In this case it would results as:
11 LIT
8 LAT
4 LKT
I wrote a code in grep, but it doesn't return what I need. The code is:
grep -E -boe "L.T" file.txt
It returns:
11:LIT
21:LAT
30:LKT
Any help would be appreciated!!
Awk suites this better:
awk 'match($0, /L[[:alpha:]]T/) {
print RSTART, substr($0, RSTART, RLENGTH)}' file
11 LIT
8 LAT
4 LKT
This is assuming only one such match per line.
If there can be multiple overlapping matches per line then use:
awk '{
n = 0
while (match($0, /L[[:alpha:]]T/)) {
n += RSTART
print n, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + 1)
}
}' file
With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk '
{
ind=prev=""
while(ind=index($0,"L")){
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){
if(prev==""){ print prev+ind,substr($0,ind,3) }
if(prev>1) { print prev+ind+2,substr($0,ind,3) }
}
$0=substr($0,ind+3)
prev+=ind
}
}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
{
ind=prev="" ##Nullifying ind and prev variables here.
while(ind=index($0,"L")){ ##Run while loop to check if index for L letter is found(whose index will be stored into ind variable).
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){ ##Checking condition if letter after 1 position of L is T AND letter next to L is a letter.
if(prev==""){ print prev+ind,substr($0,ind,3) } ##Checking if prev variable is NULL then printing prev+ind along with 3 letters from index of L eg:(LIT).
if(prev>1) { print prev+ind+2,substr($0,ind,3) } ##If prev is greater than 1 then printing prev+ind+2 and along with 3 letters from index of L eg:(LIT).
}
$0=substr($0,ind+3) ##Setting value of rest of line value to 2 letters after matched L position.
prev+=ind ##adding ind to prev value.
}
}' Input_file ##Mentioning Input_file name here.
Peeking at the answer of #anubhava you might also sum the RSTART + RLENGTH and use that as the start for the substr to get multiple matches per line and per word.
The while loop takes the current line, and for every iteration it updates its value by setting it to the part right after the last match till the end of the string.
Note that if you use the . in a regex it can match any character.
awk '{
pos = 0
while (match($0, /L[a-zA-Z]T/)) {
pos += RSTART;
print pos, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' file
If file contains
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
ARTGHFRHOPLITLOT LATTELET
LUT
The output is
11 LIT
8 LAT
4 LKT
11 LIT
12 LOT
14 LAT
17 LET
1 LUT

Prevent awk from adding non-integers?

I have a file that has these columns that I would like to add:
absolute_broad_major_cn
1
1
1
1
1.76
1.76
NA
1
and
absolute_broad_minor_cn
1
1
1
1
0.92
0.92
NA
1
I did awk '{ print $1+$2 }, which worked well but it put 0 for where there was an NA. Is it possible to make awk forget this and just put NA again instead (so awk only adds numbers)?
Edit: Desired output is:
<Column header>
2
2
2
2
2.68
2.68
NA
2
paste absolute* | awk '{ if ($1 == "NA" && $2 == "NA") print "NA"; else print $1 + $2; }'
would do the trick; whether you want && (both are "NA" to produce an "NA") or || (either one is "NA" produces an NA) is specific to your need.
Could you please try following, written and tested with shown samples.
awk '
FNR==NR{
a[FNR]=$0
next
}
{
print ($0~/[a-zA-Z]/ && a[FNR]~/[a-zA-Z]/?"NA":a[FNR]+$0)
}
' absolute_broad_major_cn absolute_broad_minor_cn
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when Input_file absolute_broad_major_cn is being read.
a[FNR]=$0 ##Creating array a with index FNR and having value as current line here.
next ##next will skip all further statements from here.
}
{
print ($0~/[a-zA-Z]/ && a[FNR]~/[a-zA-Z]/?"NA":a[FNR]+$0) ##Printing either addition of current line with array a value or print NA in case any alphabate is found either in array value OR in current line.
}
' absolute_broad_major_cn absolute_broad_minor_cn ##Mentioning Input_file names here.
I think what you're really trying to do is sum 2 numeric columns from 1 file:
awk '{print ($1==($1+0) ? $1+$2 : $1)}' file
$1 == $1+0 will only be true if $1 is a number.
Just remove the lines with NA & then add them
awk '$1 != "NA"' FS=' ' file | awk '{ print $1+$2 }'

Awk: check if field value is in one of the given

I need to print all lines in which field $2 is one of the follows (23, 17, 21, 1)
awk -F $'\t' 'BEGIN { arr = (23, 17, 21, 1) } {if ($2 in arr) {print $0}}' file.txt
doesn't work
This should do:
awk '$2~/^(23|17|21|1)$/' file
This will test if field #2 is one of 23,17,21 or 1
Just en example on how to do it with array:
awk 'BEGIN{split("23 17 21 1",tmp); for (i in tmp) arr[tmp[i]]} $2 in arr' file
Make a variable data with number to use
Split it with split into array arr
Loop trough all value in arr fore every line and print if $2 is found in arr.
EDIT: Updated with Eds suggestions.