AWK or sed way to paste non-adjacent lines - awk

$ cat file
aaa bbb ccc
ddd eee
jjj kkk lll
mmm
nnn ooo ppp
The following AWK command will paste the 'mmm' line at the end of the 'ddd eee' line. Is there a simpler way to do this using AWK or sed?
$ awk 'FNR==NR {if (NR==4) foo=$0; next} FNR==2 {print $0" "foo; next} FNR==4 {next} 1' file file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
To clarify: I want to paste line 4 at the end of line 2 in this particular file, with a single space between the 'ddd eee' and the 'mmm'. That's the task. Is there an AWK or sed solution that's simpler than the one I came up with?

This can be done in sed using the hold space:
sed '2{N;h;N;s/\n.*\n/ /;p;g;D;}' file
2{...} Run the enclosed commands on line two.
N;h;N Read next two lines into the pattern space, holding the first two.
s/\n.*\n/ / Substitute a space for the middle line.
p;g;D Print the pasted lines, load the hold space, and delete the
first line (leaving the one that was removed by the previous substitute).
or using captures (\(...\)) & back-references (\1, \2, etc.):
sed '2{N;N;s/\(\n.*\)\n\(.*\)/ \2\1/;}' file
2{...} Run the enclosed commands on line two.
N;N Read next two lines into the pattern space.
s/\(\n.*\)\n\(.*\)/ \2\1/ Swap the third and fourth line, joining the first and third lines.
\(\n.*\) Capture the third line, including the leading newline.
\n\(.*\) Capture the fourth line, excluding the leading newline.
/ \2\1/ Replace the matched portion (the third & fourth lines) with a space, followed by the second, and then the first capture groups.

This meets the letter of the amended problem statement — it prints line 1, appends line 4 after the content of line 2 as line 2, then prints line 3, and then prints line 5 and beyond:
awk 'NR == 1 || NR >= 5 { print; next }
NR == 2 { save2 = $0 }
NR == 3 { save3 = $0 }
NR == 4 { print save2, $0; print save3 }' file
It's simpler than the code in the question in that it only scans the file once.
The output:
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp

Solution in TXR:
$ txr -c '#line1
#line2
#line3
#line4
#(data rest)
#(output)
#line1
#line2 #line4
#line3
# (repeat)
# rest
# (end)
#(end)' file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp

This is simpler:
$ awk 'FNR==NR {if (NR==4) foo=$0; next} FNR==2{$0=$0" "foo} FNR!=4' file file
aaa bbb ccc
ddd eee mmm
jjj kkk lll
nnn ooo ppp
Other solutions might be faster or use less memory but they won't be simpler.

Related

Extract text between patterns in new files

I'm trying to analyze a file with the following structure:
AAAAA
123
456
789
AAAAA
555
777
999
777
The idea is to detect the 'AAAAA' pattern and extract the two following lines. After this is done, I would like to append the next 'AAAAA' pattern and the following two lines, so th final file will look something like this:
AAAAA
123
456
AAAA
555
777
Taking into account that the last one will not end with the 'AAAAA' pattern.
Any idea about how this can be done ? I've use sed but I don't know how to select the number of lines to be retained after the pattern...
Fo example with AWK:
awk '/'$AAAAA'/,/'$AAAAA'/' INPUTFILE.txt
Bu this will only extract all the text between the two AAAAA
Thanks
With sed
sed -n '/AAAAA/{N;N;p}' file.txt
with smart counters
$ awk '/AAAAA/{n=3} n&&n--' file
AAAAA
123
456
AAAAA
555
777
The grep command has a flag that prints lines after each match. For example:
grep AAAAA --after 2 <file>
Unless I misunderstood, this should match your requirements, and is much simpler than awk scripts.
You may try this awk:
awk '$1 == "AAAAA" {n = NR+2} NR <= n' file
AAAAA
123
456
AAAAA
555
777
just cheat
mawk/mawk2/gawk 'BEGIN { FS = OFS = "AAAAA\n"; RS = "^$";
} END { for(x=2; x<= NF; x++) { print $(x) } }'
no one says the fields must be split by spaces, and rows must be new-lines one-by-one. By design of FS, every field after $1 will contain the matches you need, and fitting multiple "rows" of each within $2 etc.
In this example, in $2 you will find 12 bytes, like this :
1 2 3 \n 4 5 6 \n 7 8 9 \n # spaced out for readability

How to print lines before and after a match until a specific match (3 matching patterns)

I have a lenghty data, which are built in blocks.
in the example below let's see that they start with (AAA) and end with (FFF) between them they could have many lines of information
I want to extract specific blocks, only if the pattern (CCC) is inside these blocks.
An example would be:
cat text
AAA1
BBB
FFF1
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA3
BBB
FFF3
AAA4
BBB
CCC4
DDD
EEE
FFF4
The output should be:
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA4
BBB
CCC4
DDD
EEE
FFF4
I thought on using sed, but not really working:
If use this only gives me from CCC to the next AAA/FFF: sed -n -e '/CCC/,/AAA/ p' text or sed -n -e '/CCC/,/AAA/ p' text
CCC2
DDD
EEE
FFF2
AAA3
CCC4
DDD
EEE
FFF4
if I use it this way: sed -n -e '/AAA/,/FFF/ p' text I will capture patterns between AAA and FFF that not has CCC in it.
This might work for you (GNU sed):
sed -n '/AAA/{:a;N;/FFF/!ba;/CCC/p}' file
Turn off implicit printing -n because this is a filtering operation.
Match a line containing AAA and append further lines until one containing FFF.
If the collection contains the string CCC, print it.
Repeat.
N.B. This assumes AAA and FFF are paired, if not use:
sed -n '/AAA/{:a;N;/\n.*AAA/s/.*\n//;/FFF/!ba;/CCC/p}' file
Alternative:
sed -n 'H;/AAA/h;/FFF/{g;/AAA.*CCC/p;z;h}' file
EDIT:
For AAA ,CCC and FFF at the beginning of a line,use:
sed -n '/^AAA/{:a;N;/^FFF/M!ba;/^CCC/Mp}' file
or
sed -n '/^AAA/{:a;N;/\nAAA/s/.*\n//;/\nFFF/!ba;/\nCCC/p}' file
or
sed -n 'H;/^AAA/h;/^FFF/{g;/AAA.*\nCCC/p;z;h}' file
Using any awk in any shell on every Unix box:
$ awk '/^AAA/{a=1; buf=""} /^CCC/{c=1} {buf=buf $0 ORS} /^FFF/{if (a && c) printf "%s", buf; a=c=0}' text
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA4
BBB
CCC4
DDD
EEE
FFF4
You can consider your input as data-blocks, with AAA.* as the start-tag and FFF.* as your end-tag. Now collect each block into hold-space and at the end-tag, check if the block contains the desired pattern.
For example, here is a GNU sed version that does this:
parse.sed
# Start-tag -> start a new block in hold-space
/^AAA/ { h; b; }
# Save input
H
# End-tag AND block contains CCC -> print
/^FFF/ { x; /\nCCC/ p; }
Run it like this, e.g.:
sed -nf parse.sed | sed '/^FFF/G'
Or as a one-liner:
sed -n '/^AAA/{h;b};H;/^FFF/{x;/\nCCC/p}' | sed '/^FFF/G'
Output:
AAA2
BBB
CCC2
DDD
EEE
FFF2
AAA4
BBB
CCC4
DDD
EEE
FFF4
A more portable sed script looks like this:
# Start-tag -> start a new block in hold-space
/^AAA/ {
h
b
}
# Save input
H
# End-tag AND block contains CCC -> print
/^FFF/ {
x
/\nCCC/p
}
An awk variant
awk '/^AAA/{f=1} f{i=i $0 ORS} /^FFF/{if(i~/\nCCC/){printf "%s", i} i=f=""}' input

Keep lines only if a column is repeated three times within the file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a column file and I want to keep the lines that contain the second column repeated exactly three times within the file.
Input:
000 BBB PPP DDD
111 BBB SSS 444
777 CCC RRR 555
222 BBB 555 666
321 AAA YYY MMM
123 CCC LLL MMM
OOO AAA BBB VVV
545 UUU 321 R32
PPP AAA HHH TTT
Desired output
000 BBB PPP DDD
111 BBB SSS 444
222 BBB 555 666
321 AAA YYY MMM
OOO AAA BBB VVV
PPP AAA HHH TTT
I have searched on the internet but nothing similar found. Any help is welcome. Thanks.
1st solution: Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
cntIndArray[$2]++
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0
}
END{
for(i in cntIndArray){
if(cntIndArray[i]==3){
print valArray[i]
}
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
cntIndArray[$2]++ ##Creating array which keep trakcs of 2nd field occurence in lines.
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0 ##Creating array which keeps adding same 2nd field lines concatinating with a new line.
}
END{ ##Starting END block of this code here.
for(i in cntIndArray){ ##Traversing through array which has field count here.
if(cntIndArray[i]==3){ ##Checking if an element value equals 3 then do following.
print valArray[i] ##Printing array value with index i which has exact line value in it.
}
}
}' Input_file ##Mentioning Input_file name here.
2nd solution: In case you need output to be in same sequence in which 2nd field occurs in Input_file then try following.
awk '
!seen[$2]++{
cntIndArray[++count]=$2
}
{
cntArray[$2]++;
valArray[$2]=(valArray[$2]?valArray[$2] ORS:"")$0
}
END{
for(i=1;i<=count;i++){
if(cntArray[cntIndArray[i]]==3){
print valArray[cntIndArray[i]]
}
}
}' Input_file
$ awk 'NR==FNR{a[$2]++;next} a[$2]==3' file{,}
000 BBB PPP DDD
111 BBB SSS 444
222 BBB 555 666
321 AAA YYY MMM
OOO AAA BBB VVV
PPP AAA HHH TTT
note that this is a double pass approach, if the content is in a file works better (small files doesn't matter, very large files this will work but internally keeping the data might not). If the data is piped in this approach will not work.

Query the contents of a file using another file in AWK

I am trying to conditionally filter a file based on values in a second file. File1 contains numbers and File2 contains two columns of numbers. The question is to filter out those rows in file1 which fall within the range denoted in each row of file2.
I have a series of loops which works, but takes >12hrs to run depending on the lengths of both files. This code is noted below. Alternatively, I have tried to use awk, and looked at other questions posted on slack overflow, but I cannot figure out how to change the code appropriately.
Loop method:
while IFS= read READ
do
position=$(echo $READ | awk '{print $4}')
while IFS= read BED
do
St=$(echo $BED | awk '{print $2}')
En=$(echo $BED | awk '{print $3}')
if (($position < "$St"))
then
break
else
if (($position >= "$St" && $position <= "$En"));
then
echo "$READ" | awk '{print $0"\t EXON"}' >> outputfile
fi
fi
done < file2
done < file1
Blogs with similar questions:
awk: filter a file with another file
awk 'NR==FNR{a[$1];next} !($2 in a)' d3_tmp FS="[ \t=]" m2p_tmp
Find content of one file from another file in UNIX
awk -v FS="[ =]" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)' File1 File2
file1: (tab delimited)
AAA BBB 1500
CCC DDD 2500
EEE FFF 2000
file2: (tab delimited)
GGG 1250 1750
HHH 1950 2300
III 2600 2700
Expected output would retain rows 1 and 3 from file1 (in a new file, file3) because these records fall within the ranges of row 1 columns 2 and 3, and row 2 columns 2 and columns 3 of file2. In the actual files, they're not row restricted i.e. I am not wanting to look at row1 of file1 and compare to row1 of file2, but compare row1 to all rows in file2 to get the hit.
file3 (output)
AAA BBB 1500
EEE FFF 2000
One way:
awk 'NR==FNR{a[i]=$2;b[i++]=$3;next}{for(j=0;j<i;j++){if ($3>=a[j] && $3<=b[j]){print;}}}' i=0 file2 file1
AAA BBB 1500
EEE FFF 2000
Read the file2 contents and store it in arrays a and b. When file1 is read, check for the number to be between the entire a and b arrays and print.
One more option:
$ awk 'NR==FNR{for(i=$2;i<=$3;i++)a[i];next}($3 in a)' file2 file1
AAA BBB 1500
EEE FFF 2000
File2 is read and the entire range of numbers is broken up and stored into the associate array a. When we read the file1, we just need to lookup the array a.
Another awk. It may or may not make sense depending on the filesizes:
$ awk '
NR==FNR {
a[$3]=$2 # hash file2 records, $3 is key, $2 value
next
}
{
for(i in a) # for each record in file1 go thru ever element in a
if($3<=i && $3>=a[i]) { # if it falls between
print # output
break # exit loop once match found
}
}' file2 file1
Output:
AAA BBB 1500
EEE FFF 2000

Using awk for a table lookup

I would like to use awk to lookup a value from a text file. The text file has a very simple format:
text \t value
text \t value
text \t value
...
I want to pass the actual text for which the value should be looked up via a shell variable, e.g., $1.
Any ideas how I can do this with awk?
your help is great appreciated.
All the best,
Alberto
You can do this in a pure AWK script without a shell wrapper:
#!/usr/bin/awk -f
BEGIN { key = ARGV[1]; ARGV[1]="" }
$1 == key { print $2 }
Call it like this:
./lookup.awk keyval lookupfile
Example:
$ cat lookupfile
aaa 111
bbb 222
ccc 333
ddd 444
zzz 999
mmm 888
$ ./lookup.awk ddd lookupfile
444
$ ./lookup.awk zzz lookupfile
999
This could even be extended to select the desired field using an argument.
#!/usr/bin/awk -f
BEGIN { key = ARGV[1]; field = ARGV[2]; ARGV[1]=ARGV[2]="" }
$1 == key { print $field }
Example:
$ cat lookupfile2
aaa 111 abc
bbb 222 def
ccc 333 ghi
ddd 444 jkl
zzz 999 mno
mmm 888 pqr
$ ./lookupf.awk mmm 1 lookupfile2
mmm
$ ./lookupf.awk mmm 2 lookupfile2
888
$ ./lookupf.awk mmm 3 lookupfile2
pqr
Something like this would do the job:
#!/bin/sh
awk -vLOOKUPVAL=$1 '$1 == LOOKUPVAL { print $2 }' < inputFile
Essentially you set the lookup value passed into the shell script in $1 to an awk variable, then you can access that within awk itself. To clarify, the first $1 is the shell script argument passed in on the command line, the second $1 (and subsequent $2) are fields 1 and 2 of the input file.
TEXT=`grep value file | cut -f1`
I think grep might actually be a better fit:
$ echo "key value
ambiguous correct
wrong ambiguous" | grep '^ambiguous ' | awk ' { print $2 } '
The ^ on the pattern is to match to the start of the line and ensure that you don't match a line where the value, rather than the key, was the desired text.