How to awk pattern over two consecutive lines? - awk

I am trying do something which I guess could be done very easy but I cant seem to find the answer. I want to use awk to pick out lines between two patterns, but I also want the pattern to match two consecutive lines. I have tried to find the solution on the Internet bu perhaps I did not search for the right keywords. An example would better describe this.
Suppose I have the following file called test:
aaaa
bbbb
SOME CONTENT 1
ddddd
fffff
aaaa
cccc
SOME CONTENT 2
ccccc
fffff
For example lets say I would like to find "SOME CONTENT 1"
Then I would use awk like this:
cat test | awk '/aaa*/ { show=1} show; /fff*/ {show=0}'
But that is not want I want. I want somehow to enter the pattern:
aaaa*\nbbbb*
And the same for the end pattern. Any suggestions how to do this?

You can use this:
awk '/aaa*/ {f=1} /bbb*/ && f {show=1} show; /fff*/ {show=f=0}' file
bbbb
SOME CONTENT 1
ddddd
fffff
If pattern1 is aaa* then set flag f
If pattern2 is bbb* and flag f is true, then set the show flag
If you need to print patter1 the aaa*?
awk '/aaa*/ {f=$0} /bbb*/ && f {show=1;$0=f RS $0} show; /fff*/ {show=f=0}' file
aaaa
bbbb
SOME CONTENT 1
ddddd
fffff

If every record ends with fffff, and GNU awk is available, you could do something like this:
$ awk '/aaa*\nbbbb*/' RS='fffff' file
aaaa
bbbb
SOME CONTENT 1
ddddd
Or if you want just SOME CONTENT 1 to be visible, you can do:
$ awk -F $'\n' '/aaa*\nbbbb*/{print $4}' RS='fffff' file
SOME CONTENT 1

I searched for two patterns and checkd that they were consecutive using line numbers, having line numbers lets sed insert a line between them, well after the first line/pattern.
awk '$0 ~ "Encryption" {print NR} $0 ~ "Bit Rates:1" {print NR}' /tmp/mainscan | while read line1; do read line2; echo "$(($line2 - 1)) $line1"; done > /tmp/this
while read line
do
pato=$(echo $line | cut -f1 -d' ')
patt=$(echo $line | cut -f2 -d' ')
if [[ "$pato" = "$patt" ]]; then
inspat=$((patt + 1))
sed -i "${inspat}iESSID:##" /tmp/mainscan
sed -i 's/##/""/g' /tmp/mainscan
fi
done < /tmp/this

Related

Counting the number of unique values based on two columns in bash

I have a tab-separated file looking like this:
A 1234
A 123245
A 4546
A 1234
B 24234
B 4545
C 1234
C 1234
Output:
A 3
B 2
C 1
Basically I need counts of unique values that belong to the first column, all in one commando with pipelines. As you may see, there can be some duplicates like "A 1234". I had some ideas with awk or cut, but neither of the seem to work. They just print out all unique pairs, while I need count of unique values from the second column considering the value in the first one.
awk -F " "'{print $1}' file.tsv | uniq -c
cut -d' ' -f1,2 file.tsv | sort | uniq -ci
I'd really appreciate your help! Thank you in advance.
With complete awk solution could you please try following.
awk 'BEGIN{FS=OFS="\t"} !found[$0]++{val[$1]++} END{for(i in val){print i,val[i]}}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Using GNU awk:
$ gawk -F\\t '{a[$1][$2]}END{for(i in a)print i,length(a[i])}' file
Output:
A 3
B 2
C 1
Explained:
$ gawk -F\\t '{ # using GNU awk and tab as delimiter
a[$1][$2] # hash to 2D array
}
END {
for(i in a) # for all values in first field
print i,length(a[i]) # output value and the size of related array
}' file
$ sort -u file | cut -f1 | uniq -c
3 A
2 B
1 C
Another way, using the handy GNU datamash utility:
$ datamash -g1 countunique 2 < input.txt
A 3
B 2
C 1
Requires the input file to be sorted on the first column, like your sample. If real file isn't, add -s to the options.
You could try this:
cat file.tsv | sort | uniq | awk '{print $1}' | uniq -c | awk '{print $2 " " $1}'
It works for your example. (But I'm not sure if it works for other cases. Let me know if it doesn't work!)

Count b or B in even lines

I need count the number of times in the even lines of the file.txt the letter 'b' or 'B' appears, e.g. for the file.txt like:
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
I got the first part but I need to count these b or B letters and I do not know how to count them together
awk '!(NR%2)' file.txt
$ awk '!(NR%2){print gsub(/[bB]/,"")}' file
4
Could you please try following, one more approach with awk written on mobile will try it in few mins should work but.
awk -F'[bB]' 'NR%2 == 0{print (NF ? NF - 1 : 0)}' Input_file
Thanks to #Ed sir for solving zero matches found line problem in comments.
In a single awk:
awk '!(NR%2){gsub(/[^Bb]/,"");print length}' file.txt
gsub(/[^Bb]/,"") deletes every character in the line the line except for B and b.
print length prints the length of the resulting string.
awk '!(NR%2)' file.txt | tr -cd 'Bb' | wc -c
Explanation:
awk '!(NR%2)' file.txt : keep only even lines from file.txt
tr -cd 'Bb' : keep only B and b characters
wc -c : count characters
Example:
With file bellow, the result is 4.
everyB or gbnBra
uitiakB and kanapB bodddB
Kanbalis astroBominus
Here is another way
$ sed -n '2~2s/[^bB]//gp' file | wc -c

Sort a file preserving the header as first position with bash

When sorting a file, I am not preserving the header in its position:
file_1.tsv
Gene Number
a 3
u 7
b 9
sort -k1,1 file_1.tsv
Result:
a 3
b 9
Gene Number
u 7
So I am tryig this code:
sed '1d' file_1.tsv | sort -k1,1 > file_1_sorted.tsv
first='head -1 file_1.tsv'
sed '1 "$first"' file_1_sorted.tsv
What I did is to remove the header and sort the rest of the file, and then trying to add again the header. But I am not able to perform this last part, so I would like to know how can I copy the header of the original file and insert it as the first row of the new file without substituting its actuall first row.
You can do this as well :
{ head -1; sort; } < file_1.tsv
** Update **
For macos :
{ IFS= read -r header; printf '%s\n' "$header" ; sort; } < file_1.tsv
a simpler awk
$ awk 'NR==1{print; next} {print | "sort"}' file
$ head -1 file; tail -n +2 file | sort
Output:
Gene Number
a 3
b 9
u 7
Could you please try following.
awk '
FNR==1{
first=$0
next
}
{
val=(val?val ORS:"")$0
}
END{
print first
print val | "sort"
}
' Input_file
Logical explanation:
Check condition FNR==1 to see if its first line; then save its values to variable and move on to next line by next.
Then keep appending all lines values to another variable with new line till last line.
Now come to END block of this code which executes when Input_file is done being read, there print first line value and put sort command on rest of the lines value there.
This will work using any awk, sort, and cut in any shell on every UNIX box and will work whether the input is coming from a pipe (when you can't read it twice) or from a file (when you can) and doesn't involve awk spawning a subshell:
awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
The above uses awk to stick a 0 at the front of the header line and a 1 in front of the rest so you can sort by that number then whatever other field(s) you want to sort on and then remove the added field again with a cut. Here it is in stages:
$ awk -v OFS='\t' '{print (NR>1), $0}' file
0 Gene Number
1 a 3
1 u 7
1 b 9
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2
0 Gene Number
1 a 3
1 b 9
1 u 7
$ awk -v OFS='\t' '{print (NR>1), $0}' file | sort -k1,1n -k2,2 | cut -f2-
Gene Number
a 3
b 9
u 7

Only output line if value in specific column is unique

Input:
line1 a gh
line2 a dd
line3 c dd
line4 a gg
line5 b ef
Desired output:
line3 c dd
line5 b ef
That is, I want to output line only in the case that no other line includes the same value in column 2. I thought I could do this with combination of sort (e.g. sort -k2,2 input) and uniq, but it appears that with uniq I can only skip columns from the left (-f avoid comparing the first N fields). Surely there's some straightforward way to do this with awk or something.
You can do this as a two-pass awk script:
awk 'NR==FNR{a[$2]++;next} a[$2]<2' file file
This runs through the file once incrementing a counter in an array whose key is the second field of each line, then runs through a second time printing only those lines whose counter is less than 2.
You'd need multiple reads of the file because at any point during the first read, you can't possibly know whether there will be another instance of the second field of that line later in the file.
Here is a one pass awk solution:
awk '{a1[$2]++;a2[$2]=$0} END{for (a in a1) if (a1[a]==1) print a2[a]}' file
The original order of the file will be lost however.
You can combine awk, grep, sort and uniq for a quick one-liner:
grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d) " input.txt
Edit, to avoid the regexes, \+ and \backreferences:grep -v "^[^ ]* $(awk '{print $2}' input.txt | sort | uniq -d | sed 's/[^+0-9]/\\&/g') " input.txt
alternative to awk to demonstrate that it can still be done with sort and uniq (there is option -u for this), however setting up the right format requires some juggling (decorate/do stuff/undecorate pattern).
$ paste file <(cut -d' ' -f2 file) | sort -k2 | uniq -uf3 | cut -f1
line5 b ef
line3 c dd
as a side effect you lose the original sorting order, which can be recovered as well if you add line numbers...

Delete multiple strings/characters in a file

I have a curl output generated similar below, Im working on a SED/AWK script to eliminate unwanted strings.
File
{id":"54bef907-d17e-4633-88be-49fa738b092d","name":"AA","description","name":"AAxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"BBxxxxxx","enabled":true}
{id":"542ndf07-d19e-2233-87gf-49fa738b092d","name":"AA","description","name":"CCxxxxxx","enabled":true}
{id":"20000000000000000000000000000000","name":"BB","description","name":"DDxxxxxx","enabled":true}
......
I like to modify this file and retain similar below,
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
AA n.....
BB n.....
Is there a way I could remove word/commas/semicolons in-between so I can only retain these values?
Try this awk
curl your_command | awk -F\" '{print $(NF-9),$(NF-3)}'
Or:
curl your_command | awk -F\" '{print $7,$13}'
A semantic approach ussing perl:
curl your_command | perl -lane '/"name":"(\w+)".*"name":"(\w+)"/;print $1." ".$2'
For any number of name ocurrences:
curl your_command | perl -lane 'printf $_." " for ( $_ =~ /"name":"(\w+)"/g);print ""'
This might work for you (GNU sed):
sed -r 's/.*("name":")([^"]*)".*\1([^"]*)".*/\2 \3/p;d' file
This extracts the fields following the two name keys and prints them if successful.
Alternatively, on simply pattern matching:
sed -r 's/.*:.*:"([^"]*)".*:"([^"]*)".*:.*/\1 \2/p;d' file
In this particular case, you could do
awk -F ":|," '{print $4,$7}' file2 |tr -d '"'
and get
AA AAxxxxxx
BB BBxxxxxx
AA CCxxxxxx
BB DDxxxxxx
Here, the field separator is either : or ,, we print the fourth and seventh field (because all lines have the entries in these two fields) and finally, we use tr to delete the " because you don't want to have it.