awk: Compare two sets of numbers (generated by random and strict rules) - awk

I have many files containing some fixed words and numbers:
The FIRST SET of numbers has a fixed length of 7 digits: the first 4 of them being like a random prefix (in example are 100,200,300 but can be others..) we do not need it, we are interested for the remaining 4 digits.
The SECOND SET of number/s is generated number based on the last 4 digits from the FIRST SET (xxx7777 = 7777; xxx0066 = 66). You can see that the SECOND SET can NOT have leading zeros, they are cut out already and this is a rule.
Input
first second third 1007777 fourth 7777
...
first second third 2008341 fourth 8341
...
first second third 3000005 fourth 5
...
...
first second third 2008341 fourth 8
...
first second third 2008341 fourth 341
I found in other examples here - how to find interested lines using grep, but I didn't found AWK example doing what I want, because of the rule with the leading zeros maybe i'm having problems..
My attempt to find the wrong generations:
grep -Pr 'first second third' docs/test/*.txt | awk '{ if($4=$6) print $4 " " $6}'
7777 7777
8341 8341
5 5
8 8
341 341
The correct Output should look like this:
2008341 8
2008341 341
..only the problems (not right generated) lines and the filename.
Thanks ! :)

$ awk '/first second third/ && (substr($4,4)+0 != $NF) {print FILENAME, $4, $NF}' file
file 2008341 8
file 2008341 341
Call it as:
awk '...' docs/test/*.txt
or:
find docs -name '*.txt' -exec awk '...' {} \;
or similar as you see fit.

Use this gnu way, intented to be human readable and maintenable :
$ grep -r foobarbase . | awk '
{match($4, /[0-9]{4}$/, a); #1
a[0]=gensub(/^0+/, "", "g", a[0])} #2
$NF != a[0] #3
' file
Output :
first second third 2008341 fourth 8
first second third 2008341 fourth 341
Explanations :
#1 get the last 4 digits of column 4 and assign a array with match
#2 remove all leading 0
#3 if cutted part is different than last column, print (default awk behavior on true condition)

Related

extract specific row with numbers over N

I have a dataframe like this
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
with using awk (or others)
I want to extract rows with only SRMAPQ over 60.
This means the output is
2 34 MAPQ=60;CT=3to5;SRMAPQ=67
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
update: "SRMAPQ=60" can be anywhere in the line,
MAPQ=44;CT=3to5;SRMAPQ=61;DT=3to5
You don't have to extract the value out of SRMAPQ separately and do the comparison. If the format is fixed like above, just use = as the field separator and access the last field using $NF
awk -F= '$NF > 60' file
Or if SRMAPQ can occur anywhere in the line (as updated in the comments), use a generic approach
awk 'match($0, /SRMAPQ=([0-9]+)/){ l = length("SRMAPQ="); v = substr($0, RSTART+l, RLENGTH-l) } v > 60' file
I would use GNU AWK following way let file.txt content be
1 3 MAPQ=0;CT=3to5;SRMAPQ=60
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
4 56 MAPQ=67;CT=3to5;SRMAPQ=50
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
then
awk 'BEGIN{FS="SRMAPQ="}$2>60' file.txt
output
2 34 MAPQ=60;CT=3to5;SRMAPQ=67;SOMETHING=2
5 7 MAPQ=44;CT=3to5;SRMAPQ=61
Note: added SOMETHING to test if it would work when SRMAPQ is not last. Explantion: I set FS to SRMAPQ= thus what is before that becomes first field ($1) and what is behind becomes second field ($2). In 2nd line this is 67;SOMETHING=2 with which GNU AWK copes by converting its' longmost prefix which constitute number in this case 67, other lines have just numbers. Disclaimer: this solution assumes that all but last field have trailing ;, if this does not hold true please test my solution fully before usage.
(tested in gawk 4.2.1)

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

How can I remove a string after a specific character ONLY in a column/field in awk or bash?

I have a file with tab-delimited fields (or columns) like this one below:
cat abc_table.txt
a b c
1 11;qqw 213
2 22 222
3 333;rs2 83838
I would like to remove everything after the ";" on only the second field.
I have tried with
awk 'BEGIN{FS=OFS="\t"} NR>=1 && sub (/;[*]/,"",$2){print $0}' abc_table.txt
but it does not seem to work.
I also tried with sed:
's/;.*//g' abc_table.txt
but it erases also the strings in the third field:
a b c
1 11
2 22 222
3 333
The desired output is:
a b c
1 11 213
2 22 222
3 333 83838
If someone could help me, I would be very grateful!
You need to simply correct your regex.
awk '{sub(/;.*/,"",$2)} 1' Input_file
In case you have Input_file TAB delimited then try:
awk 'BEGIN{FS=OFS="\t"} {sub(/;.*/,"",$2)} 1' Input_file
Problem in OP's regex: OP's regex ;[*] is looking for ; and *(literal character) in 2nd field that's why its NOT able to substitute everything after ; in 2nd field. We need to simply give ;.* which means grab everything from very first occurrence of ; till last of 2nd field and then substitute with NULL in 2nd field.
An alternative solution using gnu sed:
sed -E 's/(^[^\t]*\t+[^;]*);[^\t]*/\1/' file
a b c
1 11 213
2 22 222
3 333 83838
This might work for you (GNU sed):
sed 's/[^\t]*/&\n/2;s/;[^\t]*\n//;s/\n//' file
Append a unique marker e.g. newline, to the end of field 2.
Remove everything from the first ; which is not a tab to a newline.
Remove the newline if any.
N.B. This method can be extended for selective or all fields e.g. same removal but for the first and third fields:
sed 's/[^\t]*/&\n/1;s//&\n/3;s/;[^\t]*\n//g;s/\n//g' file

From linux command line, how can I remove \n from a particular line to merge two lines together? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Using the command line, how can I transform something like:
1 first line
2 second line
3 third line
4 fourth line
extra bit
5 fifth line
6 sixth line
into, say:
1 first line
2 second line
3 third line
4 fourth line; extra bit
5 fifth line
6 sixth line
The condition on which I would like to merge, is to remove any newline creating a line which does not start with a number.
I have seen answers to similar questions using the command-line tools awk, sed, and tr.
awk '/^[0-9]/{ printf "%s%s", (NR == 1 ? "" : "\n"), $0; next}
{printf "; %s", $0} END { printf "\n"}' input
I'm not really sure what you want to do when the first line does not begin with a digit, and I'm making the assumption that starting with a digit is the characteristic you are looking for to combine lines. Modify as needed.
With GNU sed:
sed "4{N;s/\n/; /}" file
With GNU awk:
awk -v line=4 'NR==line{x=$0; getline; $0=x "; " $0}1' file
Output:
1 first line
2 second line
3 third line
4 fourth line; extra bit
5 fifth line
6 sixth line
Could you please try following.
Written and tested it in
https://ideone.com/xqk4si
awk -v line_num="5" '
FNR==(line_num-1){
val=$0
next
}
val{
$0=val";"$0
val=""
}
1
' Input_file
Explanation: mentioning awk variable named line_num which has line number which OP wants to merge with its previous line. In main program checking condition if current line is just one lesser than mentioned line number of yes then create variable val and save that line. Then next condition checking if Val is SET then print previous line value semi colon and current line value and next will skip all further statements from there. 1 is way to print the current lines in awk
On second thought, it might be better to merge all lines that do not start with a number, rather than specifying by number each line to be merged.
Easy to do with ed:
printf "%s\n" '2,$g/^[^0-9]/-1s/$/; /\' '.,+1j' w | ed -s input.txt
Translated from ed's rather cryptic commands: For each line that does not start with a digit (Skipping the first line because it has no previous one to merge with), add ; to the end of the previous line, and then join those two lines. Finally save the changed file.
Example:
$ cat input.txt
1 first line
2 second line
extra stuff
3 third line
4 fourth line
extra bit
5 fifth line
6 sixth line
$ printf "%s\n" '2,$g/^[^0-9]/-1s/$/; /\' '.,+1j' w | ed -s input.txt
$ cat input.txt
1 first line
2 second line; extra stuff
3 third line
4 fourth line; extra bit
5 fifth line
6 sixth line
With GNU sed, to join any number of lines not starting with a digit:
sed -E ':a;N;s/\n([^0-9])/; \1/;ta;P;D;' file

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.