How to replace a content on specific patter using sed without losing info? - awk

I have a text file with a bunch of data. I was able to extract exactly what I want using sed; but
I need to replaced only the specific pattern I searched without losing the other content from the file.
Im using the following sed command; but I need to the replacement; but dont know how to do it.
cat file.txt | sed -rn '/([a-z0-9]{2}\s){6}/p' > output.txt
The sed searches for the following pattern: ## ## ## ## ## ##, but I want to replace that pattern like this: ######-######.
cat file.txt | sed -rn '/([a-z0-9]{2}\s){6}/p' > output.txt
Output:
1 | ec eb b8 7b e3 c0 47
9 | 90 20 c2 f6 3d c0 1/1/1
25 | 00 fd 45 3d a7 80 31
Desired Output:
1 | ecebb8-7be3c0 47
9 | 9020c2-f63dc0 1/1/1
25 | 00fd45-3da780 31
Thanks

With your shown samples please try following awk program.
awk '
BEGIN{ FS=OFS="|" }
{
$2=substr($2,1,3) substr($2,5,2) substr($2,8,2)"-" substr($2,11,2) substr($2,14,2) substr($2,17,2) substr($2,19)
}
1
' Input_file
Explanation: Simple explanation would be, setting FS and OFS as | for awk program. Then in 2nd field using awk's substr function keeping only needed values as per shown samples of OP. Where substr function works on method of printing specific indexes/position number values(eg: from which value to which value you need to print). Then saving required values into 2nd field itself and printing current line then.

With awk:
awk '{$3=$3$4$5"-"$6$7$8; print $1"\t",$2,$3,$NF}' file
1 | ecebb8-7be3c0 47
9 | 9020c2-f63dc0 1/1/1
25 | 00fd45-3da780 31

This might work for you (GNU sed):
sed -E 's/ (\S\S) (\S\S) (\S\S)/ \1\2\3/;s//-\1\2\3/' file
Pattern match 3 spaced pairs twice, removing the spaces between the 2nd and 3rd pairs and replacing the first space in the 2nd match by -.

If you want to extract specific substrings, you'll need to write a more specific regex to pull out exactly those.
sed -rn '/([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s([a-z0-9]{2})\s/\1\2\3-\4\5\6/' file.txt > output.txt
Notice also how easy it is to avoid the useless cat.
\s is generally not portable; the POSIX equivalent is [[:space:]].

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

How can I remove a string after a specific character ONLY in a column/field in awk or bash?

I have a file with tab-delimited fields (or columns) like this one below:
cat abc_table.txt
a b c
1 11;qqw 213
2 22 222
3 333;rs2 83838
I would like to remove everything after the ";" on only the second field.
I have tried with
awk 'BEGIN{FS=OFS="\t"} NR>=1 && sub (/;[*]/,"",$2){print $0}' abc_table.txt
but it does not seem to work.
I also tried with sed:
's/;.*//g' abc_table.txt
but it erases also the strings in the third field:
a b c
1 11
2 22 222
3 333
The desired output is:
a b c
1 11 213
2 22 222
3 333 83838
If someone could help me, I would be very grateful!
You need to simply correct your regex.
awk '{sub(/;.*/,"",$2)} 1' Input_file
In case you have Input_file TAB delimited then try:
awk 'BEGIN{FS=OFS="\t"} {sub(/;.*/,"",$2)} 1' Input_file
Problem in OP's regex: OP's regex ;[*] is looking for ; and *(literal character) in 2nd field that's why its NOT able to substitute everything after ; in 2nd field. We need to simply give ;.* which means grab everything from very first occurrence of ; till last of 2nd field and then substitute with NULL in 2nd field.
An alternative solution using gnu sed:
sed -E 's/(^[^\t]*\t+[^;]*);[^\t]*/\1/' file
a b c
1 11 213
2 22 222
3 333 83838
This might work for you (GNU sed):
sed 's/[^\t]*/&\n/2;s/;[^\t]*\n//;s/\n//' file
Append a unique marker e.g. newline, to the end of field 2.
Remove everything from the first ; which is not a tab to a newline.
Remove the newline if any.
N.B. This method can be extended for selective or all fields e.g. same removal but for the first and third fields:
sed 's/[^\t]*/&\n/1;s//&\n/3;s/;[^\t]*\n//g;s/\n//g' file

Comparing two files based on 1st column, printing the unique part of one file

I have two files looking like this:
file1:
RYR2 29 70 0.376583106063 4.77084855376
MUC16 51 94 0.481067457376 3.9233164551
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
file2:
A26B2
RYR2
MUC16
ACTL9
I need to compare them based on first column and print only those lines of first file that are not in second, so the output should be:
DCAF4L2 0 13 0.0691414496833 3.05307268261
USH2A 32 62 0.481792717087 2.81864194236
ZFHX4 14 37 0.371576262084 2.81030548752
I tried with grep:
grep -vFxf file2 file1
with awk:
awk 'NR==FNR {exclude[$0];next} !($0 in exclude)' file 2 file1
comm:
comm -23 <(sort file1) <(sort file2)
nothing works
You can use
grep -vFf file2 file1
Also, grep -vf file2 file1 will work, too, but in case the file2 strings contain * or [ that should be read in as literal chars you might get into trouble since they should be escaped. F makes grep treat those strings as fixed strings.
NOTES
-v: Invert match.
-f file: Take regexes from a file.
-F: Interpret the pattern as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
So, it reads the regexes from file2 and applies them to file1, and once it finds a match, that line is not output due to inverted search. This is enough because only the first column contains alphanumerics, the rest contain numeric data only.
Why your command did not work
The -x (short for --line-regexp) option means Select only those matches that exactly match the whole line.
Also, see more about grep options in grep documentation.

Print rows that has numbers in it

this is my data - i've more than 1000rows . how to get only the the rec's with numbers in it.
Records | Num
123 | 7 Y1 91
7834 | 7PQ34-102
AB12AC|87 BWE 67
5690278| 80505312
7ER| 998
Output has to be
7ER| 998
5690278| 80505312
I'm new to linux programming, any help would be highly useful to me. thanks all
I would use awk:
awk -F'[[:space:]]*[|][[:space:]]*' '$2 ~ /^[[:digit:]]+$/'
If you want to print the number of lines deleted as you've been asking in comments, you may use this:
awk -F'[[:space:]]*[|][[:space:]]*' '
{
if($2~/^[[:digit:]]+$/){print}else{c++}
}
END{printf "%d lines deleted\n", c}' file
A short and simple GNU awk (gawk) script to filter lines with numbers in the second column (field), assuming a one-word field (e.g. 1234, or 12AB):
awk -F'|' '$2 ~ /\y[0-9]+\y/' file
We use the GNU extension for regexp operators, i.e. \y for matching the word boundary. Other than that, pretty straightforward: we split fields on | and look for isolated digits in the second field.
Edit: Since the question has been updated, and now explicitly allows for multiple words in the second field (e.g. 12 AB, 12-34, 12 34), to get lines with numbers and separators only in the second field:
awk -F'|' '$2 ~ /^[- 0-9]+$/' file
Alternatively, if we say only letters are forbidden in the second field, we can use:
awk -F'|' '$2 ~ /^[^a-zA-Z]+$/' file

Delete "0" or "1" from the end of each line, except the first line

the input file looks like
Kick-off team 68 0
Ball safe 69 1
Attack 77 8
Attack 81 4
Throw-in 83 0
Ball possession 86 3
Goal kick 100 10
Ball possession 101 1
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 135 1
Ball safe 137 2
and at the end it should look like this:
Kick-off team 68 0
Attack 77 8
Attack 81 4
Ball possession 86 3
Goal kick 100 10
Ball safe 114 13
Throw-in 123 9
Ball safe 134 11
Ball safe 137 2
my solution is
awk '{print $NF}' test.txt | sed -re '2,${/(^0$|^1$)/d}'
how can i directly change the file, e.g. sed -i?
sed -i '2,$ {/[^0-9][01]$/d}' test.txt
2,$ lines to act upon, this one says 2nd line to end of file
{/[^0-9][01]$/d} from filtered lines, delete those ending with 0 or 1
'2,$ {/ [01]$/d}' can be also used if character before last column is always a space
With awk which is better suited for column manipulations:
awk 'NR==1 || ($NF!=1 && $NF!=0)' test.txt > tmp && mv tmp test.txt
NR==1 first line
($NF!=1 && $NF!=0) last column shouldn't be 0 or 1
can also use $NF>1 if last column only have non-negative numbers
> tmp && mv tmp test.txt save output to temporary file and then move it back as original file
With GNU awk, there is inplace option awk -i inplace 'NR==1 || ($NF!=1 && $NF!=0)' test.txt
Here's my take on this.
sed -i.bak -e '1p;/[^0-9][01]$/d' file.txt
The sed script prints the first line, then deletes all subsequent lines that match the pattern you described. This assumes that your first line would be a candidate for deletion; if it contains something other than 0 or 1 in the last field, this script will print it twice. And the -i option is what tells sed to edit in-place (with a backup file).
Awk doesn't have an equivalent option for editing files in-place -- if you want that kind of functionality, you need to implement it in a shell wrapper around your awk script, as #sundeep suggested.
Note that I'm not using GNU sed, but this command should work equally well with it.
awk to the rescue!
$ awk 'NR==1 || $NF && $NF!=1' file
or more cryptic
$ awk 'NR==1 || $NF*($NF-1)' file
This might work for you (GNU sed):
sed -i '1b;/\s[01]$/d' file
Other than the first line, delete any line ending in 0 or 1.