How to remove lines that match exact phrasse on Linux? - awk

My file contains two lines with Unicode (probably) characters:
▒▒▒▒=
▒▒▒=
and I wish to remove both these lines from the file.
I searched and found I can use this command to remove non UTF-8 characters:
iconv -c -f utf-8 -t ascii file
but it leaves those two lines like this:
=
=
I can't find how to remove lines that match (not just contain, but match) certain phrase, in my case: =.
UPDATE: i found that when i redirect the "=" lines to other file, and open the file, it contains unwanted line: ^A=
which i was unable to match with sed to delete it.

This might work for you (GNU sed):
sed '/^\(\o342\o226\o222\)\+=/d' file
Use:
sed -n l file
To find the octal representation of the unicode characters and then use the \o... metacharacter in the regexp to match.
EDIT:
To remove the lines only containing = use:
sed '/^\(\o342\o226\o222\)\*=\s*$/d' file

Here is the command to clear these lines:
sed -i 's/^=$//g' your_file
As specify in the comment you can also use grep -v '^whatever$' your_file > cleared_file. Note that this solution required to set a different ouput (cleared_file) while the sed-solution allows you to modify the content "in place".

Related

How to specify a file prefix in gawk

I am trying to identify file extensions from a list of filenames extracted from a floppy disk image. The problem is different from this example where files are already extracted from the disk image. I'm new to gawk so maybe it is not the right tool.
ls Sounddsk2.img -a1 > allfilenames
The command above creates the list of filenames shown below.
flute.pt
flute.ss
flute.vc
guitar.pt
guitar.ss
guitar.vc
The gawk command below identifies files ending in .ss
cat allfilenames | gawk '/[fluteguitar].ss/' > ssfilenames
This would be fine when there are just a few known file names. How do I specify a file prefix in a more generic form?
Unless someone can suggest a better one this seems to be the most generic way to express this. It will work for any prefix filename spelt with uppercase letters, lowercase letters and numbers
cat allfilenames | gawk '/[a-zA-Z0-9].ss/' > ssfilenames
Edit
αғsнιη's first suggested answer and jetchisel's comment prompted me to try using gawk without using cat.
gawk '/^([a-zA-Z0-9])\.ss$/' allfilenames > ssfilenames
and this also worked
gawk '/[a-zA-Z0-9]\.ss/' allfilenames > ssfilenames
Please use find command to deal with matching of names of files, with your shown samples you could try following. You could run this command on directory itself and you need not to store file names into a file and then use awk for it.
find . -regextype egrep -regex '.*/(flute|guitar)\.ss$'
Explanation: Simple explanation would be, using find command's capability to add regextype in it(using egrep style here); where giving regex to match file names fulte OR guitar and make sure its ending with ss here.
You might also use grep with -E for extended regexp and use an alter alternation to match either flute or guitar.
ls Sounddsk2.img -a1 | grep -E "^(flute|guitar)\.ss$" > ssfilenames
The pattern matches:
^ Start of string
(flute|guitar) Match either flute or guitar
\.ss Match .ss
$ End of string
The file ssfilenames contains:
flute.ss
guitar.ss
with the regex you come /[fluteguitar].ss/, this matches on lines having one of these characters in it f, l, u, e, g, i, t, a and r (specified within bracket expression [...],duplicated characters count only once) followed by any single character (except newline here) that a single un-escaped dot . matches, then double ss in any place of a line.
you need to restrict the matching by using the start ^ and end $ of line anchors, as well as using the group of match.
awk '/^(flute|guitar)\.ss$/' allFilesName> ssFileNames
to filter only two files names matched with flute.ss and/or guitar.ss. The group match (...|...) is matches on any one of regexpes separated by the pipe as saying logical OR.
if these are just prefixes and to match any files beginning with these characters and end with .ss, use:
awk '/^(flute|guitar).*\.ss$/' allFilesName> ssFileNames

How to delete the "0"-row for multiple fles in a folder?

Each file's name starts with "input". One example of the files look like:
0.0005
lii_bk_new
traj_new.xyz
0
73001
146300
I want to delete the lines which only includes '0' and the expected output is:
0.0005
lii_bk_new
traj_new.xyz
73001
146300
I have tried with
sed -i 's/^0\n//g' input_*
and
grep -RiIl '^0\n' input_* | xargs sed -i 's/^0\n//g'
but neither works.
Please give some suggestions.
Could you please try changing your attempted code to following, run it on a single Input_file once.
sed 's/^0$//' Input_file
OR as per OP's comment to delete null lines:
sed 's/^0$//;/^$/d' Input_file
I have intentionally not put -i option here first test this in a single file of output looks good then only run with -i option on multiple files.
Also problem in your attempt was, you are putting \n in regex of sed which is default separator of line, we need to put $ in it to tell sed delete those lines which starts and ends with 0.
In case you want to take backup of files(considering that you have enough space available in your file system) you could use -i.bak option of sed too which will take backup of each file before editing(this isn't necessary but for safer side you have this option too).
$ sed '/^0$/d' file
0.0005
lii_bk_new
traj_new.xyz
73001
146300
In your regexp you were confusing \n (the literal LineFeed character which will not be present in the string sed is analyzing since sed reads one \n-separated line at a time) with $ (the end-of-string regexp metacharacter which represents end-of-line when the string being parsed is a line as is done with sed by default).
The other mistake in your script was replacing 0 with null in the matching line instead of just deleting the matching line.
Please give some suggestions.
I would use GNU awk -i inplace for that following way:
awk -i inplace '!/^0$/' input_*
This simply will preserve all lines which do not match ^0$ i.e. (start of line)0(end of line). If you want to know more about -i inplace I suggest reading this tutorial.

Extract Regex Pattern For Each Line - Leave Blank Line If No Pattern Exists

I am working with the following input:
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"
I need to be able to extract both the phone number and email of each line into separate files. However, both values don't always appear in the same field - they will always be prefaced with "phone": or "email":, but they may be in the first, second, third or even twentieth field.
I have tried chopping together solutions in SED and AWK to remove everything up until "phone" and then every after the next , but this doesn't not work as desired. It also means that, if "phone" and/or "email do not exist, the line is not changed at all.
I need a solution that will give me an output with the phone value of each line in one file, and the email value in another. HOWEVER, if no phone or email value exists, a blank line in the output needs to be in place.
Any ideas?
This might work for you (GNU sed):
sed -Ene 'h;/.*"phone":([^,]*).*/!z;s//\1/;w phoneFile' -e 'g;/.*"email":([^,]*).*/!z;s//\1/;w emailFile' file
Make a copy of line.
If the line does not contain a phone number empty the line, otherwise remove everything but the phone number.
Write the result to the phone number file.
Replace the current pattern space by the copy of the original line.
Repeat as above for an email address.
N.B. My first attempt used s/.*// instead of z to empty the line which worked but should not have. If the line contained no phone/email, the substitution should have reset default regexp and the second substitution should have objected that it did not contain a back reference. However the second substitution worked in either case.
After fixing your file to be valid json and adding an extra line missing the phone attribute so we can test more of your requirements:
$ cat file
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"}
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"}
you can do whatever you like with the data:
$ jq -r '.email // ""' file
mortina.curabia#gmail.com
foo.bar#gmail.com
$
$ jq -r '.phone // ""' file
549-287-5287
$
As long as it doesn't contain embedded newlines you can used sed 's/.*/{&}/' file to convert the input in your question to valid json as in my answer:
$ cat file
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"
"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"
$ sed 's/.*/{&}/' file
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"}
{"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"city":"Marshall","gender":"female","email":"foo.bar#gmail.com"}
$ sed 's/.*/{&}/' file | jq -r '.email // ""'
mortina.curabia#gmail.com
foo.bar#gmail.com
but I'm betting you started out with valid json and removed the {} by mistake along the way so you probably just need to not do that.
Using grep
Try:
grep -o '"phone":"[0-9-]*"' < Input > phone.txt
grep -o '"email":"[^"]*"' <Input > email.txt
Demo:
$echo '"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"' | grep -o '"phone":"[0-9-]*"'
"phone":"549-287-5287"
$echo '"visit_date":{"$date":"2017-11-28T04:43:00.000Z"},"phone":"549-287-5287","city":"Marshall","gender":"female","email":"mortina.curabia#gmail.com"' | grep -o '"email":"[^"]*"'
"email":"mortina.curabia#gmail.com"
$

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.
sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.
If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.
This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

awk to replace a line in a text file and save it

I want to open a text file that has a list of 500 IP addresses. I want to make the following changes to one of the lines and save the file. Is it possible to do that with awk or sed?
current line :
100.72.78.46:1900
changes :
100.72.78.46:1800
You can achieve that with the following:
sed -ie 's/100.72.78.46:1900/100.72.78.46:1800/' file.txt
The i option will update the original file, and a backup file will be created. This will edit only the first occurrence of the pattern. If you want to replace all matching patterns, add a g after the last /
This solution, however (as point out on the comments) fails in many other instances, such as 72100372578146:190032, which would transform into 72100.72.78.46:180032.
To circumvent that, you'd have to do an exact match, and also not treat the . as special character (see here):
sed -ie 's/\<100\.72\.78\.46:1900\>/100.72.78.46:1800/g' file.txt
note the \. and the \<...\> "word boundary" notation for the exact match. This solution worked for me on a Linux machine, but not on a MAC. For that, you would have to use a slightly different syntax (see here):
sed -ie 's/[[:<:]]100\.72\.78\.46:1900[[:>:]]/100.72.78.46:1800/g' file.txt
where the [[:<:]]...[[:>:]] would give you the exact match.
finally, I also realized that, if you have only one IP address per line, you could also use the special characters ^$ for the beginning and end of line, preventing the erroneous replacement:
sed -ie 's/^100\.72\.78\.46:1900$/100.72.78.46:1800/g' file.txt