How to specify a file prefix in gawk

How to specify a file prefix in gawk - awk

I am trying to identify file extensions from a list of filenames extracted from a floppy disk image. The problem is different from this example where files are already extracted from the disk image. I'm new to gawk so maybe it is not the right tool.
ls Sounddsk2.img -a1 > allfilenames
The command above creates the list of filenames shown below.
flute.pt
flute.ss
flute.vc
guitar.pt
guitar.ss
guitar.vc
The gawk command below identifies files ending in .ss
cat allfilenames | gawk '/[fluteguitar].ss/' > ssfilenames
This would be fine when there are just a few known file names. How do I specify a file prefix in a more generic form?

Unless someone can suggest a better one this seems to be the most generic way to express this. It will work for any prefix filename spelt with uppercase letters, lowercase letters and numbers
cat allfilenames | gawk '/[a-zA-Z0-9].ss/' > ssfilenames
Edit
αғsнιη's first suggested answer and jetchisel's comment prompted me to try using gawk without using cat.
gawk '/^([a-zA-Z0-9])\.ss$/' allfilenames > ssfilenames
and this also worked
gawk '/[a-zA-Z0-9]\.ss/' allfilenames > ssfilenames

Please use find command to deal with matching of names of files, with your shown samples you could try following. You could run this command on directory itself and you need not to store file names into a file and then use awk for it.
find . -regextype egrep -regex '.*/(flute|guitar)\.ss$'
Explanation: Simple explanation would be, using find command's capability to add regextype in it(using egrep style here); where giving regex to match file names fulte OR guitar and make sure its ending with ss here.

You might also use grep with -E for extended regexp and use an alter alternation to match either flute or guitar.
ls Sounddsk2.img -a1 | grep -E "^(flute|guitar)\.ss$" > ssfilenames
The pattern matches:
^ Start of string
(flute|guitar) Match either flute or guitar
\.ss Match .ss
$ End of string
The file ssfilenames contains:
flute.ss
guitar.ss

with the regex you come /[fluteguitar].ss/, this matches on lines having one of these characters in it f, l, u, e, g, i, t, a and r (specified within bracket expression [...],duplicated characters count only once) followed by any single character (except newline here) that a single un-escaped dot . matches, then double ss in any place of a line.
you need to restrict the matching by using the start ^ and end $ of line anchors, as well as using the group of match.
awk '/^(flute|guitar)\.ss$/' allFilesName> ssFileNames
to filter only two files names matched with flute.ss and/or guitar.ss. The group match (...|...) is matches on any one of regexpes separated by the pipe as saying logical OR.
if these are just prefixes and to match any files beginning with these characters and end with .ss, use:
awk '/^(flute|guitar).*\.ss$/' allFilesName> ssFileNames

Related

Replace character except between pattern using grep -o or sed (or others)

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.
Example:
Input
A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;
output
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.
The regex for the " delimited string: \".*;.*\" .
(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)
My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.
The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.
Any ideas are appreciated.

sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.
You have a ;-separated CSV that you want to make ,-separated. That's simply:
$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,
The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.

If I get correctly your requirements, one option would be to make a three pass thing.
From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :
sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed
The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).
The three pass can be combined in a single command with
sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed
That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.
From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.

This might work for you (GNU sed):
sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file
Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

How to extract the first column from a tsv file?

I have a file containing some data and I want to use only the first column as a stdin for my script, but I'm having trouble extracting it.
I tried using this
awk -F"\t" '{print $1}' inputs.tsv
but it only shows the first letter of the first column. I tried some other things but it either shows the entire file or just the first letter of the first column.
My file looks something like this:
Harry_Potter 1
Lord_of_the_rings 10
Shameless 23
....

You can use cut which is available on all Unix and Linux systems:
cut -f1 inputs.tsv
You don't need to specify the -d option because tab is the default delimiter. From man cut:
-d delim
Use delim as the field delimiter character instead of the tab character.
As Benjamin has rightly stated, your awk command is indeed correct. Shell passes literal \t as the argument and awk does interpret it as a tab, while other commands like cut may not.
Not sure why you are getting just the first character as the output.
You may want to take a look at this post:
Difference between single and double quotes in Bash

Try this (better rely on a real csv parser...):
csvcut -c 1 -f $'\t' file
Check csvkit
Output :
Harry_Potter
Lord_of_the_rings
Shameless
Note :
As #RomanPerekhrest said, you should fix your broken sample input (we saw spaces where tabs are expected...)

awk to replace a line in a text file and save it

I want to open a text file that has a list of 500 IP addresses. I want to make the following changes to one of the lines and save the file. Is it possible to do that with awk or sed?
current line :
100.72.78.46:1900
changes :
100.72.78.46:1800

You can achieve that with the following:
sed -ie 's/100.72.78.46:1900/100.72.78.46:1800/' file.txt
The i option will update the original file, and a backup file will be created. This will edit only the first occurrence of the pattern. If you want to replace all matching patterns, add a g after the last /
This solution, however (as point out on the comments) fails in many other instances, such as 72100372578146:190032, which would transform into 72100.72.78.46:180032.
To circumvent that, you'd have to do an exact match, and also not treat the . as special character (see here):
sed -ie 's/\<100\.72\.78\.46:1900\>/100.72.78.46:1800/g' file.txt
note the \. and the \<...\> "word boundary" notation for the exact match. This solution worked for me on a Linux machine, but not on a MAC. For that, you would have to use a slightly different syntax (see here):
sed -ie 's/[[:<:]]100\.72\.78\.46:1900[[:>:]]/100.72.78.46:1800/g' file.txt
where the [[:<:]]...[[:>:]] would give you the exact match.
finally, I also realized that, if you have only one IP address per line, you could also use the special characters ^$ for the beginning and end of line, preventing the erroneous replacement:
sed -ie 's/^100\.72\.78\.46:1900$/100.72.78.46:1800/g' file.txt

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?

It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop

Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ

This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.

(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'

The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file

text processing: sed to work backwards to delete until string

My AWK script generates 1 of the following 2 outputs depending on what text file it is being used on.
49 1146.469387755102 mongodb 192.168.0.8:27017 -p mongodb.database
1 1243.0 jdbc:mysql 192.168.0.8:3306/ycsb -p db.user
I need a way of deleting everything past the IP address, including the port number.
sed 's/:[^:]*//2g'
Works apart from the fact it deletes from left to right and as one of the outputs contains 2 : 's it stops and deletes everything after that. Is there a way of reversing sed to work from right to left?
Just to be clear, desired output of each would be:
49 1146.469387755102 mongodb 192.168.0.8
1 1243.0 jdbc:mysql 192.168.0.8

You could use the below sed command.
sed 's/:[0-9]\{4\}.*//' file
OR
sed 's/:[^:]*$//' file
[^:]* negated character class which matches any char but not of :, zero or more times. $ matches the end of the line boundary. So :[^:]*$ matches all the chars from the last colon upto the end. Replacing those matched chars with empty string will give you the desired output.

You can take advantage of the greedy nature of the Kleene *:
sed 's/\(.*\):.*/\1/' file
The .* consumes as much as it can, while still matching the pattern. The captured part of the line is used in the replacement.
Alternatively, using awk (thanks to glenn jackman for setting me straight):
awk -F: -v OFS=: 'NF{NF--}1' file
Set the input and output field separators to a colon remove the final field by decrementing NF. 1 is true so the default action {print} is performed. The NF condition prevents empty lines from causing an error, which may not be necessary in your case but does no harm.
Output either way:
49 1146.469387755102 mongodb 192.168.0.8
1 1243.0 jdbc:mysql 192.168.0.8

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to specify a file prefix in gawk - awk

Related

Replace character except between pattern using grep -o or sed (or others)

How to extract the first column from a tsv file?

awk to replace a line in a text file and save it

Replace chars after column X

text processing: sed to work backwards to delete until string

Categories

Resources