Checking GAWK binary characters - awk

Win 7-64
Cygwin
GNU Awk 5.1.0
I'm trying to develop a program to handle both DOS an Unix formatted files. It looks like the only difference between the two at the application level is that the last character of a line in a DOS file is "\r". I can't figure out how to do a comparison.
My input looks like "w,x,y,z", where z can be "" in Unix for "\r" in DOS. The following does not work:
if (z || z == "\r") # check for Unix ($4) and DOS ($4 == "\r").

gawk may not even see the \r as they can be stripped off by underlying primitives. You need to set BINMODE to make sure your script sees them. See https://www.gnu.org/software/gawk/manual/gawk.html#Built_002din-Variables and https://www.gnu.org/software/gawk/manual/gawk.html#PC-Using where it talks about:
Under MS-Windows, gawk (and many other text programs) silently
translates end-of-line ‘\r\n’ to ‘\n’ on input and ‘\n’ to ‘\r\n’ on
output.
So you can do:
awk -v BINMODE=3 '
{ print "is " ( /\r$/ ? "" : "not " ) "dos" }
'
but even with that you can't tell on a line-by line basis if a file has DOS line-endings or not since Windows tools can generate quoted text with linefeeds in the middle, e.g. Excel would output
"foo","some\nother\nstuff","bar"\r\n
where the above is a single row in the spreadsheet if the middle cell in a spreadsheet contained a couple of linefeeds and that'd look like this and be read as 3 separate lines by gawk on a UNIX platform unless you specifically set RS='\r\n':
"foo","some
other
stuff","bar"\r
So to detect if your file has DOS line-endings or not you need to search your whole input file til you find \r\n and even then you don't really KNOW if that's what it means - could be the data just happened to have a \r at the end of one line.
From your comments below I think you're trying to do something that is simply impossible.
Here's a file created on UNIX and using UNIX line endings where the final field is bar\r:
foo,bar\r\n
Here's a file created on Windows and using DOS line endings where the final field is bar:
foo,bar\r\n
As you can see there's no way to programmatically determine given just the file foo,bar\r\n which of the above 2 cases that means.

$ cat -t carriage
a1^M
a2^M
$ cat -t nocarriage
a1
a2
$ gawk '/\r/' carriage
a1
a2
$ gawk '/\r/' nocarriage
As you can see, with gawk it is straightforward to check if each line has carriage-returns. Writing the octal \015 is a possible alternative to \r.

Related

How to extract the first column from a tsv file?

I have a file containing some data and I want to use only the first column as a stdin for my script, but I'm having trouble extracting it.
I tried using this
awk -F"\t" '{print $1}' inputs.tsv
but it only shows the first letter of the first column. I tried some other things but it either shows the entire file or just the first letter of the first column.
My file looks something like this:
Harry_Potter 1
Lord_of_the_rings 10
Shameless 23
....
You can use cut which is available on all Unix and Linux systems:
cut -f1 inputs.tsv
You don't need to specify the -d option because tab is the default delimiter. From man cut:
-d delim
Use delim as the field delimiter character instead of the tab character.
As Benjamin has rightly stated, your awk command is indeed correct. Shell passes literal \t as the argument and awk does interpret it as a tab, while other commands like cut may not.
Not sure why you are getting just the first character as the output.
You may want to take a look at this post:
Difference between single and double quotes in Bash
Try this (better rely on a real csv parser...):
csvcut -c 1 -f $'\t' file
Check csvkit
Output :
Harry_Potter
Lord_of_the_rings
Shameless
Note :
As #RomanPerekhrest said, you should fix your broken sample input (we saw spaces where tabs are expected...)

Renaming files based on internal text match - keep all content of file

Still having trouble figuring out how to preserve the contents of a given file using the following code that is attempting to rename the file based on a specific regex match within said file (i.e. within a given file there will always be one SMILE followed by 12 digits, e.g., SMILE000123456789).
for f in FILENAMEX_*; do awk '/SMILE[0-9]/ {OUT=$f ".txt"}; OUT {print >OUT}' ${f%.*}; done
This code is naming the file correctly but is simply printing out everything after the match instead of the entire contents of the file.
The list of files to be processed don't currently have an extension (and they need one for the next step) because I was using csplit to parse out the content from a larger file.
There are two problems: the first is using a shell variable in your awk program, and the second is the logic of the awk program itself.
To use a shell variable in awk, you can use
awk -v var="$var" '<program>'
and then use just var inside of awk.
For the second problem: if a line doesn't match your pattern and OUT is not set, you don't print the line. After the first line matching the pattern, OUT is set and you print. Since the match might be anywhere in the file, you have to store the lines at least up to the first match.
Here is a version that should work and is pretty close to your approach:
for f in FILENAMEX_*; do
awk -v f="${f%.*}" '
/SMILE[0-9]/ {
out=f".txt"
for (i=1;i<NR;++i) # Print file so far
print lines[i] > out
}
out { print > out } # Match has been seen: print
! out { lines[NR] = $0 } # No match yet: store
' "$f"
done
You could do some trickery and work with FILENAME or similar to do everything in a single invocation of awk, but since the main purpose is to find the presence of a pattern in the file, you're much better off using grep -q, which returns an exit status of 0 if the pattern has been found:
for f in FILENAMEX_*; do grep -q 'SMILE[0-9]' "$f" && cp "$f" "${f%.*}".txt; done
perhaps a different approach and just do each step separately..
ie pseudocode
for all files with some given text
extract text
rename file

using literal string for gawk

I thing I'm too close to the problem already that I just can solve it on my own, alltough I'm sure it's easy to solve.
I'm working on a NAS with a SHELL Script for my Raspberry PI which automaticly collects data and distributes it over my other devices. So I decided to include a delete-option, since otherwise it would be a pain in the ass to delete a file, since the raspberry would always copy it right back from the other devices. While the script runs it creats a file: del_tmp_$ip.txt in which are directorys and files to delete from del_$ip.txt (Not del_TMP_$ip.txt).
It looks like this:
test/delete_me.txt
test/hello/hello.txt
pi.txt
I tried to delete the lines viá awk, and this is how far I got by now:
while read r; do
gawk -i inplace '!/^'$r'$/' del_$ip.txt
done <del_tmp_$ip.txt
If the line from del_tmp_$ip.txt tells gawk to delete pi.txt it works without problems, but if the string includes a slash like test/delete_me.txt it doesn't work:
"unexpected newline or end of string"
and it points to the last slash then.
I can't escape the forwardslash with a backwardslash manually, since I don't know whether and how many slashes there will be. Depending on the line of the file which contains the information to be deleted.
I hope you can help me!
Never allow a shell variable to expand to become part of the awk script text before awk evaluates it (which is what you're doing with '!/^'$r'$/') and always quote your shell variables (so the correct shell syntax would have been '!/^'"$r"'$/' IF it hadn't been the wrong approach anyway). The correct syntax to write that command would have been
awk -v r="$r" '$0 !~ "^"r"$"' file
but you said you wanted a string comparison, not regexp so then it'd be simply:
awk -v r="$r" '$0 != r' file
and of course you don't need a shell loop at all:
while read r; do
gawk -i inplace '!/^'$r'$/' del_$ip.txt
done <del_tmp_$ip.txt
you just need 1 awk command:
gawk -i inplace 'NR==FNR{skip[$0];print;next} !($0 in skip)' "del_tmp_$ip.txt" "del_$ip.txt"

Run awk on file being decoded continuously and then print patterns from decode file using awk

I a command which decodes binary logs to ascii format
From ASCII format file, I need to grep some patters using awk and print them
How can this be done?
What I have tried is as below in shell script and it does not work.
command > file.txt | awk /pattern/ | sed/pattern/
Also I need command to continously decode file and keep printing patterns on file being updated
Thanks in advance
command to continously decode file and keep printing patterns
The first question is exactly how continuously manifests itself. Most log files grow by being appended to -- for our purpose here, by some unknown external process -- and are periodically rotated. If you're going to continuously decode them, you're going to have to keep track of log rotation.
Can command continuously decode, or do you intend to re-run command periodically, picking up where you left off? If the latter, you might instead try some variation of:
cat log | command | awk
If that can't be done, you'll have to record where each iteration terminates, something like:
touch pos
while -f pos
do
command | awk -v status=pos script.awk > output || rm pos
done
where script.awk skips input until NR equals the value of the number in the pos file. It then processes lines until EOF, and overwrites pos with its final NR. On error, it calls exit 1, and the file is removed, and the loop terminates.
I recommend you ignore sed, and put all the pattern matching logic in one awk script. It will be easier to understand and cheaper to execute.

How to do search/replace in a "binary" file from the command line

I have some data files to import into a database with some "unique" delimiters:
Field Separator (FS): SOH (ASCII character 1)
Record Separator (RS) : STX (ASCII character 2) +’\n’
I'd like to import the files into Postgres using the COPY command but while I can specify a custom field delimiter, it can't handle the record separator.
I can't just strip out the \002 from the data either, because if there is a newline in one of the fields (and there are) it will incorrectly case the COPY to think it is a new record when in fact it is not.
One important thing to note: it's not important that newlines in fields are preserved, it's fine if they are just converted into a space.
With this in mind, I was thinking of using something like "sed" to convert newlines into spaces, then convert \002 into newlines. However, since sed is a line-based tool it doesn't seem to see the newlines at the end of each line and can't do a search/replace on them.
Are there any other unix command-line tools that could do the job?
EDIT: I guess what I'm really asking for is a unix utility that can process a file (perform search/replace) as "binary" without splitting it up into lines
Based on the suggestion given by Patrick, I have been able to do it using Perl:
cat file | perl -pe 's/\002\n/\002\002/g' | perl -pe 's/\n/ /g' | perl -pe 's/\002\002/\n/g'
Could you do multiple passes through the file? Pass 1 converts all \002\n to \002\002 say.
Pass 2 could convert all the \n to spaces. Pass 3 can convert all the \002\002 to \n.