Writing special characters (α, β) in bash to an output file - not encoding correctly - awk

I'm attempting to modify an existing file (input.csv) using the awk command in the OSX terminal and then writing the output to a file (output.csv).
The command I have works for the text changes I need and displays the output correctly in the terminal (including correct special characters):
awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' input.csv
I then attempt to write this output to a file by piping it to the tee command:
awk '{for(i=1;i<=NF;i++){ $i=toupper(substr($i,1,1)) substr($i,2) }}1' input.csv | tee output.csv
I find the that the special characters in the file are now corrupted, e.g. "α-Synuclein"
becomes "α-synuclein". I believe this is to do with the encoding but am unsure how to specify which to use or where to change it.
I've noticed the by running the command file input.txt the encoding listed is:
"UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line
terminators"
When I run file output.txt on the output file I get:
"Non-ISO extended-ASCII text, with very long lines, with CRLF, LF line
terminators"
I think this means I should be encoding the output file as UTF-8... Can anyone suggest a way to fix this so that I can write these characters to my output file without them becoming corrupted?

Related

Checking GAWK binary characters

Win 7-64
Cygwin
GNU Awk 5.1.0
I'm trying to develop a program to handle both DOS an Unix formatted files. It looks like the only difference between the two at the application level is that the last character of a line in a DOS file is "\r". I can't figure out how to do a comparison.
My input looks like "w,x,y,z", where z can be "" in Unix for "\r" in DOS. The following does not work:
if (z || z == "\r") # check for Unix ($4) and DOS ($4 == "\r").
gawk may not even see the \r as they can be stripped off by underlying primitives. You need to set BINMODE to make sure your script sees them. See https://www.gnu.org/software/gawk/manual/gawk.html#Built_002din-Variables and https://www.gnu.org/software/gawk/manual/gawk.html#PC-Using where it talks about:
Under MS-Windows, gawk (and many other text programs) silently
translates end-of-line ‘\r\n’ to ‘\n’ on input and ‘\n’ to ‘\r\n’ on
output.
So you can do:
awk -v BINMODE=3 '
{ print "is " ( /\r$/ ? "" : "not " ) "dos" }
'
but even with that you can't tell on a line-by line basis if a file has DOS line-endings or not since Windows tools can generate quoted text with linefeeds in the middle, e.g. Excel would output
"foo","some\nother\nstuff","bar"\r\n
where the above is a single row in the spreadsheet if the middle cell in a spreadsheet contained a couple of linefeeds and that'd look like this and be read as 3 separate lines by gawk on a UNIX platform unless you specifically set RS='\r\n':
"foo","some
other
stuff","bar"\r
So to detect if your file has DOS line-endings or not you need to search your whole input file til you find \r\n and even then you don't really KNOW if that's what it means - could be the data just happened to have a \r at the end of one line.
From your comments below I think you're trying to do something that is simply impossible.
Here's a file created on UNIX and using UNIX line endings where the final field is bar\r:
foo,bar\r\n
Here's a file created on Windows and using DOS line endings where the final field is bar:
foo,bar\r\n
As you can see there's no way to programmatically determine given just the file foo,bar\r\n which of the above 2 cases that means.
$ cat -t carriage
a1^M
a2^M
$ cat -t nocarriage
a1
a2
$ gawk '/\r/' carriage
a1
a2
$ gawk '/\r/' nocarriage
As you can see, with gawk it is straightforward to check if each line has carriage-returns. Writing the octal \015 is a possible alternative to \r.

Renaming .txt files with string contained within the file

I have a folder containing .txt files that each contain 1 line of text. The files are named as such:
file_1.txt
file_2.txt
file_3.txt
file_4.txt
In each file, there is a line of text that has the text transcript of a corresponding audio file plus the name of the audio file in brackets, like this:
Some text here (31ca4d158eaef166c37a_18.87_27.76)
For each file, I need to rename it to the string contained within brackets in the file. I have been able to extract the string in brackets using the following code:
awk -F'[()]' '{print $2}' file_*.txt
This gives me a list of all the strings contained within brackets in each .txt file, but I have not been able to use this information to rename all of the files. Any help would be greatly appreciated.
This has been part of a multi-step process, and solutions that have worked for me have mainly been in awk, but I would be happy to hear solutions using other languages, especially if they are built-in to Ubuntu 18.04 (my OS).
for file in file_*.txt;do mv -- "$file" "$(awk -F'[()]' '{print $2}' "$file")";done

why awk print file content while there is no print command

i have an awk file, which i read each words from a file into an array, there is no print command in it, but after i run it, the whole content of the file is printed,
#!/bin/awk -f
{
for(i=1;i<=NF;i++)
used[$i]=1
}
after i run this awk file like this
awk 1.awk 2
the whole content of file 2 is printed on the screen, i am confused,
i tried this directly from command line, there is nothing printed out, so i think there is something wrong with the file or the way to run this file,
You missed the -f option: awk -f 1.awk 2
What you provided is, instead of the contents of "1.awk" as the awk commands, you're providing the literal string 1.awk as the awk command.
You can essentially done this: awk '"1.awk"' 2
And since that is a "true" value, the default action is to print each record of the data contained in file "2".

How to extract the first column from a tsv file?

I have a file containing some data and I want to use only the first column as a stdin for my script, but I'm having trouble extracting it.
I tried using this
awk -F"\t" '{print $1}' inputs.tsv
but it only shows the first letter of the first column. I tried some other things but it either shows the entire file or just the first letter of the first column.
My file looks something like this:
Harry_Potter 1
Lord_of_the_rings 10
Shameless 23
....
You can use cut which is available on all Unix and Linux systems:
cut -f1 inputs.tsv
You don't need to specify the -d option because tab is the default delimiter. From man cut:
-d delim
Use delim as the field delimiter character instead of the tab character.
As Benjamin has rightly stated, your awk command is indeed correct. Shell passes literal \t as the argument and awk does interpret it as a tab, while other commands like cut may not.
Not sure why you are getting just the first character as the output.
You may want to take a look at this post:
Difference between single and double quotes in Bash
Try this (better rely on a real csv parser...):
csvcut -c 1 -f $'\t' file
Check csvkit
Output :
Harry_Potter
Lord_of_the_rings
Shameless
Note :
As #RomanPerekhrest said, you should fix your broken sample input (we saw spaces where tabs are expected...)

How to get few lines from a .gz compressed file without uncompressing

How to get the first few lines from a gziped file ?
I tried zcat, but its throwing an error
zcat CONN.20111109.0057.gz|head
CONN.20111109.0057.gz.Z: A file or directory in the path name does not exist.
zcat(1) can be supplied by either compress(1) or by gzip(1). On your system, it appears to be compress(1) -- it is looking for a file with a .Z extension.
Switch to gzip -cd in place of zcat and your command should work fine:
gzip -cd CONN.20111109.0057.gz | head
Explanation
-c --stdout --to-stdout
Write output on standard output; keep original files unchanged. If there are several input files, the output consists of a sequence of independently compressed members. To obtain better compression, concatenate all input files before compressing
them.
-d --decompress --uncompress
Decompress.
On some systems (e.g., Mac), you need to use gzcat.
On a mac you need to use the < with zcat:
zcat < CONN.20111109.0057.gz|head
If a continuous range of lines needs be, one option might be:
gunzip -c file.gz | sed -n '5,10p;11q' > subFile
where the lines between 5th and 10th lines (both inclusive) of file.gz are extracted into a new subFile. For sed options, refer to the manual.
If every, say, 5th line is required:
gunzip -c file.gz | sed -n '1~5p;6q' > subFile
which extracts the 1st line and jumps over 4 lines and picks the 5th line and so on.
If you want to use zcat, this will show the first 10 rows
zcat your_filename.gz | head
Let's say you want the 16 first row
zcat your_filename.gz | head -n 16
This awk snippet will let you show not only the first few lines - but a range you can specify. It will also add line numbers which i needed for debugging an error message pointing to a certain line way down in a gzipped file.
gunzip -c file.gz | awk -v from=10 -v to=20 'NR>=from { print NR,$0; if (NR>=to) exit 1}'
Here is the awk snippet used in the one liner above. In awk NR is a built-in variable (Number of records found so far) which usually is equivalent to a line number. the from and to variable are picked up from the command line via the -v options.
NR>=from {
print NR,$0;
if (NR>=to)
exit 1
}