Simplest scripting language for working with CSVs - scripting

I like using Python, because of the easy-to-learn syntax, however, I recently learned it has no support for UTF-8 in the area of CSVs. As I often use CSVs, this seems a serious problem for me. Is there another scripting language that has a simple syntax that I can learn when I need to manage really large CSV UTF-8 files?

If you're working on the command and can install another command line tool I'd strongly recommend csvfix.
Once installed you can robustly query any csv file e.g.
csvfix order -f 1,3 file.csv
will extract the 1st and 3rd columns of a csv.
There is a full list of commands here
See this related question

I'd recommend using gawk. E.g.:
awk -F ";" '{print $1 ";" $2}' FILE.csv
would print FILE.CSV's first two (; separated) column. To work properly with UTF-8, you should use it like:
LC_ALL=C awk 'BEGIN {print length("árvíztűrőtükörkúrópék")}'
=> 30
LC_ALL=en_US.utf8 awk 'BEGIN {print length("árvíztűrőtükörkúrópék")}'
=> 21
(Or you can set LC_ALL globally if you're using UTF-8 all the time, and you're on *nix, e.g. in .bashrc, export LC_ALL=en_US.utf8.)
awk is an old, but really powerful and fast tool.
HTH

Related

Why AWK program FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?

Why AWK program's FS variable can be specified with -F flag of gawk (or other awk) interpreter/command?
Let me explain, AWK is a programming language and gawk is (one of many) an interpreter for AWK. gawk interpreter/execute/runs the AWK program that given to it. So why the FS (field separator) variable can be specified with gawk's -F flag? I find it kind of unnatural... and how does it technically do that?
My best guess as to "why" is as a convenience. FS is probably the most used/manipulated awk variable, so having a short option to set it is helpful
Consider
awk -F, '...' file.csv
# vs
awk 'BEGIN {FS=","} ...' file.csv
"How does it technically do that" -- see https://git.savannah.gnu.org/cgit/gawk.git/tree/main.c#n1586
Historically -F was implemented in gawk v1.01 so it would have existed in whatever legacy awk that gawk was based on.
Additionally, the POSIX specification mandates -F.
So why the FS (field separator) variable can be specified with gawk's
-F flag?
awk man page claims that
Command line variable assignment is most useful for dynamically
assigning values to the variables AWK uses to control how input is
broken into fields and records. It is also useful for controlling
state if multiple passes are needed over a single data file.
So -F comes handy when field separator is not etched in stone, but rather computed dynamically, as -F allows you tu use bash variable easily, imagine that you was tasked with developing part of bash script which should output last field of each line of file.txt when using character stored in variable sep as separator, then you could do that following way
awk -F ${sep} '{print $NF}' file.txt
find it kind of unnatural
This depend on what you have used before, cut user which want to get 3rd column from csv file might do that following way
cut -d , -f 3 file.csv

Understanding A Particular Line of Nawk Command

I am going through shell scripting online lessons as my work requires me to learn shell scripting.
I came across "awk" and "nawk " commands and my learning hasn't yet reached up to it.
In a nutshell, I know that awk/nawk search for a particular pattern and perform an action in case a match has been found.
Even despite of that, I couldn't understand what the following line is meant for:
eval $( cat ${MMORPHYDIR}/${PWFILE} | nawk ' /^#BEGIN '${ENV_NAME}'/,/^#END '${ENV_NAME}'/ { print }' | egrep "${USEFULL_PARAM}" )
Please help me to understand what this line does or is intended to do.
... awk '/start/,/end/'
prints the records between start and end patterns. {print} can be omitted since it's implied. The ^ in your script indicates beginning of a line.
Note that cat and eval are unnecessary, the same can be written as
$ awk '...' "${MMORPHYDIR}/${PWFILE}"
also grep can be included in the awk script as well.
awk is THE standard, general purpose tool for manipulating text on all UNIX-like systems. There are various flavors of awk, all with the same core functionality plus some differences. nawk is the very unfortunately named new awk because it was named that about 30 years ago as the successor to 1977s old awk (e.g. /bin/awk on Solaris, aka old, broken awk which must never be used) and is now actually best avoided as it doesn't even support the minimal awk functionality required by POSIX (e.g. character classes). Important lesson there: never use the word "new" in the name of any software entity!
The best awk to use these days is GNU awk, gawk, as it supports all POSIX functionality plus a ton of useful extensions, is generally available, is extremely well documented, and has a massive user base.
wrt:
eval $( cat ${MMORPHYDIR}/${PWFILE} | nawk ' /^#BEGIN '${ENV_NAME}'/,/^#END '${ENV_NAME}'/ { print }' | egrep "${USEFULL_PARAM}" )
That is a complete mess, doing literally about a dozen things that should never be done in shell or in awk. Trying to explain it would be like trying to explain someone mixing concrete with a screwdriver. Forget you ever saw it and move on.
To learn awk, read the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

Retain backslashes with while read loop in multiple shells

I have the following code:
#!/bin/sh
while read line; do
printf "%s\n" $line
done < input.txt
Input.txt has the following lines:
one\two
eight\nine
The output is as follows
onetwo
eightnine
The "standard" solutions to retain the slashes would be to use read -r.
However, I have the following limitations:
must run under #!/bin/shfor reasons of portability/posix compliance.
not all systems
will support the -r switch to read under /sh
The input file format cannot be changed
Therefore, I am looking for another way to retain the backslash after reading in the line. I have come up with one working solution, which is to use sed to replace the \ with some other value (e.g.||) into a temporary file (thus bypassing my last requirement above) then, after reading them in use sed again to transform it back. Like so:
#!/bin/sh
sed -e 's/[\/&]/||/g' input.txt > tempfile.txt
while read line; do
printf "%s\n" $line | sed -e 's/||/\\/g'
done < tempfile.txt
I'm thinking there has to be a more "graceful" way of doing this.
Some ideas:
1) Use command substitution to store this into a variable instead of a file. Problem - I'm not sure command substitution will be portable here either and my attempts at using a variable instead of a file were unsuccessful. Regardless, file or variable the base solution is really the same (two substitutions).
2) Use IFS somehow? I've investigated a little, but not sure that can help in this issue.
3) ???
What are some better ways to handle this given my constraints?
Thanks
Your constraints seem a little strict. Here's a piece of code I jotted down(I'm not too sure of how valuable your while loop is for the other stuffs you would like to do, so I removed it off just for ease). I don't guarantee this code to be robustness. But anyway, the logic would give you hints in the direction you may wish to proceed. (temp.dat is the input file)
#!/bin/sh
var1="$(cut -d\\ -f1 temp.dat)"
var2="$(cut -d\\ -f2 temp.dat)"
iter=1
set -- $var2
for x in $var1;do
if [ "$iter" -eq 1 ];then
echo $x "\\" $1
else
echo $x "\\" $2
fi
iter=$((iter+1))
done
As Larry Wall once said, writing a portable shell is easier than writing a portable shell script.
perl -lne 'print $_' input.txt
The simplest possible Perl script is simpler still, but I imagine you'll want to do something with $_ before printing it.

parse a url with the command line only

I have a csv file looking like this:
id,author,url
1,bob,http://mywebsite.com/path/to/content
2,john,https://anotherwebsite.com/path/to/some/other/content
3,alice,http://www.somewebsite.com/path/to/content
And I'd like to turn it into:
id,author,url
1,bob,mywebsite.com
2,john,anotherwebsite.com
3,alice,somewebsite.com
I know this could be done easily with javascript or python but I am trying to understand how awk and sed work. Is there a way to do this easily with command line tools only?
Many thanks
This should do:
awk -F, 'NR>1{split($3,a,"/");$0=$1","$2","a[3]}1' file
id,author,url
1,bob,mywebsite.com
2,john,anotherwebsite.com
3,alice,www.somewebsite.com
Split the line using ,
Then for all except first line NR>1, split filed $3, recreate the line.
1print all
Also remove www.
awk -F, 'NR>1{split($3,a,"/");sub(/^www./,"",a[3]);$0=$1","$2","a[3]}1'
id,author,url
1,bob,mywebsite.com
2,john,anotherwebsite.com
3,alice,somewebsite.com

Awk unable to store the value into an array

I am using a script like below
SCRIPT
declare -a GET
i=1
awk -F "=" '$1 {d[$1]++;} {for (c in d) {GET[i]=d[c];i=i+1}}' session
echo ${GET[1]} ${GET[2]}
DESCRIPTION
The problem is the GET value printed outside is not the correct value ...
I understand your question as "how can I use the results of my awk script inside the shell where awk was called". The truth is that it isn't really trivial. You wouldn't expect to be able to use the output from a C program or python script easily inside your shell. Same with awk, which is a scripting language of its own.
There are some workarounds. For a robust solution, write your results from the awk script to a file in a suitably easy format and read them from the shell. As a hack, you could also try to ready the output from awk directly into the shell using $(). Combine that with the set command and you could do:
set -- $(awk <awk script here>)
and then use $1 $2 etc. but you have to be careful with spaces in the output from awk.