parse a url with the command line only - awk

I have a csv file looking like this:
id,author,url
1,bob,http://mywebsite.com/path/to/content
2,john,https://anotherwebsite.com/path/to/some/other/content
3,alice,http://www.somewebsite.com/path/to/content
And I'd like to turn it into:
id,author,url
1,bob,mywebsite.com
2,john,anotherwebsite.com
3,alice,somewebsite.com
I know this could be done easily with javascript or python but I am trying to understand how awk and sed work. Is there a way to do this easily with command line tools only?
Many thanks

This should do:
awk -F, 'NR>1{split($3,a,"/");$0=$1","$2","a[3]}1' file
id,author,url
1,bob,mywebsite.com
2,john,anotherwebsite.com
3,alice,www.somewebsite.com
Split the line using ,
Then for all except first line NR>1, split filed $3, recreate the line.
1print all
Also remove www.
awk -F, 'NR>1{split($3,a,"/");sub(/^www./,"",a[3]);$0=$1","$2","a[3]}1'
id,author,url
1,bob,mywebsite.com
2,john,anotherwebsite.com
3,alice,somewebsite.com

Related

Awk-How remove pattern with brackets

I made a mistake, adding Ini file (Yes we're in 2022 :D) a section with errors
I added a line [End[edit=true]
How could remove this entire line using awk (I don't have any others choice 😕)
I don't understand how escape the [ in the AWK command line.
Could you please help me?
Thanks
I don't understand how escape the [ in the AWK command line.
If line is always literal [End[edit=true] then you do not need to, just request lines which are not that one following way, let file.ini content be
[someline=true]
[End[edit=true]
[another=true]
then
awk '$0!="[End[edit=true]"' file.ini
gives output
[someline=true]
[another=true]
Explanation: $0 denotes whole line, if it is not [End[edit=true] then it is printed.
(tested in GNU Awk 5.0.1)
A couple ideas where you escape the leading (left) brackets:
awk '/\[End\[edit=true]/ {next} 1' file
# or
awk '!/\[End\[edit=true]/' file
Once you've confirmed the results, and assuming you're using GNU awk, you can add -i inplace to update the file:
awk -i inplace '!/\[End\[edit=true]/' file

How to print multiple files in awk?

What is wrong with this file please? I owuld like to print all lines from file01, file02, file03 ... file11.
awk '{print}' file[01-11].txt > file
Assuming you are running this in BASH, the [01-11] second is not in the correct format. Instead, consider the following:
awk '{print}' file{01..11}.txt > file
This is again, assuming a specific shell. If you are running this awk command in a shell that does not support the {##..##} nomenclature, consider testing how your file[01-11].txt is expanding first -- I imagine it's not expanding out to the files you think.
How about using cat itself for it like(since you are only printing and not doing any other operation):
cat Input_file{01..11}.txt > file
In case you really want to do only in awk then try:
awk '1' Input_file{01..11}.txt > file

For gawk, how to set FS and RS in the same command as an awk script?

I have an awk command that returns the duplicates in an input stream with
awk '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'
However, I want to change the field separator characters and record separator characters before I do that. The command I use for that is
FS='\n' RS='\n\n'
Yet I'm having trouble making that happen. Is there a way to effectively combine these two commands into one? Piping one to the other doesn't seem to work either.
the action of BEGIN rule is executed before reading any input.
awk 'BEGIN{FS="\n";RS="\n\n"}{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'
or you can specify them using command line options like:
awk -F '\n' -v RS='\n\n' '{a[$0]++}END{for (i in a)if (a[i]>1)print i;}'

How do i delete all the lines in a file ending with colon using awk or sed?

I have a file with some tabular data in it. But it also has some text line(ending with a colon) in between the data. So i want to remove those text lines and only have my data.
Try this:
awk '{gsub(/:$/,""); print}' file.txt
This will take only the last colon and not the one which contain colon in between the line.
or as JID commented:
awk '!/:$/' file
you could use grep - something like
grep -v ":$" file
or sed - like
sed "/:$/d" file

Simplest scripting language for working with CSVs

I like using Python, because of the easy-to-learn syntax, however, I recently learned it has no support for UTF-8 in the area of CSVs. As I often use CSVs, this seems a serious problem for me. Is there another scripting language that has a simple syntax that I can learn when I need to manage really large CSV UTF-8 files?
If you're working on the command and can install another command line tool I'd strongly recommend csvfix.
Once installed you can robustly query any csv file e.g.
csvfix order -f 1,3 file.csv
will extract the 1st and 3rd columns of a csv.
There is a full list of commands here
See this related question
I'd recommend using gawk. E.g.:
awk -F ";" '{print $1 ";" $2}' FILE.csv
would print FILE.CSV's first two (; separated) column. To work properly with UTF-8, you should use it like:
LC_ALL=C awk 'BEGIN {print length("árvíztűrőtükörkúrópék")}'
=> 30
LC_ALL=en_US.utf8 awk 'BEGIN {print length("árvíztűrőtükörkúrópék")}'
=> 21
(Or you can set LC_ALL globally if you're using UTF-8 all the time, and you're on *nix, e.g. in .bashrc, export LC_ALL=en_US.utf8.)
awk is an old, but really powerful and fast tool.
HTH