awk to remove 5th column from N column with fixed delimiter - awk

I have file with Nth columns
I want to remove the 5th column from last of Nth columns
Delimiter is "|"
I tested with simple example as shown below:
bash-3.2$ echo "1|2|3|4|5|6|7|8" | nawk -F\| '{print $(NF-4)}'
4
Expecting result:
1|2|3|5|6|7|8
How should I change my command to get the desired output?

If I understand you correctly, you want to use something like this:
sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/'
This matches a pipe character \| followed by any number of non-pipe characters [^|]*, then captures 4 more of the same pattern ((\|[^|]*){4}). The $ at the end matches the end of the line. The first part of the match (i.e. the fifth field from the end) is dropped.
Testing it out:
$ sed -E 's/\|[^|]*((\|[^|]*){4})$/\1/' <<<"1|2|3|4|5|6|7"
1|2|4|5|6|7
You could achieve the same thing using GNU awk with gensub but I think that sed is the right tool for the job in this case.
If your version of sed doesn't support extended regex syntax with -E, you can modify it slightly:
sed 's/|[^|]*\(\(|[^|]*\)\{4\}\)$/\1/'
In basic mode, pipes are interpreted literally but parentheses for capture groups and curly brcneed to be escaped.

AWK is your friend :
Sample Input
A|B|C|D|E|F|G|H|I
A|B|C|D|E|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|D|O|R|Q|U|I
A|B|C|D|E|F|G|H|I|E|O|Q
A|B|C|D|E|F|G|H|I|X
A|B|C|D|E|F|G|H|I|J|K|L
Script
awk 'BEGIN{FS="|";OFS="|"}
{$(NF-5)="";sub(/\|\|/,"|");print}' file
Sample Output
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
What we did here
As you are aware awk's has special variables to store each field in the record, which ranges from $1,$2 upto $(NF)
To exclude the 5th from the last column is as simple as
Emptying the colume ie $(NF-5)=""
Removing from the record, the consecutive | formed by the above step ie do sub(/\|\|/,"|")

another alternative, using #sjsam's input file
$ rev file | cut -d'|' --complement -f6 | rev
A|B|C|E|F|G|H|I
A|B|C|D|F|G|H|I|A
A|B|C|D|E|F|G|H|I|F|E|O|R|Q|U|I
A|B|C|D|E|F|H|I|E|O|Q
A|B|C|D|F|G|H|I|X
A|B|C|D|E|F|H|I|J|K|L
not sure you want the 5'th from the last or 6th. But it's easy to adjust.

Thanks for the help and guidance.
Below is what I tested:
bash-3.2$ echo "1|2|3|4|5|6|7|8|9" | nawk 'BEGIN{FS="|";OFS="|"} {$(NF-4)="!";print}' | sed 's/|!//'
Output: 1|2|3|4|6|7|8|9
Further tested on the file that I have extracted from system and so it worked fine.

Related

Git URL - Pull out substring via Shell (awk & sed)?

I have got the following URL:
https://xcg5847#git.rz.bankenit.de/scm/smat/sma-mes-test.git
I need to pull out smat-mes-test and smat:
git config --local remote.origin.url|sed -n 's#.*/\([^.]*\)\.git#\1#p'
sma-mes-test
This works. But I also need the project name, which is smat
I am not really familiar to complex regex and sed, I was able to find the other command in another post here. Does anyone know how I am able to extract the smat value here?
With your shown samples please try following awk code. Simple explanation would be, setting field separator(s) as / and .git for all the lines and in main program printing 3rd last and 3nd last elements from the line.
your_git_command | awk -F'/|\\.git' '{print $(NF-2),$(NF-1)}'
Your sed is pretty close. You can just extend it to capture 2 values and print them:
git config --local remote.origin.url |
sed -E 's~.*/([^/]+)/([^.]+)\.git$~\1 \2~'
smat sma-mes-test
If you want to populate shell variable using these 2 values then use this read command in bash:
read v1 v2 < <(git config --local remote.origin.url |
sed -E 's~.*/([^/]+)/([^.]+)\.git$~\1 \2~')
# check variable values
declare -p v1 v2
declare -- v1="smat"
declare -- v2="sma-mes-test"
Using sed
$ sed -E 's#.*/([^/]*)/#\1 #' input_file
smat sma-mes-test.git
I would harness GNU AWK for this task following way, let file.txt content be
https://xcg5847#git.rz.bankenit.de/scm/smat/sma-mes-test.git
then
awk 'BEGIN{FS="/"}{sub(/\.git$/,"",$NF);print $(NF-1),$NF}' file.txt
gives output
smat sma-mes-test
Explanation: I instruct GNU AWK that field separator is slash character, then I replace .git (observe that . is escaped to mean literal dot) adjacent to end ($) in last field ($NF), then I print 2nd from end field ($(NF-1)) and last field ($NF), which are sheared by space, which is default output field separator, if you wish to use other character for that purpose set OFS (output field separator) in BEGIN. If you want to know more about NF then read 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
(tested in gawk 4.2.1)
Why not sed 's!.*/\(.*/.*\)!\1!'?
string=$(config --local remote.origin.url | tail -c -21)
var1=$(echo "${string}" | cut -d'/' -f1)
var2=$(echo "${string}" | cut -d'/' -f2 | sed s'#\.git##')
If you have multiple urls with variable lengths, this will not work, but if you only have the one, it will.
var1=smat
var2=sma-mes-test.git
If I did have something variable, personally I would replace all of the forward slashes with carriage returns, throw them into a file, and then export the last and second last lines with ed, which would give me the two last segments of the url.
Regular expressions literally give me a migraine headache, but as long as I can get everything on its' own line, I can quite easily bypass the need for them entirely.

How to use awk to count the occurence of a word beginning with something?

I have a file that looks like this:
**FID IID**
1 RQ50131-0
2 469314
3 469704
4 469712
5 RQ50135-2
6 469720
7 470145
I want to use awk to count the occurences of IDs beginning with 'RQ' in column 2.
So for the little snapshot, it should be 2. After the RQ, the numbers differ so I want a count with anything that begins with RQ.
I am using this code
awk -F '\t' '{if(match("^RQ$",$2))print}'|wc -l ID.txt > RQ.txt
But I don't get an output.
Tabs are used as field delimiters by default (same as spaces), so you can omit -F '\t'.
You can use
awk '$2 ~ /^RQ/{cnt++} END{print cnt}' ID.txt > RQ.txt
Once Field 2 starts with RQ, increment cnt and once the file is processed print cnt.
See the online demo.
You did
{if(match("^RQ$",$2))print}
but compulsory arguments to match function are string, regexp. Also do not use $ if you are interesting in finding strings starting with as $ denotes end. After fixing that issues code would be
{if(match($2,"^RQ"))print}
Disclaimer: this answer does describe solely fixing problems with your current code, it does not contain any ways to ameliorate your code.
Also apart from the reversed parameters for match, the file ID.txt should come right after the closing single quote.
As you want to print the whole line, you can omit the if statement and the print statement because match returns the index at which that substring begins, or 0 if there is no match.
awk 'match($2,"^RQ")' ID.txt | wc -l > RQ.txt

Convert data format using awk?

There is a file which contains data in a 'n*1' format:
1
2
3
4
5
6
Is there any way to convert it to a 'n*3' format like:
1,2,3
4,5,6
via awk rather than using for loop ?
Really no idea about this..Any help or key word is appreciated.
Using awk
$ awk '{printf "%s%s",$0,(NR%3==0?ORS:",")}' File
1,2,3
4,5,6
The command printf "%s%s",$0,(NR%3==0?ORS:",") tells awk to print two strings. The first is $0 which is the current line. The second string is NR%3==0?ORS:"," which is either ORS the output record separator (if the line number is a multiple of three) or else , for all other line numbers.
Using sed
$ sed 'N;N;s/\n/,/g' File
1,2,3
4,5,6
By default, sed reads in each line from the file one by one. N tells sed to read in another line, appending the line to the current one, separated by a newline. N;N tells sed to do that twice so that we have a total of three lines in the pattern space. s/\n/,/g tells sed to replace those two separator newlines with commas. The result is then printed.
The above assumes that we are using GNU sed. With minor modifications, this can be made to work with BSD/OSX sed.
The most simple one - paste command:
paste -d, - - - <file
The output:
1,2,3
4,5,6
Following may help you on same.
xargs -n3 < Input_file | sed 's/ /,/g'
Try this:
awk 'NR%3==0{print;next}{printf "%s,",$0}' file
or decomposed :
NR%3==0 # condition, modulo 3 == 0
{print;next} # then print and skip to the first line
{printf "%s,",$0} # printf to not print newlines but current int + ,
$ awk '{ORS=(NR%3?",":RS)}1' file
1,2,3
4,5,6

Need help AWK script

Could you let me know how to print "user.%" string in below text by awk?
The value of 'user' is not fixed and the number of strings in '( )' are not fixed.
start user1.table% NOT (%OLD, %2016%) user.% another strings
UPDATE
It is the basis of SQL processing. $2 means schema.table but here user can use '%' and also exclude by NOT keyword. It ends with ')'. The next one is a second schema.table and that is the one I want to catch.
I think I should parse the string after ')' with a regular expression but failed.
Regular expression:
[)]\s+(\S+)
Above expression can be used to catch that string I guess.
How can I apply this one in awk script(Not one liner).
If the structure of the query keeps the same, you can use this:
awk -F'[).]' '{print $3".%"}'
I'm using the closing parenthesis or the literal dot as the delimiter. Doing so the value of interest is in field 3.
While it is simple it leaves some whitespace in front of user. We can enhance the field delimiter regex to fix this:
awk -F')[[:space:]]*|[.]' '{print $3".%"}'
Btw, you may use this sed command alternatively:
sed 's/.*)[[:space:]]*\([^.]*\).*/\1.%/'
or if you have GNU grep, use this:
grep -oP '\)\s*\K[^%]*%'
Try this (GNU awk):
awk '{match($0, /[)] +([^ ]+)/, var);print var[1];}'
You need to match first (GNU awk function).
Given your posted sample input, all you need is:
awk '{print $6}'
e.g.:
$ echo 'start user1.table% NOT (%OLD, %2016%) user.% another strings' |
awk '{print $6}'
user.%
If that doesn't work for you then your posted sample input isn't representative enough of your real input so edit your question to include a few lines of truly representative sample input and the expected output given that input.

Replace chars after column X

Lets say my data looks like this
iqwertyuiop
and I want to replace all the letters i after column 3 with a Z.. so my output would look like this
iqwertyuZop
How can I do this with sed or awk?
It's not clear what you mean by "column" but maybe this is what you want using GNU awk for gensub():
$ echo iqwertyuiop | awk '{print substr($0,1,3) gensub(/i/,"Z","g",substr($0,4))}'
iqwertyuZop
Perl is handy for this: you can assign to a substring
$ echo "iiiiii" | perl -pe 'substr($_,3) =~ s/i/Z/g'
iiiZZZ
This would totally be ideal for the tr command, if only you didn't have the requirement that the first 3 characters remain untouched.
However, if you are okay using some bash tricks plus cut and paste, you can split the file into two parts and paste them back together afterwords:
paste -d'\0' <(cut -c-3 foo) <(cut -c4- foo | tr i Z)
The above uses paste to rejoin together the two parts of the file that get split with cut. The second section is piped through tr to translate i's to Z's.
(1) Here's a short-and-simple way to accomplish the task using GNU sed:
sed -r -e ':a;s/^(...)([^i]*)i/\1\2Z/g;ta'
This entails looping (t), and so would not be as efficient as non-looping approaches. The above can also be written using escaped parentheses instead of unescaped characters, and so there is no real need for the -r option. Other implementations of sed should (in principle) be up to the task as well, but your MMV.
(2) It's easy enough to use "old awk" as well:
awk '{s=substr($0,4);gsub(/i/,"Z",s); print substr($0,1,3) s}'
The most intuitive way would be to use awk:
awk 'BEGIN{FS="";OFS=FS}{for(i=4;i<=NF;i++){if($i=="i"){$i="Z"}}}1' file
FS="" splits the input string by characters into fields. We iterate trough character/field 4 to end and replace i by Z.
The final 1 evaluates to true and make awk print the modified input line.
With sed it looks not very intutive but still it is possible:
sed -r '
h # Backup the current line in hold buffer
s/.{3}// # Remove the first three characters
s/i/Z/g # Replace all i by Z
G # Append the contents of the hold buffer to the pattern buffer (this adds a newline between them)
s/(.*)\n(.{3}).*/\2\1/ # Remove that newline ^^^ and assemble the result
' file