awk group-by on a sub-string of a column - awk

I have the following log file:
/veratt/po/dashboard.do
/veratt/po/dashboardfilter.do?view=R
/veratt/po/leaseagent.do?view=R
/veratt/po/dashboardfilter.do?&=R&=E&propcode=0&display=0&rateType=0&floorplan=&=Display&format=4&action=getReport
/veratt/po/leaseagent.do
/veratt/po/leaseagent.do?view=V
Desired AWK output of Count of each of the HTTP request (minus the request parameters)**
/veratt/po/dashboard.do - 1
/veratt/po/leaseagent.do - 3
/veratt/po//veratt/po/dashboardfilter.do - 2
I know basic AWK command using an array - but the desired output is quite different from what I need.
awk '{a[$2]=a[$2]+1;} END {for( item in a) print item , a[item];} '

awk -F\? '{ count[$1]++}
END { for (item in count)
printf("%s - %d\n", item, count[item]) }' logfile
-F: separate fields on ? character, so $1 is the request; it there are URL parameters they are in $2, whose existence we ignore. Note: could be done using BEGIN { FS="?" }. Note: if FS is more than one character, it is treated as a regex.
{ count[$1]++ }: for each line, tally up the occurrence count of $1.
END: run this block at the end of processing all the inputs
for (item in count): iterate the item variable over the keys in the count array.
printf("%s - %d\n", item, count[item]): formatted printing of the item and its count, separated by a dash with spaces. Note: %d can be replaced by %s; awk is weakly typed.

test.txt
/veratt/po/dashboard.do
/veratt/po/dashboardfilter.do?view=R
/veratt/po/leaseagent.do?view=R
/veratt/po/dashboardfilter.do?&=R&=E&propcode=0&display=0&rateType=0&floorplan=&=Display&format=4&action=getReport
/veratt/po/leaseagent.do
/veratt/po/leaseagent.do?view=V
command:
awk 'BEGIN{FS="?"} {a[$1]++} END{for(i in a) print i, a[i]}' test.txt
output:
/veratt/po/leaseagent.do 3
/veratt/po/dashboard.do 1
/veratt/po/dashboardfilter.do 2
explain:
BEGIN{FS="?"} set ? to be the field separator, so $1 will be the substring before the first ?. This only run once before process contents of test.txt
{a[$1]++} create an array, index is the substring, make it auto-increment.
END{for(i in a) print i, a[i]} iterate the array, checks its index and corresponding value, the END block runs once after all lines of the test.txt processed.

Related

Replace case of 2nd column of dataset within awk?

Im trying command
awk 'BEGIN{FS=","}NR>1{tolower(substr($2,2))} {print $0}' emp.txt
on below data but not working
- M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,AJIT,D003,8-Mar-07,8-Sep-07,70000
M004,SHARVARI,D004,28-Mar-07,28-Mar-08,120000
M005,ADITYA,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
Expected output
M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
With your shown samples in GNU awk please try following awk code. Its using GNU awk's match function, where I am using regex (^[^,]*,.)([^,]*)(.*) which is creating 3 capturing groups and storing values into an array named arr(whose indexes are 1,2,3 and so on depending upon number of capturing groups created). Then if this condition is fine then printing array elements where using tolower function to Lower the spellings on 2nd element of arr to get expected output.
awk '
FNR==1{
print
next
}
match($0,/(^[^,]*,.)([^,]*)(.*)/,arr){
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
You need to assign the result of tolower() to something, it doesn't operate in place. And in this case, you need to concatenate it with the first character of the field and assign that back to the field.
$2 = substr($2, 1, 1) tolower(substr($2, 2));
To get comma separators in the output file, you need to set OFS. So you need:
BEGIN {OFS=FS=","}
mawk, gawk, or nawk :
awk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || $_ = substr( toupper($(+_)=\
tolower($_)), --_,_) substr($++_,_)'
M_ID,M_name,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999

awk: first, split a line into separate lines; second, use those new lines as a new input

Let's say I have this line:
foo|bar|foobar
I want to split it at | and then use those 3 new lines as the input for the further proceedings (let's say replace bar with xxx).
Sure, I can pipe two awk instances, like this:
echo "foo|bar|foobar" | awk '{gsub(/\|/, "\n"); print}' | awk '/bar/ {gsub(/bar/, "xxx"); print}'
But how I can achieve this in one script? First, do one operation on some input, and then treat the result as the new input for the second operation?
I tried something like this:
echo "foo|bar|foobar" | awk -v c=0 '{
{
gsub(/\|/, "\n");
sprintf("%s", $0);
}
{
if ($0 ~ /bar/) {
c+=1;
gsub(/bar/, "xxx");
print c;
print
}
}
}'
Which results in this:
1
foo
xxx
fooxxx
And thanks to the counter c, it's absolutely obvious that the subsequent if doesn't treat the multi-line input it receives as several new records but instead just as one multi-lined record.
Thus, my question is: how to tell awk to treat this new multi-line record it receives as many single-line records?
The desired output in this very example should be something like this if I'm correct:
1
xxx
2
fooxxx
But this is just an example, the question is more about the mechanics of such a transition.
I would suggest an alternative approach using split() where you can just split the elements based on the delimiter into an array and iterate over its fields, Instead of working on a single multi line string.
echo "foo|bar|foobar" |\
awk '{
count = 0
n = split($0, arr, "|")
for ( i = 1; i <= n; i++ )
{
if ( arr[i] ~ /bar/ )
{
count += sub(/bar/, "xxx", arr[i])
print count
print arr[i]
}
}
}'
Also you don't need an explicit increment of count variable, sub() returns the number of substitutions made on the source string. You can just increment to the existing value of count.
As one more level of optimization, you can get rid of the ~ match in the if condition and directly use the sub() function there
if ( sub(/bar/, "xxx", arr[i]) )
{
count++
print count
print arr[i]
}
If you set the record separator (RS) to the pipe character, you almost get the desired effect, e.g.:
echo 'foo|bar|foobar' | awk -v RS='|' 1
Output:
foo
bar
foobar
[...an empty line
Except that a new-line character becomes part of the last field, so there is an extra line at the end of the output. You can work around this by either including a new-line in the RS variable, making it less portable, or avoid sending new-lines to awk.
For example using the less portable way:
echo 'foo|bar|foobar' | awk -v RS='\\||\n' '{ sub(/bar/, "baz") } 1'
Output:
foo
baz
foobaz
Note that the empty record at the end is ignored.
With GNU awk:
$ awk -v RS='[|\n]' 'gsub(/bar/,"xxx"){print ++c ORS $i}' file
1
xxx
2
fooxxx
With any awk:
$ awk -F'|' '{c=0; for (i=1;i<=NF;i++) if ( gsub(/bar/,"xxx",$i) ) print ++c ORS $i }' file
1
xxx
2
fooxxx

How to append last column of every other row with the last column of the subsequent row

I'd like to append every other (odd-numbered rows) row with the last column of the subsequent row (even-numbered rows). I've tried several different commands but none seem to do the task I'm trying to achieve.
Raw data:
user|396012_232|program|30720Mn|
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|
|391983_233.batch|batch|30720Mn|5050424K
I'd like to take the last field in the "batch" lines and append the line above it with the last field in the "batch" line.
Desired output:
user|396012_232|program|30720Mn|5108656K
|396012_232.batch|batch|30720Mn|5108656K
user|398498_2|program|102400Mn|36426336K
|398498_2.batch|batch|102400Mn|36426336K
user|391983_233|program|30720Mn|5050424K
|391983_233.batch|batch|30720Mn|5050424K
The "batch" lines would then be discarded from the output, so in those lines there is no preference if the line is cut or copied or changed in any way.
Where I got stumped, my attempts to finish the logic were embarrassingly illogical:
awk 'BEGIN{OFS="|"} {FS="|"} {if ($3=="batch") {a=$5} else {} ' file.data
Thanks!
If you do not need to keep the lines with batch in Field 3, you may use
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; $3=="batch" { print prev $5 }' file.data
or
awk 'BEGIN{OFS=FS="|"} NR%2==1 { prev=$0 }; NR%2==0 { print prev $5 }' file.data
See the online awk demo and another demo.
Details
BEGIN{OFS=FS="|"} - sets the field separator to pipe
NR%2==1 { prev=$0 }; - saves the odd lines in prev variable
$3=="batch" - checks if Field 3 is equal to batch (probably, with this logic you may replace it with NR%2==0 to get the even line)
{ print prev $5 } - prints the previous line and Field 5.
You may consider also a sed option:
sed 'N;s/\x0A.*|\([^|]*\)$/\1/' file.data > newfile
See this demo
Details
N; - adds a newline to the pattern space, then appends the next line of
input to the pattern space and if there is no more input then sed
exits without processing any more commands
s/\x0A.*|\([^|]*\)$/\1/ - replaces with Group 1 contents a
\x0A - newline
.*| - any 0+ chars up to the last | and
\([^|]*\) - (Capturing group 1): any 0+ chars other than |
$ - end of line
if your data in 'd' file try gnu awk:
awk 'BEGIN{FS="|"} {if(getline n) {if(n~/batch/){b=split(n,a,"|");print $0 a[b]"\n"n} } }' d

Combine awk with sub to print multiple columns

Input:
MARKER POS EA NEA BETA SE N EAF STRAND IMPUTED
1:244953:TTGAC:T 244953 T TTGAC -0.265799 0.291438 4972 0.00133176 + 1
2:569406:G:A 569406 A G -0.17456 0.296652 4972 0.00128021 + 1
Desired output:
1 1:244953:TTGAC:T 0 244953
2 2:569406:G:A 0 569406
Column 1 in output file is first number from first column in input file
Tried:
awk '{gsub(/:.*/,"",$1);print $1,0,$2}' input
But it does not print $2 correctly
Thank you for any help
Your idea is right, but the reason it didn't work is that you've replaced the $1 value as part of the gsub() routine and have not backed it up. So next call to $1 will return the value after the call. So back it up as below. Also sub() is sufficient here for the first replacement part
awk 'NR>1{backup=$1; sub(/:.*/,"",backup);print backup,$1,0,$2}' file
Or use split() function to the first part of the first column. The call to the function returns the number of elements split by delimiter : and updates the elements to the array a. We print the element and subsequent columns as needed.
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file
From GNU awk documentation under String functions
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string.
Add a | column -t to beautify the result to make it appear more spaced out and readable
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file | column -t
Could you please try following and let me know if this helps you?
awk -v s1=" " -F"[: ]" 'FNR>1{print $1 s1 $1 OFS $2 OFS $3 OFS $4 s1 "0" s1 $5}' OFS=":" Input_file

Awk Field number of matched pattern

I was wondering if there's a built in command in awk to get the field number of the phrase that you just matched.
Banana is yellow.
awk {
/yellow/{ for (i=1;i<=NF;i++) if($i ~/yellow/) print $i}'
Is there a way to avoid writing the loop?
Your command doesn't work when I test it. Here's my version:
echo "banana is yellow" | awk '{for (i=1;i<=NF;i++) if($i ~/yellow/) print i}'
The output is :
3
As far as I know, there's no such built-in feature, to improve your command, the pattern match /yellow/ at the beginning is not necessary, and also $i will print the matching field other than the field number that you need.
Alternatively, you can use an array to store each field and its corresponding index number, and then print field by arr["yellow"]
If the input string is a oneline string you can set the record delimiter to the field delimiter. Doing so you can use NR to print the position:
awk 'BEGIN{RS=FS}/yellow/{print NR}' <<< 'banana is yellow'
3