I've run into an issue where I'm trying to understand awk for a class. We are supposed to take the table full of names and some other information and divide each field using "," to make it easier to export to .csv. So far what I have removes all extra characters including the initial "," tied to the first field. I'm down to 2 last issues with my script. The first is adding the "," to divide each field. I know this seems basic, but I'm having a hard time wrapping my head around it. The second is that occasionally $2 is followed by an extra initial standing in for a middle name. I have no idea how to incorporate another field to every other line that does not have an initial.
The table is the following:
+---------------------------------+------------+------+----------+
| Name | NumCourses | Year | Semester |
+---------------------------------+------------+------+----------+
| ABDULHADI, ASHRAF M | 2 | 1990 | 3 |
| ACHANTA, BALA | 2 | 1995 | 3 |
| ACHANTA, BALA | 2 | 1996 | 3 |
+---------------------------------+------------+------+----------+
My Code:
awk 'NR==3, N==6{gsub(","," "); gsub(/\|/, " "); gsub(/\+/," "); gsub(/\-/," "); print $1, $2, $3, $4, $5, $6}' awktest.txt
Output:
ABDULHADI ASHRAF M 2 1990 3
ACHANTA BALA 2 1995 3
ACHANTA BALA 2 1996 3
P.S. It should be noted that we were instructed to rip out the headers.
Expected Output:
ABDULHADI,ASHRAF,M,2,1990,3
ACHANTA,BALA,N/A,,2,1995,3
ACHANTA,BALA,N/A,2,1996,3
Your approach to first remove punctuation characters is good. Keeping it, you could write:
awk -v OFS="," '{gsub(","," ");gsub(/\|/," ")}{$1=$1}NF==5{$2=$2",N/A"}NF>4' awktest.txt
Let's unwind it and understand what is happening:
awk -v OFS="," ' #Output field separator is set to comma
{
gsub(","," ") #Substitute any comma by space
gsub(/\|/," ") #Substitute any pipe by space
}
{$1=$1} #Force line rebuild so that OFS is used to separate fields outputed
NF==5{$2=$2",N/A"} #If there are only 5 fields, middle-name is missing, so append ",N/A" to 2nd field
NF>4 #Print resulting lines that have at least 5 fields (this gets rid of headers)
' awktest.txt
Output:
ABDULHADI,ASHRAF,M,2,1990,3
ACHANTA,BALA,N/A,2,1995,3
ACHANTA,BALA,N/A,2,1996,3
Feel free to request further clarification if you need it.
Related
I'm trying to analyze a file with the following structure:
AAAAA
123
456
789
AAAAA
555
777
999
777
The idea is to detect the 'AAAAA' pattern and extract the two following lines. After this is done, I would like to append the next 'AAAAA' pattern and the following two lines, so th final file will look something like this:
AAAAA
123
456
AAAA
555
777
Taking into account that the last one will not end with the 'AAAAA' pattern.
Any idea about how this can be done ? I've use sed but I don't know how to select the number of lines to be retained after the pattern...
Fo example with AWK:
awk '/'$AAAAA'/,/'$AAAAA'/' INPUTFILE.txt
Bu this will only extract all the text between the two AAAAA
Thanks
With sed
sed -n '/AAAAA/{N;N;p}' file.txt
with smart counters
$ awk '/AAAAA/{n=3} n&&n--' file
AAAAA
123
456
AAAAA
555
777
The grep command has a flag that prints lines after each match. For example:
grep AAAAA --after 2 <file>
Unless I misunderstood, this should match your requirements, and is much simpler than awk scripts.
You may try this awk:
awk '$1 == "AAAAA" {n = NR+2} NR <= n' file
AAAAA
123
456
AAAAA
555
777
just cheat
mawk/mawk2/gawk 'BEGIN { FS = OFS = "AAAAA\n"; RS = "^$";
} END { for(x=2; x<= NF; x++) { print $(x) } }'
no one says the fields must be split by spaces, and rows must be new-lines one-by-one. By design of FS, every field after $1 will contain the matches you need, and fitting multiple "rows" of each within $2 etc.
In this example, in $2 you will find 12 bytes, like this :
1 2 3 \n 4 5 6 \n 7 8 9 \n # spaced out for readability
Input:
dog
fish
elephant
...
Output:
dog |
fish |
elephant|
... |
I want to add a "|" on the 9th character of every row
You should first space pad the lines to the max line width (eg: 8 chars as you say).
Then, you can use
sed 's/./&|/9' <padded.txt >output.txt
Hard-coding the output field width:
$ awk '{printf "%-*s|\n",8,$0}' file
dog |
fish |
elephant|
... |
or specifying the output field width as an argument:
$ awk -v wid=8 '{printf "%-*s|\n",wid,$0}' file
dog |
fish |
elephant|
... |
or dynamically determining the output field width from the input field widths:
$ awk 'NR==FNR{lgth=length($0); wid=(lgth > wid ? lgth : wid); next} {printf "%-*s|\n",wid,$0}' file file
dog |
fish |
elephant|
... |
If you need to further process the records, it might be a good idea to actually make the $0 9 chars wide:
$ awk '{$0=$0 sprintf("%" 9-length() "s","|")}1' file
Output:
dog |
fish |
elephant|
... |
Currently i am using a awk script to compare 2 files having random numbers in non sequential order.
It works perfect , but there is just one future condition i would like to fulfill.
Current awk function
awk '
{
$0=$0+0
}
FNR==NR{
a[$0]
next
}
($0 in a){
b[$0]
next
}
{ print }
END{
for(j in a){
if(!(j in b)){ print j }
}
}
' compare1.txt compare2.txt
What the the function accomplishes currently ?
It outputs list of all the numbers which are present in compare1 but not in compare 2 and vice versa
If any number has zero in its prefix, ignore zeros while comparing ( basically the absolute value of number must be different to be treated as a mismatch ) Example - 3 should be considered matching with 003 and 014 should be considered matching with 14, 008 with 8 etc
As required It also considers a number matched even if they are not necessarily on the same line in both files
Required additional condition
In its current form , this functions works in such a way that if a file has multiple occurances of a number and other file has even one occurance of that same number , it considers the number matched for both repetitions.
I need the awk function to be edited to output any additional occurrence of a number
cat compare1.txt
57
11
13
3
889
014
91
775
cat compare2.txt
003
889
13
14
57
12
90
775
775
Expected output
12
90
11
91
**775**
The number marked here at end is currently not being shown in output in my present awk function ( 2 occurances - 1 occurrence )
As mentioned at https://stackoverflow.com/a/62499047/1745001, this is the job that comm exists to do:
$ comm -3 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort)
11
12
775
90
91
and to get rid of the white space:
$ comm -3 <(awk '{print $0+0}' compare1.txt | sort) <(awk '{print $0+0}' compare2.txt | sort) |
awk '{print $1}'
11
12
775
90
91
you just need to count the occurrences and account for it in matching...
$ awk '{k=$0+0}
NR==FNR {a[k]++; next}
!(k in a && a[k]-->0);
END {for(k in a) while(a[k]-->0) print k}' file1 file2
12
90
775
11
91
note that as in your original script there is no absolute value comparison, which you can add easily by just changing k in the first line.
I have below data named atp.csv file
Date_Time,M_ID,N_ID,Status,Desc,AMount,Type
2015-01-05 00:00:00 076,1941321748,BD9010423590206,200,Transaction Successful,2000,PRETOP
2015-01-05 00:00:00 077,1941323504,BD9010423590207,351,Transaction Successful,5000,PRETOP
2015-01-05 00:00:00 078,1941321743,BD9010423590205,200,Transaction Successful,1500,PRETOP
2015-01-05 00:00:00 391,1941323498,BD9010500000003,200,Transaction Successful,1000,PRETOP
i want to count status wise using below command.
cat atp.csv|awk -F',' '{print $4}'|sort|uniq -c
The output is like below:
3 200
1 351
But i want to like below output and also want to sum the amount column in status wise.
200,3,4500
351,1,5000
That is status is first and then count value.Please help..
AWK has associative arrays.
% cat atp.csv | awk -F, 'NR>1 {n[$4]+=1;s[$4]+=$6;} END {for (k in n) { print k "," n[k] "," s[k]; }}' | sort
200,3,4500
351,1,5000
In the above:
The first line (record) is skipped with NR>1.
n[k] is the number of occurrences of key k (so we add 1), and s[k] is the running sum values in field 6 (so we add $6).
Finally, after all records are processed (END), you can iterate over associated arrays by key (for (k in n) { ... }) and print the keys and values in arrays n and s associated with the key.
You can try this awk version also
awk -F',' '{print $4,",", a[$4]+=$6}' FileName | sort -r | uniq -cw 6 | sort -r
Output :
3 200 , 4500
1 351 , 5000
Another Way:
awk -F',' '{print $4,",", a[$4]+=$6}' FileName | sort -r | uniq -cw 6 |sort -r | sed 's/\([^ ]\+\).\([^ ]\+\).../\2,\1,/'
All in (g)awk
awk -F, 'NR>1{a[$4]++;b[$4]+=$6}
END{n=asorti(a,c);for(i=1;i<=n;i++)print c[i]","a[c[i]]","b[c[i]]}' file
I would like to calculate percentage of value in each line out of all lines and add it as another column.
Input (delimiter is \t):
1 10
2 10
3 20
4 40
Desired output with added third column showing calculated percentage based on values in second column:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00
I have tried to do it myself, but when I calculated total for all lines I didn't know how to preserve rest of line unchanged. Thanks a lot for help!
Here you go, one pass step awk solution -
awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] awk 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Update: If tab is a required in output then just set the OFS variable to "\t".
[jaypal:~/Temp] awk -v OFS="\t" 'NR==FNR{a = a + $2;next} {c = ($2/a)*100;print $1,$2,c }' file file
1 10 12.5
2 10 12.5
3 20 25
4 40 50
Breakout of pattern {action} statements:
The first pattern is NR==FNR. FNR is awk's in-built variable that keeps track of number of records (by default separated by a new line) in a given file. So FNR in our case would be 4. NR is similar to FNR but it does not get reset to 0. It continues to grow on. So NR in our case would be 8.
This pattern will be true only for the first 4 records and thats exactly what we want. After perusing through the 4 records, we are assign the total to a variable a. Notice that we did not initialize it. In awk we don't have to. However, this would break if entire column 2 is 0. So you can handle it by putting an if statement in the second action statement i.e do the division only if a > 0 else say division by 0 or something.
next is needed cause we don't really want second pattern {action} statement to execute. next tells awk to stop further actions and move to the next record.
Once the four records are parsed, the next pattern{action} begins, which is pretty straight forward. Doing the percentage and print column 1 and 2 along with percentage next to them.
Note: As #lhf mentioned in the comment, this one-liner will only work as long as you have the data set in a file. It won't work if you pass data through a pipe.
In the comments, there is a discussion going on ways to make this awk one-liner take input from a pipe instead of a file. Well the only way I could think of was to store the column values in array and then using for loop to spit each value out along with their percentage.
Now arrays in awk are associative and are never in order, i.e pulling the values out of arrays will not be in the same order as they went in. So if that is ok then the following one-liner should work.
[jaypal:~/Temp] cat file
1 10
2 10
3 20
4 40
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}'
2 10 12.5
3 20 25
4 40 50
1 10 12.5
To get them in order, you can pipe the result to sort.
[jaypal:~/Temp] cat file | awk '{b[$1]=$2;sum=sum+$2} END{for (i in b) print i,b[i],(b[i]/sum)*100}' | sort -n
1 10 12.5
2 10 12.5
3 20 25
4 40 50
You can do it in a couple of passes
#!/bin/bash
total=$(awk '{total=total+$2}END{print total}' file)
awk -v total=$total '{ printf ("%s\t%s\t%.2f\n", $1, $2, ($2/total)*100)}' file
You need to escape it as %%. For instance:
printf("%s\t%s\t%s%%\n", $1, $2, $3)
Perhaps there is better way but I would pass file twice.
Content of 'infile':
1 10
2 10
3 20
4 40
Content of 'script.awk':
BEGIN {
## Tab as field separator.
FS = "\t";
}
## First pass of input file. Get total from second field.
ARGIND == 1 {
total += $2;
next;
}
## Second pass of input file. Print each original line and percentage as third field.
{
printf( "%s\t%2.2f\n", $0, $2 * 100 / total );
}
Run the script in my linux box:
gawk -f script.awk infile infile
And result:
1 10 12.50
2 10 12.50
3 20 25.00
4 40 50.00