Print Distinct Values from Field AWK - awk

I'm looking for a way to print the distinct values in a field while in the command-prompt environment using AWK.
ID Title Promotion_ID Flag
12 Purse 7 Y
24 Wallet 7 Y
709 iPhone 1117 Y
74 Satchel 7 Y
283 Xbox 84 N
Ideally I'd like to return the promotion_ids: 7, 1117, 84.
I've researched the question on Google and have found some examples such as:
`cut -f 3 | uniq *filename.ext*` (returned error)
`awk cut -f 3| uniq *filename.ext*` (returned error)
`awk cut -d, -f3 *filename.ext* |sort| uniq` (returned error)

awk 'NR>1{a[$3]++} END{for(b in a) print b}' file
Output:
7
84
1117

Solution 1st: Simple awk may help.(Following will remove the header of Input_file)
awk 'FNR>1 && !a[$3]++{print $3}' Input_file
Solution 2nd: In case you need to keep the Header of the Input_file then following may help you on same.
awk 'FNR==1{print;next} !a[$3]++{print $3}' Input_file

with the pipe line
$ sed 1d file | # remove header
tr -s ' ' '\t' | # normalize space delimiters to tabs
cut -f3 | # isolate the field
sort -nu # sort numerically and report unique entries
7
84
1117

[root#test ~]# cat test
ID Title Promotion_ID Flag
12 Purse 7 Y
24 Wallet 7 Y
709 iPhone 1117 Y
74 Satchel 7 Y
283 Xbox 84 N
Output -:
[root#test ~]# awk -F" " '!s[$3]++' test
ID Title Promotion_ID Flag
12 Purse 7 Y
709 iPhone 1117 Y
283 Xbox 84 N
[root#test ~]#

mawk '!__[$!NF=$--NF]--^(!_<NR)'
or
gawk' !__[$!--NF=$NF]--^(!_<NR)'
or perhaps
gawk '!__[$!--NF=$NF]++^(NF<NR)'
or even
mawk '!__[$!--NF=$NF]++^(NR-!_)' # mawk-only
gawk '!__[$!--NF=$--NF]--^(NR-NF)' # gawk-equiv of similar idea
7
1117
84

Related

Pad AWK columns

Trying to parse BSD top output to show only PID - COMMAND - MEM for a specific process;
$ top -l 1 | grep -E '%CPU\ |coreaudio'
PID COMMAND %CPU TIME #TH #WQ #PORTS MEM PURG
354 com.apple.audio. 0.0 00:00.00 2 1 12 820K 0B
296 com.apple.audio. 0.0 00:00.03 2 1 38 2024K 0B
282 coreaudiod 0.0 03:25.05 94 2 736 21M 0B
Using awk to show only column $1 - $2
$ top -l 1 | grep -E '%CPU\ |coreaudio' | awk {'print $1" -- "$2'}
PID -- COMMAND
354 -- com.apple.audio.
296 -- com.apple.audio.
282 -- coreaudiod
Adding a 3th column messes with the 'columns' since the second columns is not being padded;
$ top -l 1 | grep -E '%CPU\ |coreaudio' | awk {'print $1" -- "$2" -- "$8'}
PID -- COMMAND -- MEM
354 -- com.apple.audio. -- 820K
296 -- com.apple.audio. -- 2024K
282 -- coreaudiod -- 21M
How would I 'pad' the 'column' to keep the 'layout' intact? Or should I use a different tool like sed ?
Note;
Using top -l 1 since I'm on a Mac
You can pad the strings with some "constant" amount of spaces:
<<<$input awk '{printf "%-20s %-10s %-4s\n", $1, $2, $8}'
# ^^ ^^ ^ field width
# ^ ^ ^ left justify
You can use column:
<<<$input awk '{print $1, $2, $8}' | column -t

count, groupby with sed, or awk

i want to perform two different sort and count on a file, based on each line's content.
1. i need to take the first column of a .tsv file
i would like to group by each line that starts with three digits, and keep only the three first digits, and for everything else, just sort and count the whole occurrence of the sentence in the first column.
Sample data:
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe fkdjald
890897 34213
6878853 834
32fasd 53891
abcdee 8794371
abd 873
result:
687 2
890 3
01a 1
1b 1
32fasd 1
abd 1
dfeqfe 1
abcdee 2
I would also appreciate a solution that would
also take into account a sample input like
687/878 9
890987 4
01a 55
1b 8743917
890a 34
abcdee 987
dfeqfe 545
890897 34213
6878853 834
(632)fasd 53891
(88)abcdee 8794371
abd 873
so the first column may have values like (,), #, ', all kind of characters
so output will have two columns, the first with the values extracted, and the second with the new count, with the new values extracted from the source file.
Again preferred output format tsv.
so i need to extract all values that start with
^\d\d\d, and then for these three first digits, sort and count unique values,
but in a second pass, also do the same for each line, that does not start with 3 digits, but this time, keep the whole columns value and sort count by it.
what i have tried:
| sort | uniq -c | sort -nr for the lines that do start with ^\d\d\d, and
the same for those that do not fulfill the above regex, but is there a more elegant way using either sed or awk?
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ }
END {
for (key in cnt) {
print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key]
}
}
$ awk -f tst.awk file | sort -k1,2n | cut -f3-
687 1
890 2
abcdee 1
You can try Perl
$ cat nefijaka.txt
687 878 9
890987 4
890a 34
abcdee 987
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt
687 1
890 2
abcdee 1
$
You can pipe it to sort and get the values sorted..
$ perl -lne ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt | sort -k2 -nr
890 2
abcdee 1
687 1
EDIT1:
$ cat nefijaka.txt2
687 878 9
890987 4
890a 34
abcdee 987
a word and then 23
$ perl -lne ' /^(\d{3})|(.+?\t)/; $x=$1?$1:$2; $x=~s/\t//g; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt2
687 1
890 2
a word and then 1
abcdee 1
$

Changing the field separator of awk to newline

The -F option lets you specify the field separator for awk, but using '\n' as the line separator doesn't work, that is, it doesn't make $1 the first line of the input, $2 the second line, and so on.
I suspect that this is because awk looks for the field separator within each line. Is there a way to get around this with awk, or some other Linux command? Basically, I want to separate my input by newline characters and put them into an Excel file.
I'm still warming up to Linux and shell scripts, which is the reason for my lack of creativity with this problem.
Thank you!
You may require to overwrite the input record separator (RS), which default is newline.
See my example below,
$ cat test.txt
a
b
c
d
$ awk 'BEGIN{ RS = "" ; FS = "\n" }{print $1,$2,$3,$4}' test.txt
a b c d
Note that you can change both the input and output record separator so you can do something like this to achieve a similar result to the accepted answer.
cat test.txt
a
b
c
d
$ awk -v ORS=" " '{print $1}' test.txt
a b c d
one can simplify it to just the following, with a minor caveat of extra trailing space without trailing newline :
% echo "a\nb\nc\nd"
a
b
c
d
% echo "a\nb\nc\nd" | mawk 8 ORS=' '
a b c d %
To rectify that, plus handling the edge case of no trailing newline from input, one can modify it to :
% echo -n "a\nb\nc\nd" | mawk 'NF-=_==$NF' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010
% echo "a\nb\nc\nd" | mawk 'NF -= (_==$NF)' FS='\n' RS='^$' | odview
0000000 543301729 174334051
a b c d \n
141 040 142 040 143 040 144 012
a sp b sp c sp d nl
97 32 98 32 99 32 100 10
61 20 62 20 63 20 64 0a
0000010

rearrange columns using awk or cut command

I have large file with 1000 columns. I want to rearrange so that last column should be the 3rd column. FOr this i have used,
cut -f1-2,1000,3- file > out.txt
But this does not change the order.
Could anyone help using cut or awk?
Also, I want to rearrange columns 10 and 11 as shown below:
Example:
1 10 11 2 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
try this awk one-liner:
awk '{$3=$NF OFS $3;$NF=""}7' file
this is moving the last col to the 3rd col. if you have 1000, then it does it with 1000th col.
EDIT
if the file is tab-delimited, you could try:
awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7' file
EDIT2
add an example:
kent$ seq 20|paste -s -d'\t'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$NF OFS $3;$NF=""}7'
1 2 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
EDIT3
You didn't give any input example. so assume you don't have empty columns in original file. (no continuous multi-tabs):
kent$ seq 20|paste -s -d'\t'|awk -F'\t' -v OFS="\t" '{$3=$10 FS $11 FS $3;$10=$11="";gsub(/\t+/,"\t")}7'
1 2 10 11 3 4 5 6 7 8 9 12 13 14 15 16 17 18 19 20
After all we could print those fields in a loop.
I THINK what you want is:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; sub(OFS "[^" OFS "]*$","")}1' file
This might also work for you depending on your awk version:
awk 'BEGIN{FS=OFS="\t"} {$3=$NF OFS $3; NF--}1' file
Without the part after the semi-colon you'll have trailing tabs in your output.
Since many people are searching for this and even the best awk solution is not really pretty and easy to use I wanted to post my solution (mycut) written in Python:
#!/usr/bin/env python3
import sys
from signal import signal, SIGPIPE, SIG_DFL
signal(SIGPIPE,SIG_DFL)
#example usage: cat file | mycut 3 2 1
columns = [int(x) for x in sys.argv[1:]]
delimiter = "\t"
for line in sys.stdin:
parts = line.split(delimiter)
print("\t".join([parts[col] for col in columns]))
I think about adding the other features of cut like changing the delimiter and a feature to use a * to print the remaning columns. But then it will get an own page.
A shell wrapper function for awk' that uses simpler syntax:
# Usage: rearrange int_n [int_o int_p ... ] < file
rearrange ()
{
unset n;
n="{ print ";
while [ "$1" ]; do
n="$n\$$1\" \" ";
shift;
done;
n="$n }";
awk "$n" | grep '\w'
}
Examples...
echo foo bar baz | rearrange 2 3 1
bar baz foo
Using bash brace expansion, rearrange first and last 5 items in descending order:
echo {1..1000}a | tr '\n' ' ' | rearrange {1000..995} {5..1}
1000a 999a 998a 997a 996a 995a 5a 4a 3a 2a 1a
Sorted 3-letter shells in /bin:
ls -lLSr /bin/?sh | rearrange 5 9
150792 /bin/csh
154072 /bin/ash
771552 /bin/zsh
1554072 /bin/ksh

How do I print a range of data in awk?

I am reviewing my access_logs with a statment like:
cat access_log | grep 16/Sep/2012:17 | awk '{print $12 $13 $14 $15 $16}' | sort | uniq -c | sort -n | tail -40
The purpose is to see the user agent of the anyone that has been hitting my server for the last hour sorted by number of hits. My server has unusual activity to I want stop any unwanted spiders/etc.
But the part: awk '{print $12 $13 $14 $15 $16}' would be much preferred as something like: awk '{print $12-through-end-of-line}' so that I could see the whole user agent which is a different length for each one.
Is there a way to do this with awk?
Not extremely elegant, but this works:
grep 16/Sep/2012:17 access_log | awk '{for (i=12;i<=NF;++i) printf "%s ",$i;print ""}'
It has the side effect of condensing multiple spaces between fields down to one, and putting an extra one at the end of the line, though, which probably isn't critical.
I've never found one; in situations like this, I use cut (assuming I don't need awk's flexible handling of field separation):
# Assuming tab-separated fields, cut's default
grep 16/Sep/2012:17 access_log | cut -f12- | sort | uniq -c | sort -n | tail -40
# For space-separated fields (single spaces, not arbitrary amounts of whitespace)
grep 16/Sep/2012:17 access_log | cut -d' ' -f12- | sort | uniq -c | sort -n | tail -40
(Clarification: I've never found a good way. I've used #twalberg's for-loop when necessary, but prefer using cut if possible.)
$ echo somefields:; cat somefields ; echo from-to.awk: ; \
cat from-to.awk ; echo ;awk -f from-to.awk somefields
somefields:
a b c d e f g h i j k l m n o p q r s t u v w x y z
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
from-to.awk:
{ for (i=12; i<=NF; i++) { printf "%s ", $i }; print "" }
l m n o p q r s t u v w x y z
12 13 14 15 16 17 18 19 20 21
from man awk:
NF The number of fields in the current input record.
So you basically loop through fields (separated by spaces) from 12 to the last one.
why not
#!/bin/bash
awk "/$1/"'{for (i=12;i<=NF;i++) printf("%s ", $i) ;printf "\n"}' log | sort | uniq -c | sort -n | tail -40
in a script file.
Then you can call it like
myMonitor.sh 16/Sep/2012:17
Don't have a way to test this right. Appologies for any formatting/syntax errors.
Hopefully you get the idea.
IHTH
awk '/16/Sep/2012:17/{for(i=1;i<12;i++){$i="";}print}' access_log| sort | uniq -c | sort -n | tail -40