Using awk to count names in a column

Using awk to count names in a column - awk

Indicates the initials of people who have three proper names;
so, i have a column in this case it's
awk -F ';' '{print $1, $2}' users.txt
output:
xxx JoaoPedroVilar
xxa JoaoMiguel
RMF RitaPereira
....
My question is: i need with function count ( i guess) , so count only in the column $2 which names have more than 2 names... because i just want in the output Acronyms and names with 2 plus letter upppercases like this:
xxx JoaoPedroVilar
RAT RicardoAntonioPereira
Sample data:
awk -F ';' '{print $1, $2}' users.txt
Output:
xxx NunoAndr�Ferreira
xxx HugoFernandes
xxx HugoGomes
xxx In�sSilva
xxx Jo�oTeixeira
xxx JoaquimGon�alves
JAR JoaquimRibeiro
xxx Jos�PedroRafael
xxx Jos�Soares
xxx LuisFernandes
xxx MiguelMadeira
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha
the answer is :
awk -F';' -b '$2~/[A-Z]{1}.*[A-Z]{1}.*[A-Z]{1}.*/{print $1, $2}' users.txt
So -b it's just awk interpret all caracthers

This regex can probably be refactored, but it does the trick (I think)
awk -F';' '$2~/[A-Z]{1}.*[A-Z]{1}.*[A-Z]{1}.*/{print $1, $2}' users.txt
That's just matching for 3 single upper case characters in your second column. Note that this is going to have false positives if you have names like ScottMcMasters or BobO'Neal but trying to seperate names that aren't already separated is never 100%.
Example:
 cat users.txt
xxx;JoaoPedroVilar
xxx;PedroAndrePereira
RAT;RicardoAntonioPereira
xxx;BobBob
xxx;SomeName
 awk -F";" '$2~/[A-Z]{1}.*[A-Z].*[A-Z].*/' users.txt
xxx;JoaoPedroVilar
xxx;PedroAndrePereira
RAT;RicardoAntonioPereira

Without testable input/output it's a guess but it sounds like you need:
awk -F';' '$2 ~ /([[:upper:]][^[:upper:]]+){2}[[:upper:]]/{print $1, $2}' file
e.g. using the output you posted elsewhere to create sample input:
$ cat file
xxx;NunoAndr�Ferreira
xxx;HugoFernandes
xxx;HugoGomes
xxx;In�sSilva
xxx;Jo�oTeixeira
xxx;JoaquimGon�alves
JAR;JoaquimRibeiro
xxx;Jos�PedroRafael
xxx;Jos�Soares
xxx;LuisFernandes
xxx;MiguelMadeira
xxx;NunoAndr�Ferreira
xxx;PedroLucasFarinha
.
$ awk -F';' '$2 ~ /([[:upper:]][^[:upper:]]+){2}[[:upper:]]/{print $1, $2}' file
xxx NunoAndr�Ferreira
xxx Jos�PedroRafael
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha
If that doesn't work for you then try setting your locale to C first (and next try setting it to whatever local understands the "control chars" in your files as letters) as you're probably just having a locale problem:
LC_ALL=C awk -F';' '$2 ~ /([[:upper:]][^[:upper:]]+){2}[[:upper:]]/{print $1, $2}' file

another approach of counting upper case chars
$ awk -F';' 'gsub(/[A-Z]/,"&",$2)>2 {print $1,$2}'

with:
awk -F";" '$2~/[A-Z]{1}.*[A-Z].*[A-Z].*/' users.txt
Output:
PLF PedroLucasFarinha
but when i print just the names i have example:
awk -F ';' '{print $1, $2}' users.txt
Output:
xxx NunoAndr�Ferreira
xxx HugoFernandes
xxx HugoGomes
xxx In�sSilva
xxx Jo�oTeixeira
xxx JoaquimGon�alves
JAR JoaquimRibeiro
xxx Jos�PedroRafael
xxx Jos�Soares
xxx LuisFernandes
xxx MiguelMadeira
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha
but with :
awk -F';' -b '$2~/[A-Z]{1}.*[A-Z]{1}.*[A-Z]{1}.*/{print $1, $2}' users.txt
the output it's
xxx Jos�PedroRafael
xxx NunoAndr�Ferreira
xxx PedroLucasFarinha

Related

awk to capture element upto space or using special escape character

Trying to extract the 5th element in $1 after the - upto the space or \\. If a / was used then the script awk -F'[-/'] 'NR==0{print; next} {print $0"\t""\t"$5}' file works as expected. Thank you :).
file --tab-delimited--
00-0000-L-F-Male \\path\to xxx xxx
00-0001-L-F-Female \\path\to xxx xxx
desired (last field has two tabs before)
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
awk
awk -F'-[[:space:]][[:space:]]+' 'NR==0{print; next} {print $0"\t""\t"$5}' file
00-0000-L-F-Male \\path\to xxx xxx
00-0001-L-F-Female \\path\to xxx xxx
awk 2
awk -F'[-\\]' 'NR==0{print; next} {print $0"\t""\t"$5}' file
awk: fatal: Unmatched [ or [^: /[-\]/

Using any awk:
$ awk -F'[-\t]' -v OFS='\t\t' '{print $0, $5}' file
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
Regarding your scripts:
awk
awk -F'-[[:space:]][[:space:]]+' 'NR==0{print; next} {print $0"\t""\t"$5}' file
-F'-[[:space:]][[:space:]]+' says that your fields are separated by a - followed by 2 or more spaces, which they aren't.
NR==0{foo} says "do foo for line number 0" but there is no line number 0 in any input.
awk 2
awk -F'[-\\]' 'NR==0{print; next} {print $0"\t""\t"$5}' file
-F'[-\\]' appears to be trying to set FS to a minus sign or a backslash, but you already told us your fields are tab-separated, not backslash-separated.
When setting FS this way it goes through a few different phases of interpretation, converting a shell string to an awk string, converting an awk string to a regexp, and using the regexp as a field separator, so you need several layers of escaping, not just 1, to make a backslash literal. If unsure, keep adding backslashes until the warnings and errors go away.

You may use this awk:
awk -F'\t' '{n=split($1, a, /-/); print $0 FS FS a[(n > 4 ? 5 : n)]}' file
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female
a[(n > 4 ? 5 : n)] expression gets 5th element from array if there are 5 or more elements in array otherwise it gets last element.

Presuming your file is '\t' separated with one-tab per field and you want an empty field before the Male/Female output, you can use:
awk -F"\t" '{ split($1,arr,"-"); print $0 "\t\t" arr[5] }' filetabs.txt
Example Use/Output
Where filetabs.txt contains your sample data with tab field-separators you would get:
$ awk -F"\t" '{ split($1,arr,"-"); print $0 "\t\t" arr[5] }' filetabs.txt
00-0000-L-F-Male \\path\to xxx xxx Male
00-0001-L-F-Female \\path\to xxx xxx Female

With perl one liner which supports lazy match we can try following code. Written and tested in shown samples only.
perl -pe 's/^((?:.*?-)+)([^[:space:]]+)([[:space:]]+.*)$/\1\2\3\t\t\2/' Input_file
OR above could be written as following also:
perl -pe 's/^((?:.*?-)+)(\S+)(\s+.*)$/\1\2\3\t\t\2/' Input_file
Explanation: Adding detailed explanation for used regex above. Here is the Online Demo for used regex in code.
^( ##From starting of the value creating one capturing group here.
(?: ##Opening non-capturing group here.
.*?- ##Using lazy match till - here.
)+ ##Closing non-capturing group here with matching 1 OR more occurrences of this.
) ##Closing 1st capturing group here.
([^[:space:]]+) ##Creating 2nd capturing group and matching all non-spaces in it.
([[:space:]]+.*)$ ##Creating 3rd capturing group which matches spaces till end of the value.

awk split with asterix

I am trying to split a variable as follows. is there any efficient way to do this preferably using awk.
echo 262146*10,69636*32 |awk -F, 'split($1, DCAP,"\\*") {print DCAP[1]}; split($2, DCAP,"\\*"){print DCAP[1]}'

echo '262146*10,69636*32' | awk -F '[,*]' '{print $1; print $3}'
or
echo '262146*10,69636*32' | awk -F '[,*]' '{printf("%d\n%d\n",$1,$3)}'
Output:
262146
69636

If you have a longer sequence you could try:
echo 262146*10,69636*32,10*3 | awk 'BEGIN {FS="*"; RS=","} {print $1}'

AWK: Apply filter only if field separator is present

I surprisingly found that when you do this:
echo "hello" | awk -F'|' '{print $1;}'
you get:
hello
How to return nothing given that the field separator '|' is absent in the line ?
I do this to extract dates in beginning of log lines, but some lines don't start with a date and then give me this problem. Thanks, I am quite new in awk.

You can do this
echo "hello" | awk -F'|' 'NF>1 {print $1}'
echo "hello|1" | awk -F'|' 'NF>1 {print $1}'
hello
Only when you have more than one field, return the first field
On a file
cat testing
record1|val1
record2|val2
record3
record4|val4
awk -F'|' 'NF>1 {print $1}' testing
record1
record2
record4

Alternatively, you could use
awk -F'|' '$1==$0'
If no separator is present, then field one will contain the whole line.

Print default value if index is not in awk array

$ cat file1 #It contains ID:Name
5:John
4:Michel
$ cat file2 #It contains ID
5
4
3
I want to Replace the IDs in file2 with Names from file1, output required
John
Michel
NO MATCH FOUND
I need to expand the below code to reult NO MATCH FOUND text.
awk -F":" 'NR==FNR {a[$1]=$2;next} {print a[$1]}' file1 file2
My current result:
John
Michel
<< empty line
Thanks,

You can use a ternary operator for this: print ($1 in a)?a[$1]:"NO MATCH FOUND". That is, if $1 is in the array, print it; otherwise, print the text "NO MATCH FOUND".
All together:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} {print ($1 in a)?a[$1]:"NO MATCH FOUND"}' f1 f2
John
Michel
NO MATCH FOUND

You can test whether the index occurs in the array:
$ awk -F":" 'NR==FNR {a[$1]=$2;next} $1 in a {print a[$1]; next} {print "NOT FOUND"}' file1 file2
John
Michel
NOT FOUND

if file2 has only digit (no space at the end)
awk -F ':' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1
if not
awk -F '[:[:blank:]]' '$1 in A {print A[$1];next}{if($2~/^$/) print "NOT FOUND";else A[$1]=$2}' file1 file2

Tab separated values in awk

How do I select the first column from the TAB separated string?
# echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F'\t' '{print $1}'
The above will return the entire line and not just "LOAD_SETTLED" as expected.
Update:
I need to change the third column in the tab separated values.
The following does not work.
echo $line | awk 'BEGIN { -v var="$mycol_new" FS = "[ \t]+" } ; { print $1 $2 var $4 $5 $6 $7 $8 $9 }' >> /pdump/temp.txt
This however works as expected if the separator is comma instead of tab.
echo $line | awk -v var="$mycol_new" -F'\t' '{print $1 "," $2 "," var "," $4 "," $5 "," $6 "," $7 "," $8 "," $9 "}' >> /pdump/temp.txt

You need to set the OFS variable (output field separator) to be a tab:
echo "$line" |
awk -v var="$mycol_new" -F'\t' 'BEGIN {OFS = FS} {$3 = var; print}'
(make sure you quote the $line variable in the echo statement)

Make sure they're really tabs! In bash, you can insert a tab using C-v TAB
$ echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -F$'\t' '{print $1}'
LOAD_SETTLED

Use:
awk -v FS='\t' -v OFS='\t' ...
Example from one of my scripts.
I use the FS and OFS variables to manipulate BIND zone files, which are tab delimited:
awk -v FS='\t' -v OFS='\t' \
-v record_type=$record_type \
-v hostname=$hostname \
-v ip_address=$ip_address '
$1==hostname && $3==record_type {$4=ip_address}
{print}
' $zone_file > $temp
This is a clean and easy to read way to do this.

You can set the Field Separator:
... | awk 'BEGIN {FS="\t"}; {print $1}'
Excellent read:
https://docs.freebsd.org/info/gawk/gawk.info.Field_Separators.html

echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk -v var="test" 'BEGIN { FS = "[ \t]+" } ; { print $1 "\t" var "\t" $3 }'

If your fields are separated by tabs - this works for me in Linux.
awk -F'\t' '{print $1}' < tab_delimited_file.txt
I use this to process data generated by mysql, which generates tab-separated output in batch mode.
From awk man page:
-F fs
--field-separator fs
Use fs for the input field separator (the value of the FS prede‐
fined variable).

1st column only
— awk NF=1 FS='\t'
LOAD_SETTLED
First 3 columns
— awk NF=3 FS='\t' OFS='\t'
LOAD_SETTLED LOAD_INIT 2011-01-13
Except first 2 columns
— {g,n}awk NF=NF OFS= FS='^([^\t]+\t){2}'
— {m}awk NF=NF OFS= FS='^[^\t]+\t[^\t]+\t'
2011-01-13 03:50:01
Last column only
— awk '($!NF=$NF)^_' FS='\t', or
— awk NF=NF OFS= FS='^.*\t'
03:50:01

Should this not work?
echo "LOAD_SETTLED LOAD_INIT 2011-01-13 03:50:01" | awk '{print $1}'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using awk to count names in a column - awk

another approach of counting upper case chars $ awk -F';' 'gsub(/[A-Z]/,"&",$2)>2 {print $1,$2}'

Related

awk to capture element upto space or using special escape character

awk split with asterix

AWK: Apply filter only if field separator is present

Print default value if index is not in awk array

Tab separated values in awk

Categories

Resources