adding common names in the column - awk - awk

Is it possible to print unique names in 1st column by adding the names in the 2nd column like below ? thanx in advance!
input
tony singapore
johnny germany
johnny singapore
output
tony singapore
johnny germany;singapore

try this one-liner:
awk '{a[$1]=$1 in a?a[$1]";"$2:$2}END{for(x in a)print x, a[x]}' file

$ awk '{name2vals[$1] = name2vals[$1] sep[$1] $2; sep[$1] = ";"} END { for (name in name2vals) print name, name2vals[name]}' file
johnny germany;singapore
tony singapore

Here is a cryptic sed variant:
Content of script.sed
$ cat script.sed
:a # Create a label called loop
$!N # If not last line, append the line to pattern space
s/^(([^ ]+ ).*)\n\2/\1;/ # If first column is same append second column to it separated by ;
ta # If the last substitution was successful loop back
P # Print up to the first \n of the current pattern space
D # Delete from current pattern space, up to the \n character
Execution:
$ cat file
tony singapore
johnny germany
johnny singapore
$ sed -rf script.sed file
tony singapore
johnny germany; singapore

Related

Add file name as a new column with awk

First of all existing questions didn't solve my problem that's why I am asking again.
I have two txt files temp.txt
adam 12
george 15
thomas 20
and demo.txt
mark 8
richard 11
james 18
I want to combine them and add a 3rd column as their file names without extension, like this:
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
I used this script:
for i in $(ls); do name=$(basename -s .txt $i)| awk '{OFS="\t";print $0, $name} ' $i; done
But it yields following table:
mark 8 mark 8
richard 11 richard 11
james 18 james 18
adam 12 adam 12
george 15 george 15
thomas 20 thomas 20
I don't understand why it gives the name variable as the whole table.
Thanks in advance.
Awk has no access to Bash's variables, or vice versa. Inside the Awk script, name is undefined, so $name gets interpreted as $0.
Also, don't use ls in scripts, and quote your shell variables.
Finally, the assignment of name does not print anything, so piping its output to Awk makes no sense.
for i in ./*; do
name=$(basename -s .txt "$i")
awk -v name="$name" '{OFS="\t";print $0, $name}' "$i"
done
As such, the basename calculation could easily be performed natively in Awk, but I leave that as an exercise. (Hint: sub(regex, "", FILENAME))
awk has a FILENAME variable whose value is the path of the file being processed, and a FNR variable whose value is the current line number in the file;
so, at FNR == 1 you can process FILENAME and store the result in a variable that you'll use afterwards:
awk -v OFS='\t' '
FNR == 1 {
basename = FILENAME
sub(".*/", "", basename) # strip from the start up to the last "/"
sub(/\.[^.]*$/, "", basename) # strip from the last "." up to the end
}
{ print $0, basename }
' ./path/temp.txt ./path/demo.txt
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
Using BASH:
for i in temp.txt demo.txt ; do while read -r a b ; do printf "%s\t%s\t%s\n" "$a" "$b" "${i%%.*}" ; done <"$i" ; done
Output:
adam 12 temp
george 15 temp
thomas 20 temp
mark 8 demo
richard 11 demo
james 18 demo
For each source file read each line and use printf to output tab-delimited columns including the current source file name without extension via bash parameter expansion.
First, you need to unmask $name which is inside the single quotes, so does not get replaced by the filename from the shell. After you do that, you need to add double quotes around $name so that awk sees that as a string:
for i in $(ls); do name=$(basename -s .txt $i); awk '{OFS="\t";print $0, "'$name'"} ' $i; done

Most efficient way to gsub strings in awk where strings come from a separate file

I have a tab-sebarated file called cities that looks like this:
Washington Washington N 3322 +Geo+Cap+US
Munich München N 3842 +Geo+DE
Paris Paris N 4948 +Geo+Cap+FR
I have a text file called countries.txt which looks like this:
US
DE
IT
I'm reading this file into a Bash variable and sending it to an awk program like this:
#!/usr/bin/env bash
countrylist=$(<countries.txt)
awk -v countrylist="$countrylist" -f countries.awk cities
And I have an awk file which should split the countrylist variable into an array, then process the cities file in such a way that we replace "+"VALUE with "" in $5 only if VALUE is in the countries array.
{
FS = "\t"; OFS = "\t";
split(countrylist, countries, /\n/)
# now gsub efficiently every country in $5
# but only if it's in the array
# i.e. replace "+US" with "" but not
# "+FR"
}
I am stuck in this last bit because I don't know how to check if $5 has a value from the array countries and to remove it only then.
Many thanks in advance!
[Edit]
The output should be tab-delimited:
Washington Washington N 3322 +Geo+Cap
Munich München N 3842 +Geo
Paris Paris N 4948 +Geo+Cap+FR
Could you please try following, if I understood your requirement correctly.
awk 'FNR==NR{a[$0]=$0;next} {for(i in a){if(index($5,a[i])){gsub(a[i],"",$5)}}} 1' countries.txt cities
A non-one liner form of code is as follows(you could set FS and OFS to \t in case your Input_file is TAB delimited):
awk '
FNR==NR{
a[$0]=$0
next
}
{
for(i in a){
if(index($5,a[i])){
gsub(a[i],"",$5)
}
}
}
1
' countries.txt cities
Output will be as follows.
Washington Washington N 3322 +Geo+Cap+
Munich München N 3842 +Geo+
Paris Paris N 4948 +Geo+Cap+FR
This is the awk way of doing it:
$ awk '
BEGIN {
FS=OFS="\t" # delimiters
}
NR==FNR { # process countries file
countries[$0] # hash the countries to an array
next # skip to next citi while there are cities left
}
{
n=split($5,city,"+") # split the 5th colby +
if(city[n] in countries) # search the last part in countries
sub(city[n] "$","",$5) # if found, replace in the 5th
}1' countries cities # output and mind the order of files
Output (with actual tabs in data):
Washington Washington N 3322 +Geo+Cap+
Munich München N 3842 +Geo+
Paris Paris N 4948 +Geo+Cap+FR

how to perform a search for words in one file against another file and display the first matching word in a line

I have an annoying problem. I have two files.
$ cat file1
Sam
Tom
$ cat file2
I am Sam. Sam I am.
Tom
I am Tom. Tom I am.
File 1 is a word list file whereas file2 is a file containing varying number of columns. I want to perform a search using file 1 against file2, display all possible the first matching word that appear in each line of file2. Thus the result needs to be the following:
Sam (line 1 match)
Tom (line 2 match)
Tom (line 3 match)
If the f2 is the following,
I am Sam. Sam I am.
Tom
I am Tom. Tom I am.
I am Tom. Sam I am.
I am Sam. Tom I am.
I am Sammy.
It needs to display the following:
Sam (1st line match)
Tom (2nd line match)
Tom (3rd line match)
Tom (4th line match)
Sam (4th line match)
Sam (5th line match)
Tom (5th line match)
Sam (6th line match)
I think I need an awk solution since the command "grep -f file1 file2" won't work.
Seems like you want first match from each line:
$ cat f1
Sam
Tom
$ cat f2
I am Sam. Sam I am.
Tom
I am Tom. Tom I am.
I am Tom. Sam I am.
I am Sam. Tom I am.
$ grep -Fnof f1 f2 | sort -t: -u -k1,1n
1:Sam
2:Tom
3:Tom
4:Tom
5:Sam
-n option to display line number which is later used to remove duplicates
-F option to match search terms literally and not as regex
-o to display only matching terms
pipe the output to cut -d: --complement -f1 to remove first column of line numbers
With GNU awk for sorted_in:
$ cat tst.awk
BEGIN { PROCINFO["sorted_in"] = "#val_num_asc" }
NR==FNR { res[$0]; next }
{
delete found
for ( re in res ) {
if ( !(re in found) ) {
if ( match($0,re) ) {
found[re] = RSTART
}
}
}
for ( re in found ) {
printf "%s (line #%d match)\n", re, FNR
}
}
$ awk -f tst.awk file1 file2
Sam (line #1 match)
Tom (line #2 match)
Tom (line #3 match)
Tom (line #4 match)
Sam (line #4 match)
Sam (line #5 match)
Tom (line #5 match)
Sam (line #6 match)
Could you please try following and let me know if this helps you.
awk -F"[. ]" 'FNR==NR{a[$0];next} {for(i=1;i<=NF;i++){if($i in a){print $i;next}}}' Input_file1 Input_file2
Seems grep could be made to work
grep -nof f1 f2 | sort -u
1:Sam
2:Tom
3:Tom
4:Sam
4:Tom
5:Sam
5:Tom
6:Sam

awk find out how many times columns two and three equal specific word

Lets say I have a names.txt file with the following
Bob Billy Billy
Bob Billy Joe
Bob Billy Billy
Joe Billy Billy
and using awk I want to find out how many times $2 = Billy while $3 = Billy. In this case my desired output would be 3 times.
Also, I'm testing this on a mac if that matters.
You first need to test $2==$3 then test that one of those equals "Billy". Increment a counter and then print the result at the end:
$ awk '$2==$3 && $2=="Billy"{cnt++} END{print cnt+0}' names.txt
3
Or, you could almost write just what you said:
$ awk '$2=="Billy" && $3=="Billy" {cnt++} END{print cnt+0}' names.txt
3
And if you want to use a variable so you don't need to type it several times:
$ awk -v name='Billy' '$2==name && $3==name {cnt++}
END{printf "Found \"%s\" %d times\n", name, cnt+0}' names.txt
Found "Billy" 3 times
Or, you could collect them all up and report what was found:
$ awk '{cnts[$2 "," $3]++}
END{for (e in cnts) print e ": " cnts[e]}' names.txt
Billy,Billy: 3
Billy,Joe: 1
You may also consider use grep to do that,
$ grep -c "\sBilly\sBilly" name.txt
3
-c: print a count of matching lines

delete line if string is matched and the next line contains another string

got an annoying text manipulation problem, i need to delete a line in a file if it contains a string but only if the next line also contains another string. for example, i have these lines:
john paul
george
john paul
12
john paul
i want to delete any line containing 'john paul' if it is immediately followed by a line that contains 'george', so it would return:
george
john paul
12
john paul
not sure how to grep or sed this. if anyone could lend a hand that'd be great!
This might work for you (GNU sed):
sed '/john paul/{$!N;/\n.*george/!P;D}' file
If the line contains john paul read the next line and if it contains george don't print the first line.
N.B. If the line containing george contains john paul it will be checked also.
awk 'NR > 1 && !(/george/ && p ~ /john paul/) { print p } { p = $0 } END { print }' file
Output:
george
john paul
12
john paul
This awk should do:
cat file
john paul
george
john paul
12
john paul
hans
george
awk 'f~/john paul/ && /george/ {f=$0;next} NR>1 {print f} {f=$0} END {print}' file
george
john paul
12
john paul
hans
george
This will only delete name above george if it is john paul
Here is one version more general:
if the lines matches a string and previous line was exactly "john paul" then do nothing, otherwise, print the previous line. (change the ^[a-zA-Z]$ part to george if you only want george to be detected.
awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
In your example:
$> echo 'john paul
george
john paul
12
john paul' |awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
george
john paul
12
john paul
if there is some numbers in the line, it prints the previous, otherwise it doesn't:
$> echo 'john paul
george 234
john paul
auie
john paul' |awk '!(/^[a-zA-W]+$/ && previous ~/^john paul$/){print previous}{previous=$0}END{print}'
john paul
george 234
auie
john paul
The sed solution is short: two commands and lots of comments ;)
/john paul/ {
# read the next line and append to pattern space
N
# and then if we find "george" in that next line,
# only retain the last line in the pattern space
s/.*\n\(.*george\)/\1/
# and finally print the pattern space,
# as we don't use the -n option
}
You put the above in some sedscript file and then run:
sed -f sedscript your_input_file
Just to throw some Perl into the mix:
perl -ne 'print $p unless /george/ && $p =~ /john paul/; $p = $_ }{ print $p' file
Print the previous line, unless the current line matches /george/ and the previous line $p matched /john paul/. Set $p to the value of the previous line. }{ effectively creates an END block, so the last line is also printed after the file has been read.
You might have to change the \r\n to \n or to \r, other than that this should work:
<?php
$string = "john paul
george
john paul
12
john paul";
$string = preg_replace("#john paul\r\n(george)#i",'$1',$string);
echo $string;
?>
You could also read a file into the variable and then after overwrite the file.
With GNU awk for multi-char RS:
$ gawk -vRS='^$' '{gsub(/john paul\ngeorge/,"george")}1' file
george
john paul
12
john paul
or if there's more on each line than your sample input shows just change the RE to suit and use gensub():
$ gawk -vRS='^$' '{$0 = gensub(/[^\n]*john paul[^\n]*\n([^\n]*george[^\n]*)/,"\\1","")}1' file
george
john paul
12
john paul