How to join two files using awk? - awk

Have two files: 1.txt and 2.txt
1.txt has items and their order in this form:
item-code|order-value|label
2.txt has items and their properties in this form:
item-code|property-A|property-B| ... |property-Z
For example, 1.txt looks like this:
ITEM-CODE|_o_o_|prefLabel-EN-ANSI
6|8719|disparlure
7|3300|acids,-bases,-and-salts
8|3299|chemical-compounds
2.txt looks like this:
ITEM-CODE|TERM|AV-FTC|DB-PEDIA-IRI|LCSH-1|LCSH-2|LCSH-3|LCSH-4|LCSH-5|LCSH-6|LCSH-7|GACS-IRI
2|positive-sense,-single-stranded-RNA-viruses|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|http://id.agrisemantics.org/gacs/C4028
4|negative-sense,-single-stranded-RNA-viruses|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|http://id.agrisemantics.org/gacs/C3806
6|disparlure|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_
7|acids,-bases,-and-salts|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_
8|chemical-compounds|c_49870|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|http://id.agrisemantics.org/gacs/C29686
sample 3.txt (result - see below) looks liken this:
ITEM-CODE|TERM|AV-FTC|DB-PEDIA-IRI|LCSH-1|LCSH-2|LCSH-3|LCSH-4|LCSH-5|LCSH-6|LCSH-7|GACS-IRI|_o_o_
2|positive-sense,-single-stranded-RNA-viruses|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|http://id.agrisemantics.org/gacs/C4028|NULL
4|negative-sense,-single-stranded-RNA-viruses|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|http://id.agrisemantics.org/gacs/C3806|NULL
6|disparlure|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|_0_|8719
this awk function:
BEGIN { FS=OFS="|" }
NR==FNR{
a[$1]=$2
next
}
{
if ($1 in a)
$(NF+1)=a[$1]
else
$(NF+1)="NULL"
print
}
generates:
item-code|label|property-A|property-B| ... |property-Z|order-value
if no item-code from 1.txt matches item-code in 2.txt, NULL is substituted for the missing order-value
How to modify the awk function to keep 1.txt on the left (the "constant") and 2.txt on the right (the "variables") and generate a result like this:
item-code|order-value|label|property-A|property-B| ... |property-Z
or, if no property-value is available for item-code, then
item-code|order-value|label|NULL
the command looks like this:
C:\gnu\GnuWin32\bin\awk.exe -f a.awk 1.txt 2.txt > 3.txt
where a.awk is the awk function above.
Am running awk on Win10 and using double quotes

Could you please try following.
awk '
BEGIN{
FS=OFS="|"
}
FNR==1 && ++count==1{
val=$2
next
}
FNR==1 && ++count==2{
print $0,val
next
}
FNR==NR{
a[$1]=$2
next
}
{
print $0,a[$1]?a[$1]:"NULL"
}
' 1.txt 2.txt
Explanation: Adding explanation for above code too now.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section for awk program here.
FS=OFS="|" ##Setting field separator and output field separator as pipe here.
} ##Closing BEGIN section here.
FNR==1 && ++count==1{ ##Checking condition if FNR==1 and variable count value is 1 means first Input_file header is being read.
val=$2 ##Creating variable val and setting its value as $2 here.
next ##Next will skip all further statements from here onwards.
} ##Closing this condition block.
FNR==1 && ++count==2{ ##Checking condition where FNR==1 and count variable value is 2 here.
print $0,val ##Printing current line with variable val here.
next ##Next will skip all further statements from here.
}
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when 1.txt is being read.
a[$1]=$2 ##Creating an array named a whose index is $1 and value is $2.
next ##next will skip all further statements from here.
}
{
print $0,a[$1]?a[$1]:"NULL" ##Printing current line and printing value of a[$1] if a[$1] is having no value then print NULL.
}
' 1.txt 2.txt ##Mentioning Input_file names here.

You can do that with join.
1.txt
1|48000|first
2|67500|second
3|81990|third
4|55000|fourth
2.txt
1|fred|sara|anthony
3|steve|jane|mike
4|tim
Then run:
join -a 1 -e "NULL" -t '|' -o 1.1,1.2,1.3,2.2,2.3,2.4 1.txt 2.txt
Sample Result
1|48000|first|fred|sara|anthony
2|67500|second|NULL|NULL|NULL
3|81990|third|steve|jane|mike
4|55000|fourth|tim|NULL|NULL

Related

How to merge duplicate lines into same row with primary key and more than one column of information

Here is my data:
NAME1,NAME1_001,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
NAME1,NAME1_003,LIC000_1,NULL,NULL,NULL,NULL
NAME2,NAME2_001,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_001,NULL,LIC400_2,NULL,NULL,LIC500_1
NAME3,NAME3_005,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_006,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME4,NAME4_002,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
Expected result:
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6
I tried below command, but have no idea how to add the details ($3 to $7)
awk '
BEGIN{FS=","; OFS="|"};
{ arr[$1] = arr[$1] == ""? $2 : arr[$1] "|" $2 }
END {for (i in arr) print i, arr[i] }' file.csv
Any suggestion? thanks!!
Could you please try following. Written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==NR{
first=$1
$1=""
sub(/^,/,"")
arr[first]=(first in arr?arr[first] OFS:"")$0
next
}
($1 in arr){
print $1 arr[$1]
delete arr[$1]
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="," ##Setting FS as comma here.
OFS="|" ##Setting OFS as | here.
}
FNR==NR{ ##Checking FNR==NR which will be TRUE when first time Input_file is being read.
first=$1 ##Setting first as 1st field here.
$1="" ##Nullifying first field here.
sub(/^,/,"") ##Substituting starting comma with NULL in current line.
arr[first]=(first in arr?arr[first] OFS:"")$0 ##Creating arr with index of first and keep adding same index value to it.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field is present in arr then do following.
print $1 arr[$1] ##Printing 1st field with arr value here.
delete arr[$1] ##Deleting arr item here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Another awk:
$ awk '
BEGIN { # set them field separators
FS=","
OFS="|"
}
{
if($1 in a) { # if $1 already has an entry in a hash
t=$1 # store key temporarily
$1=a[$1] # set the a hash entry to $1
a[t]=$0 # and hash the record
} else { # if $1 seen for the first time
$1=$1 # rebuild record to change the separators
a[$1]=$0 # and hash the record
}
}
END { # afterwards
for(i in a) # iterate a
print a[i] # and output
}' file
Assuming your input is grouped by the key field as shown in your example (if it isn't then sort it first) you don't need to store the whole file in memory or read it twice and this will output the lines in the same order they appear in the input:
$ cat tst.awk
BEGIN { FS=","; OFS="|" }
$1 != prev {
if (NR>1) {
print rec
}
prev = rec = $1
}
{
$1 = ""
rec = rec $0
}
END { print rec }
$ awk -f tst.awk file
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6

Search one file's lines for a partial match in another file

I have 2 files, the first one:
values.txt
test#
test1#
test3#
test4#
test6#
test7#
test8#
test9#
test10#
data.csv
"username","email"
"user","test#gmail.com"
"user1","test1#gmail.com"
"user2","test3#gmail.com"
"user4","test4#gmail.com"
"user456","loka#gmail.com"
"user789","lopa#gmail.com"
"user5","test7#gmail.com"
"user","xpos#gmail.com"
"user5","test9#gmail.com"
"user","xpx#gmail.com"
I want the output to be like this:
"user","test#gmail.com"
"user1","test1#gmail.com"
"user2","test3#gmail.com"
"user4","test4#gmail.com"
"user5","test7#gmail.com"
"user5","test9#gmail.com"
What I was able to do :
$ awk -F, -v q='"' 'NR==FNR{a[q $0 q]; next}
$2 in a' values.txt data.csv > test1.csv
This will work only when i have the full "email" exp: test9#gmail.com and not only test9# a new file test1.csv containing:
"user5","test9#gmail.com"
....
....
Couldn't figure out how to do it with a partial substring with awk
You may use this awk:
awk -F, 'NR==FNR {a[$1]; next} {ea = $2; gsub(/^"|#.*$/, "", ea)} ea "#" in a' values.txt data.csv
"user","test#gmail.com"
"user1","test1#gmail.com"
"user2","test3#gmail.com"
"user4","test4#gmail.com"
"user5","test7#gmail.com"
"user5","test9#gmail.com"
A more readable version:
awk -F, 'NR == FNR {
a[$1] # from values.txt store each value in array a
next
}
{
ea = $2 # copy $2 into ea (email address)
gsub(/^"|#.*$/, "", ea) # strip starting " and text after #
}
ea "#" in a # check if ea + "#" exists in array a
' values.txt data.csv
Could you please try following, written and tested with shown samples in GNU awk. Looks like few of your lines have empty spaces at last of the lines in case you want to remove them and then match both the file's contents I have added gsub(/ +$/,"") in my solution.
awk '
{ gsub(/ +$/,"") }
FNR==NR{
arr[$0]
next
}
{
for(key in arr){
if(index($2,key)){
print
next
}
}
}' values.txt FS="," delta.csv
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{ gsub(/ +$/,"") } ##Using gsub to remove spaces at last of lines.
FNR==NR{ ##Checking condition which will be TRUE when values.txt is being read.
arr[$0] ##Creating arr here with index of current line value.
next ##next will skip all further statements from here.
}
{
for(key in arr){ ##Going through arr elements from here.
if(index($2,key)){ ##Checking condition if key is present by index in 2nd field.
print ##Printing the current line.
next ##next will skip all further statements from here.
}
}
}' values.txt FS="," delta.csv ##Mentioning Input_file names here.

How to fetch a particular string using a sed command

I have an input string like below:
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
Like this, there are more than 1000 rows.
I want to fetch the value of the first parameter, i.e., the a field and d field, but for the d field I want only har:919876543210#abc.com.
I tried like this:
cat $filename | grep -v Orig |sed -e 's/['a:','d:']//g' |awk -F'|' -v OFS=',' '{print $1 "," $4}' >> $NGW_DATA_FILE
The output I got is below:
1,<har919876543210#abc.com>; tag=vy6r5BpcvQ
I want it like this,
1,har:919876543210#abc.com
Where did I make the mistake and how do I solve it?
EDIT: As per OP's change of Input_file and OP's comments, adding following now.
awk '
BEGIN{ FS="|"; OFS="," }
{
sub(/[^:]*:/,"",$1)
gsub(/^[^<]*|; .*/,"",$4)
gsub(/^<|>$/,"",$4)
print $1,$4
}' Input_file
With shown samples, could you please try following, written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS="|"
OFS=","
}
{
val=""
for(i=1;i<=NF;i++){
split($i,arr,":")
if(arr[1]=="a" || arr[1]=="d"){
gsub(/^[^:]*:|; .*/,"",$i)
gsub(/^<|>$/,"",$i)
val=(val?val OFS:"")$i
}
}
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="|" ##Setting FS as pipe here.
OFS="," ##Setting OFS as comma here.
}
{
val="" ##Nullify val here(to avoid conflicts of its value later).
for(i=1;i<=NF;i++){ ##Traversing through all fields here
split($i,arr,":") ##Splitting current field into arr with delimiter by :
if(arr[1]=="a" || arr[1]=="d"){ ##Checking condition if first element of arr is either a OR d
gsub(/^[^:]*:|; .*/,"",$i) ##Globally substituting from starting till 1st occurrence of colon OR from semi colon to everything with NULL in $i.
val=(val?val OFS:"")$i ##Creating variable val which has current field value and keep adding in it.
}
}
print val ##printing val here.
}
' Input_file ##Mentioning Input_file name here.
You may also try this AWK script:
cat file
VAL:1|b:2|c:3|VAL:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|VAl:1234|name:mnp|VAL:91987654321
awk -F '[|;]' '{
s=""
for (i=1; i<=NF; ++i)
if ($i ~ /^VAL:/) {
gsub(/^[^:]+:|[<>]*/, "", $i)
s = (s == "" ? "" : s "," ) $i
}
print s
}' file
1,har:919876543210#abc.com
You can do the same thing with sed rather easily using Extended Regex, two capture groups and two back-references, e.g.
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
Explanation
's/find/replace/' standard substitution, where the find is;
^[^:]*: from the beginning skip through the first ':', then
(\w+) capture one or more word characters ([a-zA-Z0-9_]), then
[^<]*[<] consume zero or more characters not a '<', then the '<', then
([^>]+) capture everything not a '>', and
.*$ discard all remaining chars in line, then the replace is
\1,\2 reinsert the captured groups separated by a comma.
Example Use/Output
$ echo 'a:1|b:2|c:3|d:<har:919876543210#abc.com>; tag=vy6r5BpcvQ|' |
sed -E 's/^[^:]*:(\w+)[^<]*[<]([^>]+).*$/\1,\2/'
1,har:919876543210#abc.com

Conditional transposition in awk based on column values

I'm trying to make the below transformation using awk.
Input:
status,parent,child,date
first,foo,bar,2019-01-01
NULL,foo,bar,2019-01-02
NULL,foo,bar,2019-01-03
last,foo,bar,2019-01-04
NULL,foo,bar,2019-01-05
blah,foo,bar,2019-01-06
NULL,foo,bar,2019-01-07
first,bif,baz,2019-01-02
NULL,bif,baz,2019-01-03
last,bif,baz,2019-01-04
Expected output:
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
I'm pretty stumped by this problem, and haven't got anything to show yet - any pointers would be very helpful.
Could you please try following.
awk '
BEGIN{
FS=OFS=SUBSEP=","
print "parent,child,first,last"
}
$1=="first" || $1=="last"{
a[$1,$2,$3]=$NF
b[$2,$3]
}
END{
for(i in b){
print i,a["first",i],a["last",i]
}
}
' Input_file
Output will be as follows.
parent,child,first,last
bif,baz,2019-01-02,2019-01-04
foo,bar,2019-01-01,2019-01-04
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=SUBSEP="," ##Setting Fs, OFS and SUBSEP as comma here.
print "parent,child,first,last" ##Printing header values as per OP request here.
} ##Closing BEGIN BLOCK for this progam here.
$1=="first" || $1=="last"{ ##Checking condition if $1 is either string first or last then do following.
a[$1,$2,$3]=$NF ##Creating an array named a whose index is $1,$2,$3 and its value is $NF(last column of current line).
b[$2,$3] ##Creating an array named b whose index is $2,$3 from current line.
} ##Closing main BLOCK for main program here.
END{ ##Starting END BLOCK for this awk program.
for(i in b){ ##Starting a for loop to traverse through array here.
print i,a["first",i],a["last",i] ##Printing variable it, array a with index of "first",i and value of array b with index of "last",i.
} ##Closing BLOCK for, for loop here.
} ##Closing BLOCK for END block for this awk program here.
' Input_file ##Mentioning Input_file name here.
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $2 OFS $3 }
FNR==1 { print key, "first", "last" }
$1=="first" { first[key] = $4 }
$1=="last" { print key, first[key], $4 }
$ awk -f tst.awk file
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
If you can have a first without a last or vice-versa or they can occur out of order then include those cases in the example in your question.
Not awk, you already have that, but here's an option in bash alone, just for kicks.
#!/usr/bin/env bash
declare -A first=()
printf 'parent,child,first,last\n'
while IFS=, read pos a b date; do
case "$pos" in
first) first[$a,$b]=$date ;;
last) printf "%s,%s,%s,%s\n" "$a" "$b" "${first[$a,$b]}" "$date" ;;
esac
done < input.csv
Requires bash 4+ for the associative array.

awk search pattern in a specific field and replace its content

I need to found field of password that is empty, with space or tab, and replace it with x (on /etc/passwd file)
I found this syntax with awk, that show users where second field (using : as delimiter) is or empty, or has space or tab inside:
awk -F":" '($2 == "" || $2 == " " || $2 == "\t") {print $0}' $file
and result is the follow:
user1::53556:100::/home/user1:/bin/bash
user2: :53557:100::/home/user2:/bin/bash
user3: :53558:100::/home/user3:/bin/bash
How I can say to awk to replace this 2nd field (empty or with space or tab) with another character? (for example x)
Could you please try following.
awk 'BEGIN{FS=OFS=":"} {$2=$2=="" || $2~/^[[:space:]]+$/?"X":$2} 1' Input_file
Explanation: Adding explanation of above code.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here which will be executed before Input_file is being read.
FS=OFS=":" ##Setting FS and OFS as colon here for all lines of Input_file.
} ##Closing BEGIN section block here.
{
$2=$2=="" || $2~/^[[:space:]]+$/?"X":$2 ##Checking condition if $2(2nd field) of current line is either NULL or having complete space in it then put its vaklue as X or keep $2 value as same as it is.
}
1 ##mentioning 1 will print edited/non-edited current line.
' Input_file ##Mentioning Input_file name here.
EDIT: As per OP, OP need NOT to touch last line of Input_file so adding following solutio now.
tac Input_file | awk 'BEGIN{FS=OFS=":"} FNR==1{print;next} {$2=$2=="" || $2~/^[[:space:]]+$/?"X":$2} 1' | tac
EDIT2: In case you want to do it kin single awk itself then try following.
awk '
BEGIN{
FS=OFS=":"
}
prev{
num=split(prev,array,":")
array[2]=array[2]=="" || array[2]~/^[[:space:]]+$/?"X":array[2]
for(i=1;i<=num;i++){
val=(val?val OFS array[i]:array[i])
}
print val
val=""
}
{
prev=$0
}
END{
if(prev){
print prev
}
}' Input_file
In case you want to change Input_file itself append > temp_file && mv temp_file Input_file in above code.
$ awk 'BEGIN{FS=OFS=":"} (NF>1) && ($2~/^[[:space:]]*$/){$2="x"} 1' file
user1:x:53556:100::/home/user1:/bin/bash
user2:x:53557:100::/home/user2:/bin/bash
user3:x:53558:100::/home/user3:/bin/bash
To change the original file using GNU awk:
awk -i inplace 'BEGIN{FS=OFS=":"} (NF>1) && ($2~/^[[:space:]]*$/){$2="x"} 1' file
or with any awk:
awk 'BEGIN{FS=OFS=":"} (NF>1) && ($2~/^[[:space:]]*$/){$2="x"} 1' file > tmp && mv tmp file
The test for NF>1 ensures we only operate on lines that already have at least 2 fields and so we don't create a line like :x in the output when there's an empty line in the input file. The rest is hopefully obvious.