combine 2 files with AWK based last colums [closed] - awk

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
i have two files
file1
-------------------------------
1 a t p b
2 b c f a
3 d y u b
2 b c f a
2 u g t c
2 b j h c
file2
--------------------------------
1 a b
2 p c
3 n a
4 4 a
i want combine these 2 files based last columns (column 5 of file1 and column 3 of file2) using awk
result
----------------------------------------------
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c

at the very beginning, I didn't see the duplicated "a" in file2, I thought it would be solved with normal array matching. ... now it works.
an awk onliner:
awk 'NR==FNR{a[$3"_"NR]=$0;next;}{for(x in a){if(x~"^"$5) print $1,$2,$3,$4,a[x];}}' f2.txt f1.txt
test
kent$ head *.txt
==> f1.txt <==
1 a t p b
2 b c f a
3 d y u b
2 b c f a
2 u g t c
2 b j h c
==> f2.txt <==
1 a b
2 p c
3 n a
4 4 a
kent$ awk 'NR==FNR{a[$3"_"NR]=$0;next;}{for(x in a){if(x~"^"$5) print $1,$2,$3,$4,a[x];}}' f2.txt f1.txt
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c
note, the output format was not sexy, but it would be acceptable if pipe it to column -t

Other way assuming files have no headers:
awk '
FNR == NR {
f2[ $NF ] = f2[ $NF ] ? f2[ $NF ] SUBSEP $0 : $0;
next;
}
FNR < NR {
if ( $NF in f2 ) {
split( f2[ $NF ], a, SUBSEP );
len = length( a );
for ( i = 1; i <= len; i++ ) {
$NF = a[ i ];
}
}
printf "%s\n", $0;
}
' file2 file1 | column -t
It yields:
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c

A bit easier in a language that supports arbitrary data structures (list of lists). Here's ruby
# read "file2" and group by the last field
file2 = File .foreach('file2') .map(&:split) .group_by {|fields| fields[-1]}
# process file1
File .foreach('file1') .map(&:split) .each do |fields|
file2[fields[-1]] .each do |fields2|
puts (fields[0..-2] + fields2).join(" ")
end
end
outputs
1 a t p 1 a b
2 b c f 3 n a
2 b c f 4 4 a
3 d y u 1 a b
2 b c f 3 n a
2 b c f 4 4 a
2 u g t 2 p c
2 b j h 2 p c

Related

See if 2 columns are 1 to 1 match in sql

I have a table in sql.
I need to check if columns A and B are 1 to 1 match (Meaning for each row in A, there is only one value in B)
but if you see below, it is not.
Only Cat A and Cat C have one values in Col B.
So is there a way to identify between 2 columns
A B C
Cat A asd 34
Cat A asd 56
Cat B dfg 67
Cat B ghj 7
Cat C ggh 78
Cat D ertrty 9
Cat D tyutyu 6
Cat D tuuiy 45
SELECT A, COUNT(DISTINCT B)
FROM your_table
GROUP BY A
HAVING COUNT(DISTINCT B) = 1;

pasting files/multiple columns with different number of rows

Hi I was trying to paste multiple files (each with a single column but different number of rows) together. But it did't provide what I was expecting. How to solve that?
paste file1.txt file2.txt paste3.txt ... paste100 > out.txt
input file 1:
A
B
C
input file 2:
D
E
input file 3:
F
G
H
I
J
.......
......
Desired output:
A D F
B E G
C H
I
J
Would this be same if the files have multiple columns with different number of rows?
for example:
file1
A 1
B 2
C 3
file2
D 4
E 5
file3
F 6 %
G 7 &
H 8 #
I 9 #
J 10 ?
output:
A 1 D 4 F 6 %
B 2 E 5 G 7 &
C 3 H 8 #
I 9 #
J 10 ?
Isn't the default behaviour of paste exactly what you ask?
% paste <(echo "a
b
c
d") <(echo "1
2
3") <(echo "10
> 20
> 30
> 40
> 50
> 60")
a 1 10
b 2 20
c 3 30
d 40
50
60
%

awk: delete first and last entry of comma-separated field

I have a 4 column data that looks something like the following:
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
I am trying to delete first two numbers and the last two numbers on entire fourth column so the output looks like the following
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Can someone give me a help? I would appreciate it.
You can use this:
$ awk '{n=split($4, a, ","); for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":","); print $1, $2, $3, t; t=""}' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
Explanation
n=split($4, a, ",") slices the 4th field in pieces, based on comma as delimiter. As split() returns the number of pieces we got, we store it in n to work with it later on.
for (i=3; i<=n-2; i++) t=t""a[i](i==n-2?"":",") stores in t the last field, looping through all the slices.
print $1, $2, $3, t; t="" prints the new output and blanks the variable t.
This will work for your posted sample input:
$ awk '{gsub(/^([^,]+,){2}|(,[^,]+){2}$/,"",$NF)}1' file
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6
If you have cases where there's less than 4 commas in your 4th field then update your question to show how those should be handled.
This uses bash array manipulation. It may be a little ... gnarly:
while read -a fields; do # read the fields for each line
IFS=, read -a values <<< "${fields[3]}" # split the last field on comma
new=("${values[#]:2:${#values[#]}-4}") # drop the first 2 and last fields
fields[3]=$(IFS=,; echo "${new[*]}") # join the new list on comma
printf "%s\t" "${fields[#]}"; echo # print the new line
done <<END
a 1 g 1,2,3,4,5,6,7
b 2 g 3,5,3,2,6,4,3,2
c 3 g 5,2,6,3,4
d 4 g 1,5,3,6,4,7
END
a 1 g 3,4,5
b 2 g 3,2,6,4
c 3 g 6
d 4 g 3,6

Using an array in AWK when working with two files

I have two files I merged them based key using below code
file1
-------------------------------
1 a t p bbb
2 b c f aaa
3 d y u bbb
2 b c f aaa
2 u g t ccc
2 b j h ccc
file2
--------------------------------
1 11 bbb
2 22 ccc
3 33 aaa
4 44 aaa
I merged these two file based key using below code
awk 'NR==FNR{a[$3]=$0;next;}{for(x in a){if(x==$5) print $1,$2,$3,$4,a[x]};
My question is how I can save $2 of file2 in variable or array and print after a[x] again.
My desired result is :
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
As you see the first 7 columns is the result of my merge code. I need add the last column (field 2 of a[x]) to my result.
Important:
My next question is if I have .awk file, how I can use some bash script code like (| column -t) or send result to file (awk... > result.txt)? I always use these codes in command prompt. Can I use them inside my code in .awk file?
Simply add all of file2 to an array, and use split to hold the bits you want:
awk 'FNR==NR { two[$0]++; next } { for (i in two) { split(i, one); if (one[3] == $NF) print $1,$2,$3,$4, i, one[2] } }' file2 file1
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Regarding your last question; you can also add 'pipes' and 'writes' inside of your awk. Here's an example of a pipe to column -t:
Contents of script.awk:
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
Run like: awk -f script.awk file2 file1
EDIT:
Add the following to your shell script:
results=$(awk '
FNR==NR {
two[$0]++
next
}
{
for (i in two) {
split(i, one)
if (one[3] == $NF) {
print $1,$2,$3,$4, i, one[2] | "column -t"
}
}
}
' $1 $2)
echo "$results"
Run like:
./script.sh file2.txt file1.txt
Results:
1 a t p 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 3 33 aaa 33
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Your current script is:
awk 'NR==FNR { a[$3]=$0; next }
{ for (x in a) { if (x==$5) print $1,$2,$3,$4,a[x] } }'
(Actually, the original is missing the second close brace for the second pattern/action pair.)
It seems that you process file2 before you process file1.
You shouldn't need the loop in the second code. And you can make life easier for yourself by using the splitting in the first phase to keep the values you need:
awk 'NR==FNR { c1[$3] = $1; c2[$3] = $2; next }
{ print $1, $2, $3, $4, c1[$5], c2[$5], $5, c2[$5] }'
You can upgrade that to check whether c1[$5] and c2[$5] are defined, presumably skipping the row if they are not.
Given your input files, the output is:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22
Give or take column spacing, that's what was requested. Column spacing can be fixed by using printf instead of print, or setting OFS to tab, or ...
The c1 and c2 notations for column 1 and 2 is OK for two columns. If you need more, then you should probably use the 2D array notation:
awk 'NR==FNR { for (i = 1; i <= NF; i++) col[i,$3] = $i; next }
{ print $1, $2, $3, $4, col[1,$5], col[2,$5], $5, col[2,$5] }'
This produces the same output as before.
To achieve what you ask, save the second field after the whole line in the processing of your first file, with a[$3]=$0 OFS $2. For your second question, awk has a variable to separate fields in output, it's OFS, assign a tabulator to it and play with it. Your script would be like:
awk '
BEGIN { OFS = "\t"; }
NR==FNR{
a[$3]=$0 OFS $2;
next;
}
{
for(x in a){
if(x==$5) print $1,$2,$3,$4,a[x]
}
}
' file2 file1
That yields:
1 a t p 1 11 bbb 11
2 b c f 4 44 aaa 44
3 d y u 1 11 bbb 11
2 b c f 4 44 aaa 44
2 u g t 2 22 ccc 22
2 b j h 2 22 ccc 22

awk histogram in buckets

Consider I have a following file..
1 a
1 b
1 a
1 c
1 a
2 a
2 d
2 a
2 d
I want to have a histogram within a bucket... for example if bucket is 1 then the output will be
a 3
b 1
c 1
a 2
d 2
for bucket 2... we have
a 5
b 1
c 1
d 2
I want to do it with awk and I literally stuck...
here is my code:
awk '
{A[$1]} count [$2]++
{for(i in A) {print i,A[i]}
}' test
Any help?
Thanks,
Amir.
Edit Adding a size_of_bucket variable.
awk -v "size_of_bucket=2" '
{
bucket = int(($1-1)/size_of_bucket);
A[bucket","$2]++;
}
END {
for (i in A) {
print i, A[i];
}
}
'