Merge files with different columns and rows linux - awk

I have some files that I need join, I have looked for some solutions but they do not fit what I need, I have the following files
a.csv
date |A|B|C|D
15-03-2017|1|3|9|4
and
b.csv
date |A|C|D|E
16-03-2017|2|9|3|4
And I would like to get the next output:
date |A|B|C|D|E
15-03-2017|1|3|9|4|0
16-03-2017|2|0|9|3|4
Any insights or suggestions are appreciated!
EDIT:
Thanks for all
These file examples are not always the same
sometimes it can have between 10 and 50 columns and between 1 and 30 rows (dates)

something like this...
awk 'BEGIN {FS=OFS="|"}
FNR==1 {split($0,h); next}
{c++;
for(i=1;i<=NF;i++)
{a[h[i],c]=$i;
hall[h[i]]}}
END {for(k in hall) printf "%s", k OFS;
print "";
for(i=1;i<=c;i++)
{for(k in hall) printf "%s", ((k,i) in a?a[k,i]:0) OFS;
print ""}}' file1 file2
A|B|C|D|E|date |
1|3|9|4|0|15-03-2017|
2|0|9|3|4|16-03-2017|
you can reorder the columns with some additional code, but perhaps a better solution will show up...

A slightly simpler solution using sed:
sed 's/$/|0/;1s/0/E/' a.csv; sed '1d;s/|/|0|/2' b.csv

I am expanding my answer into a script for readability. You can use the script with the awk -f option. Starting with a BEGIN statement: specify | as your field separator, map an index to each header label using associative array and print the full header. For each file, map labels to indices on line 1. Then map labels to data on line 2 and replace empty data fields with "0". Print the filled out line and clear the arrays for the next file.
BEGIN{
# field separator
FS="|"
# index:label mapping
map[1]="date "; map[2]="A"; map[3]="B"
map[4]="C"; map[5]="D"; map[6]="E"
# print full header
print "date |A|B|C|D|E"
}
# first line of each file, create index:label mapping
FNR==1{
for (i=1;i<=NF;i++)
label[i]=$i
}
# next line of the file, create label:data mapping
FNR==2{
for (i=1;i<=NF;i++)
data[label[i]]=$i
# cycle through index:label mapping and print data
# for each label or "0" if there is no data
printf("%s", data["date "])
for (i=2;i<=6;i++) {
(data[map[i]]) ? s=data[map[i]] : s=0
printf("|%s", s)
}
print "" # print empty string for newline
# delete arrays to start from scratch on the following file
delete label
delete data
}
Result on the two example files:
$ awk -f joiner.awk a.csv b.csv
date |A|B|C|D|E
15-03-2017|1|3|9|4|0
16-03-2017|2|0|9|3|4

I solved this problem
I changed the way to get the data
Something like this:
today=$(date +%Y%m%d)
echo "DataBase "$(date +%d/%m/%Y)>/jdb"$today".txt
du -s $(ls -l|grep ^d|awk '{print $9})|awk '{print $2" "$1" "}'>>/jdb"$today".txt
the output is like:
jdb_20170507.txt:
database 07/05/2017
jdb_A 4345
jdb_CFX 7654
jdb_ZZXD 97865
jdb_20170508.txt:
database 08/05/2017
jdb_A 9876
jdb_CFX 7545
jdb_ZXCFG 2344
for this examples in jdb_20170508.txt was deleted the jdb_ZZXD database and created the jdb_ZXCFG database
with this structure I can use JOIN command:
x=0
touch jdbaux$x.txt
for jdbfile in $(ls -1t|grep jdb2)
do
y=$(($x+1))
join -a1 -a2 -e0 -o auto --nocheck-order jdbaux$x.txt $jdbfile >jdbaux$y.txt
rm jdbaux$x.txt
x=$(($x+1))
done
This is my recursive JOIN rustic option for all the archives of the month
-a1= file one
-a2=file two
-e0=replace missing input fields with 0
-o auto= output automatic format
--nocheck-order =do not check that the input is correctly sorted
the output is like:
jdb_sizes201705.txt:
database 07/05/2017 08/05/2017
jdb_A 4345 9876
jdb_CFX 7654 7545
jdb_ZZXD 97865 0
jdb_ZXCFG 0 2344
and the last step is a pivote
cat jdb_sizes201705.txt |awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}'
Obtaining the expected output
database jdb_A jdb_CFX jdb_ZZXD jdb_ZXCFG
07/05/2017 4345 7654 97865 0
08/05/2017 9876 7545 0 2344
I know it's not the best solution but it works!
Thanks!

Related

Filtering using awk returns empty files

I have a similar problem to this question: How to do filtering of multiple files in a directory using awk?
The solution in the answers of the question above does not work for me.
I have tab-delimited txt files (all in folder Observation_by_pracid). For each file, I want to create a new file that only contains rows with a specific value in column $9 (medcodeid). The specific values are to be found in medicalcode_list.txt.
There is no error, however it returns only empty files.
Codelist
medcodeid
2576
3199
Format of input files
patid consid ... medcodeid
500470520002 3062539302 ... 2576
951924020002 3062538414 ... 310803013
503478020002 3061587464 ... 257619018
951924020002 3062537807 ... 55627011
503576720002 3062537720 ... 3199
Desired output
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
My code
mkdir HBA1C_observation_bypracid
awk '
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Solution
mkdir HBA1C_observation_bypracid
awk '
BEGIN{ FS=OFS="\t" }
NR==FNR {mlist[$1]; next }
FNR==1 {close(out); out="HBA1C_observation_bypracid/HBA1C_" FILENAME }
($9 in mlist) { print > out }
' PATH/medicalcode_list.txt *.txt
Adding "BEGIN..." solved my problem.
You can join two files on a column using join.
Files must be sorted on the joined column. To perform a numerical sort on a column, use sort this way, where N is the column number:
sort -kN -n FILE
You also need to get ride of the first line (column names) of each files. You can use tail command the way below, where N is the number of line from which you want to output the content (so 2nd line):
tail -n +N
... But still need to display the column values:
head -n 1 FILE
To join two files f1 and f2, on the fields c1 of f1 and c2 of f2, and output fields y of files x:
join -1 c1 -2 c2 f1 f2 -o "x.y, x.y"
Working sample:
head -n 1 input_file
for input_file in *.txt ; do
join -1 1 -2 9 -o "2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9" \
<(tail -n +2 PATH/medicalcode_list.txt | sort -k1 -n) \
<(tail -n +2 "$input_file" | sort -k3 -n)
done
Result (for the input file you gave):
patid consid ... medcodeid
500470520002 3062539302 ... 2576
503576720002 3062537720 ... 3199
Note: the column names arent aligned with the values. Don't know if it's a prerequisite. You can format the display with printf command.
Personally I think it would be simpler to loop over in the shell (understanding that this will reread the code list more than once), with a simpler awk function that you should be able to test and debug. Something like:
for file in *.txt; do
awk 'FNR == NR { mlist[$1] } FNR != NR && ($9 in mlist) { print }' \
PATH/medicalcode_list.txt "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done
You should be able to start without the redirection to make sure that for a single file, you get the results printed to the terminal that you were expected. If you don't there might be some incorrect assumption about the files.
Another option would be to write a separate awk script that writes the code to hard-code the list in another awk script. Also gives the advantage to check the contents of the variable mlist.
printf 'BEGIN {\n%s\n}\n $9 in mlist { print }' \
"$(awk '{ print "mlist[" $1 "]" }' PATH/medicalcode_list.txt)" > filter.awk
for file in *.txt; do
awk -f filter.awk "$file" > HBA1C_observation_bypracid/HBA1C_"$file"
done

linux csv file concatenate columns into one column

I've been looking to do this with sed, awk, or cut. I am willing to use any other command-line program that I can pipe data through.
I have a large set of data that is comma delimited. The rows have between 14 and 20 columns. I need to recursively concatenate column 10 with column 11 per row such that every row has exactly 14 columns. In other words, this:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
will become:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
I can get the first 10 columns. I can get the last N columns. I can concatenate columns. I cannot think of how to do it in one line so I can pass a stream of endless data through it and end up with exactly 14 columns per row.
Examples (by request):
How many columns are in the row?
sed 's/[^,]//g' | wc -c
Get the first 10 columns:
cut -d, -f1-10
Get the last 4 columns:
rev | cut -d, -f1-4 | rev
Concatenate columns 10 and 11, showing columns 1-10 after that:
awk -F',' ' NF { print $1","$2","$3","$4","$5","$6","$7","$8","$9","$10$11}'
Awk solution:
awk 'BEGIN{ FS=OFS="," }
{
diff = NF - 14;
for (i=1; i <= NF; i++)
printf "%s%s", $i, (diff > 1 && i >= 10 && i < (10+diff)?
"": (i == NF? ORS : ","))
}' file
The output:
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
With GNU awk for the 3rd arg to match() and gensub():
$ cat tst.awk
BEGIN{ FS="," }
match($0,"(([^,]+,){9})(([^,]+,){"NF-14"})(.*)",a) {
$0 = a[1] gensub(/,/,"","g",a[3]) a[5]
}
{ print }
$ awk -f tst.awk file
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
If perl is okay - can be used just like awk for stream processing
$ cat ip.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
$ awk -F, '{print NF}' ip.txt
16
18
22
$ perl -F, -lane '$n = $#F - 4;
print join ",", (#F[0..8], join("", #F[9..$n]), #F[$n+1..$#F])
' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
-F, -lane split on , results saved in #F array
$n = $#F - 4 magic number, to ensure output ends with 14 columns. $#F gives the index of last element of array (won't work if input line has less than 14 columns)
join helps to stitch array elements together with specified string
#F[0..8] array slice with first 9 elements
#F[9..$n] and #F[$n+1..$#F] the other slices as needed
Borrowing from Ed Morton's regex based solution
$ perl -F, -lape '$n=$#F-13; s/^([^,]*,){9}\K([^,]*,){$n}/$&=~tr|,||dr/e' ip.txt
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
1,2,3,4,5,6,3,4,2,43432,5,2,3,4
1,2,3,4,5,6,3,4,2,4asfe3432,5,2,3,4
$n=$#F-13 magic number
^([^,]*,){9}\K first 9 fields
([^,]*,){$n} fields to change
$&=~tr|,||dr use tr to delete the commas
e this modifier allows use of Perl code in replacement section
this solution also has the added advantage of working even if input field is less than 14
You can try this gnu sed
sed -E '
s/,/\n/9g
:A
s/([^\n]*\n)(.*)(\n)(([^\n]*\n){4})/\1\2\4/
tA
s/\n/,/g
' infile
First variant - with awk
awk -F, '
{
for(i = 1; i <= NF; i++) {
OFS = (i > 9 && i < NF - 4) ? "" : ","
if(i == NF) OFS = "\n"
printf "%s%s", $i, OFS
}
}' input.txt
Second variant - with sed
sed -r 's/,/#/10g; :l; s/#(.*)((#[^#]){4})/\1\2/; tl; s/#/,/g' input.txt
or, more straightforwardly (without loop) and probably faster.
sed -r 's/,(.),(.),(.),(.)$/#\1#\2#\3#\4/; s/,//10g; s/#/,/g' input.txt
Testing
Input
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u
Output
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p
a,b,c,d,e,f,g,h,i,jklmn,o,p,q,r
a,b,c,d,e,f,g,h,i,jklmnopq,r,s,t,u
Solved a similar problem using csvtool. Source file, copied from one of the other answers:
$ cat input.txt
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p
1,2,3,4,5,6,3,4,2,4,3,4,3,2,5,2,3,4
1,2,3,4,5,6,3,4,2,4,a,s,f,e,3,4,3,2,5,2,3,4
Concatenating columns:
$ cat input.txt | csvtool format '%1,%2,%3,%4,%5,%6,%7,%8,%9,%10%11%12,%13,%14,%15,%16,%17,%18,%19,%20,%21,%22\n' -
a,b,c,d,e,f,g,h,i,jkl,m,n,o,p,,,,,,
1,2,3,4,5,6,3,4,2,434,3,2,5,2,3,4,,,,
1,2,3,4,5,6,3,4,2,4as,f,e,3,4,3,2,5,2,3,4
anatoly#anatoly-workstation:cbs$ cat input.txt

How to do calculations over lines of a file in awk

I've got a file that looks like this:
88.3055
45.1482
37.7202
37.4035
53.777
What I have to do is isolate the value from the first line and divide it by the values of the other lines (it's a speedup calculation). I thought of maybe storing the first line in a variable (using NR) and then iterate over the other lines to obtain the values from the divisions. Desired output is:
1,9559
2,3410
2,3608
1,6420
UPDATE
Sorry Ed, my mistake, the desired decimal point is .
I made some small changes to Ed's answer so that awk prints the division of 88.3055 by itself and outputs it to a file speedup.dat:
awk 'NR==1{n=$0} {print n/$0}' tavg.dat > speedup.dat
Is it possible to combine the contents of speedup.dat and the results from another awk command without using intermediate files and in one single awk command?
First command:
awk 'BEGIN { FS = \"[ \\t]*=[ \\t]*\" } /Total processes/ { if (! CP) CP = $2 } END {print CP}' cg.B.".n.".log ".(n == 1 ? ">" : ">>")." processes.dat
This first command outputs:
1
2
4
8
16
Paste of the two files:
paste processes.dat speedup.dat > prsp.dat
which gives the now desired output:
1 1
2 1.9559
4 2.34107
8 2.36089
16 1.64207
$ awk 'NR==1{n=$0;next} {print n/$0}' file
1.9559
2.34107
2.36089
1.64207
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", n/$0}' file
1.9559
2.3411
2.3609
1.6421
$ awk 'NR==1{n=$0;next} {printf "%.4f\n", int(n*10000/$0)/10000}' file
1.9559
2.3410
2.3608
1.6420
$ awk 'NR==1{n=$0;next} {x=sprintf("%.4f",int(n*10000/$0)/10000); sub(/\./,",",x); print x}' file
1,9559
2,3410
2,3608
1,6420
Normally you'd just use the correct locale to have . or , as your decimal point but your input uses . while your output uses , so I don't think that's an option.
awk '{if(n=="") n=$1; else print n/$1}' inputFile

How to use awk sort by column 3

I have a file (user.csv)like this
ip,hostname,user,group,encryption,aduser,adattr
want to print all column sort by user,
I tried awk -F ":" '{print|"$3 sort -n"}' user.csv , it doesn't work.
How about just sort.
sort -t, -nk3 user.csv
where
-t, - defines your delimiter as ,.
-n - gives you numerical sort. Added since you added it in your
attempt. If your user field is text only then you dont need it.
-k3 - defines the field (key). user is the third field.
Use awk to put the user ID in front.
Sort
Use sed to remove the duplicate user ID, assuming user IDs do not contain any spaces.
awk -F, '{ print $3, $0 }' user.csv | sort | sed 's/^.* //'
Seeing as that the original question was on how to use awk and every single one of the first 7 answers use sort instead, and that this is the top hit on Google, here is how to use awk.
Sample net.csv file with headers:
ip,hostname,user,group,encryption,aduser,adattr
192.168.0.1,gw,router,router,-,-,-
192.168.0.2,server,admin,admin,-,-,-
192.168.0.3,ws-03,user,user,-,-,-
192.168.0.4,ws-04,user,user,-,-,-
And sort.awk:
#!/usr/bin/awk -f
# usage: ./sort.awk -v f=FIELD FILE
BEGIN {
FS=","
}
# each line
{
a[NR]=$0 ""
s[NR]=$f ""
}
END {
isort(s,a,NR);
for(i=1; i<=NR; i++) print a[i]
}
#insertion sort of A[1..n]
function isort(S, A, n, i, j) {
for( i=2; i<=n; i++) {
hs = S[j=i]
ha = A[j=i]
while (S[j-1] > hs) {
j--;
S[j+1] = S[j]
A[j+1] = A[j]
}
S[j] = hs
A[j] = ha
}
}
To use it:
awk sort.awk f=3 < net.csv # OR
chmod +x sort.awk
./sort.awk f=3 net.csv
You can choose a delimiter, in this case I chose a colon and printed the column number one, sorting by alphabetical order:
awk -F\: '{print $1|"sort -u"}' /etc/passwd
awk -F, '{ print $3, $0 }' user.csv | sort -nk2
and for reverse order
awk -F, '{ print $3, $0 }' user.csv | sort -nrk2
try this -
awk '{print $0|"sort -t',' -nk3 "}' user.csv
OR
sort -t',' -nk3 user.csv
awk -F "," '{print $0}' user.csv | sort -nk3 -t ','
This should work
To exclude the first line (header) from sorting, I split it out into two buffers.
df | awk 'BEGIN{header=""; $body=""} { if(NR==1){header=$0}else{body=body"\n"$0}} END{print header; print body|"sort -nk3"}'
With GNU awk:
awk -F ',' '{ a[$3]=$0 } END{ PROCINFO["sorted_in"]="#ind_str_asc"; for(i in a) print a[i] }' file
See 8.1.6 Using Predefined Array Scanning Orders with gawk for more sorting algorithms.
I'm running Linux (Ubuntu) with mawk:
tmp$ awk -W version
mawk 1.3.4 20200120
Copyright 2008-2019,2020, Thomas E. Dickey
Copyright 1991-1996,2014, Michael D. Brennan
random-funcs: srandom/random
regex-funcs: internal
compiled limits:
sprintf buffer 8192
maximum-integer 2147483647
mawk (and gawk) has an option to redirect the output of print to a command. From man awk chapter 9. Input and output:
The output of print and printf can be redirected to a file or command by appending > file, >> file or | command to the end of the print statement. Redirection opens file or command only once, subsequent redirections append to the already open stream.
Below you'll find a simplied example how | can be used to pass the wanted records to an external program that makes the hard work. This also nicely encapsulates everything in a single awk file and reduces the command line clutter:
tmp$ cat input.csv
alpha,num
D,4
B,2
A,1
E,5
F,10
C,3
tmp$ cat sort.awk
# print header line
/^alpha,num/ {
print
}
# all other lines are data lines that should be sorted
!/^alpha,num/ {
print | "sort --field-separator=, --key=2 --numeric-sort"
}
tmp$ awk -f sort.awk input.csv
alpha,num
A,1
B,2
C,3
D,4
E,5
F,10
See man sort for the details of the sort options:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
-k, --key=KEYDEF
sort via a key; KEYDEF gives location and type
-n, --numeric-sort
compare according to string numerical value

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Consider the following command:
$ gawk -F"\t" "BEGIN{OFS=\"\t\"}{$2=$3=\"\"; print $0}" Input.tsv
When I set $2 = $3 = "", the intended effect is to get the same effect as writing:
print $1,$4,$5...$NF
However, what actually happens is that I get two empty fields, with the extra field delimiters still printing.
Is it possible to actually delete $2 and $3?
Note: If this was on Linux in bash, the correct statement above would be the following, but Windows does not handle single quotes well in cmd.exe.
$ gawk -F'\t' 'BEGIN{OFS="\t"}{$2=$3=""; print $0}' Input.tsv
This is an oldie but goodie.
As Jonathan points out, you can't delete fields in the middle, but you can replace their contents with the contents of other fields. And you can make a reusable function to handle the deletion for you.
$ cat test.awk
function rmcol(col, i) {
for (i=col; i<NF; i++) {
$i = $(i+1)
}
NF--
}
{
rmcol(3)
}
1
$ printf 'one two three four\ntest red green blue\n' | awk -f test.awk
one two four
test red blue
You can't delete fields in the middle, but you can delete fields at the end, by decrementing NF.
So you can shift all the later fields down to overwrite $2 and $3 then decrement NF by two, which erases the last two fields:
$ echo 1 2 3 4 5 6 7 | awk '{for(i=2; i<NF-1; ++i) $i=$(i+2); NF-=2; print $0}'
1 4 5 6 7
If you are just looking to remove columns, you can use cut:
$ cut -f 1,4- file.txt
To emulate cut:
$ awk -F "\t" '{ for (i=1; i<=NF; i++) if (i != 2 && i != 3) { if (i == NF) printf $i"\n"; else printf $i"\t" } }' file.txt
Similarly:
$ awk -F "\t" '{ delim =""; for (i=1; i<=NF; i++) if (i != 2 && i != 3) { printf delim $i; delim = "\t"; } printf "\n" }' file.txt
HTH
The only way I can think to do it in Awk without using a loop is to use gsub on $0 to combine adjacent FS:
$ echo {1..10} | awk '{$2=$3=""; gsub(FS"+",FS); print}'
1 4 5 6 7 8 9 10
One way could be to remove fields like you do and remove extra spaces with gsub:
$ awk 'BEGIN { FS = "\t" } { $2 = $3 = ""; gsub( /\s+/, "\t" ); print }' input-file
In the addition of the answer by Suicidal Steve I'd like to suggest one more solution but using sed instead awk.
It seems more complicated than usage of cut as it was suggested by Steve. But it was the better solution because sed -i allows editing in-place.
$ sed -i 's/\(.*,\).*,.*,\(.*\)/\1\2/' FILENAME
To remove fields 2 and 3 from a given input file (assuming a tab field separator), you can remove the fields from $0 using gensub and regenerate it as follows:
awk -F '\t' 'BEGIN{OFS="\t"}\
{$0=gensub(/[^\t]*\t/,"",3);\
$0=gensub(/[^\t]*\t/,"",2);\
print}' Input.tsv
The method presented in the answer of ghoti has some problems:
every assignment of $i = $(i+1) forces awk to rebuild the record $0. This implies that if you have 100 fields and you want to delete field 10, you rebuild the record 90 times.
changing the value of NF manually is not posix compliant and leads to undefined behaviour (as is mentioned in the comments).
A somewhat more cumbersome, but stable robust way to delete a set of columns would be:
a single column:
awk -v del=3 '
BEGIN{FS=fs;OFS=ofs}
{ b=""; for(i=1;i<=NF;++i) if(i!=del) b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
multiple columns:
awk -v del=3,5,7 '
BEGIN{FS=fs;OFS=ofs; del="," del ","}
{ b=""; for(i=1;i<=NF;++i) if (del !~ ","i",") b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
Well, if the goal is to remove the extra delimiters, then you can use tr on Linux. Example:
$ echo "1,2,,,5" | tr -s ','
1,2,5
echo one two three four five six|awk '{
print $0
is3=$3
$3=""
print $0
print is3
}'
one two three four five six
one two four five six
three