Better way to manipulate data using AWK and SED - awk

I am wondering if somebody could help me re-write this in a more sensible and smarter way?
sed -e '1d; $d' <someinputfile> |
awk -F"\t" '{split($2,a,/-/); print $1","a[1]","a[2]","$3","$4","$5","$6}' |
sed -e "s/,/\",\"/g" |
sed 's/^/"/;s/$/"/' |
sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
It is possible to write the correct output already with awk and I assume there are much better ways to write this.
Shorter way? More efficient way? More correct way? POSIX compliant? GNU compliant?
If you can help please also try to explain the changes as I really want to understand "how" and "what is what" :)
Thanks!
What it does is:
Deletes first and last lines
Splits the second field based on separator - and prints it (here it should be possible to print the right format right away?)
Change , to "," from the previous awk print
Add " around all lines
Add a new header
If somebody wants to play with inputfile here is an example:
START 9 1997-07-27T13:37:01Z
X1 24087-27 Axgma8PYjc1yRJlUr41688 1997-07-27T13:09:00Z 9876 OK
X1 642-68 6nwPtLQTqAAKufH3ejoEeg 1997-07-27T14:31:00Z 9876 OK
X1 642-31 qfKH99UnxZTcp2AN8NNB21 1997-07-27T16:15:00Z 9876 OK
X1 642-24 PouJBByqUJkqhKHBynUesD 1997-07-27T16:15:00Z 9876 OK
X1 642-30 J7t2sJKKtcxWJr18I84A46 1997-07-27T16:15:00Z 9876 OK
X1 642-29 g7hPkNpUywvk6FvGqgpHsx 1997-07-27T16:15:00Z 9876 OK
X1 642-26 W2KM24xvmy0Q8cLV950tXq 1997-07-27T16:15:00Z 9876 OK
X1 642-25 dqu8jB5tUthIKevNAQXgld 1997-07-27T16:15:00Z 9876 OK
X1 753-32 Gh0kZkIJr8j6FSYljbpyyy 1997-07-27T16:15:00Z 9876 OK
X1 753-23 Jvl8LMh6SDHfgvLfJIHi5l 1997-07-27T16:15:00Z 9876 OK
X1 753-28 IZ83996cthjhZGYcAk97iJ 1997-07-27T16:15:00Z 9876 OK
X1 753-22 YJwokU0Dq6xiydkf3EDyxl 1997-07-27T16:15:00Z 9876 OK
X1 753-36 OZHOMirRKjA3LcXTbPJL31 1997-07-27T16:15:00Z 9876 OK
X1 753-34 LvMgT6ed1b1e3uwasGi48G 1997-07-27T16:15:00Z 9877 OK
X1 753-35 VJk4x8sTG1BJTnZYvgu6px 1997-07-27T16:15:00Z 9876 OK
X1 663-27 mkZXgTHKBjmAplrDeoQZXo 1997-07-27T16:15:00Z 9875 ERR
X1 f1K1PzQ9sp2QAv1AX0Zix4 1997-07-27T16:27:00Z 9875 ERR
DONE 69 3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
Thanks!
PS: Since I'm not sure if you will get the same output on your computer here is how it correctly looks for me when I run it and how I want it:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

One awk idea:
awk '
BEGIN { FS="\t"
OFS="\",\"" # define output field delimiter as <doublequote> <comma> <doublequote>
# print header
print "\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven\""
}
FNR>1 { if (prev) print prev
split($2,a,"-")
# reformat current line and save in variable "prev", to be printed on next pass; add <doublequote> on ends
prev= "\"" $1 OFS a[1] OFS a[2] OFS $3 OFS $4 OFS $5 OFS $6 "\""
}
' input.dat
This generates:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

Given:
sed -E 's/\t/\\t/g' file
START\t9\t1997-07-27T13:37:01Z
X1\t24087-27\tAxgma8PYjc1yRJlUr41688\t1997-07-27T13:09:00Z\t9876\tOK
X1\t642-68\t6nwPtLQTqAAKufH3ejoEeg\t1997-07-27T14:31:00Z\t9876\tOK
X1\t642-31\tqfKH99UnxZTcp2AN8NNB21\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-24\tPouJBByqUJkqhKHBynUesD\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-30\tJ7t2sJKKtcxWJr18I84A46\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-29\tg7hPkNpUywvk6FvGqgpHsx\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-26\tW2KM24xvmy0Q8cLV950tXq\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-25\tdqu8jB5tUthIKevNAQXgld\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-32\tGh0kZkIJr8j6FSYljbpyyy\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-23\tJvl8LMh6SDHfgvLfJIHi5l\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-28\tIZ83996cthjhZGYcAk97iJ\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-22\tYJwokU0Dq6xiydkf3EDyxl\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-36\tOZHOMirRKjA3LcXTbPJL31\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-34\tLvMgT6ed1b1e3uwasGi48G\t1997-07-27T16:15:00Z\t9877\tOK
X1\t753-35\tVJk4x8sTG1BJTnZYvgu6px\t1997-07-27T16:15:00Z\t9876\tOK
X1\t663-27\tmkZXgTHKBjmAplrDeoQZXo\t1997-07-27T16:15:00Z\t9875\tERR
X1\t\tf1K1PzQ9sp2QAv1AX0Zix4\t1997-07-27T16:27:00Z\t9875\tERR
DONE\t69\t3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
It is a very good idea to use a proper CSV parser to deal with issues like this.
Ruby is ubiquitous and has a very lightweight but capable CSV parser included in the distribution.
Here is a ruby:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>"\t"})
d2=CSV::Table.new([], headers:["field_one","field_two","field_three","field_four","field_five","field_six","field_seven"])
data[1...-1].each { |r|
r_=[]
r.each_with_index { |e,i|
if i == 1
e && e[/-/] ? (r_.concat e.split(/-/,2)) : (r_.concat ["",""])
else
r_ << e
end
}
d2 << r_ }
puts d2.to_csv(**{:force_quotes=>true})
' file
Prints:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

I would rework this part of your code
awk -F"\t" '{split($2,a,/-/); print $1","a[1]","a[2]","$3","$4","$5","$6}' | sed -e "s/,/\",\"/g" | sed 's/^/"/;s/$/"/' | sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
following way, 1st step: use "," rather than , and then changing it, i.e.
awk -F"\t" '{split($2,a,/-/); print $1"\",\""a[1]"\",\""a[2]"\",\""$3"\",\""$4"\",\""$5"\",\""$6}' | sed 's/^/"/;s/$/"/' | sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
2nd step: add leading " and trailing " in print i.e.
awk -F"\t" '{split($2,a,/-/); print "\""$1"\",\""a[1]"\",\""a[2]"\",\""$3"\",\""$4"\",\""$5"\",\""$6"\""}' | sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
3rd step: use BEGIN to print header i.e.
awk -F"\t" 'BEGIN{print "\"field_one\",\"field_two\",\"field_three\",\"field_four\",\"field_five\",\"field_six\",\"field_seven\""}{split($2,a,/-/); print "\""$1"\",\""a[1]"\",\""a[2]"\",\""$3"\",\""$4"\",\""$5"\",\""$6"\""}'
(tested in gawk 4.2.1)

not as elegant of a solution as i hoped for, but it got the job done -
instead of hard-coding in verbal names for fields, it would simple compute the required header row on the fly based on actual input, which also accounts for the anticipated split of field 2
gnice gcat sample.txt \
\
| mawk2 'function trim(_) { return \
\
substr("",gsub("^[,\42]*|[,\42]*$","\42",_))_
} BEGIN { FS = "[ \11]+"
OFS = "\42\54\42"
} NR==2 {
for(_=NF+!_;_;_—) {
___=(_)(OFS)___
}
printf("%*s\n",gsub("[0-9]+[^0-9]+",\
"field_&",___)~"",trim(___))
} !/^(START|DONE)/ {
printf("\42%.0s%s\42\n",$1=$(($0=\
$(sub("[-]"," ",$2)<""))~""),$0) } ' | lgp3 3
"field_1","field_2","field_3","field_4","field_5","field_6","field_7"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

Related

Issues with OFS and sort working together in Bash

Given this type of input:
1,Name1,Type1,100,50
2,Name2,Type2,85,80
3,Name3,Type3,95,92
4,Name4,Type4,60,55
5,Name5,Type5,87,77
I want to calculate the average of the last 2 values and then sort them in decreasing order, so I wrote this bash code:
awk 'BEGIN{FS=","} {avg=($4+$5)/2;print $1,$3,avg}' | sort -k3 -nr
which gives me this output which is very close to my expected output:
3 Type3 93.5
2 Type2 82.5
5 Type5 82
1 Type1 75
4 Type4 57.5
The final thing I want is to separate the output with | (pipes), so I use the variable OFS like this:
awk 'BEGIN{FS=",";OFS="|"} {avg=($4+$5)/2;print $1,$3,avg}' | sort -k3 -nr
The output from this:
5|Type5|82
4|Type4|57.5
3|Type3|93.5
2|Type2|82.5
1|Type1|75
It seems like OFS is breaking the sort. Is this behaviour expected? Is there any workaround for this?
There are 2 issues in your shown code attempt. First is Input_file name is not passed in awk code(could be typo), 2nd is you need to set delimiter in sort by using -t'|' option so it will be like:
awk 'BEGIN{FS=",";OFS="|"} {avg=($4+$5)/2;print $1,$3,avg}' Input_file | sort -t'|' -k3 -nr
3|Type3|93.5
2|Type2|82.5
5|Type5|82
1|Type1|75
4|Type4|57.5
OR in a non-one liner form of code + removing avg variable you could get average of columns during printing of itself(in case you are using avg variable further any where in program then you could create it):
awk '
BEGIN{
FS=","
OFS="|"
}
{
print $1,$3,($4 + $5)/2
}' Input_file |
sort -t'|' -k3 -nr
From man sort page:
-t, --field-separator=SEP
use SEP instead of non-blank to blank transition
Some more way,actually you can also use awk's print with sort :
$ cat testfile.csv
1,Name1,Type1,100,50
2,Name2,Type2,85,80
3,Name3,Type3,95,92
4,Name4,Type4,60,55
5,Name5,Type5,87,77
$ awk 'BEGIN{FS=",";OFS="|"}{print $1,$3,($4+$5)/2 | "sort -t \"|\" -nrk3"}' testfile.csv
3|Type3|93.5
2|Type2|82.5
5|Type5|82
1|Type1|75
4|Type4|57.5
Using GNU awk's controlling array traversal feature:
gawk '
BEGIN { FS = ","; SUBSEP = "|" }
{ avg = ($4+$5)/2; result[$1,$3,avg] = avg }
END {
PROCINFO["sorted_in"] = "#val_num_desc"
for (line in result) print line
}
' testfile.csv
3|Type3|93.5
2|Type2|82.5
5|Type5|82
1|Type1|75
4|Type4|57.5
SUBSEP is the variable that holds the join string for comma-separated array keys. It's default value is octal 034, the "FS" character.

While using awk showing fatal : cannot open pipe ( Too many open files) error

I was trying to do masking of file with command 'tr' and 'awk' but failing with error fatal: cannot open pipe ( Too many open pipes) error. FILE has approx 1000000 records quite a huge number.
Below is the code I am trying :-
awk - F "|" - v OFS="|" '{ "echo \""$1"\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\"" | get line $1}1' FILE.CSV > test.CSV
It is showing error :-
awk: (FILENAME=- FNR=1019) fatal: cannot open pipe `echo ""TTP_123"" | tr "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" "QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq"' (Too many open pipes)
Please let me know what I am doing wrong here
Also a Note any number of columns could be used for masking and can be at any positions in this example I have taken 1 and 2 column positions but it could be 3 and 10 or 5,7,25 columns
Thanks
AJ
First things first, you can't have a space between - and F or v.
I was going to suggest sed, but as you only want to translate the first column, that's not as easy.
Unfortunately, awk doesn't have built-in tr functionality, so you'd have to use the shell like you are and just close the pipe:
awk -F "|" -v OFS="|" '{
command="echo \"\\"$1"\\\" | tr \" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\" \" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq\""
command | getline $1
close(command)
}1' FILE.CSV > test.CSV
However, I suggest using perl, which can do field splitting and character translation:
perl -F'\|' -lane '$F[0] =~ tr/0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq/; print join("|", #F)' FILE.CSV > test.CSV
Or, for a shorter command line, just put the program into a file, drop the e in -lane and use the file name instead of the '...' command.
you can do the mapping in awk instead of making a system call for each line, or perhaps simply
paste -d'|' <(cut -d'|' -f1 file | tr '0-9' 'a-z') <(cut -d'|' -f2- file)
replace the tr arguments with yours.
This does not answer your question, but you can implement tr as an awk function that would save having to spawn lots of external processes
$ cat tr.awk
function tr(str, from, to, s,i,c,idx) {
s = ""
for (i=1; i<=length($str); i++) {
c = substr(str, i, 1)
idx = index(from, c)
s = s (idx == 0 ? c : substr(to, idx, 1))
}
return s
}
{
print $1, tr($1,
" 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ",
" QWERTYUIOPASDFGHJKLZXCVBNM9876543210mnbvcxzlkjhgfdsapoiuytrewq")
}
Example:
$ printf "%s\n" hello wor-ld | awk -f tr.awk
hello KGCCN
wor-ld 3N8-CF

Merge files with different columns and rows linux

I have some files that I need join, I have looked for some solutions but they do not fit what I need, I have the following files
a.csv
date |A|B|C|D
15-03-2017|1|3|9|4
and
b.csv
date |A|C|D|E
16-03-2017|2|9|3|4
And I would like to get the next output:
date |A|B|C|D|E
15-03-2017|1|3|9|4|0
16-03-2017|2|0|9|3|4
Any insights or suggestions are appreciated!
EDIT:
Thanks for all
These file examples are not always the same
sometimes it can have between 10 and 50 columns and between 1 and 30 rows (dates)
something like this...
awk 'BEGIN {FS=OFS="|"}
FNR==1 {split($0,h); next}
{c++;
for(i=1;i<=NF;i++)
{a[h[i],c]=$i;
hall[h[i]]}}
END {for(k in hall) printf "%s", k OFS;
print "";
for(i=1;i<=c;i++)
{for(k in hall) printf "%s", ((k,i) in a?a[k,i]:0) OFS;
print ""}}' file1 file2
A|B|C|D|E|date |
1|3|9|4|0|15-03-2017|
2|0|9|3|4|16-03-2017|
you can reorder the columns with some additional code, but perhaps a better solution will show up...
A slightly simpler solution using sed:
sed 's/$/|0/;1s/0/E/' a.csv; sed '1d;s/|/|0|/2' b.csv
I am expanding my answer into a script for readability. You can use the script with the awk -f option. Starting with a BEGIN statement: specify | as your field separator, map an index to each header label using associative array and print the full header. For each file, map labels to indices on line 1. Then map labels to data on line 2 and replace empty data fields with "0". Print the filled out line and clear the arrays for the next file.
BEGIN{
# field separator
FS="|"
# index:label mapping
map[1]="date "; map[2]="A"; map[3]="B"
map[4]="C"; map[5]="D"; map[6]="E"
# print full header
print "date |A|B|C|D|E"
}
# first line of each file, create index:label mapping
FNR==1{
for (i=1;i<=NF;i++)
label[i]=$i
}
# next line of the file, create label:data mapping
FNR==2{
for (i=1;i<=NF;i++)
data[label[i]]=$i
# cycle through index:label mapping and print data
# for each label or "0" if there is no data
printf("%s", data["date "])
for (i=2;i<=6;i++) {
(data[map[i]]) ? s=data[map[i]] : s=0
printf("|%s", s)
}
print "" # print empty string for newline
# delete arrays to start from scratch on the following file
delete label
delete data
}
Result on the two example files:
$ awk -f joiner.awk a.csv b.csv
date |A|B|C|D|E
15-03-2017|1|3|9|4|0
16-03-2017|2|0|9|3|4
I solved this problem
I changed the way to get the data
Something like this:
today=$(date +%Y%m%d)
echo "DataBase "$(date +%d/%m/%Y)>/jdb"$today".txt
du -s $(ls -l|grep ^d|awk '{print $9})|awk '{print $2" "$1" "}'>>/jdb"$today".txt
the output is like:
jdb_20170507.txt:
database 07/05/2017
jdb_A 4345
jdb_CFX 7654
jdb_ZZXD 97865
jdb_20170508.txt:
database 08/05/2017
jdb_A 9876
jdb_CFX 7545
jdb_ZXCFG 2344
for this examples in jdb_20170508.txt was deleted the jdb_ZZXD database and created the jdb_ZXCFG database
with this structure I can use JOIN command:
x=0
touch jdbaux$x.txt
for jdbfile in $(ls -1t|grep jdb2)
do
y=$(($x+1))
join -a1 -a2 -e0 -o auto --nocheck-order jdbaux$x.txt $jdbfile >jdbaux$y.txt
rm jdbaux$x.txt
x=$(($x+1))
done
This is my recursive JOIN rustic option for all the archives of the month
-a1= file one
-a2=file two
-e0=replace missing input fields with 0
-o auto= output automatic format
--nocheck-order =do not check that the input is correctly sorted
the output is like:
jdb_sizes201705.txt:
database 07/05/2017 08/05/2017
jdb_A 4345 9876
jdb_CFX 7654 7545
jdb_ZZXD 97865 0
jdb_ZXCFG 0 2344
and the last step is a pivote
cat jdb_sizes201705.txt |awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}'
Obtaining the expected output
database jdb_A jdb_CFX jdb_ZZXD jdb_ZXCFG
07/05/2017 4345 7654 97865 0
08/05/2017 9876 7545 0 2344
I know it's not the best solution but it works!
Thanks!

Order a column then print a certain row with awk in command line

I have a txt file like this:
ID row1 row2 row3 score
rs16 ... ... ... 0.23
rs52 ... ... ... 1.43
rs87 ... ... ... 0.45
rs89 ... ... ... 2.34
rs67 ... ... ... 1.89
Rows1- row3 do not matter.
I have about 8 million rows, and the scores range from 0-3. I would like to the score that correlates with being the top 1%. I was thinking of re-ordering the data by score and then printing the ~80,000 line? What do you guys think would be the best code for this?
With GNU coreutils you can do it like this:
sort -k5gr <(tail -n+2 infile) | head -n80KB
You can increase to speed of the above pipeline by removing columns 2 through 4 like this:
tr -s ' ' < infile | cut -d' ' -f1,5 > outfile
Or taken together:
sort -k5gr <(tail -n+2 <(tr -s ' ' < infile | cut -d' ' -f1,5)) | head -n80KB
Edit
I noticed that you are only interested in the 80000th line of the result, then sed -n 80000 {p;q} instead of head as you suggested, is the way to go.
Explanation
tail:
-n+2 - skip header.
sort:
k5 - sort on 5th column.
gr - flags that make sort choose reverse general-numeric-sort.
head:
n - number of lines to keep. KB is a 1000 multiplier, see info head for others.
With GNU awk you can sort the values by setting the PROCINFO["sorted_in"] to "#val_num_desc". For example like this:
parse.awk
# Set sorting method
BEGIN { PROCINFO["sorted_in"]="#val_num_desc" }
# Print header
NR == 1 { print $1, $5 }
# Save 1st and 5th columns in g and h hashes respectively
NR>1 { g[NR] = $1; h[NR] = $5 }
# Print values from g and h until ratio is reached
END {
for(k in h) {
if(i++ >= int(0.5 + NR*ratio_to_keep))
exit
print g[k], h[k]
}
}
Run it like this:
awk -f parse.awk OFS='\t' ratio_to_keep=.01 infile

How to split a delimited string into an array in awk?

How to split the string when it contains pipe symbols | in it.
I want to split them to be in array.
I tried
echo "12:23:11" | awk '{split($0,a,":"); print a[3] a[2] a[1]}'
Which works fine. If my string is like "12|23|11" then how do I split them into an array?
Have you tried:
echo "12|23|11" | awk '{split($0,a,"|"); print a[3],a[2],a[1]}'
To split a string to an array in awk we use the function split():
awk '{split($0, array, ":")}'
# \/ \___/ \_/
# | | |
# string | delimiter
# |
# array to store the pieces
If no separator is given, it uses the FS, which defaults to the space:
$ awk '{split($0, array); print array[2]}' <<< "a:b c:d e"
c:d
We can give a separator, for example ::
$ awk '{split($0, array, ":"); print array[2]}' <<< "a:b c:d e"
b c
Which is equivalent to setting it through the FS:
$ awk -F: '{split($0, array); print array[2]}' <<< "a:b c:d e"
b c
In GNU Awk you can also provide the separator as a regexp:
$ awk '{split($0, array, ":*"); print array[2]}' <<< "a:::b c::d e
#note multiple :
b c
And even see what the delimiter was on every step by using its fourth parameter:
$ awk '{split($0, array, ":*", sep); print array[2]; print sep[1]}' <<< "a:::b c::d e"
b c
:::
Let's quote the man page of GNU awk:
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If fieldsep is omitted, the value of FS is used. split() returns the number of elements created. seps is a gawk extension, with seps[i] being the separator string between array[i] and array[i+1]. If fieldsep is a single space, then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n], where n is the return value of split() (i.e., the number of elements in array).
Please be more specific! What do you mean by "it doesn't work"?
Post the exact output (or error message), your OS and awk version:
% awk -F\| '{
for (i = 0; ++i <= NF;)
print i, $i
}' <<<'12|23|11'
1 12
2 23
3 11
Or, using split:
% awk '{
n = split($0, t, "|")
for (i = 0; ++i <= n;)
print i, t[i]
}' <<<'12|23|11'
1 12
2 23
3 11
Edit: on Solaris you'll need to use the POSIX awk (/usr/xpg4/bin/awk) in order to process 4000 fields correctly.
I do not like the echo "..." | awk ... solution as it calls unnecessary fork and execsystem calls.
I prefer a Dimitre's solution with a little twist
awk -F\| '{print $3 $2 $1}' <<<'12|23|11'
Or a bit shorter version:
awk -F\| '$0=$3 $2 $1' <<<'12|23|11'
In this case the output record put together which is a true condition, so it gets printed.
In this specific case the stdin redirection can be spared with setting an awk internal variable:
awk -v T='12|23|11' 'BEGIN{split(T,a,"|");print a[3] a[2] a[1]}'
I used ksh quite a while, but in bash this could be managed by internal string manipulation. In the first case the original string is split by internal terminator. In the second case it is assumed that the string always contains digit pairs separated by a one character separator.
T='12|23|11';echo -n ${T##*|};T=${T%|*};echo ${T#*|}${T%|*}
T='12|23|11';echo ${T:6}${T:3:2}${T:0:2}
The result in all cases is
112312
Actually awk has a feature called 'Input Field Separator Variable' link. This is how to use it. It's not really an array, but it uses the internal $ variables. For splitting a simple string it is easier.
echo "12|23|11" | awk 'BEGIN {FS="|";} { print $1, $2, $3 }'
I know this is kind of old question, but I thought maybe someone like my trick. Especially since this solution not limited to a specific number of items.
# Convert to an array
_ITEMS=($(echo "12|23|11" | tr '|' '\n'))
# Output array items
for _ITEM in "${_ITEMS[#]}"; do
echo "Item: ${_ITEM}"
done
The output will be:
Item: 12
Item: 23
Item: 11
Joke? :)
How about echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
This is my output:
p2> echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
112312
so I guess it's working after all..
echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
should work.
echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
code
awk -F"|" '{split($0,a); print a[1],a[2],a[3]}' <<< '12|23|11'
output
12 23 11
The challenge: parse and store split strings with spaces and insert them into variables.
Solution: best and simple choice for you would be convert the strings list into array and then parse it into variables with indexes. Here's an example how you can convert and access the array.
Example: parse disk space statistics on each line:
sudo df -k | awk 'NR>1' | while read -r line; do
#convert into array:
array=($line)
#variables:
filesystem="${array[0]}"
size="${array[1]}"
capacity="${array[4]}"
mountpoint="${array[5]}"
echo "filesystem:$filesystem|size:$size|capacity:$capacity|mountpoint:$mountpoint"
done
#output:
filesystem:/dev/dsk/c0t0d0s1|size:4000|usage:40%|mountpoint:/
filesystem:/dev/dsk/c0t0d0s2|size:5000|usage:50%|mountpoint:/usr
filesystem:/proc|size:0|usage:0%|mountpoint:/proc
filesystem:mnttab|size:0|usage:0%|mountpoint:/etc/mnttab
filesystem:fd|size:1000|usage:10%|mountpoint:/dev/fd
filesystem:swap|size:9000|usage:9%|mountpoint:/var/run
filesystem:swap|size:1500|usage:15%|mountpoint:/tmp
filesystem:/dev/dsk/c0t0d0s3|size:8000|usage:80%|mountpoint:/export
awk -F'['|'] -v '{print $1"\t"$2"\t"$3}' file <<<'12|23|11'