awk split and drop column but something wrong with substr part

awk split and drop column but something wrong with substr part - awk

I have an awk code to split a file by columns and print out the output to a new file name.
awk -F"|" 'NR==1 {h=substr($0, index($0,$5)); next}
{file= path ""$1""$2"_"$3"_"$4"_03042017.csv"; print (a[file]++?"": "DM9 03042017" ORS h ORS) substr($0, index($0,$5)) > file}
END{for(file in a) print "EOF " a[file] > file}'
As I use substr ($0, index($0,$5) so the new output will only have data start at fifth column and the rest. It works fine except when the input data I have got the same value.
For example,
product | ID | Branch | Office | Type | ....
ABC | 12 | KL | GH | Z | ....
For the above example, the code works well as the data input is different.
product | ID | Branch | Office | Type | ....
ABC | 12 | KK | KK | Z | ....
But if I have data input like second example, I have the same value data for third and fourth columns, the code doesn't work well. Instead of getting output start and fifth column and more, I got the result at third column and more.
So, I suspect because as the data input for third and fourth are the same, so it stopped at third line as I used substr.
Is anyone can help me on this matter? Sorry for the long post and appreciate it a lot if you guys can give me some ideas. Thank you.

if structure is fixed like your sample (fixed length field)
awk -F '[[:blank:]]*[|][[:blank:]]*' -v path="./" '
NR==1 {
for( i=1;i<5;i++) $i = ""
h = $0; sub(/^[[:blank:]|]+/,"", h)
next
}
{
file= path $1 $2 "_" $3 "_" $4 "_03042017.csv"
# remove 4 first field
for( i=1;i<5;i++) $i = ""
# cleaning starting space
Cleaned = $0; sub( /^[[:blank:]|]+/, "", Cleaned)
print ( a[file]++ ? "" : "DM9 03042017" ORS h ORS ) Cleaned > file
}
END {
for(file in a) { print "EOF " a[file] > file }
}
' YourFile

Related

Better way to manipulate data using AWK and SED

I am wondering if somebody could help me re-write this in a more sensible and smarter way?
sed -e '1d; $d' <someinputfile> |
awk -F"\t" '{split($2,a,/-/); print $1","a[1]","a[2]","$3","$4","$5","$6}' |
sed -e "s/,/\",\"/g" |
sed 's/^/"/;s/$/"/' |
sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
It is possible to write the correct output already with awk and I assume there are much better ways to write this.
Shorter way? More efficient way? More correct way? POSIX compliant? GNU compliant?
If you can help please also try to explain the changes as I really want to understand "how" and "what is what" :)
Thanks!
What it does is:
Deletes first and last lines
Splits the second field based on separator - and prints it (here it should be possible to print the right format right away?)
Change , to "," from the previous awk print
Add " around all lines
Add a new header
If somebody wants to play with inputfile here is an example:
START 9 1997-07-27T13:37:01Z
X1 24087-27 Axgma8PYjc1yRJlUr41688 1997-07-27T13:09:00Z 9876 OK
X1 642-68 6nwPtLQTqAAKufH3ejoEeg 1997-07-27T14:31:00Z 9876 OK
X1 642-31 qfKH99UnxZTcp2AN8NNB21 1997-07-27T16:15:00Z 9876 OK
X1 642-24 PouJBByqUJkqhKHBynUesD 1997-07-27T16:15:00Z 9876 OK
X1 642-30 J7t2sJKKtcxWJr18I84A46 1997-07-27T16:15:00Z 9876 OK
X1 642-29 g7hPkNpUywvk6FvGqgpHsx 1997-07-27T16:15:00Z 9876 OK
X1 642-26 W2KM24xvmy0Q8cLV950tXq 1997-07-27T16:15:00Z 9876 OK
X1 642-25 dqu8jB5tUthIKevNAQXgld 1997-07-27T16:15:00Z 9876 OK
X1 753-32 Gh0kZkIJr8j6FSYljbpyyy 1997-07-27T16:15:00Z 9876 OK
X1 753-23 Jvl8LMh6SDHfgvLfJIHi5l 1997-07-27T16:15:00Z 9876 OK
X1 753-28 IZ83996cthjhZGYcAk97iJ 1997-07-27T16:15:00Z 9876 OK
X1 753-22 YJwokU0Dq6xiydkf3EDyxl 1997-07-27T16:15:00Z 9876 OK
X1 753-36 OZHOMirRKjA3LcXTbPJL31 1997-07-27T16:15:00Z 9876 OK
X1 753-34 LvMgT6ed1b1e3uwasGi48G 1997-07-27T16:15:00Z 9877 OK
X1 753-35 VJk4x8sTG1BJTnZYvgu6px 1997-07-27T16:15:00Z 9876 OK
X1 663-27 mkZXgTHKBjmAplrDeoQZXo 1997-07-27T16:15:00Z 9875 ERR
X1 f1K1PzQ9sp2QAv1AX0Zix4 1997-07-27T16:27:00Z 9875 ERR
DONE 69 3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
Thanks!
PS: Since I'm not sure if you will get the same output on your computer here is how it correctly looks for me when I run it and how I want it:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

One awk idea:
awk '
BEGIN { FS="\t"
OFS="\",\"" # define output field delimiter as <doublequote> <comma> <doublequote>
# print header
print "\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven\""
}
FNR>1 { if (prev) print prev
split($2,a,"-")
# reformat current line and save in variable "prev", to be printed on next pass; add <doublequote> on ends
prev= "\"" $1 OFS a[1] OFS a[2] OFS $3 OFS $4 OFS $5 OFS $6 "\""
}
' input.dat
This generates:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

Given:
sed -E 's/\t/\\t/g' file
START\t9\t1997-07-27T13:37:01Z
X1\t24087-27\tAxgma8PYjc1yRJlUr41688\t1997-07-27T13:09:00Z\t9876\tOK
X1\t642-68\t6nwPtLQTqAAKufH3ejoEeg\t1997-07-27T14:31:00Z\t9876\tOK
X1\t642-31\tqfKH99UnxZTcp2AN8NNB21\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-24\tPouJBByqUJkqhKHBynUesD\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-30\tJ7t2sJKKtcxWJr18I84A46\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-29\tg7hPkNpUywvk6FvGqgpHsx\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-26\tW2KM24xvmy0Q8cLV950tXq\t1997-07-27T16:15:00Z\t9876\tOK
X1\t642-25\tdqu8jB5tUthIKevNAQXgld\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-32\tGh0kZkIJr8j6FSYljbpyyy\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-23\tJvl8LMh6SDHfgvLfJIHi5l\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-28\tIZ83996cthjhZGYcAk97iJ\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-22\tYJwokU0Dq6xiydkf3EDyxl\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-36\tOZHOMirRKjA3LcXTbPJL31\t1997-07-27T16:15:00Z\t9876\tOK
X1\t753-34\tLvMgT6ed1b1e3uwasGi48G\t1997-07-27T16:15:00Z\t9877\tOK
X1\t753-35\tVJk4x8sTG1BJTnZYvgu6px\t1997-07-27T16:15:00Z\t9876\tOK
X1\t663-27\tmkZXgTHKBjmAplrDeoQZXo\t1997-07-27T16:15:00Z\t9875\tERR
X1\t\tf1K1PzQ9sp2QAv1AX0Zix4\t1997-07-27T16:27:00Z\t9875\tERR
DONE\t69\t3QXFXKQAFRSZXJLJ6JZ9NWMXR00B1V1J1FUMBQAA9DQSRCTZF8JXAWWSGHSDIPQ9
It is a very good idea to use a proper CSV parser to deal with issues like this.
Ruby is ubiquitous and has a very lightweight but capable CSV parser included in the distribution.
Here is a ruby:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>"\t"})
d2=CSV::Table.new([], headers:["field_one","field_two","field_three","field_four","field_five","field_six","field_seven"])
data[1...-1].each { |r|
r_=[]
r.each_with_index { |e,i|
if i == 1
e && e[/-/] ? (r_.concat e.split(/-/,2)) : (r_.concat ["",""])
else
r_ << e
end
}
d2 << r_ }
puts d2.to_csv(**{:force_quotes=>true})
' file
Prints:
"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","","","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

I would rework this part of your code
awk -F"\t" '{split($2,a,/-/); print $1","a[1]","a[2]","$3","$4","$5","$6}' | sed -e "s/,/\",\"/g" | sed 's/^/"/;s/$/"/' | sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
following way, 1st step: use "," rather than , and then changing it, i.e.
awk -F"\t" '{split($2,a,/-/); print $1"\",\""a[1]"\",\""a[2]"\",\""$3"\",\""$4"\",\""$5"\",\""$6}' | sed 's/^/"/;s/$/"/' | sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
2nd step: add leading " and trailing " in print i.e.
awk -F"\t" '{split($2,a,/-/); print "\""$1"\",\""a[1]"\",\""a[2]"\",\""$3"\",\""$4"\",\""$5"\",\""$6"\""}' | sed -e $'1i\\\"field_one","field_two","field_three","field_four","field_five","field_six","field_seven"'
3rd step: use BEGIN to print header i.e.
awk -F"\t" 'BEGIN{print "\"field_one\",\"field_two\",\"field_three\",\"field_four\",\"field_five\",\"field_six\",\"field_seven\""}{split($2,a,/-/); print "\""$1"\",\""a[1]"\",\""a[2]"\",\""$3"\",\""$4"\",\""$5"\",\""$6"\""}'
(tested in gawk 4.2.1)

not as elegant of a solution as i hoped for, but it got the job done -
instead of hard-coding in verbal names for fields, it would simple compute the required header row on the fly based on actual input, which also accounts for the anticipated split of field 2
gnice gcat sample.txt \
\
| mawk2 'function trim(_) { return \
\
substr("",gsub("^[,\42]*|[,\42]*$","\42",_))_
} BEGIN { FS = "[ \11]+"
OFS = "\42\54\42"
} NR==2 {
for(_=NF+!_;_;_—) {
___=(_)(OFS)___
}
printf("%*s\n",gsub("[0-9]+[^0-9]+",\
"field_&",___)~"",trim(___))
} !/^(START|DONE)/ {
printf("\42%.0s%s\42\n",$1=$(($0=\
$(sub("[-]"," ",$2)<""))~""),$0) } ' | lgp3 3
"field_1","field_2","field_3","field_4","field_5","field_6","field_7"
"X1","24087","27","Axgma8PYjc1yRJlUr41688","1997-07-27T13:09:00Z","9876","OK"
"X1","642","68","6nwPtLQTqAAKufH3ejoEeg","1997-07-27T14:31:00Z","9876","OK"
"X1","642","31","qfKH99UnxZTcp2AN8NNB21","1997-07-27T16:15:00Z","9876","OK"
"X1","642","24","PouJBByqUJkqhKHBynUesD","1997-07-27T16:15:00Z","9876","OK"
"X1","642","30","J7t2sJKKtcxWJr18I84A46","1997-07-27T16:15:00Z","9876","OK"
"X1","642","29","g7hPkNpUywvk6FvGqgpHsx","1997-07-27T16:15:00Z","9876","OK"
"X1","642","26","W2KM24xvmy0Q8cLV950tXq","1997-07-27T16:15:00Z","9876","OK"
"X1","642","25","dqu8jB5tUthIKevNAQXgld","1997-07-27T16:15:00Z","9876","OK"
"X1","753","32","Gh0kZkIJr8j6FSYljbpyyy","1997-07-27T16:15:00Z","9876","OK"
"X1","753","23","Jvl8LMh6SDHfgvLfJIHi5l","1997-07-27T16:15:00Z","9876","OK"
"X1","753","28","IZ83996cthjhZGYcAk97iJ","1997-07-27T16:15:00Z","9876","OK"
"X1","753","22","YJwokU0Dq6xiydkf3EDyxl","1997-07-27T16:15:00Z","9876","OK"
"X1","753","36","OZHOMirRKjA3LcXTbPJL31","1997-07-27T16:15:00Z","9876","OK"
"X1","753","34","LvMgT6ed1b1e3uwasGi48G","1997-07-27T16:15:00Z","9877","OK"
"X1","753","35","VJk4x8sTG1BJTnZYvgu6px","1997-07-27T16:15:00Z","9876","OK"
"X1","663","27","mkZXgTHKBjmAplrDeoQZXo","1997-07-27T16:15:00Z","9875","ERR"
"X1","f1K1PzQ9sp2QAv1AX0Zix4","1997-07-27T16:27:00Z","9875","ERR"

how to keep newline(s) when selecting a given column with awk

Suppose I have a file like this (disclaimer: this is not fixed I can have more than 7 rows, and more than 4 columns)
R H A 23
S E A 45
T E A 34
U A 35
Y T A 35
O E A 353
J G B 23
I want the output to select second column if third column is A but keeping newline or whitespace character.
output should be:
HEE TE
I tried this:
awk '{if ($3=="A") print $2}' file | awk 'BEGIN{ORS = ""}{print $1}'
But this gives:
HEETE%
Which has a weird % and is missing the space.

You may use this gnu-awk solution using FIELDWIDTHS:
awk 'BEGIN{ FIELDWIDTHS = "1 1 1 1 1 1 *" } $5 == "A" {s = s $3}
END {print s}' file
HEE TE
awk splits each record using width values provided in this variable FIELDWIDTHS.
1 1 1 1 1 1 * means each of first 6 columns will have single character length and remaining text will be filled in 7th column. Since you have a space after each value so $2,$4,$6 will be filled with a single space and $1,$3,$5 will be filled with the provided values in input.
$5 == "A" {s = s $3}: Here we are checking if $5 is A and if that condition is true then we keep appending value of $3 in a variable s. In the END block we just print variable s.
Without using fixed width parsing, awk will treat A in 4th row as $2.
Or else if we let spaces part of column value then use:
awk '
BEGIN{ FIELDWIDTHS = "2 2 2 *" }
$3 == "A " {s = s substr($2,1,1)}
END {print s}
' file

awk/sed remove duplicates and merge permuted columns

I have the following file:
ABC MNH 1
UHR LOI 2
QWE LOI 3
MNH ABC 4
PUQ LOI 5
MNH ABC 6
QWE LOI 7
LOI UHR 8
I want to remove all duplicates (based on the the first two columns - e.g., row 6 is a duplicate of row 4). Also I want to merge entries where column 1 and 2 are permuted (e.g., row 1 and 4). This means that this list should result in:
ABC MNH 1 4
UHR LOI 2 8
QWE LOI 3
PUQ LOI 5
However, this file is huge. About 2-3 TB. Can this be done with awk/sed?

I don't understand why what you posted is your expected output so you may have to massage it but IMHO this is right the way to approach the problem so that only "sort" is handling storing the multi-TB input internally (and sort is designed to do that with paging etc.) while the awk scripts are just processing one line at a time and keeping very little in memory:
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
prev = curr
next
}
{ rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4 6
PUQ LOI 5
QWE LOI 3 7
LOI UHR 8 2
An alternative implementation after discussing with #kvantour in the comments below (requires GNU sort for -s stable sort):
$ cat tst.sh
#!/bin/env bash
awk '{print ($1>$2 ? $1 OFS $2 : $2 OFS $1), $0}' "$1" |
sort -s -k1,2 |
awk '
{ curr = $1 OFS $2 }
prev != curr {
if ( NR>1 ) {
print rec
}
rec = $0
sub(/^([^[:space:]]+[[:space:]]+){2}/,"",rec)
sub(/[[:space:]]+[^[:space:]]+$/,"",rec)
delete seen
prev = curr
}
!seen[$3,$4]++ { rec = rec OFS $NF }
END { print rec }
'
$ ./tst.sh file
ABC MNH 1 4
PUQ LOI 5
QWE LOI 3
UHR LOI 2 8

The always helpful GNU datmash to the rescue!
$ sort -k1,2 -u input.txt |
awk -v OFS="\t" '$2 < $1 { tmp = $1; $1 = $2; $2 = tmp } { print $1, $2, $3 }' |
sort -k1,2 |
datamash groupby 1,2 collapse 3 |
tr ',' ' '
ABC MNH 1 4
LOI PUQ 5
LOI QWE 3
LOI UHR 2 8
Broken down, this:
Sorts the input file based on the first two columns and removes duplicates.
If the second column is less than the first column, swaps the two (So MNH ABC 6 becomes ABC MNH 6), and outputs tab-separated columns (Which is what datamash works with by default).
Sorts that so all the transformed rows are in order (But this time keeping duplicates).
Uses datamash to produce a single line for all the duplicate first two columns, with a comma-separated list of the values of the third columns as the third column of the output (Like ABC MNH 1,4)
Turns those commas into spaces.
Most memory-efficient solutions will require the data to be sorted, and while the sort program is quite good at doing that, it'll still use a bunch of temporary files so you'll need 2-3 or so terabytes of free disk space.
If you're going to be doing a lot of stuff with the same data, it's probably worth sorting it once and reusing that file instead of sorting it every time as the first step of a pipeline:
$ sort -k1,2 -u input.txt > unique_sorted.txt
$ awk ... unique_sorted.txt | ...
If there's enough duplicates and enough RAM that it's feasible to hold the results in memory, it can be done in one pass through the input file removing duplicates as it goes and then iterating through all the remaining pairs of values:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my %keys;
while (<>) {
chomp;
my ($col1, $col2, $col3) = split ' ';
$keys{$col1}{$col2} = $col3 unless exists $keys{$col1}{$col2};
}
$, = " ";
while (my ($col1, $sub) = each %keys) {
while (my ($col2, $col3) = each %$sub) {
next unless defined $col3;
if ($col1 lt $col2 && exists $keys{$col2}{$col1}) {
$col3 .= " $keys{$col2}{$col1}";
$keys{$col2}{$col1} = undef;
} elsif ($col2 lt $col1 && exists $keys{$col2}{$col1}) {
next;
}
say $col1, $col2, $col3;
}
}
This produces output in arbitrary unsorted order for efficiency's sake.
And an approach using sqlite (Also requires lots of extra free disk space, and that the columns are separated by tabs, not arbitrary whitespace):
#!/bin/sh
input="$1"
sqlite3 -batch -noheader -list temp.db 2>/dev/null <<EOF
.separator \t
PRAGMA page_size = 8096; -- Make sure the database can grow big enough
CREATE TABLE data(col1, col2, col3, PRIMARY KEY(col1, col2)) WITHOUT ROWID;
.import "$input" data
SELECT col1, col2, group_concat(col3, ' ')
FROM (
SELECT col1, col2, col3 FROM data WHERE col1 < col2
UNION ALL
SELECT col2, col1, col3 FROM data WHERE col2 < col1
)
GROUP BY col1, col2
ORDER BY col1, col2;
EOF
rm -f temp.db

If your first two columns will only have 3 characters maximum you will have 26^6 possible combinations for the first two columns. This is very easy to handle with awk.
{ key1=$1$2; key2=$2$1 }
(key1 in a) { next } # duplicate :> skip
(key2 in a) { print $2,$1,a[key2],$3 } # permutation :> print
{ a[key1]=$3 } # store value
This however will only print the permutations, and as requested, maximum 2 elements. As a consequence, the array a will have both key1 and the permuted key key2 in the array in case a permutation is found, otherwise it will only have key1.
This can be cleaned up with a second array keeping track if a permutation is already printed. Call it b. This way you can eliminate 2 elements from a while keeping track of one element in b:
{ key1=$1$2; key2=$2$1 }
(key1 in b) || (key2 in b) { next } # permutation printed, is duplicate
(key1 in a) { next } # only duplicate, no permutation found
(key2 in a) { # permutation found
print $2,$1,a[key2],$3 # - print
delete a[key1] # - delete keys from a
delete a[key2]
b[key1] # - store key in b
next # - skip the rest
}
{ a[key1]=$3 }
END { for (k in a) { print substr(1,3,k),substr(4,3,k),a[k] } }

Print the 1st and every nth column of a text file using awk

I have a txt file contains a total of 10177 columns and a total of approximately 450,000 rows. The information is separated by tabs. I am trying to trim the file down using awk so that it only prints the 1-3, 5th, and every 14th column after the fifth one.
My file has a format that looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ... 10177
A B C D E F G H I J K L M N O P Q R S T ...
X Y X Y X Y X Y X Y X Y X Y X Y X Y X Y ...
I am hoping to generate an output txt file (also separated with tab) that contains:
1 2 3 5 18 ...
A B C E R ...
X Y X X Y ...
The current awk code I have looks like (I am using cygwin to use the code):
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
But the result I am getting shows something like:
123518...ABCER...XYXXY...
When opened with excel program, the results are all mashed into 1 single cell.
In addition, when I try to include code
for (i=0;i<=3;i++) printf "%s ",$i
in the awk to get the first 3 columns, it just prints out the original input document together with the mashed result. I am not familiar with awk, so I am not sure what causes this issue.

Awk field numbers, strings, and array indices all start at 1, not 0, so when you do:
for (i=0;i<=3;i++) printf "%s ",$i
the first iteration prints $0 which is the whole record.
You're on the right track with:
$ awk -F"\t" '{OFS="\t"} { for(i=5;i<10177;i+=14) printf ($i) }' test2.txt > test3.txt
but never do printf with input data as the only argument to printf since then printf will treat it as a format string without data (rather than what you want which is a plain string format with your data) and then that will fail cryptically if/when your input data contains formatting characters like %s or %d. So, always use printf "%s", $i, never printf $i.
The problem you're having with excel, I would guess, is you're trying to double click on the file and hoping excel knows what to do with it (it won't, unlike if this was a CSV). You can import tab-separated files into excel after it's opened though - google that.
You want something like:
awk '
BEGIN { FS=OFS="\t" }
{
for (i=1; i<=3; i++) {
printf "%s%s", (i>1?OFS:""), $i
}
for (i=5; i<=NF; i+=14) {
printf "%s%s", OFS, $i
}
print ""
}
' file
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.

In awk using conditional operator in for:
$ awk 'BEGIN { FS=OFS="\t" }
{
for(i=1; i<=NF; i+=( i<3 ? 1 : ( i==3 ? 2 : 14 )))
printf "%s%s", $i, ( (i+14)>NF ? ORS : OFS)
}' file
1 2 3 5 19
A B C E S
X Y X X X
In the for if i<3 increment by one, if i==3 increment by two to get to 5 and after that by 14.

I would be tempted to solve the problem along the following lines. I think you'll find you save time by not iterating in awk.
$ cols="$( { echo 1 2 3; seq 5 14 10177; } | sed 's/^/$/; 2,$ s/^/, /' )"
$ awk -F\\t "{print $cols}" test.txt

awk to Count Sum and Unique improve command

Would like to print based on 2nd column ,count of line items, sum of 3rd column and unique values of first column.Having around 100 InputTest files and not sorted ..
Am using below 3 commands to achieve the desired output , would like to know the simplest way ...
InputTest*.txt
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
abc,xx,5,sss
abc,yy,10,sss
def,xx,15,sss
def,yy,20,sss
ghi,zz,10,sss
Step#1:
cat InputTest*.txt | awk -F, '{key=$2;++a[key];b[key]=b[key]+$3} END {for(i in a) print i","a[i]","b[i]}'
Op#1
xx,4,40
yy,4,60
zz,1,10
Step#2
awk -F ',' '{print $1,$2}' InputTest*.txt | sort | uniq >Op_UniqTest2.txt
Op#2
abc xx
abc yy
def xx
def yy
ghi zz
Step#3
awk '{print $2}' Op_UniqTest2.txt | sort | uniq -c
Op#3
2 xx
2 yy
1 zz
Desired Output:
xx,4,40,2
yy,4,60,2
zz,1,10,1
Looking for suggestions !!!

BEGIN { FS = OFS = "," }
{ ++lines[$2]; if (!seen[$2,$1]++) ++diff[$2]; count[$2]+=$3 }
END { for(i in lines) print i, lines[i], count[i], diff[i] }
lines tracks the number of occurrences of each value in column 2
seen records unique combinations of the second and first column, incrementing diff[$2] whenever a unique combination is found. The ++ after seen[$2,$1] means that the condition will only be true the first time the combination is found, as the value of seen[$2,$1] will be increased to 1 and !seen[$2,$1] will be false.
count keeps a total of the third column
$ awk -f avn.awk file
xx,4,40,2
yy,4,60,2
zz,1,10,1

Using awk:
$ awk '
BEGIN { FS = OFS = "," }
{ keys[$2]++; sum[$2]+=$3 } !seen[$1,$2]++ { count[$2]++ }
END { for(key in keys) print key, keys[key], sum[key], count[key] }
' file
xx,4,40,2
yy,4,60,2
zz,1,10,1
Set the input and output field separator to , in BEGIN block. We use arrays keys to identify and count keys. sum array keeps the sum for each keys. count allows us to keep track of unique column1 for each of column2 values.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

awk split and drop column but something wrong with substr part - awk

Related

Better way to manipulate data using AWK and SED

how to keep newline(s) when selecting a given column with awk

awk/sed remove duplicates and merge permuted columns

Print the 1st and every nth column of a text file using awk

awk to Count Sum and Unique improve command

Categories

Resources