trim trailing tabs from a string/column in awk - awk

I have a file containing 4 columns separated by tabs. In the last column there can be sometimes trailing tabs between the quotation marks.
It is a similar question to trim leading and trailing spaces from a string in awk. Here is an example:
col1 col2 col3 col4
"12" "d" "5" "this is great"
"13" "d" "6" "this is great<tab>"
"14" "d" "7" "this is great<tab><tab>"
"15" "d" "8" "this is great"
"16" "d" "9" "this is great<tab>"
This is what I come up with so far:
gawk --re-interval -F '"' 'NF = 9 {if ($8 ~ /\t$/) {gsub(/[\t]+$,"",$8)} ;}'
The problem is that it destroys my format meaning I get no quotation marks for each column. The good thing is that the tabs in between the columns are still there:
col1 col2 col3 col4
12 d 5 this is great
13 d 6 this is great
14 d 7 this is great
15 d 8 this is great
16 d 9 this is great
What do I do wrong?

You need to tell awk that the output field separator (OFS) is also a quote. For example:
awk -v OFS='"' -F '"' 'NF == 9 {
if ($8 ~ /\t$/) {
gsub(/[\t]+$/,"",$8)
}
}
1' input.txt
Output:
col1 col2 col3 col4
"12" "d" "5" "this is great"
"13" "d" "6" "this is great"
"14" "d" "7" "this is great"
"15" "d" "8" "this is great"
"16" "d" "9" "this is great"

Related

how to name matrix columns

I have a matrix like below, how can I give the column names like "month", "2015", "2016, "2017" from column 2:5? Thank you.
[,1] [,2] [,3] [,4] [,5]
[1,] "" "1" "75" "75" "94"
[2,] "" "2" "77" "67" "69"
[3,] "" "3" "67" "78" "80"
[4,] "" "4" "71" "99" "84"
[5,] "" "5" "62" "89" "74"
Assuming you're using R, you could do something like this (for matrix M)
colnames(M) <- c("","month","2015","2016","2017")

Is awk 2 dimension array or something similar to store value?

Hi all im new in awk can i ask i have a input file like this:
# ABC DEFG
value1 GH
value2 GH
value3 GH
# BCF SQW
value4 GH
value5 GH
# BEC YUW
value6 GH
value7 GH
Desire output:
##### ABC DEFG #####
DEFG_ABC
ABC_DEFG
value1 ABC
value1 DFG
value2 ABC
value2 DFG
value3 ABC
value3 DFG
##### BCF SQW #####
BCF_SQW
SQW_BCF
value4 BCF
value4 SQW
value5 BCF
value5 SQF
##### BEC YUW #####
BEC_YUW
YUW_BEC
value6 BEC
value6 YUW
value7 BEC
value7 YUW
I had seperate the file $2 and $3 in line have character # into array like this
awk '
/^#/ {
a[na++] = $2
b[nb++] = $3
}
END {
for(i = 0; i < na; i ++){
print ("######" a[i] " " b[i] "#####")
print (a[i] "_" b[i])
print (b[i] "_" a[i])
}
}
' input
But i dont know how to store the $1 of all line between "#" line to the array anyone how to make it ? Thank you so much
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
/^#/{
sub(/^#/,"##### ")
print $0," #####"
val1=$2
val2=$3
print val1"_"val2 ORS val2"_"val1
next
}
{
print $1,val1 ORS $1,val2
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/^#/{ ##Checking condition if line starts from #.
sub(/^#/,"##### ") ##Substituting starting # with #####
print $0," #####" ##Printing current line follows by ##### here.
val1=$2 ##Creating val1 which has 2nd field in it.
val2=$3 ##Creating val2 which has 3rd field in it.
print val1"_"val2 ORS val2"_"val1 ##Printing val1 _ val2 newline val2 _ val1.
next ##next will skip further statements from here.
}
{
print $1,val1 ORS $1,val2 ##Printing 1st field val1 ORS 1st field val2 here.
}
' Input_file ##Mentioning Input_file name here.

How to improve this awk code to reduce processing time

I have 400 tab-delimited text files with 6 million rows in each file. Below is the format of the files:
### input.txt
col1 col2 col3 col4 col5
ID1 str1 234 cond1 0
ID1 str2 567 cond1 0
ID1 str3 789 cond1 1
ID1 str4 123 cond1 1
### file1.txt
col1 col2 col3 col4 col5
ID2 str1 235 cond1 0
ID2 str2 567 cond2 3
ID2 str3 789 cond1 3
ID2 str4 123 cond2 0
### file2.txt
col1 col2 col3 col4 col5
ID3 str1 235 cond1 0
ID3 str2 567 cond2 4
ID3 str3 789 cond1 1
I am trying to add values in $1 from the rest of the file1..filen to $6 in input.txt file by using:
conditions:
1. columns $2 and $3 as key
2. If the key is found in files1...filen then if $5>=2 add the value from $1 to $6 in the input file.
Code:
awk -F "\t" -v OFS="\t" '!c {
c=$0"\tcol6";
next
}
NR==FNR {
a[$2$3]=$0 "\t";
next
}
{
if ($5>=2) {
a[$2$3]=a[$2$3] $1 ","
}
}
END {
print c;
for (i in a) {
print a[i]
}
}' input.txt file1..filen.txt
The output from the above code is as expected:
Output.txt
col1 col2 col3 col4 col5 col6
ID1 str2 567 cond1 0 ID2,ID3,
ID1 str4 123 cond1 1
ID1 str1 234 cond1 0
ID1 str3 789 cond1 1 ID2,
However, the problem is that the code is very slow as it has to iterate each key in input.txt through 400 files with 6 million rows in each file. This takes several hours to few days. Could someone suggest a better way to reduce the processing time in awk or using other scripts.
Any help would really save lot of time.
input.txt
Sam string POS Zyg QUAL
WSS 1 125 hom 4973.77
WSS 1 810 hom 3548.77
WSS 1 389 hom 62.74
WSS 1 689 hom 4.12
file1.txt
Sam string POS Zyg QUAL
AC0 1 478 hom 8.64
AC0 1 583 het 37.77
AC0 1 588 het 37.77
AC0 1 619 hom 92.03
file2.txt
Sam string POS zyg QUAL
AC1 1 619 hom 89.03
AC1 1 746 hom 17.86
AC1 1 810 het 2680.77
AC1 1 849 het 200.77
awk -F "\t" -v OFS="\t" '!c {
c=$0"\tcol6";
next
}
NR==FNR {
a[$2$3]=$0 "\t";
next
}
{
if ( ($5>=2) && (FNR > 1) ) {
if ( $2$3 in a ) {
a[$2$3]=a[$2$3] $1 ",";
} else {
print $0 > "Errors.txt";
}
}
}
END {
print c;
for (i in a) {
print a[i]
}
}' input.txt file*
For the above input files it prints the below output:
AC0,AC1,
WSS 1 389 hom 62.74
AC1,
WSS 1 810 hom 3548.77 AC1,
WSS 1 689 hom 4.12
WSS 1 1250 hom 4973.77
It still prints the values in $1 from file1 and file2

sscan doesn't returns part of members

I have a set which contains integer values. And I want to retrieve part of it with sscan.
127.0.0.1:6379[1]> smembers d
1) "1"
2) "2"
3) "3"
4) "4"
5) "5"
6) "6"
7) "7"
8) "8"
...
But sscan returns full list of members:
127.0.0.1:6379[1]> sscan d 0
1) "0"
2) 1) "1"
2) "2"
3) "3"
4) "4"
5) "5"
6) "6"
7) "7"
8) "8"
9) "9"
....
Is there any way which brings me members page by page(for eg. 10 items for every scan)
Use the COUNT directive as explained in SCAN's documentation to return a fixed number of results.

awk + match string after “=” separator

I have a problem with the following awk syntax
echo " param1 param2 param3 = param1 AA , AB , AC , AD " | awk -F"=" '$2~/AA|AB|AC|AD/{print "passed"}'
The awk prints "passed", but it shouldn't be because after "=" I have "param1" and not "AA" or AB", etc.
The target of the awk is to print "passed" only if the string after "=" is AA OR AB OR AC OR AD.
and if I have something else after "=" then its not should print passed
how to fix the awk syntax?
lidia
You need anchors:
awk -F= '$2 ~ /^(AA|AB|AC|AD)$/ {print "passed"}'
If you want to allow spaces:
awk -F= '$2 ~ /^ *(AA|AB|AC|AD) *$/ {print "passed"}'
This should work:
echo " param1 param2 param3 = param1 AA , AB , AC , AD " |
awk -F"=" -v var="passed" '$2~/AA|AB|AC|AD/{printf "%s",var}'