Concatenate columns and adds digits awk - awk

I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.

You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999

With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.

Related

Print each column count alongwith header name

I have a coma delemiated file. I am interested to count the number of rows (column count/length) in each column with their header name.
Example Dataset:
ID, IB, IM, IZ
0.05, 0.02, 0.01, 0.09
0.06, 0.01, , 0.08
0.02, 0.06,
Coumn ID:3
Column IB:3
Column IM: 1
Column IZ:2
I have tried quite few option:
I can split these columns into seperate files and then can count number of lines in each file using wc -l File_name command.
This Command is very close to what I am interested in but stillunable to get header name. Any help will be highly appreciated.
I would use GNU AWK for this task following way, let file.txt content be
ID, IB, IM, IZ
0.05, 0.02, 0.01, 0.09
0.06, 0.01, , 0.08
0.02, 0.06,
then
awk 'BEGIN{FS=","}NR==1{split($0,names);next}{for(i=1;i<=NF;i+=1){counts[i]+=$i~/[^[:space:]]/}}END{for(i=1;i<=length(names);i+=1){print "Column",names[i]": "counts[i]}}' file.txt
output
Column ID: 3
Column IB: 3
Column IM: 1
Column IZ: 2
Explanation: I inform GNU AWK that , is field separator, when processing 1st record split whole lines ($0) into array names, so ID becomes names[1], IB becomes names[2], IM becomes names[3] and so on. After doing that go to next line. For all but 1st line iterate over columns using for loop, for every line increase value of counts[i] (where i is number of column) by does that column contain non-whitespace character? which is 0 for false and 1 for true. In other words increase by 1 if non-whitespace character found else increase by 0. After processing all lines iterate over names and print name with corresponding value of counts.
(tested in gawk 4.2.1)
With GNU awk:
awk -F'[[:space:]]*,[[:space:]]*' 'NR == 1 {header = $0; next} \
{for(i = 1; i <= NF; i++) n[i] += ($i ~ /\S/)} \
END {$0 = header; for(i = 1; i <= NF; i++) print "Column " $i ": " n[i]}' file
Column ID: 3
Column IB: 3
Column IM: 2
Column IZ: 1
Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=" *, *" }
NR == 1 {
numCols = split($0,tags)
next
}
{
for ( i=1; i<=NF; i++ ) {
if ( $i ~ /./ ) {
cnt[i]++
}
}
}
END {
for ( i=1; i<=numCols; i++ ) {
printf "Column %s:%d\n", tags[i], cnt[i]
}
}
$ awk -f tst.awk file
Column ID:3
Column IB:3
Column IM:1
Column IZ:2

Extract first position of a regex match grep

Good morning everyone,
I have a text file containing multiple lines. I want to find a regular pattern inside it and print its position using grep.
For example:
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
I want to find L[any_letter]T in the file and print the position of L and the three letter code. In this case it would results as:
11 LIT
8 LAT
4 LKT
I wrote a code in grep, but it doesn't return what I need. The code is:
grep -E -boe "L.T" file.txt
It returns:
11:LIT
21:LAT
30:LKT
Any help would be appreciated!!
Awk suites this better:
awk 'match($0, /L[[:alpha:]]T/) {
print RSTART, substr($0, RSTART, RLENGTH)}' file
11 LIT
8 LAT
4 LKT
This is assuming only one such match per line.
If there can be multiple overlapping matches per line then use:
awk '{
n = 0
while (match($0, /L[[:alpha:]]T/)) {
n += RSTART
print n, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + 1)
}
}' file
With your shown samples, please try following awk code. Written and tested in GNU awk, should work in any awk.
awk '
{
ind=prev=""
while(ind=index($0,"L")){
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){
if(prev==""){ print prev+ind,substr($0,ind,3) }
if(prev>1) { print prev+ind+2,substr($0,ind,3) }
}
$0=substr($0,ind+3)
prev+=ind
}
}' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
{
ind=prev="" ##Nullifying ind and prev variables here.
while(ind=index($0,"L")){ ##Run while loop to check if index for L letter is found(whose index will be stored into ind variable).
if(substr($0,ind+2,1)=="T" && substr($0,ind+1,1) ~ /[a-zA-Z]/){ ##Checking condition if letter after 1 position of L is T AND letter next to L is a letter.
if(prev==""){ print prev+ind,substr($0,ind,3) } ##Checking if prev variable is NULL then printing prev+ind along with 3 letters from index of L eg:(LIT).
if(prev>1) { print prev+ind+2,substr($0,ind,3) } ##If prev is greater than 1 then printing prev+ind+2 and along with 3 letters from index of L eg:(LIT).
}
$0=substr($0,ind+3) ##Setting value of rest of line value to 2 letters after matched L position.
prev+=ind ##adding ind to prev value.
}
}' Input_file ##Mentioning Input_file name here.
Peeking at the answer of #anubhava you might also sum the RSTART + RLENGTH and use that as the start for the substr to get multiple matches per line and per word.
The while loop takes the current line, and for every iteration it updates its value by setting it to the part right after the last match till the end of the string.
Note that if you use the . in a regex it can match any character.
awk '{
pos = 0
while (match($0, /L[a-zA-Z]T/)) {
pos += RSTART;
print pos, substr($0, RSTART, RLENGTH)
$0 = substr($0, RSTART + RLENGTH)
}
}' file
If file contains
ARTGHFRHOPLIT
GFRTLOPLATHLG
TGHLKTGVARTHG
ARTGHFRHOPLITLOT LATTELET
LUT
The output is
11 LIT
8 LAT
4 LKT
11 LIT
12 LOT
14 LAT
17 LET
1 LUT

Awk script extra output: printing raw line (as read) as well as processed line

I have some CSV files where a certain column is actually supposed to be an array, but ALL fields are separated by commas. I need to convert the file to where every value is quoted, and the array column is a quoted, comma-delimited list. I do know the column index for each file.
I wrote the script below to handle this. However, I get each line printed as hoped for, but followed by the raw line.
desired output:
A,B,C,D
"1","","a,b,c","2"
"3","4","","5"
"","5","d,e","6"
"7","8","f","9"
(base) balter#winmac:~/winhome/CancerGraph$ cat testfile
A,B,C,D
1,,a,b,c,2
3,4,,5
,5,d,e,6
7,8,f,9
(base) balter#winmac:~/winhome/CancerGraph$ ./fix_array_cols.awk FS="," array_col=3 testfile
A,B,C,D
"1","","a,b,c","2"
1,,a,b,c,2
"3","4","","5"
3,4,,5
"","5","d,e","6"
,5,d,e,6
"7","8","f","9"
7,8,f,9
(base) balter#winmac:~/winhome/CancerGraph$ cat fix_array_cols.awk
#!/bin/awk -f
BEGIN {
getline;
print $0;
num_cols = NF;
#printf("num_cols: %s, array_col: %s\n\n", num_cols, array_col);
}
NR>1 {
total_fields = NF;
# fields_before_array = (array_col - 1)
# fields_before_array + array_length + fields_after_array = NF
# fields_before_array + fields_after_array + 1 = num_cols
# array_length - 1 = total_fields - num_cols
# array_length = total_fields - num_cols + 1
# fields_after_array = total_fields - array_length - fields_before_array
# = total_fields - (total_fields - num_cols + 1) - (array_col - 1)
# = num_cols - array_col
fields_before_array = (array_col - 1);
array_length = total_fields - num_cols + 1;
fields_after_array = num_cols - array_col;
first_array_position = array_col;
last_array_position = array_col + array_length-1;
#printf("array_col: %s, fields_before_array: %s, array_length: %s, fields_after_array: %s, total_fields: %s, num_cols: %s", array_col, fields_before_array, array_length, fields_after_array, total_fields, num_cols)
### loop through fields before array column
### remove whitespace, and print surround with ""
for (i=1; i<array_col; i++)
{
gsub(/ /,"",$i);
printf("\"%s\",", $i);
}
### Collect array surrounded by ""
array_data = "";
### Loop through array
for (i=array_col ; i<array_col+array_length-1 ; i++)
{
gsub(/ /, "", $i);
array_data = array_data $i ",";
}
### collect last array element with no trailing ,
array_data = array_data $i
### print array surrounded by quotes
printf("\"%s\",", array_data);
### loop through remaining fields, remove whitespace, surround with ""
for (i=last_array_position+1 ; i<total_fields ; i++)
{
gsub(/ /,"",$i);
printf("\"%s\",", $i);
}
### finish line with \n
printf("\"%s\"\n", $total_fields);
} FILENAME
Remove FILENAME from your script.

Concatenating array elements into a one string in for loop using awk

I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.

Awk - Substring comparison

Working native bash code :
while read line
do
a=${line:112:7}
b=${line:123:7}
if [[ $a != "0000000" || $b != "0000000" ]]
then
echo "$line" >> FILE_OT_YHAV
else
echo "$line" >> FILE_OT_NHAV
fi
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
AAAAAAAAAAAAAA XXXXXX BB CCCCCCC 12312312443430000000
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
CCCCCCCCCC C C QWEQWEE DDD AAAAAAA A12312312312312310000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
BEGIN {
print "Awk Started"
FILE_OT_YHAV="FILE_OT_YHAV.test"
FILE_OT_NHAV="FILE_OT_NHAV.test"
FS=""
}
# For each line of input.
{
fline=$0
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
else
print $0 > FILE_OT_NHAV;
}
# After last line.
END {
print "Awk Ended"
}
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
##0000000##1022016##
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $