Awk script extra output: printing raw line (as read) as well as processed line - awk

I have some CSV files where a certain column is actually supposed to be an array, but ALL fields are separated by commas. I need to convert the file to where every value is quoted, and the array column is a quoted, comma-delimited list. I do know the column index for each file.
I wrote the script below to handle this. However, I get each line printed as hoped for, but followed by the raw line.
desired output:
A,B,C,D
"1","","a,b,c","2"
"3","4","","5"
"","5","d,e","6"
"7","8","f","9"
(base) balter#winmac:~/winhome/CancerGraph$ cat testfile
A,B,C,D
1,,a,b,c,2
3,4,,5
,5,d,e,6
7,8,f,9
(base) balter#winmac:~/winhome/CancerGraph$ ./fix_array_cols.awk FS="," array_col=3 testfile
A,B,C,D
"1","","a,b,c","2"
1,,a,b,c,2
"3","4","","5"
3,4,,5
"","5","d,e","6"
,5,d,e,6
"7","8","f","9"
7,8,f,9
(base) balter#winmac:~/winhome/CancerGraph$ cat fix_array_cols.awk
#!/bin/awk -f
BEGIN {
getline;
print $0;
num_cols = NF;
#printf("num_cols: %s, array_col: %s\n\n", num_cols, array_col);
}
NR>1 {
total_fields = NF;
# fields_before_array = (array_col - 1)
# fields_before_array + array_length + fields_after_array = NF
# fields_before_array + fields_after_array + 1 = num_cols
# array_length - 1 = total_fields - num_cols
# array_length = total_fields - num_cols + 1
# fields_after_array = total_fields - array_length - fields_before_array
# = total_fields - (total_fields - num_cols + 1) - (array_col - 1)
# = num_cols - array_col
fields_before_array = (array_col - 1);
array_length = total_fields - num_cols + 1;
fields_after_array = num_cols - array_col;
first_array_position = array_col;
last_array_position = array_col + array_length-1;
#printf("array_col: %s, fields_before_array: %s, array_length: %s, fields_after_array: %s, total_fields: %s, num_cols: %s", array_col, fields_before_array, array_length, fields_after_array, total_fields, num_cols)
### loop through fields before array column
### remove whitespace, and print surround with ""
for (i=1; i<array_col; i++)
{
gsub(/ /,"",$i);
printf("\"%s\",", $i);
}
### Collect array surrounded by ""
array_data = "";
### Loop through array
for (i=array_col ; i<array_col+array_length-1 ; i++)
{
gsub(/ /, "", $i);
array_data = array_data $i ",";
}
### collect last array element with no trailing ,
array_data = array_data $i
### print array surrounded by quotes
printf("\"%s\",", array_data);
### loop through remaining fields, remove whitespace, surround with ""
for (i=last_array_position+1 ; i<total_fields ; i++)
{
gsub(/ /,"",$i);
printf("\"%s\",", $i);
}
### finish line with \n
printf("\"%s\"\n", $total_fields);
} FILENAME

Remove FILENAME from your script.

Related

Concatenating array elements into a one string in for loop using awk

I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.

Concatenate columns and adds digits awk

I have a csv file:
number1;number2;min_length;max_length
"40";"1801";8;8
"40";"182";8;8
"42";"32";6;8
"42";"4";6;6
"43";"691";9;9
I want the output be:
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
So the new file will be consisting of:
column_1 = a concatenation of old_column_1 + old_column_2 + a number
of "0" equal to (old_column_3 - length of the old_column_2)
column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal
to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8.
Also, i need to delete the quotation mark everywhere.
I tried with paste and cut like that:
paste -d ";" <(cut -f1,2 -d ";" < file1) > file2
for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks!
Edit: Actually, added column 4 in input.
You may use this awk:
awk 'function padstr(ch, len, s) {
s = sprintf("%*s", len, "")
gsub(/ /, ch, s)
return s
}
BEGIN {
FS=OFS=";"
}
{
gsub(/"/, "");
for (i=0; i<=($4-$3); i++) {
d = $3 - length($2) + i
print $1 $2 padstr("0", d), $1 $2 padstr("9", d)
}
}' file
4018010000;4018019999
4018200000;4018299999
42320000;42329999
423200000;423299999
4232000000;4232999999
42400000;42499999
43691000000;43691999999
With awk:
awk '
BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";"
{
$0 = gensub("\"", "", "g"); # Drop double quotes
s = $1$2; # The range header number
l = $3-length($2); # Number of zeros or 9s to be appended
l = 10^l; # Get 10 raised to that number
print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n
# Adding n nines is multipliaction by 10^n + (10^n - 1)
}' input.txt
Explanation inline as comments.

Use row 1 column ith as output filename awk

I'm a very recent command line user thus I'm requiring some help to split a text file by columns using awk. The difficulty for me is that I want the ith filename to be the text from the 1st row of the ith column.
This is what I had in mind:
awk '{for(i = 2; i <= NF; i++){name= ??FNR == 1 $i?? ;print $1, $i > name}}' myfile.txt
But I don't know how to set the name variable...
Input: myfile.txt
'ID' 'sample_1' 'sample_2' ...
'id_1' 1 2 ...
'id_2' 2 3 ...
Excpected output:
sample_1.txt:
'ID' 'sample_1'
'id_1' 1
'id_2' 2
sample_2.txt:
'ID' 'sample_2'
'id_1' 2
'id_2' 3
Thanks
You should keep column headers in an array.
awk 'NR==1 {
for (i=2; i<=NF; ++i) {
fnames[i] = gensub(/\x27/, "", "g", $i)
print $1, $i > fnames[i] ".txt"
}
next
}
{
for (i=2; i<=NF; ++i)
print $1, "\x27" $i "\x27" > fnames[i] ".txt"
}' myfile.txt
\x27 is single quote in hex-escaped form
gensub(/\x27/, "", "g", $i) removes single quotes from column headers to name output files as you wanted.
You can try this awk :
awk -F'\t' ' # tab as field separator
{
for ( i = 2 ; i <= NF ; i++ ) { # for each record loop from field 2 to last field
if ( NR == 1 ) { # if first record
a[i] = $i # keep each field in array a
gsub ( /^'\''|'\''$/ , "" , a[i] ) # remove quote at start and end in array a
}
print $1 FS $i > a[i]".txt" # print needed field in corresponding file
}
}' myfile.txt

awk: Print line numbers seperated by comma

awk '{for (i = 1; i <= NF; i++) {gsub(/[^[:alnum:]]/, " "); print tolower($i)": "NR | "sort -V | uniq";}}' input.txt
With above code, I get output as:
line1: 2
line1: 3
line1: 5
line2: 1
line2: 2
line3: 10
I want it like below:
line1: 2, 3, 5
line2: 1, 2
lin23: 10
How to achieve it?
Use gawk's array features. I'll provide actual code once I hack it up.
awk '{for (i = 1; i <= NF; i++) {
gsub(/[^[:alnum:]]/, " ");
arr[tolower($i)] = arr[tolower($i)]NR", "}
}
END {
for (x in arr) {
print x": "substr(arr[x], 1, length(arr[x])-2);
}}' input.txt | sort
Note that this includes duplicate line numbers if a word appears multiple times on the same lines.
using perl...
#!/usr/bin/perl
while(<>){
if( /(\w+):\s*(\d+)/){ # extract the parts
$arr{lc($1)}{$2} ++ # count them
};
}
for my $k (sort keys %arr){ # print sorted alpha
print "$k: ";
$lines=$arr{$k};
print join(", ",(sort {$a<=>$b} keys %$lines)),"\n"; print sorted numerically
}
This solution removes ans sorts the duplicated numbers. Is this what you needed?

How to print lines of text with date older than two days

I have the following text file that I am working with and must be able to parse only the object name value when the creationdatetime is older than two days.
objectname ...........................: \Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15
creationdatetime .....................: 01-Sep-2012 02:17:43
objectname ...........................: \Path\to\file\hpVSS-LUN-22May12 22.24.11\hpVSS-LUN-28Aug12 22.16.19
creationdatetime .....................: 03-Sep-2012 10:18:09
objectname ...........................: \Path\to\file\hpVSS-LUN-22May-12 22.24.11\hpVSS-LUN-27Aug12 22.01.52
creationdatetime .....................: 03-Sep-2012 10:18:33
An output of the command for the above would be:
\Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15
Any help will be greatly appreciated.
Prem
Date parsing in awk is a bit tricky but it can be done using mktime. To convert the month name to numeric, an associative translation array is defined.
The path names have space in them so the best choice for field separator is probably : (colon followed by space). Here's a working awk script:
older_than_two_days.awk
BEGIN {
months2num["Jan"] = 1; months2num["Feb"] = 2; months2num["Mar"] = 3; months2num["Apr"] = 4;
months2num["May"] = 5; months2num["Jun"] = 6; months2num["Jul"] = 7; months2num["Aug"] = 8;
months2num["Sep"] = 9; months2num["Oct"] = 10; months2num["Nov"] = 11; months2num["Dec"] = 12;
now = systime()
two_days = 2 * 24 * 3600
FS = ": "
}
$1 ~ /objectname/ {
path = $2
}
$1 ~ /creationdatetime/ {
split($2, ds, " ")
split(ds[1], d, "-")
split(ds[2], t, ":")
date = d[3] " " months2num[d[2]] " " d[1] " " t[1] " " t[2] " " t[3]
age_in_seconds = mktime(date)
if(now - age_in_seconds > two_days)
print path
}
All the splitting in the last block is to pick out the date bits and convert them into a format that mktime accepts.
Run it like this:
awk -f older_than_two_days.awk infile
Output:
\Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15
I would do it in 2 phases:
1) reformat you input file
awk '/objectname/{$1=$2="";file=$0;getline;$1=$2="";print $0" |"file}' inputfile > inputfile2
This way you would deal with
01-Sep-2012 02:17:43 | \Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15
03-Sep-2012 10:18:09 | \Path\to\file\hpVSS-LUN-22May12 22.24.11\hpVSS-LUN-28Aug12 22.16.19
03-Sep-2012 10:18:33 | \Path\to\file\hpVSS-LUN-22May-12 22.24.11\hpVSS-LUN-27Aug12 22.01.52
2) filter on dates:
COMPARDATE=$(($(date +%s)-2*24*3600)) # 2 days off
IFS='|'
while read d f
do
[[ $(date -d "$d" +%s) < $COMPARDATE ]] && printf "%s\n" "$f"
done < inputfile2