Awk script extra output: printing raw line (as read) as well as processed line - awk
I have some CSV files where a certain column is actually supposed to be an array, but ALL fields are separated by commas. I need to convert the file to where every value is quoted, and the array column is a quoted, comma-delimited list. I do know the column index for each file.
I wrote the script below to handle this. However, I get each line printed as hoped for, but followed by the raw line.
desired output:
A,B,C,D
"1","","a,b,c","2"
"3","4","","5"
"","5","d,e","6"
"7","8","f","9"
(base) balter#winmac:~/winhome/CancerGraph$ cat testfile
A,B,C,D
1,,a,b,c,2
3,4,,5
,5,d,e,6
7,8,f,9
(base) balter#winmac:~/winhome/CancerGraph$ ./fix_array_cols.awk FS="," array_col=3 testfile
A,B,C,D
"1","","a,b,c","2"
1,,a,b,c,2
"3","4","","5"
3,4,,5
"","5","d,e","6"
,5,d,e,6
"7","8","f","9"
7,8,f,9
(base) balter#winmac:~/winhome/CancerGraph$ cat fix_array_cols.awk
#!/bin/awk -f
BEGIN {
getline;
print $0;
num_cols = NF;
#printf("num_cols: %s, array_col: %s\n\n", num_cols, array_col);
}
NR>1 {
total_fields = NF;
# fields_before_array = (array_col - 1)
# fields_before_array + array_length + fields_after_array = NF
# fields_before_array + fields_after_array + 1 = num_cols
# array_length - 1 = total_fields - num_cols
# array_length = total_fields - num_cols + 1
# fields_after_array = total_fields - array_length - fields_before_array
# = total_fields - (total_fields - num_cols + 1) - (array_col - 1)
# = num_cols - array_col
fields_before_array = (array_col - 1);
array_length = total_fields - num_cols + 1;
fields_after_array = num_cols - array_col;
first_array_position = array_col;
last_array_position = array_col + array_length-1;
#printf("array_col: %s, fields_before_array: %s, array_length: %s, fields_after_array: %s, total_fields: %s, num_cols: %s", array_col, fields_before_array, array_length, fields_after_array, total_fields, num_cols)
### loop through fields before array column
### remove whitespace, and print surround with ""
for (i=1; i<array_col; i++)
{
gsub(/ /,"",$i);
printf("\"%s\",", $i);
}
### Collect array surrounded by ""
array_data = "";
### Loop through array
for (i=array_col ; i<array_col+array_length-1 ; i++)
{
gsub(/ /, "", $i);
array_data = array_data $i ",";
}
### collect last array element with no trailing ,
array_data = array_data $i
### print array surrounded by quotes
printf("\"%s\",", array_data);
### loop through remaining fields, remove whitespace, surround with ""
for (i=last_array_position+1 ; i<total_fields ; i++)
{
gsub(/ /,"",$i);
printf("\"%s\",", $i);
}
### finish line with \n
printf("\"%s\"\n", $total_fields);
} FILENAME
Remove FILENAME from your script.
Related
Concatenating array elements into a one string in for loop using awk
I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do: Input: 1 877803 838425 GC G 1 878077 966631 C CCACGG Output: 1 877803 838425 C - 1 878077 966631 - CACGG In summary, I am trying to delete the first letters of longer strings. And here is my code: awk 'BEGIN { OFS="\t" } /#/ {next} { m = split($4, a, //) n = split($5, b, //) x = "-" delete y if (m>n){ for (i = n+1; i <= m; i++) { y = sprintf("%s", a[i]) } print $1, $2, $3, y, x } else if (n>m){ for (j = m+1; i <= n; i++) { y = sprintf("%s", b[j]) ## Problem here } print $1, $2, $3, x, y } }' input.vcf > output.vcf But, I am getting the following error in line 15, not even in line 9 awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context I don't know how to concatenate array elements into a one string using awk. I will be very happy if you guys help me. Merry X-Mas!
You may try this awk: awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file 1 877803 838425 C - 1 878077 966631 - CACGG More readable form: awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)) } { $4 = trim($4) $5 = trim($5) } 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields: awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.
Concatenate columns and adds digits awk
I have a csv file: number1;number2;min_length;max_length "40";"1801";8;8 "40";"182";8;8 "42";"32";6;8 "42";"4";6;6 "43";"691";9;9 I want the output be: 4018010000;4018019999 4018200000;4018299999 42320000;42329999 423200000;423299999 4232000000;4232999999 42400000;42499999 43691000000;43691999999 So the new file will be consisting of: column_1 = a concatenation of old_column_1 + old_column_2 + a number of "0" equal to (old_column_3 - length of the old_column_2) column_2 = a concatenation of old_column_1 + old_column_2 + a number of "9" equal to (old_column_3 - length of the old_column_2) , when min_length = max_length. And when min_length is not equal with max_length , I need to take into account all the possible lengths. So for the line "42";"32";6;8 , all the lengths are: 6,7 and 8. Also, i need to delete the quotation mark everywhere. I tried with paste and cut like that: paste -d ";" <(cut -f1,2 -d ";" < file1) > file2 for the concatenation of the first 2 columns, but i think with awk its easier. However, i can't figure out how to do it. Any help it's apreciated. Thanks! Edit: Actually, added column 4 in input.
You may use this awk: awk 'function padstr(ch, len, s) { s = sprintf("%*s", len, "") gsub(/ /, ch, s) return s } BEGIN { FS=OFS=";" } { gsub(/"/, ""); for (i=0; i<=($4-$3); i++) { d = $3 - length($2) + i print $1 $2 padstr("0", d), $1 $2 padstr("9", d) } }' file 4018010000;4018019999 4018200000;4018299999 42320000;42329999 423200000;423299999 4232000000;4232999999 42400000;42499999 43691000000;43691999999
With awk: awk ' BEGIN{FS = OFS = ";"} # set field separator and output field separator to be ";" { $0 = gensub("\"", "", "g"); # Drop double quotes s = $1$2; # The range header number l = $3-length($2); # Number of zeros or 9s to be appended l = 10^l; # Get 10 raised to that number print s*l, (s+1)*l-1; # Adding n zeros is multiplication by 10^n # Adding n nines is multipliaction by 10^n + (10^n - 1) }' input.txt Explanation inline as comments.
Use row 1 column ith as output filename awk
I'm a very recent command line user thus I'm requiring some help to split a text file by columns using awk. The difficulty for me is that I want the ith filename to be the text from the 1st row of the ith column. This is what I had in mind: awk '{for(i = 2; i <= NF; i++){name= ??FNR == 1 $i?? ;print $1, $i > name}}' myfile.txt But I don't know how to set the name variable... Input: myfile.txt 'ID' 'sample_1' 'sample_2' ... 'id_1' 1 2 ... 'id_2' 2 3 ... Excpected output: sample_1.txt: 'ID' 'sample_1' 'id_1' 1 'id_2' 2 sample_2.txt: 'ID' 'sample_2' 'id_1' 2 'id_2' 3 Thanks
You should keep column headers in an array. awk 'NR==1 { for (i=2; i<=NF; ++i) { fnames[i] = gensub(/\x27/, "", "g", $i) print $1, $i > fnames[i] ".txt" } next } { for (i=2; i<=NF; ++i) print $1, "\x27" $i "\x27" > fnames[i] ".txt" }' myfile.txt \x27 is single quote in hex-escaped form gensub(/\x27/, "", "g", $i) removes single quotes from column headers to name output files as you wanted.
You can try this awk : awk -F'\t' ' # tab as field separator { for ( i = 2 ; i <= NF ; i++ ) { # for each record loop from field 2 to last field if ( NR == 1 ) { # if first record a[i] = $i # keep each field in array a gsub ( /^'\''|'\''$/ , "" , a[i] ) # remove quote at start and end in array a } print $1 FS $i > a[i]".txt" # print needed field in corresponding file } }' myfile.txt
awk: Print line numbers seperated by comma
awk '{for (i = 1; i <= NF; i++) {gsub(/[^[:alnum:]]/, " "); print tolower($i)": "NR | "sort -V | uniq";}}' input.txt With above code, I get output as: line1: 2 line1: 3 line1: 5 line2: 1 line2: 2 line3: 10 I want it like below: line1: 2, 3, 5 line2: 1, 2 lin23: 10 How to achieve it?
Use gawk's array features. I'll provide actual code once I hack it up. awk '{for (i = 1; i <= NF; i++) { gsub(/[^[:alnum:]]/, " "); arr[tolower($i)] = arr[tolower($i)]NR", "} } END { for (x in arr) { print x": "substr(arr[x], 1, length(arr[x])-2); }}' input.txt | sort Note that this includes duplicate line numbers if a word appears multiple times on the same lines.
using perl... #!/usr/bin/perl while(<>){ if( /(\w+):\s*(\d+)/){ # extract the parts $arr{lc($1)}{$2} ++ # count them }; } for my $k (sort keys %arr){ # print sorted alpha print "$k: "; $lines=$arr{$k}; print join(", ",(sort {$a<=>$b} keys %$lines)),"\n"; print sorted numerically } This solution removes ans sorts the duplicated numbers. Is this what you needed?
How to print lines of text with date older than two days
I have the following text file that I am working with and must be able to parse only the object name value when the creationdatetime is older than two days. objectname ...........................: \Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15 creationdatetime .....................: 01-Sep-2012 02:17:43 objectname ...........................: \Path\to\file\hpVSS-LUN-22May12 22.24.11\hpVSS-LUN-28Aug12 22.16.19 creationdatetime .....................: 03-Sep-2012 10:18:09 objectname ...........................: \Path\to\file\hpVSS-LUN-22May-12 22.24.11\hpVSS-LUN-27Aug12 22.01.52 creationdatetime .....................: 03-Sep-2012 10:18:33 An output of the command for the above would be: \Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15 Any help will be greatly appreciated. Prem
Date parsing in awk is a bit tricky but it can be done using mktime. To convert the month name to numeric, an associative translation array is defined. The path names have space in them so the best choice for field separator is probably : (colon followed by space). Here's a working awk script: older_than_two_days.awk BEGIN { months2num["Jan"] = 1; months2num["Feb"] = 2; months2num["Mar"] = 3; months2num["Apr"] = 4; months2num["May"] = 5; months2num["Jun"] = 6; months2num["Jul"] = 7; months2num["Aug"] = 8; months2num["Sep"] = 9; months2num["Oct"] = 10; months2num["Nov"] = 11; months2num["Dec"] = 12; now = systime() two_days = 2 * 24 * 3600 FS = ": " } $1 ~ /objectname/ { path = $2 } $1 ~ /creationdatetime/ { split($2, ds, " ") split(ds[1], d, "-") split(ds[2], t, ":") date = d[3] " " months2num[d[2]] " " d[1] " " t[1] " " t[2] " " t[3] age_in_seconds = mktime(date) if(now - age_in_seconds > two_days) print path } All the splitting in the last block is to pick out the date bits and convert them into a format that mktime accepts. Run it like this: awk -f older_than_two_days.awk infile Output: \Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15
I would do it in 2 phases: 1) reformat you input file awk '/objectname/{$1=$2="";file=$0;getline;$1=$2="";print $0" |"file}' inputfile > inputfile2 This way you would deal with 01-Sep-2012 02:17:43 | \Path\to\file\hpvss-LUN-22May12 22.24.11\hpVSS-LUN-29Aug12 22.39.15 03-Sep-2012 10:18:09 | \Path\to\file\hpVSS-LUN-22May12 22.24.11\hpVSS-LUN-28Aug12 22.16.19 03-Sep-2012 10:18:33 | \Path\to\file\hpVSS-LUN-22May-12 22.24.11\hpVSS-LUN-27Aug12 22.01.52 2) filter on dates: COMPARDATE=$(($(date +%s)-2*24*3600)) # 2 days off IFS='|' while read d f do [[ $(date -d "$d" +%s) < $COMPARDATE ]] && printf "%s\n" "$f" done < inputfile2