I am in need of reorganizing a large CSV file. The first column, which is currently a 6 digit number needs to be split up, using commas as the field separator.
For example, I need this:
022250,10:50 AM,274,22,50
022255,11:55 AM,275,22,55
turned into this:
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Let me know what you think!
Thanks!
It's a lot shorter in perl:
perl -F, -ane '$,=","; print split("",$F[0]), #F[1..$#F]' <file>
Since you don't know perl, a quick explanation. -F, indicates the input field separator is the comma (like awk). -a activates auto-split (into the array #F), -n implicitly wraps the code in a while (<>) { ... } loop, which reads input line-by-line. -e indicates the next argument is the script to run. $, is the output field separator (it gets set iteration of the loop this way, but oh well). split has obvious purpose, and you can see how the array is indexed/sliced. print, when lists as arguments like this, uses the output field separator and prints all their fields.
In awk:
awk -F, '{n=split($1,a,""); for (i=1;i<=n;i++) {printf("%s,",a[i])}; for (i=2;i<NF;i++) {printf("%s,",$i)}; print $NF}' <file>
I think this might work. The split function (at least in the version I am running) splits the value into individual characters if the third parameter is an empty string.
BEGIN{ FS="," }
{
n = split( $1, a, "" );
for ( i = 1; i <= n; i++ )
printf("%s,", a[i] );
sep = "";
for ( i = 2; i <= NF; i++ )
{
printf( "%s%s", sep, $i );
sep = ",";
}
printf("\n");
}
here's another way in awk
$ awk -F"," '{gsub(".",",&",$1);sub("^,","",$1)}1' OFS="," file
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Here's a variation on a theme. One thing to note is it prints the remaining fields without using a loop. Another is that since you're looping over the characters in the first field anyway, why not just do it without using the null-delimiter feature of split() (which may not be present in some versions of AWK):
awk -F, 'BEGIN{OFS=","} {len=length($1); for (i=1;i<len; i++) {printf "%s,", substr($1,i,1)}; printf "%s", substr($1,len,1);$1=""; print $0}' filename
As a script:
BEGIN {FS = OFS = ","}
{
len = length($1);
for (i=1; i<len; i++)
{printf "%s,", substr($1, i, 1)};
printf "%s", substr($1, len, 1)
$1 = "";
print $0
}
Related
I'm processing a Wireshark config file (dfilter_buttons) for display filters and would like to print out the filter of a given name. The content of file is like:
Sample input
"TRUE","test","sip contains \x22Hello, world\x5cx22\x22",""
And the resulting output should have the escape sequences replaced, so I can use them later in my script:
Desired output
sip contains "Hello, world\x22"
My first pass is like this:
Current parser
filter_name=test
awk -v filter_name="$filter_name" 'BEGIN {FS="\",\""} ($2 == filter_name) {print $3}' "$config_file"
And my output is this:
Current output
sip contains \x22Hello, world\x5cx22\x22
I know I can handle these exact two escape sequences by piping to sed and matching those exact two sequences, but is there a generic way to substitutes all escape sequences? Future filters I build may utilize more escape sequences than just " and , and I would like to handle future scenarios.
Using gnu-awk you can do this using split, gensub and strtonum functions:
awk -F '","' -v filt='test' '$2 == filt {n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps); for (i=1; i<n; ++i) printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2)); print subj[i]}' file
sip contains "Hello, world\x22"
A more readable form:
awk -F '","' -v filt='test' '
$2 == filt {
n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps)
for (i=1; i<n; ++i)
printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2))
print subj[i]
}' file
Explanation:
Using -F '","' we split input using delimiter ","
$2 == filt we filter input for $2 == "test" condition
Using /\\x[0-9a-fA-F]{2}/ as regex (that matches 2 digit hex strings) we split $3 and save split tokens into array subj and matched separators into array seps
Using substr we remove first char i.e \\ and prepend 0
Using strtonum we convert hex string to equivalent ascii number
Using %c in printf we print corresponding ascii character
Last for loop joins $3 back using subj and seps array elements
Using GNU awk for FPAT, gensub(), strtonum(), and the 3rd arg to match():
$ cat tst.awk
BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
$2 == ("\"" filter_name "\"") {
gsub(/^"|"$/,"",$3)
while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
printf "%s%c", substr($3,1,RSTART-1), strtonum(gensub(/./,0,1,a[1]))
$3 = a[2]
}
print $3
}
$ awk -v filter_name='test' -f tst.awk file
sip contains "Hello, world\x22"
The above assumes your escape sequences are always \x followed by exactly 2 hex digits. It isolates every \xHH string in the input, replaces \ with 0 in that string so that strtonum() can then convert the string to a number, then uses %c in the printf formatting string to convert that number to a character.
Note that GNU awk has a debugger (see https://www.gnu.org/software/gawk/manual/gawk.html#Debugger) so if you're ever not sure what any part of a program does you can just run it in the debugger (-D) and trace it, e.g. in the following I plant a breakpoint to tell awk to stop at line 1 of the script (b 1), then start running (r) and the step (s) through the script printing the value of $3 (p $3) at each line so I can see how it changes after the gsub():
$ awk -D -v filter_name='test' -f tst.awk file
gawk> b 1
Breakpoint 1 set at file `tst.awk', line 1
gawk> r
Starting program:
Stopping in BEGIN ...
Breakpoint 1, main() at `tst.awk':1
1 BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
gawk> p $3
$3 = uninitialized field
gawk> s
Stopping in Rule ...
2 $2 == "\"" filter_name "\"" {
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
3 gsub(/^"|"$/,"",$3)
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
4 while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
gawk> p $3
$3 = "sip contains \\x22Hello, world\\x5cx22\\x22"
I'm trying to reproduce an awk command using different syntax. I have a file (test.txt) that looks like this:
>NAME_123_CONSENSUS
GACTATACA
ATACTAGA
>NAME2_48_TEST
ATAGCGA
and I'm hoping to replace all occurences of "A" with "1" using different syntax of awk. I can solve this using the following line:
awk '!/_/{gsub("A", "1"); 1' test.txt
However, I cannot get the same result using a for loop,
awk '{for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gsub("A","1")}' test.txt
nor using the following input
awk '{ if ($0 ~ "_") print $0; else print gsub("A", "1"); }' test.txt
Both of these last commands give the following output. Why are they giving different output and what am I missing to make both of the last two commands give the desired output?
>NAME_123_CONSENSUS
4
4
5
>NAME2_48_TEST
3
You are incorrectly using the gsub() function here. The sub()/gsub() function return the number of substitutions made and not the modified string. You set the string to modify as the last argument and print it back
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else { gsub("A","1",$0); print } }'
That said your first command is most efficient/terse way of writing this. Notice you were missing a } in the OP. It should been written as
awk '!/_/{ gsub("A", "1") }1'
Or use gensub() available in GNU Awk's that return the modified string that you can use to print. See more about it on String-Functions of GNU Awk
awk '{ for(j=1; j<=NF; j++) if ($j ~ "_") print; else print gensub(/A/, "1", "g") }'
I've a CSV file containing records like below.
id,h1,h2,h3,h4,h5,h6,h7
101,zebra,1,papa,4,dog,3,apple
102,2,yahoo,5,kangaroo,7,ape
I want to sort rows into this file without header and first column. My output should like this.
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
I tried below AWK but don't know how to exclude header and first column.
awk -F"," ' {
s=""
for(i=1; i<=NF; i++) { a[i]=$i; }
for(i=1; i<=NF; i++)
{
for(j = i+1; j<=NF; j++)
{
if (a[i] >= a[j])
{
temp = a[j];
a[j] = a[i];
a[i] = temp;
}
}
}
for(i=1; i<=NF; i++){ s = s","a[i]; }
print s
}
' file
Thanks
If perl is okay:
$ perl -F, -lane 'print join ",", $.==1 ? #F : ($F[0], sort #F[1..$#F])' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
-F, to indicate , as input field separator, results saved in #F array
See https://perldoc.perl.org/perlrun#Command-Switches for details on other options
join "," to use , as output field separator
$.==1 ? #F for first line, print as is
($F[0], sort #F[1..$#F]) for other lines, get first field and sorted output of other fields
.. is range operator, $#F will give index of last field
you can also use (shift #F, sort #F) instead of ($F[0], sort #F[1..$#F])
For given header, sorting first line would work too, so this can simplify logic required
$ # can also use: perl -F, -lane 'print join ",", shift #F, sort #F'
$ perl -F, -lane 'print join ",", $F[0], sort #F[1..$#F]' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
$ # can also use: ruby -F, -lane 'print [$F.shift, $F.sort] * ","'
$ ruby -F, -lane 'print [$F[0], $F.drop(1).sort] * ","' ip.txt
id,h1,h2,h3,h4,h5,h6,h7
101,1,3,4,apple,dog,papa,zebra
102,2,5,7,ape,kangaroo,yahoo
if you have gawk use asort:
awk -v OFS="," 'NR>1{split($0, a, ",");
$1=a[1];
delete a[1];
n = asort(a, b);
for (i = 1; i <= n; i++){ $(i+1)=b[i]}};
1' file.csv
This splits the columns to array a with seperator as , for all raws except the first one.
Then assign the first value in the column in a raw with the first value in a and delete this value from a.
Now the a is sorted to b and assign value starting from 2 column. then print it.
You can just use the asort() function in awk for your requirement and start sorting them from second line on-wards. The solution is GNU awk specific because of length(array) function
awk 'NR==1{ print; next }
NR>1 { finalStr=""
arrayLength=""
delete b
split( $0, a, "," )
for( i = 2; i <= length(a); i++ )
b[arrayLength++] = a[i]
asort( b )
for( i = 1; i <= arrayLength ; i++ )
finalStr = (finalStr)?(finalStr","b[i]):(b[i])
printf( "%s", a[1]","finalStr )
printf( "\n" );
}' file
The idea is first we split the entire line with a , delimiter into the array a from which we get the elements from the 2nd field onwards to a new array b. We sort those elements in this new array and append the first column element when we print it finally.
I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
AC=14511;AC_AFR=382;AC_AMR=1177;AC_Adj=14343;AC_EAS=5;AC_FIN=427;AC_Het=11813;AC_Hom=1265;AC_NFE=11027;AC_OTH=97;AC_SAS=1228;AF=0.137;AN=106198;AN_AFR=8190;AN_AMR=10424;AN_Adj=99264;AN_EAS=7068;AN_FIN=6414;AN_NFE=51090;AN_OTH=658;AN_SAS=15420;BaseQRankSum=1.73;ClippingRankSum=-1.460e-01;DB;DP=1268322;FS=0.000;GQ_MEAN=190.24;GQ_STDDEV=319.67;Het_AFR=358;Het_AMR=1049;Het_EAS=5;Het_FIN=399;Het_NFE=8799;Het_OTH=83;Het_SAS=1120;Hom_AFR=12;Hom_AMR=64;Hom_EAS=0;Hom_FIN=14;Hom_NFE=1114;Hom_OTH=7;Hom_SAS=54;InbreedingCoeff=0.0478;MQ=60.00;MQ0=0;MQRankSum=0.037;NCC=270;POSITIVE_TRAIN_SITE;QD=21.41;ReadPosRankSum=0.212;VQSLOD=4.79;culprit=MQ;DP_HIST=30|3209|1539|1494|30007|7938|4130|2038|1310|612|334|185|97|60|31|25|9|11|7|33,0|66|339|1048|2096|2665|2626|1832|1210|584|323|179|89|54|31|22|7|9|4|15;GQ_HIST=84|66|56|82|3299|568|617|403|250|319|436|310|28566|2937|827|834|451|186|217|12591,15|15|13|16|25|11|22|28|18|38|52|31|65|76|39|83|93|65|97|12397;CSQ=T|ENSG00000186868|ENST00000334239|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11502.1|ENSP00000334886|TAU_HUMAN|B4DSE3_HUMAN|UPI0000000C16||||2/8||ENST00000334239.8:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000570299|Transcript|intron_variant&non_coding_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|processed_transcript||||||||||2/6||ENST00000570299.1:n.262-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000340799|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000340438|TAU_HUMAN||UPI000004EEE6||||3/10||ENST00000340799.5:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000262410|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000262410|TAU_HUMAN||UPI0000EE80B7||||4/13||ENST00000262410.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000446361|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11500.1|ENSP00000408975|TAU_HUMAN||UPI000004EEE5||||2/9||ENST00000446361.3:c.134-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000574436|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000460965|TAU_HUMAN||UPI000002D754||||3/10||ENST00000574436.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571987|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11501.1|ENSP00000458742|TAU_HUMAN||UPI0000EE80B7||||3/12||ENST00000571987.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000415613|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45715.1|ENSP00000410838|TAU_HUMAN||UPI0001AE66E9||||3/13||ENST00000415613.2:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000571311|Transcript|intron_variant&NMD_transcript_variant||||||rs754512|1||1|MAPT|HGNC|6893|nonsense_mediated_decay|||ENSP00000460048||I3L2Z2_HUMAN|UPI00025A2E6E||||4/4||ENST00000571311.1:c.*176-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000535772|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000443028|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||4/10||ENST00000535772.1:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000576518|Transcript|stop_gained|5499|7|3|K/*|Aag/Tag|rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000458621||I3L170_HUMAN&B4DSE3_HUMAN|UPI0001639A7C|||1/7|||ENST00000576518.1:c.7A>T|ENSP00000458621.1:p.Lys3Ter|T:0.1171|||||||||15792962|||||POSITION:0.00682261208576998&ANN_ORF:-255.6993&MAX_ORF:-255.6993|PHYLOCSF_WEAK|ANC_ALLELE|LC,T|ENSG00000186868|ENST00000420682|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS45716.1|ENSP00000413056|TAU_HUMAN||UPI000004EEE6||||2/9||ENST00000420682.2:c.221-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000572440|Transcript|non_coding_transcript_exon_variant&non_coding_transcript_variant|2790|||||rs754512|1||1|MAPT|HGNC|6893|retained_intron|||||||||1/1|||ENST00000572440.1:n.2790A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000351559|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS11499.1|ENSP00000303214|TAU_HUMAN||UPI000002D754||||4/11||ENST00000351559.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000344290|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|YES|CCDS45715.1|ENSP00000340820|TAU_HUMAN||UPI0001AE66E9||||4/14||ENST00000344290.5:c.308-94A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000347967|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding|||ENSP00000302706|TAU_HUMAN|B4DSE3_HUMAN|UPI0000173D91||||4/10||ENST00000347967.5:c.32-100A>T||T:0.1171|||||||||15792962||||||||,T|ENSG00000186868|ENST00000431008|Transcript|intron_variant||||||rs754512|1||1|MAPT|HGNC|6893|protein_coding||CCDS56033.1|ENSP00000389250|TAU_HUMAN|B4DSE3_HUMAN|UPI000004EEE4||||3/9||ENST00000431008.3:c.308-94A>T||T:0.1171|||||||||15792962||||||||
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
}
}
print ""
}'
I keep getting syntax error.
output wanted :
AC=14511;AF=0.137
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
{
delete n2v
split($2,tmp,/[;=]/)
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
}
prt("AC")
prt("AF")
}
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
AF=0.137
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
split($0,a,";");
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
break;
case /^AC=/:
b=b a[i] OFS;
break
}
sub(/.$/,"\n",b);
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.
I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt