Extracting data from a txt-file - file-io

I got some input in a txt file which I attached below. I want to extract the variables x1 to x6 where the values are in the first colum after the colon. (For example for the first x2 -1.55155599552781E+00)
I tried already:
data = textscan(fileID,'%s %s %f %f %f')
But that did not work. What would be the best way to do this?
729 6
===========================================================================
solution 1 :
t : 1.00000000000000E+00 0.00000000000000E+00
m : 1
the solution for t :
x2 : -1.55155599552781E+00 -2.39714921318749E-46
x4 : -2.01518902001522E+00 1.29714616910194E-46
x1 : 1.33015840530650E+00 2.03921256321194E-46
x6 : -2.10342596985387E+00 1.19910915953576E-46
x3 : 1.27944237849516E+00 1.99067515607667E-46
x5 : 2.44955616711054E+00 -1.48359823527798E-46
== err : 2.178E-13 = rco : 2.565E-05 = res : 1.819E-11 ==
solution 2 :
t : 1.00000000000000E+00 0.00000000000000E+00
m : 1
the solution for t :
x2 : 1.55762648294693E+00 1.44303635803762E-45
x4 : 2.10025771786320E+00 -6.97912321099274E-46
x1 : -1.28451613237821E+00 -1.19859598871142E-45
x6 : 2.01187184051108E+00 -7.54361111776421E-46
x3 : -1.33529118239379E+00 -1.22818883958157E-45
x5 : -2.44570040628148E+00 8.62982269594568E-46
== err : 2.357E-13 = rco : 2.477E-05 = res : 1.637E-11 ==

You don't mention what platform you're on or what tools you have available, but here's one way using awk:
$ awk '/^ x[1-6]/{print $3}' your_input
-1.55155599552781E+00
-2.01518902001522E+00
1.33015840530650E+00
-2.10342596985387E+00
1.27944237849516E+00
2.44955616711054E+00
1.55762648294693E+00
2.10025771786320E+00
-1.28451613237821E+00
2.01187184051108E+00
-1.33529118239379E+00
-2.44570040628148E+00
or, like this:
$ awk '/^ x[1-6]/{print $1, $3}' f1
x2 -1.55155599552781E+00
...
or using grep and cut:
$ grep '^ x[1-6]' your_input | cut -d' ' -f-4,5
x2 : -1.55155599552781E+00
...
Perl:
perl -lane 'print $F[2] if /^ x[1-6]/' your_input
Stupid and simple Python:
#!/usr/bin/env python
with open("f1") as fd:
for line in fd:
if line.startswith(' x'):
print line.strip().split()[2]
sed:
$ sed -n 's/^ x[1-6] : *\([^ ]*\).*$/\1/p' your_input

In Python this should do it
import re
vars = {}
for x in open("data.txt"):
m = re.match("^\\s+(x\\d+)\s+:\\s+([^ ]*).*", x)
if m:
vars[m.group(1)] = float(m.group(2))
if re.match("^== err.*", x):
# Got a full solution
print vars
vars = {}
When a full solution is found the variables are available as
vars['x1']
vars['x2']
already converted to floating point numbers

Related

Using awk to color the output in bash

I have two files.
First one is csv while other one is plain text file.
I want to print all the lines of file2 which contains column 1 of first file with font color column2 and background color column3.
for example:
f1 contains
Basic Engineering,BLACK,WHITE
Science,RED,BLUE
f2 contains with field width of 20 each:
foo abc Science AA
bar cde Basic Engineering AP
baz efgh Science AB
expected output:
foo abc Science AA (Red font, Blue background)
bar cde Basic Engineering AP (Black font, White background)
baz efgh Science AB (Red font, Blue background)
I have already defined color in a seperate file defineColors.sh as:
BLACK_FONT=`tput setaf 0`
RED_FONT=`tput setaf 1`
WHITE_BACKGROUND=`tput setab 7`
BLUE_BACKGROUND=`tput setab 4`
RESET_ALL=`tput sgr0`
My try :
awk -F, '{sed "/$1/p" f2 | sed -e 's/^\(.*\)$/'"$2_FONT"''"$3_BACKGROUND"'\1/' }' f1
$ cat tst.awk
BEGIN {
split("BLACK RED GREEN YELLOW BLUE MAGENTA CYAN WHITE",tputColors)
for (i in tputColors) {
colorName = tputColors[i]
colorNr = i-1
cmd = "tput setaf " colorNr
fgEscSeq[colorName] = ( (cmd | getline escSeq) > 0 ? escSeq : "<" colorName ">" )
close(cmd)
cmd = "tput setab " colorNr
bgEscSeq[colorName] = ( (cmd | getline escSeq) > 0 ? escSeq : "<" colorName ">" )
close(cmd)
}
cmd = "tput sgr0"
colorOff = ( (cmd | getline escSeq) > 0 ? escSeq : "<sgr0>" )
close(cmd)
FS = ","
}
NR == FNR {
key = $1
fgColor[key] = fgEscSeq[$2]
bgColor[key] = bgEscSeq[$3]
next
}
{
# change this to substr($0,41,20) for your real 20-char fields data
key = substr($0,15,20)
gsub(/^[[:space:]]+|[[:space:]]+$/,"",key)
print bgColor[key] fgColor[key] $0 colorOff
}
Using the pipe to cat -v so you can see color code escape sequences are being output:
$ awk -f tst.awk f1 f2 | cat -v
^[[44m^[[31mfoo abc Science AA^[(B^[[m
^[[47m^[[30mbar cde Basic Engineering AP^[(B^[[m
^[[44m^[[31mbaz efgh Science AB^[(B^[[m
I see you updated your question to say I have already defined color in a seperate file defineColors.sh as: and showed a shell script - just don't use that, it's not needed.

sum up the output of 'pkpgcounter -ccmyk' in groups Cyan, Magenta, Yellow, Black to calculate ink usage

For printaccounting I'm using Tea4Cups. In /etc/cups/tea4cups.conf I have a line:
echo `pkpgcounter -ccmyk $TEADATAFILE` |sed 's/C\ \://g'|sed 's/M\ \://g'|sed 's/Y\ \://g'|sed 's/K\ \://g'|sed 's/\%/;/g'|sed 's/\./,/g' >>/var/log/accounting_ink.csv
pkpgcounter -ccmyk $TEADATAFILE give output like:
C : 4.732829% M : 4.716022% Y : 3.545420% K : 0.000000%
C : 4.753109% M : 4.736302% Y : 3.560630% K : 0.000000%
C : 4.760295% M : 4.743488% Y : 3.566019% K : 0.000000%
The more pages a file has, the more output the command will give.
sed strips the output from the characters that are not numeric and turn it into the following:
3,699918; 3,285596; 2,983343; 4,169371; 1,596966; 1,635378; 1,621895; 1,306214;
Now I need to add every value for C; for M; Y and B to get an idea of the toner/ink usage of the print jobs.
So value 1 and 5; value 2 and 6 etc. But maybe a first step is to determine the total number of values?
You never need sed when you're using awk so the intermediate output from your sed command isn't useful for this, all we need is your original output from pkpgcounter.
You don't show your expected output so it's a guess but is this what you're trying to do?
$ cat file
C : 4.732829% M : 4.716022% Y : 3.545420% K : 0.000000%
C : 4.753109% M : 4.736302% Y : 3.560630% K : 0.000000%
C : 4.760295% M : 4.743488% Y : 3.566019% K : 0.000000%
$ cat tst.awk
{
for (i=1; i<NF; i+=3) {
val[i] = $i
sum[i] += $(i+2)
}
}
END {
for (i=1; i<NF; i+=3) {
printf "%s%s: %s", (i>1?OFS:""), val[i], sum[i]
}
print ""
}
$ awk -f tst.awk file
C: 14.2462 M: 14.1958 Y: 10.6721 K: 0
You could calculate the individual totals using awk
echo `pkpgcounter -ccmyk $TEADATAFILE` |
awk '{c+=$3;m+=$6;y+=$9;k+=$12}{print}END{printf "C : %.5f%% M:%.5f%% Y:%.5f%% K:%.5f%%",c,m,y,k; print ""}'
C : 4.732829% M : 4.716022% Y : 3.545420% K : 0.000000%
C : 4.753109% M : 4.736302% Y : 3.560630% K : 0.000000%
C : 4.760295% M : 4.743488% Y : 3.566019% K : 0.000000%
C : 14.24623% M:14.19581% Y:10.67207% K:0.00000%
In GNU awk (multichar RS), file has the output from the pkpgcounter:
$ awk 'BEGIN{RS="%"RS"?"}{a[$1]+=$NF}END{for(i in a)print i, a[i]}' file
C 14.2462
K 0
Y 10.6721
M 14.1958
You could pipe the output to the awk instead of using the file as source.
Edit: Single line printing version, as requested:
$ awk 'BEGIN{RS="%"RS"?"}{a[$1]+=$NF}END{for(i in a)printf "%s: %s ", i, a[i];print ""}' file
C: 14.2462 K: 0 Y: 10.6721 M: 14.1958
Edit 2: Single line printing with output order picked from the first record:
$ awk '
BEGIN { RS="%"RS"?" } # set RS
NR<=4 { b[NR]=$1 } # store original order to b
{ a[$1]+=$NF } # sum
END { for(i=1;i<=4;i++) # respect the original order
printf "%s: %s ", b[i], a[b[i]] # output
print "" # finishing touch
}' file
C: 14.2462 M: 14.1958 Y: 10.6721 K: 0

Awk - Substring comparison

Working native bash code :
while read line
do
a=${line:112:7}
b=${line:123:7}
if [[ $a != "0000000" || $b != "0000000" ]]
then
echo "$line" >> FILE_OT_YHAV
else
echo "$line" >> FILE_OT_NHAV
fi
done <$FILE_IN
I have the following file (its a dummy), the substrings being checked are both on the 4th field, so nm the exact numbers.
AAAAAAAAAAAAAA XXXXXX BB CCCCCCC 12312312443430000000
BBBBBBB AXXXXXX CC DDDDDDD 10101010000000000000
CCCCCCCCCC C C QWEQWEE DDD AAAAAAA A12312312312312310000
I m trying to write an awk script that compares two specific substrings, if either one is not 000000 it outputs the line into File A, if both of them are 000000 it outputs the line into File B, this is the code i have so far :
# Before first line.
BEGIN {
print "Awk Started"
FILE_OT_YHAV="FILE_OT_YHAV.test"
FILE_OT_NHAV="FILE_OT_NHAV.test"
FS=""
}
# For each line of input.
{
fline=$0
# print "length = #" length($0) "#"
print "length = #" length(fline) "#"
print "##" substr($0,112,7) "##" substr($0,123,7) "##"
if ( (substr($0,112,7) != "0000000") || (substr($0,123,7) != "0000000") )
print $0 > FILE_OT_YHAV;
else
print $0 > FILE_OT_NHAV;
}
# After last line.
END {
print "Awk Ended"
}
The problem is that when i run it, it :
a) Treats every line as having a different length
b) Therefore the substrings are applied to different parts of it (that is why i added the print length stuff before the if, to check on it.
This is a sample output of the line length awk reads and the different substrings :
Awk Started
length = #130#
## ## ##
length = #136#
##0000000##22016 ##
length = #133#
##0000001##16 ##
length = #129#
##0010220## ##
length = #138#
##0000000##1022016##
length = #136#
##0000000##22016 ##
length = #134#
##0000000##016 ##
length = #137#
##0000000##022016 ##
Is there a reason why awk treats lines of the same length as having a different length? Does it have something to do with the spacing of the input file?
Thanks in advance for any help.
After the comments about cleaning the file up with sed, i got this output (and yes now the lines have a different size) :
1 0M-DM-EM-G M-A.M-E. #DEH M-SM-TM-OM-IM-WM-EM-IM-A M-DM-V/M-DM-T/M-TM-AM-P 01022016 $
2 110000080103M-CM-EM-QM-OM-MM-TM-A M-A. 6M-AM-HM-GM-MM-A 1055801001102 0000120000012001001142 19500000120 0100M-D000000000000000000000001022016 $
3 110000106302M-TM-AM-QM-EM-KM-KM-A 5M-AM-AM-HM-GM-MM-A 1043801001101 0000100000010001001361 19500000100M-IM-SM-O0100M-D000000000000000000000001022016 $
4 110000178902M-JM-AM-QM-AM-CM-IM-AM-MM-MM-G M-KM-EM-KM-AM-S 71M-AM-HM-GM-MM-A 1136101001101 0000130000013001006061 19500000130 0100M-D000000000000000000000001022016 $

Passing a shell variable to awk in gnuplot

I need to pass a shell variable to awk in gnuplot but I get error messages :
The variable is set in the sript and is called FILE. This changes according to date.
My code : (in a Gnuplot script)
plot FILE using 1:14 with points pointtype 7 pointsize 1 # this works fine
replot '< awk ''{y1 = y2; y2 = $14; if (NR > 1 && y2 - y1 >= 100) printf("\n") ; if (NR > 1 && y2 -y1 <= -100) printf("\n"); print}'' FILE' using 1:14 with linespoints
Err msg
awk: fatal: cannot open file `FILE' for reading (No such file or directory)
When I hard code the FILE path the replot works.
Could anyone clarify the code I need to pass this variable to awk? Am I on the right track with something like :
% environment_variable=FILE
% awk -vawk_variable="${environment_variable}" 'BEGIN { print awk_variable }' ?
Here is my Gnuplot script code: cobbled together from other posts mostly..
#FILE selection - we want to plot the most recent data file
FILE = strftime('/data/%Y-%m-%d.txt', time(0)) # this is correct
print "FILE is : " .FILE
#set file path variable for awk : (This is where my problem is)
awk -v var="$FILE" '{print var}'
awk '{print $0}' <<< "$FILE"
Thank you in advance
If FILE is a gnuplot variable that contains the path of the file, you can do this:
FILE = 'input'
plot '<awk ''1'' ' . FILE
This concatenates the value of the gnuplot variable FILE onto the end of the awk command. The resulting awk "script" is therefore awk '1' input (which just prints every line of the file); you can substitute the '1' for whatever it is you want to do with awk.
By the way, your awk script can be simplified a little bit to this:
awk '{ y1 = y2; y2 = $14 } NR > 1 && (y2 - y1 >= 100 || y2 - y1 <= -100) { print "" } { print $1, $14 }'
It's not often that you need to use if in awk, as each block { } is executed conditionally (or if no condition is specified, the block is always executed). Assuming you haven't modified the record separator (the RS variable), print "" is the same as printf("\n"). Rather than specifying using 1:14 in gnuplot, you may as well only print the columns that you are interested in using print $1, $14.
So your replot line in gnuplot would be:
replot '<awk ''{ y1 = y2; y2 = $14 } NR > 1 && (y2 - y1 >= 100 || y2 - y1 <= -100) { print "" } { print $1, $14 }'' ' . FILE with linespoints
Of course, this line is getting a bit long. You might want to split it up a bit:
awk_cmd = '{ y1 = y2; y2 = $14 } NR > 1 && (y2 - y1 >= 100 || y2 - y1 <= -100) { print "" } { print $1, $14 }'
replot sprintf("<awk '%s' %s", awk_cmd, FILE) with linespoints

using awk or sed extract first character of each column and store it in a separate file

I have a file like below
AT AT AG AG
GC GC GG GC
i want to extract first and last character of every col n store them in two different files
File1:
A A A A
G G G G
File2:
T T G G
C C G C
My input file is very large. Is it a way that i can do it in awk or sed
With GNU awk for gensub():
gawk '{
print gensub(/.( |$)/,"","g") > "file1"
print gensub(/(^| )./,"","g") > "file2"
}' file
You can do similar in any awk with gsub() and a couple of variables.
you can try this :
write in test.awk
#!/usr/bin/awk -f
BEGIN {
# FS = "[\s]+"
outfile_head="file1"
outfile_tail="file2"
}
{
num = NF
for(i = 1; i <= NF; i++) {
printf "%s ", substr($i, 0, 1) >> outfile_head
printf "%s ", substr($i, length($i), 1) >> outfile_tail
}
}
then you run this:
./test.awk file
It's easy to do in two passes:
sed 's/\([^ ]\)[^ ]/\1/g' file > file1
sed 's/[^ ]\([^ ]\)/\1/g' file > file2
Doing it in one pass is a challenge...
Edit 1: Modified for your multiple line edit.
You could write a perl script and pass in the file names if you plan to edit it and share it. This loops through the file only once and does not require storing the file in memory.
File "seq.pl":
#!/usr/bin/perl
open(F1,">>$ARGV[1]");
open(F2,">>$ARGV[2]");
open(DATA,"$ARGV[0]");
while($line=<DATA>) {
$line =~ s/(\r|\n)+//g;
#pairs = split(/\s/, $line);
for $pair(#pairs) {
#bases = split(//,$pair);
print F1 $bases[0]." ";
print F2 $bases[length($bases)-1]." ";
}
print F1 "\n";
print F2 "\n";
}
close(F1);
close(F2);
close(DATA);
Execute it like so:
perl seq.pl full.seq f1.seq f2.seq
File "full.seq":
AT AT AG AG
GC GC GG GC
AT AT GC GC
File "f1.seq":
A A A A
G G G G
A A G G
File "f2.seq":
T T G G
C C G C
T T C C