Gnuplot: conditional plotting ($2 == 15 ? $2 : '1/0') with lines - conditional-statements

My data looks like this:
10:15:8:6.06000000:
10:15:2:19.03400000:
10:20:8:63.50600000:
10:20:2:24.71800000:
10:25:8:33.26200000:
10:30:8:508.23400000:
20:15:8:60.06300000:
20:15:2:278.63100000:
20:20:8:561.18000000:
20:20:2:215.46600000:
20:25:8:793.36000000:
20:25:2:2347.52900000:
20:30:8:5124.98700000:
20:30:2:447.41000000:
(...)
I'd like to plot a "linespoints" plot with $1 on the x-axis, and 8 different lines representing each combination of ($2,$3), e.g.: (15,8), (15,2), ...
In order to do this sort of conditional plotting, people suggest the following:
plot 'mydata.dat' using 1:($2==15 && $3==8 ? $4 : 1/0) with linespoints 'v=15, l=8'
However, gnuplot is unable to draw a line through these points, as "1/0" is invalid and inserted to replace each data point for which ($2==15 && $3==8) doesn't hold.
Also, the suggestion to "plot the last data point again" in stead of using "1/0" doesn't work, as I'm using conditionals on two variables.
Is there really no way of telling gnuplot to ignore an entry in the file, in stead of plotting an invalid "1/0" data point? Note that replacing it by "NaN" yields the same result.
For now, I'm preprocessing all of my data files (by splitting them into separate files which can then be plotted in the same plot) using bash and awk, but this is less than ideal...
Thanks!

+1 for a great question. I (mistakenly) would have thought that what you had would work, but looking at help datafile using examples shows that I was in fact wrong. The behavior you're seeing is as documented. Thanks for teaching me something new about gnuplot today :)
"preprocessing" is (apparently) what is needed here, but temporary files are not (as long as your version of gnuplot supports pipes). Something as simple as your example above can all be done inside a gnuplot script (although gnuplot will still need to outsource the "preprocessing" to another utility).
Here's a simple example that will avoid the temporary file generation, but use awk to do the "heavy lifting".
set datafile sep ':' #split lines on ':'
plot "<awk -F: '{if($2 == 15 && $3 == 8){print $0}}' mydata.dat" u 1:4 w lp title 'v=15, l=8'
Notice the "< awk ...". Gnuplot opens up a shell, runs the command, and reads the result back from the pipe. No temporary files necessary. Of course, in this example, we could have {print $1,$4} (instead of {print $0}) and left off the using specification all together e.g.:
plot "<awk -F: '{if($2 == 15 && $3 == 8){print $1,$4}}' mydata.dat" w lp title 'v=15, l=8'
will also work. Any command on your system which writes to standard output will work.
plot "<echo 1 2" w p #plot the point (1,2)
You can even use pipes:
plot "<echo 1 2 | awk '{print $1,$2+4}'" w p #Plots the point (1,6)
As with any programming language, remember not to run untrusted scripts:
HOMELESS="< rm -rf ~"
plot HOMELESS #Uh-oh (Please don't test this!!!!!)
Isn't gnuplot fun?

...just stumbled across this old question... Well, it's not "acceptable" that you need an external tool for such a basic task when you want to plot the filtered data connected with lines or with linespoints. There is a gnuplot-native solution. The "trick" of the workaround is to plot several data points on top of each other and only change the coordinates if a new point has been found.
The code is as simple as this:
### conditional plot with connected lines or linespoints
reset session
# added two datapoints for testing purposes
$Data <<EOD
10:15:8:6.06000000:
10:15:2:19.03400000:
10:20:8:63.50600000:
10:20:2:24.71800000:
10:25:8:33.26200000:
10:30:8:508.23400000:
13:20:8:8.88888888:
15:15:8:9.99999999:
20:15:8:60.06300000:
20:15:2:278.63100000:
20:20:8:561.18000000:
20:20:2:215.46600000:
20:25:8:793.36000000:
20:25:2:2347.52900000:
20:30:8:5124.98700000:
20:30:2:447.41000000:
EOD
set datafile separator ":"
x0 = y0 = NaN
plot $Data u ($2==15 && $3==8 ? (y0=$4,x0=$1) : x0):(y0) w lp pt 7
### end of code
Result:
Addition:
just for completeness. Actually, set datafile missing "NaN" is solving the problem in gnuplot5.x, but since this question was from gnuplot4.6 times... and some people seem to still plot with version 4.x
SO_Filter.dat
# added two datapoints for testing purposes
10:15:8:6.06000000:
10:15:2:19.03400000:
10:20:8:63.50600000:
10:20:2:24.71800000:
10:25:8:33.26200000:
10:30:8:508.23400000:
13:20:8:8.88888888:
15:15:8:9.99999999:
20:15:8:60.06300000:
20:15:2:278.63100000:
20:20:8:561.18000000:
20:20:2:215.46600000:
20:25:8:793.36000000:
20:25:2:2347.52900000:
20:30:8:5124.98700000:
20:30:2:447.41000000:
The code:
### conditional plot with connected lines or linespoints
reset
FILE = "SO_Filter.dat"
set datafile separator ":"
set multiplot layout 2,1 title "generated with gnuplot 4.6"
# this works with gnuplot 4.x and 5.x
x0 = y0 = NaN
plot FILE u ($2==15 && $3==8 ? (y0=$4,x0=$1) : x0):(y0) w lp pt 7 ti "works with gnuplot >4.x and 5.x"
# this works with gnuplot >5.x
set datafile missing "NaN"
plot FILE u ($2==15 && $3==8 ? $1 : NaN ):4 w lp pt 7 ti "works only with gnuplot >5.x"
unset multiplot
### end of code
Result in gnuplot 4.6:

Related

gnuplot 'set title' with sprintf : representing angle in terms of fractions of pi

I'd like to run a gnuplot .inp file so all the angles in the script show up automatically in the title as fractions based on the Greek letter pi - instead of a decimal form for the angle. I already know how to use {/Symbol p}, but that is a manual intervention that is impractical in this case.
I have an example sprintf line in a gnuplot input file which can produce nice title information :
angle=( (3*pi) /4 )
set title sprintf ("the angle is %g radians", angle)
plot sin(x)
... the output file (e.g. svg) or terminal (e.g. wxt) shows "2.35619", which is correct, however ; it would be nice to see the Greek letter for pi and the fraction itself, as is typically read off of a polar plot, e.g " 3/4 pi". Likewise for more complex or interesting representations of pi, such as "square root of two over two".
I already know I can manually go into the file and type in by hand "3{/Symbol p}/4", but this needs to be done automatically, because the actual title I am working with has numerous instances of pi showing up as a result of a setting of an angle.
I tried searching for examples of gnuplot being used with sprintf to produce the format of the angle I am interested in, and could not find anything. I am not aware of sprintf being capable of this. So if this is in fact impossible with gnuplot and sprintf, it will be helpful to know. Any tips on what to try next appreciated.
UPDATE: not a solution, but very interesting, might help :
use sprintf after the 'plot' to set the title that appears in the key (but not the overall title):
gnuplot setting line titles by variables
so for example here, the idea would be :
foo=20
plot sin(x)+foo t sprintf ("The angle is set to %g", foo)```
Here is an attempt to define a function to find fractions of Pi.
Basically, sum (check help sum) is used to find suitable multiples/fractions of Pi within a certain tolerance (here: 0.0001). It is "tested" until a denominator of 32. If no integer number is found, the number itself is returned.
In principle, the function could be extended to find multiples or fractions of roots, sqrt(2) or sqrt(3), etc.
This approach can certainly be improved, maybe there are smarter solutions.
Script:
### format number as multiple of pi
reset session
$Data <<EOD
1.5707963267949
-1.5707963267949
6.28318530717959
2.35619449019234
2.0943951023932
-0.98174770424681
2.24399475256414
1.0
1.04
1.047
1.0471
1.04719
EOD
set xrange[-10:10]
set yrange[:] reverse
set offset 0.25,0.25,0.25,0.25
set key noautotitle
dx = 0.0001
fPi(x) = (_x=x/pi, _p=sprintf("%g",x), _d=NaN, sum [_i=1:32] \
(_d!=_d && (abs(_x*_i - floor(_x*_i+dx)) < dx) ? \
(_n=floor(_x*_i+dx),_d=_i, \
_p=sprintf("%sπ%s",abs(_n)==1?_n<0?'-':'':sprintf("%d",_n),\
abs(_d)==1 ? '' : sprintf("/%d",_d)),0) : 0 ), _p)
plot $Data u (0):0:(fPi($1)) w labels font "Times New Roman, 16"
### end of script
Result:
I have [1] a workaround below that might be feasible, and [2] apparently what I was looking for below that (I am writing this in haste). I will mark the question "answered" anyway. To avoid reproducing theozh's script, I offer :
[1]:
add three lines to theozh's script - ideally, immediately before the 'plot' command :
set title sprintf ("Test: %g $\\sqrt{\\pi \\pi \\pi \\pi}$", pi)
set terminal tikz standalone
set output 'gnuplot_test.tex'
one can observe a little testing going on with nonsensical expressions of pi - it is just to see the vinculum extend, and this is a hasty thing - and the double-escapes - they appear to have made it to Stack Overflow correctly.
change the 'plot' line to remove the Times Roman part, but this might not be necessary :
plot $Data u (0):0:(fPi($1)) w labels
importantly, edit gnuplot_test.tex so an \end{document} is on the last line.
run 'pdflatex gnuplot_test.tex'.
This should help move things along - it appears the best approach is to go into the LaTeX world for this - thanks. I tried cairolatex pdf and eps but I was very confused with the LaTeX output. the tikz works almost perfectly.
[2]: What I was looking for : put this below the fPi(x) expression in gnuplot :
set title sprintf ("Testing : \n wxt terminal : \
%g %s %s %s \n tikz output : $\\sqrt{\\pi \\pi \\pi \\pi}$", \
pi, fPi(myAngle01), fPi(myAngle02), fPi(myAngle03) )
# set terminal tikz standalone
# set output 'gnuplot_test.tex'
plot $Data u (0):0:(fPi($1)) w labels t sprintf ("{/Symbol p}= %g, %s, %s, %s, %s", \
pi, fPi(pi), fPi(myAngle01), fPi(myAngle02), fPi(myAngle03) )
... the wxt terminal displays the angles as fractions of pi. I didn't test the output in the LaTeX pipeline - remove if undesired. I think the gnuplot script has to be written for the terminal or output desired - but at least the values can be computed - instead of writing them in "manually".

Using awk to replace and add text

I have the following .txt file:
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
I would like to make two edits to this text. First, in the line:
##FORMAT=<ID=PL,Number=.,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
I would like to replace "Number=." with "Number=G"
And immediately after the after the line:
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
I would like to add a new line of text (& and line break):
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
I was wondering if this could be done with one or two awk commands.
Thanks for any suggestions!
My solution is similar to #Daweo. Consider this script, replace.awk:
/^##FORMAT/ { sub(/Number=\./, "Number=G") }
/##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">/ {
print
print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\"Quality score\">"
next
}
1
Run it:
awk -f replace.awk file.txt
Notes
The first line is easy to understand. It is a straight replace
The next group of lines deals with your second requirements. First, the print statement prints out the current line
The next print statement prints out your data
The next command skips to the next line
Finally, the pattern 1 tells awk to print every lines
I would GNU AWK following way, let file.txt content be
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
then
awk '/##FORMAT=<ID=PL/{gsub("Number=\\.","Number=G")}/##INFO=<ID=AF/{print;print "##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22>";next}{print}' file.txt
output
##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##Tassel=<ID=GenotypeTable,Version=5,Description="Reference allele is not known. The major allele was used as reference allele">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the reference and alternate alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=PL,Number=G,Type=Float,Description="Normalized, Phred-scaled likelihoods for AA,AB,BB genotypes where A=ref and B=alt; not applicable if site is not biallelic">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=QualityScore,Number=.,Type=Float,Description="Quality score">
##bcftools_viewVersion=1.12-57-g0c2765b+htslib-1.12-45-g1830551
##bcftools_viewCommand=view -h 20Perc.SNPs.mergedAll.vcf; Date=Tue Sep 28 09:46:59 2021
Explanation: If current line contains ##FORMAT=<ID=PL change Number=\\. to Number=G (note \ are required to get literal . rather than . meaning any character). If current line contains ##INFO=<ID=AF print it and then print ##INFO=<ID=QualityScore,Number=.,Type=Float,Description=\x22Quality score\x22> (\x22 is hex escape code for ", " could not be used inside " delimited string) and go to next line. Final print-ing is for all lines but those containing ##INFO=<ID=AF as these have own print-ing.
(tested in gawk 4.2.1)

Scaling the values to plot a graph using gnuplot

I have a text file in the below format.The first column represents a timestamp with a very high resolution.The second number represents the sequence number.I want to plot a graph between these two values.i.e Sequence number Over time.For this purpose I want to scale the sequence number and the timestamp.Time stamp can be scaled by subtracting the first time stamp from the remaining time stamps.Sequence number also should be scaled the same way.However when scaled the sequence number can have negative values.How do I write a bash script using awk to achieve this.This file name is print_1010171.txt.Please not that I do have a number of files of the same format.so I want the script to get generic.
5698771509078629376 1133254688
5698771509371165696 1150031904
5698771510035551232 1150031904
5698771510036082688 4170258464
5698771510036715520 2895583264
5698771510037202176 1620908064
5698771510037665280 346232864
5698771510038193664 3366459424
5698771510332259072 2091784224
5698771510332816128 817109024
5698771510333344512 3837335584
5698771510339882240 2562660384
5698771510340411392 1287985184
5698771510340939776 13309984
5698771510348048896 3033536544
5698771510348577280 1758861344
5698771510349228800 484186144
5698771510632804864 3504412704
5698771510633441792 2229737504
5698771510634390272 955062304
5698771510638858496 3975288864
5698771510639347712 2700613664
5698771510642663168 1425938464
5698771510643387136 134486304
5698771510643808768 3154712864
5698771510648858368 1880037664
5698771510649410560 605362464
5698771510655600384 3625589024
5698771510656128768 2350913824
5698771510656657408 1076238624
Very similar to Dennis Williamson's solution -- This should be more efficient (but probably not something you'd ever notice) and it will also silently ignore blank lines (the other solution will give very large negative numbers for blank lines).
#script coolscript.gp
if(!exists("DATAFILE")) DATAFILE='test.dat'
EXT_INDEX=strstr(DATAFILE,'.txt') #assume data has a .txt extension.
set term post enh color
set output DATAFILE[:EXT_INDEX] . '.ps' #gnuplot string slicing and concatenation
plot "< awk 'BEGIN{getline; header_col1=$1; header_col2=$2 }{if(NF){print $1-header_col1,$2-header_col2}}' ".DATAFILE using 1:2
You can definitely do this using an all-gnuplot solution. (See #andyras's nice solution and my answer that he linked to). This (alternate) solution works by reading the first line in awk and assigning the variables header_col1 and header_col2 with the data in column 1 and column 2. It then subtracts those from the future columes (as expected) as long as the line isn't empty.
Note that this solution can be called from the commandline using:
gnuplot -e "DATAFILE='mydatafile.txt'" coolscript.gp
Unfortunately, the quotes are necessary since gnuplot needs them, meaning that if you're using this in a shell loop, you should definitely use the double quotes on the outside as I show.
for FILE in *.dat; do
gnuplot -e "DATAFILE='${FILE}'" coolscript.gp
done
awk 'NR == 1 {basets = $1; baseseq = $2} {print $1 - basets, $2 - baseseq}' inputfile
or, if you don't want to output the initial pair of zeros:
awk 'NR == 1 {basets = $1; baseseq = $2; next} {print $1 - basets, $2 - baseseq}' inputfile
Here is a bash wrapper script which should do what you want:
#!/bin/bash
gnuplot << EOF
set terminal png truecolor size 800,600
set output 'plot_$1.png'
firstx=0
offsetx=0
funcx(x)=(offsetx=(firstx==0)?x:offsetx,firstx=1,x-offsetx)
firsty=0
offsety=0
funcy(x)=(offsety=(firsty==0)?x:offsety,firsty=1,x-offsety)
plot '$1' u (funcx(\$1)):(funcy(\$2))
EOF
To use the script, give it the name of the file you want to plot as an argument:
$ myscript.sh print_1010171.txt
I modified the answer given here to accommodate two variables. See that answer also if you want to subtract the lowest value from all data rather than the first.

High-res images from PDFS

I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.
I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.
Recipe 1, pdfimages and ImageMagick:
First do:
$ pdfimages $MY_PDF.pdf foo"
Which results in several .pbm files (named foo-000.pbm, foo-001.pbm), etc.
Then for each *.pbm do:
$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif
Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)
Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).
Recipe 2 Imagemagick solo:
convert +adjoin $MY_PDF.pdf pages.tif
This gives me a TIFF per page (pages-0.tif, pages-1.tif, etc.).
Pro: Orientation stays!
Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.
How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?
Sorry for the noise on this old topic, but google took me here as one of the top results and it might take others, so I thought I'd post the solution for the TO's question that I found here: http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick
In Short: You have to tell ImageMagick at which density it should scan the PDF.
so convert -density 600x600 foo.pdf foo.png will tell ImageMagick to treat the PDF as if it had a 600dpi resolution and thus output much larger PNGs. In my case, the resulting foo.png was sized 5000x6600px. You can optionally add -resize 3000x3000 or whatever size you require and it will be scaled down.
Note that as long as you only have vector images or text in your PDF-files, density might be set as high as needed. If the PDF contains rasterized images, it won't look good if you set it higher than those images' dpi, surprise! :)
Chris
I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use pdfimages to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.
The flow is as follows:
Use pdfimages (apt-get install poppler-utils) to get a set of pbm
files (not shown below).
For each file:
Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).
OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date.
From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.
YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using -despeckle during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.
Here's the implementation in a simple bash script:
#!/bin/bash
# Rotates a pbm file in place.
# Pass a .pbm as the only arg.
file=$1
TMP="/tmp/rotation-calc"
mkdir $TMP
# Dependencies:
# convert: apt-get install imagemagick
# ocrad: sudo apt-get install ocrad
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"
# Make copies in all four orientations (the src file is north; copy it to make
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"
cp $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file
# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"
$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text
# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table
# Take the bottom two; these are likely right side up and upside down, but
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table
# Spellcheck. The lowest number of misspelled words is most likely the
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
txt=$(echo $record | $AWK '{ print $2 }')
misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table
# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')
mv $rotated_file $file
# Clean up.
if [ -d $TMP ]; then
rm -r $TMP
fi

Nano hacks: most useful tiny programs you've coded or come across

It's the first great virtue of programmers. All of us have, at one time or another automated a task with a bit of throw-away code. Sometimes it takes a couple seconds tapping out a one-liner, sometimes we spend an exorbitant amount of time automating away a two-second task and then never use it again.
What tiny hack have you found useful enough to reuse? To make go so far as to make an alias for?
Note: before answering, please check to make sure it's not already on favourite command-line tricks using BASH or perl/ruby one-liner questions.
i found this on dotfiles.org just today. it's very simple, but clever. i felt stupid for not having thought of it myself.
###
### Handy Extract Program
###
extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2) tar xvjf $1 ;;
*.tar.gz) tar xvzf $1 ;;
*.bz2) bunzip2 $1 ;;
*.rar) unrar x $1 ;;
*.gz) gunzip $1 ;;
*.tar) tar xvf $1 ;;
*.tbz2) tar xvjf $1 ;;
*.tgz) tar xvzf $1 ;;
*.zip) unzip $1 ;;
*.Z) uncompress $1 ;;
*.7z) 7z x $1 ;;
*) echo "'$1' cannot be extracted via >extract<" ;;
esac
else
echo "'$1' is not a valid file"
fi
}
Here's a filter that puts commas in the middle of any large numbers in standard input.
$ cat ~/bin/comma
#!/usr/bin/perl -p
s/(\d{4,})/commify($1)/ge;
sub commify {
local $_ = shift;
1 while s/^([ -+]?\d+)(\d{3})/$1,$2/;
return $_;
}
I usually wind up using it for long output lists of big numbers, and I tire of counting decimal places. Now instead of seeing
-rw-r--r-- 1 alester alester 2244487404 Oct 6 15:38 listdetail.sql
I can run that as ls -l | comma and see
-rw-r--r-- 1 alester alester 2,244,487,404 Oct 6 15:38 listdetail.sql
This script saved my career!
Quite a few years ago, i was working remotely on a client database. I updated a shipment to change its status. But I forgot the where clause.
I'll never forget the feeling in the pit of my stomach when I saw (6834 rows affected). I basically spent the entire night going through event logs and figuring out the proper status on all those shipments. Crap!
So I wrote a script (originally in awk) that would start a transaction for any updates, and check the rows affected before committing. This prevented any surprises.
So now I never do updates from command line without going through a script like this. Here it is (now in Python):
import sys
import subprocess as sp
pgm = "isql"
if len(sys.argv) == 1:
print "Usage: \nsql sql-string [rows-affected]"
sys.exit()
sql_str = sys.argv[1].upper()
max_rows_affected = 3
if len(sys.argv) > 2:
max_rows_affected = int(sys.argv[2])
if sql_str.startswith("UPDATE"):
sql_str = "BEGIN TRANSACTION\\n" + sql_str
p1 = sp.Popen([pgm, sql_str],stdout=sp.PIPE,
shell=True)
(stdout, stderr) = p1.communicate()
print stdout
# example -> (33 rows affected)
affected = stdout.splitlines()[-1]
affected = affected.split()[0].lstrip('(')
num_affected = int(affected)
if num_affected > max_rows_affected:
print "WARNING! ", num_affected,"rows were affected, rolling back..."
sql_str = "ROLLBACK TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
sql_str = "COMMIT TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
ret_code = sp.call([pgm, sql_str], shell=True)
I use this script under assorted linuxes to check whether a directory copy between machines (or to CD/DVD) worked or whether copying (e.g. ext3 utf8 filenames -> fusebl
k) has mangled special characters in the filenames.
#!/bin/bash
## dsum Do checksums recursively over a directory.
## Typical usage: dsum <directory> > outfile
export LC_ALL=C # Optional - use sort order across different locales
if [ $# != 1 ]; then echo "Usage: ${0/*\//} <directory>" 1>&2; exit; fi
cd $1 1>&2 || exit
#findargs=-follow # Uncomment to follow symbolic links
find . $findargs -type f | sort | xargs -d'\n' cksum
Sorry, don't have the exact code handy, but I coded a regular expression for searching source code in VS.Net that allowed me to search anything not in comments. It came in very useful in a particular project I was working on, where people insisted that commenting out code was good practice, in case you wanted to go back and see what the code used to do.
I have two ruby scripts that I modify regularly to download all of various webcomics. Extremely handy! Note: They require wget, so probably linux. Note2: read these before you try them, they need a little bit of modification for each site.
Date based downloader:
#!/usr/bin/ruby -w
Day = 60 * 60 * 24
Fromat = "hjlsdahjsd/comics/st%Y%m%d.gif"
t = Time.local(2005, 2, 5)
MWF = [1,3,5]
until t == Time.local(2007, 7, 9)
if MWF.include? t.wday
`wget #{t.strftime(Fromat)}`
sleep 3
end
t += Day
end
Or you can use the number based one:
#!/usr/bin/ruby -w
Fromat = "http://fdsafdsa/comics/%08d.gif"
1.upto(986) do |i|
`wget #{sprintf(Fromat, i)}`
sleep 1
end
Instead of having to repeatedly open files in SQL Query Analyser and run them, I found the syntax needed to make a batch file, and could then run 100 at once. Oh the sweet sweet joy! I've used this ever since.
isqlw -S servername -d dbname -E -i F:\blah\whatever.sql -o F:\results.txt
This goes back to my COBOL days but I had two generic COBOL programs, one batch and one online (mainframe folks will know what these are). They were shells of a program that could take any set of parameters and/or files and be run, batch or executed in an IMS test region. I had them set up so that depending on the parameters I could access files, databases(DB2 or IMS DB) and or just manipulate working storage or whatever.
It was great because I could test that date function without guessing or test why there was truncation or why there was a database ABEND. The programs grew in size as time went on to include all sorts of tests and become a staple of the development group. Everyone knew where the code resided and included them in their unit testing as well. Those programs got so large (most of the code were commented out tests) and it was all contributed by people through the years. They saved so much time and settled so many disagreements!
I coded a Perl script to map dependencies, without going into an endless loop, For a legacy C program I inherited .... that also had a diamond dependency problem.
I wrote small program that e-mailed me when I received e-mails from friends, on an rarely used e-mail account.
I wrote another small program that sent me text messages if my home IP changes.
To name a few.
Years ago I built a suite of applications on a custom web application platform in PERL.
One cool feature was to convert SQL query strings into human readable sentences that described what the results were.
The code was relatively short but the end effect was nice.
I've got a little app that you run and it dumps a GUID into the clipboard. You can run it /noui or not. With UI, its a single button that drops a new GUID every time you click it. Without it drops a new one and then exits.
I mostly use it from within VS. I have it as an external app and mapped to a shortcut. I'm writing an app that relies heavily on xaml and guids, so I always find I need to paste a new guid into xaml...
Any time I write a clever list comprehension or use of map/reduce in python. There was one like this:
if reduce(lambda x, c: locks[x] and c, locknames, True):
print "Sub-threads terminated!"
The reason I remember that is that I came up with it myself, then saw the exact same code on somebody else's website. Now-adays it'd probably be done like:
if all(map(lambda z: locks[z], locknames)):
print "ya trik"
I've got 20 or 30 of these things lying around because once I coded up the framework for my standard console app in windows I can pretty much drop in any logic I want, so I got a lot of these little things that solve specific problems.
I guess the ones I'm using a lot right now is a console app that takes stdin and colorizes the output based on xml profiles that match regular expressions to colors. I use it for watching my log files from builds. The other one is a command line launcher so I don't pollute my PATH env var and it would exceed the limit on some systems anyway, namely win2k.
I'm constantly connecting to various linux servers from my own desktop throughout my workday, so I created a few aliases that will launch an xterm on those machines and set the title, background color, and other tweaks:
alias x="xterm" # local
alias xd="ssh -Xf me#development_host xterm -bg aliceblue -ls -sb -bc -geometry 100x30 -title Development"
alias xp="ssh -Xf me#production_host xterm -bg thistle1 ..."
I have a bunch of servers I frequently connect to, as well, but they're all on my local network. This Ruby script prints out the command to create aliases for any machine with ssh open:
#!/usr/bin/env ruby
require 'rubygems'
require 'dnssd'
handle = DNSSD.browse('_ssh._tcp') do |reply|
print "alias #{reply.name}='ssh #{reply.name}.#{reply.domain}';"
end
sleep 1
handle.stop
Use it like this in your .bash_profile:
eval `ruby ~/.alias_shares`