Extracting Data from an Area file - jython

I am trying to extract information at a specific location (lat,lon) from different satellite images. These images are were given to me in the AREA format and I cooked up a simple jython script to extract temperature values like so.
While the script works, here is small snippet from it that prints out the data value at a point.
from edu.wisc.ssec.mcidas import AreaFile as af
url="adde://localhost/imagedata?&PORT=8113&COMPRESS=gzip&USER=idv&PROJ=0& VERSION=1&DEBUG=false&TRACE=0&GROUP=FL&DESCRIPTOR=8712C574&BAND=2&LATLON=29.7276 -85.0274 E&PLACE=ULEFT&SIZE=1 1&UNIT=TEMP&MAG=1 1&SPAC=4&NAV=X&AUX=YES&DOC=X&DAY=2012002 2012002&TIME=&POS=0&TRACK=0"
a=af(url);
value=a.getData();
print value
array([[I, [array([I, [array('i', [2826, 2833, 2841, 2853])])])
So what does this mean?
Please excuse me if the question seems trivial, while I am comfortable with python I am really new to dealing with scientific data.
Note
Here is a link to the entire script.

After asking around, I found out that the Area objects returns data in multiples of four. So the very first value is what I am looking for.
Grabbing the value is as simple as :
ar[0][0][0]

Related

Reading Fortran binary file in Python

I'm having trouble reading an unformatted F77 binary file in Python.
I've tried the SciPy.io.FortraFile method and the NumPy.fromfile method, both to no avail. I have also read the file in IDL, which works, so I have a benchmark for what the data should look like. I'm hoping that someone can point out a silly mistake on my part -- there's nothing better than having an idiot moment and then washing your hands of it...
The data, bcube1, have dimensions 101x101x101x3, and is r*8 type. There are 3090903 entries in total. They are written using the following statement (not my code, copied from source).
open (unit=21, file=bendnm, status='new'
. ,form='unformatted')
write (21) bcube1
close (unit=21)
I can successfully read it in IDL using the following (also not my code, copied from colleague):
bcube=dblarr(101,101,101,3)
openr,lun,'bcube.0000000',/get_lun,/f77_unformatted,/swap_if_little_endian
readu,lun,bcube
free_lun,lun
The returned data (bcube) is double precision, with dimensions 101x101x101x3, so the header information for the file is aware of its dimensions (not flattend).
Now I try to get the same effect using Python, but no luck. I've tried the following methods.
In [30]: f = scipy.io.FortranFile('bcube.0000000', header_dtype='uint32')
In [31]: b = f.read_record(dtype='float64')
which returns the error Size obtained (3092150529) is not a multiple of the dtypes given (8). Changing the dtype changes the size obtained but it remains indivisible by 8.
Alternately, using fromfile results in no errors but returns one more value that is in the array (a footer perhaps?) and the individual array values are wildly wrong (should all be of order unity).
In [38]: f = np.fromfile('bcube.0000000')
In [39]: f.shape
Out[39]: (3090904,)
In [42]: f
Out[42]: array([ -3.09179121e-030, 4.97284231e-020, -1.06514594e+299, ...,
8.97359707e-029, 6.79921640e-316, -1.79102266e-037])
I've tried using byteswap to see if this makes the floating point values more reasonable but it does not.
It seems to me that the np.fromfile method is very close to working but there must be something wrong with the way it's reading the header information. Can anyone suggest how I can figure out what should be in the header file that allows IDL to know about the array dimensions and datatype? Is there a way to pass header information to fromfile so that it knows how to treat the leading entry?
I played a bit around with it, and I think I have an idea.
How Fortran stores unformatted data is not standardized, so you have to play a bit around with it, but you need three pieces of information:
The Format of the data. You suggest that is 64-bit reals, or 'f8' in python.
The type of the header. That is an unsigned integer, but you need the length in bytes. If unsure, try 4.
The header usually stores the length of the record in bytes, and is repeated at the end.
Then again, it is not standardized, so no guarantees.
The endianness, little or big.
Technically for both header and values, but I assume they're the same.
Python defaults to little endian, so if that were the the correct setting for your data, I think you would have already solved it.
When you open the file with scipy.io.FortranFile, you need to give the data type of the header. So if the data is stored big_endian, and you have a 4-byte unsigned integer header, you need this:
from scipy.io import FortranFile
ff = FortranFile('data.dat', 'r', '>u4')
When you read the data, you need the data type of the values. Again, assuming big_endian, you want type >f8:
vals = ff.read_reals('>f8')
Look here for a description of the syntax of the data type.
If you have control over the program that writes the data, I strongly suggest you write them into data streams, which can be more easily read by Python.
Fortran has record demarcations which are poorly documented, even in binary files.
So every write to an unformatted file:
integer*4 Test1
real*4 Matrix(3,3)
open(78,format='unformatted')
write(78) Test1
write(78) Matrix
close(78)
Should ultimately be padded by an np.int32 values. (I've seen references that this tells you the record length, but haven't verified persconally.)
The above could be read in Python via numpy as:
input_file = open(file_location,'rb')
datum = np.dtype([('P1',np.int32),('Test1',np.int32),('P2',np.int32),('P3',mp.int32),('MatrixT',(np.float32,(3,3))),('P4',np.int32)])
data = np.fromfile(input_file,datum)
Which should fully populate the data array with the individual data sets of the format above. Do note that numpy expects data to be packed in C format (row major) while Fortran format data is column major. For square matrix shapes like that above, this means getting the data out of the matrix requires a transpose as well, before using. For non square matrices, you will need to reshape and transpose:
Matrix = np.transpose(data[0]['MatrixT']
Transposing your 4-D data structure is going to need to be done carefully. You might look into SciPy for automated ways to do so; the SciPy package seems to have Fortran related utilities which I have not fully explored.

GeoTIFF software generation

I have some data about the height of a set of points (for example, from the Google Elevation API). There is a task to save this data in GeoTIFF format, then to use in osgEarth (GDAL). How can this be done? It does not matter in what language.
A quick search on the Internet only gave me the answer to the reverse question (How do I open geotiff images with gdal in python?)
I would be very grateful for any help.
So i would do this with GDAL from python (You could also use rasterio which is a nice wrapper around gdal for file raster file handling)
You should put your data in a numpy array,let us call it some_nparray.
Then create the tif dataset gtiffDriver.Create(). Here you can provide the name of your file, the dimensions in number of columns and rows of your image, the number of bands (here 1), and the datatype. Here i said float32, however byte, int16 etc could also work, depending on your data (you can check it with heigh_data_array.dtype)
Next you should set the geotransform, which is the information about the corner coordinates and pixel resolution, and you should set the projection you are using. This is done with dataset.SetGeoTransform and dataset.SetProjection. How these are created is not in the scope of this question I believe. If you do not need it, i guess you can even skip that part.
Finally write your array to the file with WriteArray and close the file.
You code should look something like this. Here I use the convention that variables prefixed with some_ should be provided by you.
from osgeo import gdal
height_data_array = some_nparray
gtiffDriver = gdal.GetDriverByName('GTiff')
dataset = gtiffDriver.Create('result.tif',
height_data_array.shape[1],
height_data_array.shape[0],
1,
gdal.GDT_Float32)
dataset.SetGeoTransform(some_geotrans)
dataset.SetProjection(some_projection)
dataset.GetRasterBand(1).WriteArray(height_data_array)
dataset = None

Use xls, csv or other type of file to make a float array in Objective-C

Hi all, I'm *very* new at the whole programming thing, but I really like it. Sorry if I do not have enough details
So pretty much, I have excel files that have columns with numbers ( I can't post a picture because I don't have 10 reputations yet.
I've been searching for a couple days and I haven't really found an answer to this. I was wondering if there is a way that I could either create multiple arrays from the file - I know we can't make matrices like matlab - that have the numbers in each column, i.e.
float numbers[] = {1.3, 1.2, 4.2};
Or create the excel file with numbers (the iWork version of excel), and import the numbers file into the xcode project and from there create the arrays
The issue I have is that there's around a thousand numbers so copying it one by one is extremely time consuming
sorry if this is confusing, please let me know if there's anything else I should add as information
There's two easy ways to do this, even if you are a beginner.
Open MS Excel, input your values, save excel document as .CSV file.
Then grab an objective-c CSV converter like this one here in github and you are done.
The second way, you could declare an array of numbers (in example, 'float matrix[5][5];') and use it however you see fit.
I've done both of my suggestions in seperate projects and both work very well. I used the first method for a 15 page excel document that I needed to use in my app, the second method I used in another app that I needed to constantly change the contents of the 2d array.
Once you declare the 'float matrix[5][5]' this is a 5x5 matrix (a.k.a. table) you can use it however you want. You could have the first array be the column and the second array be the row.
You mean generate objective-C source from a .csv file? One way to do this would be to use a scripting language like Perl:
#!/usr/bin/perl
use warnings;
use strict;
my #numbers = ();
while (<>) {
chomp;
push #numbers, $_;
}
my $numbers = join(', ', #numbers);
print qq(float numbers[] = {$numbers};\n);
This assumes that your .csv file (say foo.csv) has numbers in the first column and nothing else. If the file contains:
1.5
2.5
18842984
-4
And you pipe it to the script (foo.pl in this example):
cat foo.csv | perl foo.pl
It will output this:
float numbers[] = {1.5, 2.5, 18842984, -4};
Is that what you're trying to do?

how do you flatten and unflatten an array of doubles in labview?

I have created a simple LabView program shown below that attempts to flatten an array [1,0,3] and then unflatten it and print out the contents.
However, I am unsuccessful in doing so. What am I doing wrong?
What am I doing wrong?
You're not going through tutorials or you're not reading the context help for the unflatten function (Ctrl+H) or you're not reading the full help for the function (right click>>Help) or you're not looking at the examples (from the help or Help>>Find Examples). Take your pick (preferably all four).
If you want an actual answer it is that LV is strictly typed, and therefore you need to tell the unflatten function which data type you want it to output (1D DBL array) and you're not doing that, but the real answer is what's in the previous paragraph - you should use those tools to learn how to find such an answer yourself.
The string returned by Flatten to String only contains the data, not the description of what data type was passed in, so in order to unflatten it again you need to tell Unflatten from String what type it was. You do this by wiring some data of the appropriate type (any data - if it's an array it can be an empty one) to the Type terminal.
I don't think this is immediately obvious from the LabVIEW 2012 help but I think it's fairly clear if you follow the link from the Unflatten from String help page to one of the examples. The Read Flattened Data.vi example has an array wired to the Type input.

How would you get count of a given word in a given PDF?

Interview Question
I have been asked this question in an interview, and the answer doesn't have to be specific programming language, platform- or tool- specific.
The question was phrased as following:
How would you get the instance count of a given word in a PDF. The answer doesn't have to be programming, platform, or tool specific. Just let me know how would you do it in a memory and speed efficient way
I am posting this question for following reasons:
To better understand the context - I still fail to understand the context of this question, what might the interviewer be looking for by asking this question?
To get diverse opinions - I tend to answer such questions based on my skills on a programming language (C#), but there might be other valid options to get this done.
Thanks for your interest.
If I had to write a program to do it, I'd find a PDF rendering library capable of extracting text from PDF files, such as Xpdf and then count the words.
If this was a one-of task or something that needed to be automated for a non-production quality task, I'd just feed the file into pdftotext program and then parsed the output file with python, splitting into words, putting them in a dictionary and counting number of occurances.
If I was asking this interviewing question, I'd be looking for a couple of things:
understanding the difference between the setting for this task:
one-off script thingy vs production code
not attempting to
implement PDF rendered yourself and trying to find a library
instead.
Now I wouldn't expect this from any random candidate with no PDF experience, but you can have a very meaningful discussion about what PDF is and what a "word" is. You see, PDF stored text as a bunch of string with coordinates. Each string is not necessarily a word. Often times, the words will be split into a couple of completely separate strings which are absolutely positioned in the document to make a single word. This is why sometimes when searching for words in a PDF document you get strange looking results. So to implement word searching in a document you'd have to glue these strings back together (pdftotext takes care of that for you).
It's not a bad question at all.
You can use Trie It is very easy to get the count of given word.
I would suggest an open source solution using Java. First you would have to parse the pdf file and extract all the text using Tika.
Then I believe the correct question is how to to find the TF(term frequency) of a word in a text. I will not trouble you with definitions because you can achieve this simply by scanning the extracted text and counting the frequency of word.
Sample code would look like this:
while(scan.hasNext())
{
word = scan.next();
ha += (" " + word + " ");
int countWord = 0;
if(!listOfWords.containsKey(word))
{
listOfWords.put(word, 1); //first occurance of this word
}
else
{
countWord = listOfWords.get(word) + 1; //get current count and increment
//now put the new value back in the HashMap
listOfWords.remove(word); //first remove it (can't have duplicate keys)
listOfWords.put(word, countWord); //now put it back with new value
}
}