I'm starting to study the FITS format and I'm in the proccess of reading the Definition of FITS document.
I know that a FITS file can have one or more HDUs, the primary being the first one and the extensions being the following ones (if there is more than one HDU), I also know that for the extensions there is a mandatory keyword in the header (XTENSION) that let us know if the Data Unit is an Image, Binary Table or ASCII Table, but how can I know what is the Data Type (Image, Binary Table or ASCII Table) of the first HDU?
I don't understand why XTENSION isn't a mandatory keyword in the primary header.
The "type" of the PRIMARY HDU is essentially IMAGE in most cases. From v3.0 of the standard:
3.3.2. Primary data array
The primary data array, if present, shall consist of a single data
array with from 1 to 999 dimensions (as specified by the NAXIS
keyword defined in Sect. 4.4.1). The random groups convention
in the primary data array is a more complicated structure and
is discussed separately in Sect. 6. The entire array of data values
are represented by a continuous stream of bits starting with
the first bit of the first data block. Each data value shall consist
of a fixed number of bits that is determined by the value of
the BITPIX keyword (Sect. 4.4.1). Arrays of more than one dimension
shall consist of a sequence such that the index along
axis 1 varies most rapidly, that along axis 2 next most rapidly,
and those along subsequent axes progressively less rapidly, with that along axis m, where m is the value of NAXIS, varying least
rapidly. There is no space or any other special character between
the last value on a row or plane and the first value on the next
row or plane of a multi-dimensional array. Except for the location
of the first element, the array structure is independent of the
FITS block structure. This storage order is shown schematically
in Fig. 1 and is the same order as in multi-dimensional arrays in
the Fortran programming language (ISO 2004). The index count
along each axis shall begin with 1 and increment by 1 up to the
value of the NAXISn keyword (Sect. 4.4.1).
If the data array does not fill the final data block, the remainder
of the data block shall be filled by setting all bits to zero.
The individual data values shall be stored in big-endian byte order
such that the byte containing the most significant bits of the
value appears first in the FITS file, followed by the remaining
bytes, if any, in decreasing order of significance.
Though it isn't until later on (in section 7.1) that it makes this connection:
7.1. Image extension
The FITS image extension is nearly identical in structure to the
the primary HDU and is used to store an array of data. Multiple
image extensions can be used to store any number of arrays in a
single FITS file. The first keyword in an image extension shall
be XTENSION= ’IMAGE ’.
It isn't immediately apparent what it means by "nearly identical" here. I guess the only difference is that the PRIMARY HDU may also have the aformentioned "random groups" structure, whereas with IMAGE extension HDU's PCOUNT is always 0 and GCOUNT is always 1.
You'll only rarely see the "random groups" convention. This is sort of a precursor to the BINTABLE format. It was used traditionally in radio interferometry data, but hardly at all outside that.
The reason for all this is for backwards compatibility with older versions of FITS that predate even the existence of extension HDUs. Many FITS-based formats don't put any data in the PRIMARY HDU and use the primary header only for metadata keywords that pertain to the entire file (e.g. most HST data).
Related
I have a set of points and I am doing a CGAL::Delaunay_triangulation_2 with them. However, the order of the points in the resulting triangulation is not the same as in the input points. For example, is input point 0 is in (-1,-1), the output point 0 in the triangulation is not in the same position. The point in position (-1,-1) is another one, but not necesarily the 0th.
For me, it is important to keep the order, as I am taking some references (as indices) to the original set of points, so I need the vertex number i in the input set and in the output set to be the same one.
Is there any way to make the output set be ordered the same as the input set? I dont care if I need to reorder the input set, as I can easily do that before taking the references.
As documented here: "Note that this function is not guaranteed to insert the points following the order of PointInputIterator, as spatial_sort() is used to improve efficiency."
If you insert your points one by one then they will be in the insertion order (provided there is no duplicate).
See also this example that can be used to set input id as info() of vertices (then a vector can be created to have direct access from id -> vertex).
I was able to find a few, but I was wondering, is there more algorithms that based on data encoding/modification instead of complete encryption of it. Examples that I found:
Steganography. The method is based on hiding a message within a message;
Tokenization. Data is mapped in the tokenization server to a random token that represents the real data outside of the server;
Data perturbation. As far as I know it works mostly with databases. Adds noise to the sensitive records yet allows to read general and public fields, like sum of the records on a specific day.
Are there any other methods like this?
If your purpose is to publish this data there are other methods similars to data perturbation, its called Data Anonymization [source]:
Data masking—hiding data with altered values. You can create a mirror
version of a database and apply modification techniques such as
character shuffling, encryption, and word or character substitution.
For example, you can replace a value character with a symbol such as
“*” or “x”. Data masking makes reverse engineering or detection
impossible.
Pseudonymization—a data management and de-identification method that
replaces private identifiers with fake identifiers or pseudonyms, for
example replacing the identifier “John Smith” with “Mark Spencer”.
Pseudonymization preserves statistical accuracy and data integrity,
allowing the modified data to be used for training, development,
testing, and analytics while protecting data privacy.
Generalization—deliberately removes some of the data to make it less
identifiable. Data can be modified into a set of ranges or a broad
area with appropriate boundaries. You can remove the house number in
an address, but make sure you don’t remove the road name. The purpose
is to eliminate some of the identifiers while retaining a measure of
data accuracy.
Data swapping—also known as shuffling and permutation, a technique
used to rearrange the dataset attribute values so they don’t
correspond with the original records. Swapping attributes (columns)
that contain identifiers values such as date of birth, for example,
may have more impact on anonymization than membership type values.
Data perturbation—modifies the original dataset slightly by applying techniques that round numbers and add random noise. The range
of values needs to be in proportion to the perturbation. A small base
may lead to weak anonymization while a large base can reduce the
utility of the dataset. For example, you can use a base of 5 for
rounding values like age or house number because it’s proportional to
the original value. You can multiply a house number by 15 and the
value may retain its credence. However, using higher bases like 15 can
make the age values seem fake.
Synthetic data—algorithmically manufactured information that has no
connection to real events. Synthetic data is used to create artificial
datasets instead of altering the original dataset or using it as is
and risking privacy and security. The process involves creating
statistical models based on patterns found in the original dataset.
You can use standard deviations, medians, linear regression or other
statistical techniques to generate the synthetic data.
Is this what are you looking for?
EDIT: added link to the source and quotation.
Redis has a SCAN command that may be used to iterate keys matching a pattern etc.
Redis SCAN doc
You start by giving a cursor value of 0; each call returns a new cursor value which you pass into the next SCAN call. A value of 0 indicates iteration is finished. Supposedly no server or client state is needed (except for the cursor value)
I'm wondering how Redis implements the scanning algorithm-wise?
You may find answer in redis dict.c source file. Then I will quote part of it.
Iterating works the following way:
Initially you call the function using a cursor (v) value of 0. 2)
The function performs one step of the iteration, and returns the
new cursor value you must use in the next call.
When the returned cursor is 0, the iteration is complete.
The function guarantees all elements present in the dictionary get returned between the start and end of the iteration. However it is possible some elements get returned multiple times. For every element returned, the callback argument 'fn' is called with 'privdata' as first argument and the dictionary entry'de' as second argument.
How it works
The iteration algorithm was designed by Pieter Noordhuis. The main idea is to increment a cursor starting from the higher order bits. That is, instead of incrementing the cursor normally, the bits of the cursor are reversed, then the cursor is incremented, and finally the bits are reversed again.
This strategy is needed because the hash table may be resized between iteration calls. dict.c hash tables are always power of two in size, and they use chaining, so the position of an element in a given table is given by computing the bitwise AND between Hash(key) and SIZE-1 (where SIZE-1 is always the mask that is equivalent to taking the rest of the division between the Hash of the key and SIZE).
For example if the current hash table size is 16, the mask is (in binary) 1111. The position of a key in the hash table will always be the last four bits of the hash output, and so forth.
What happens if the table changes in size?
If the hash table grows, elements can go anywhere in one multiple of the old bucket: for example let's say we already iterated with a 4 bit cursor 1100 (the mask is 1111 because hash table size = 16).
If the hash table will be resized to 64 elements, then the new mask will be 111111. The new buckets you obtain by substituting in ??1100 with either 0 or 1 can be targeted only by keys we already visited when scanning the bucket 1100 in the smaller hash table.
By iterating the higher bits first, because of the inverted counter, the cursor does not need to restart if the table size gets bigger. It will continue iterating using cursors without '1100' at the end, and also without any other combination of the final 4 bits already explored.
Similarly when the table size shrinks over time, for example going from 16 to 8, if a combination of the lower three bits (the mask for size 8 is 111) were already completely explored, it would not be visited again because we are sure we tried, for example, both 0111 and 1111 (all the variations of the higher bit) so we don't need to test it again.
Wait... You have TWO tables during rehashing!
Yes, this is true, but we always iterate the smaller table first, then we test all the expansions of the current cursor into the larger table. For example if the current cursor is 101 and we also have a larger table of size 16, we also test (0)101 and (1)101 inside the larger table. This reduces the problem back to having only one table, where the larger one, if it exists, is just an expansion of the smaller one.
Limitations
This iterator is completely stateless, and this is a huge advantage, including no additional memory used.
The disadvantages resulting from this design are:
It is possible we return elements more than once. However this is usually easy to deal with in the application level.
The iterator must return multiple elements per call, as it needs to always return all the keys chained in a given bucket, and all the expansions, so we are sure we don't miss keys moving during rehashing.
The reverse cursor is somewhat hard to understand at first, but this comment is supposed to help.
I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.
I want to know in sql,how fixed-length data type take places length in memory?I know is that for varchar,if we specify length is (20),and if user input length is 15,it takes 20 by setting space.for varchar2,if we specify length is (20),and if user input is 15,it only take 15 length in memory.So how about fixed-length data type take place?I searched in Google,but I did not find explanation with example.Please explain me with example.Thanks in advance.
A fixed length data field always consumes its full size.
In the old days (FORTRAN), it was padded at the end with space characters. Modern databases might do that too, but either implicitly trim trailing blanks off or the query might have to do it explicitly.
Variable length fields are a relative newcomer to databases, probably in the 1970s or 1980s they made widespread appearances.
It is considerably easier to manage fixed length record offsets and sizes rather than compute the offset of each data item in a record which has variable length fields. Furthermore, a fixed length data record is easily addressed in a data file by computing the byte offset of its beginning by multiplying the record size times the record number (and adding the length of whatever fixed header data is at the beginning of file).