What is the best way to compare to Images in the database?
I tried to compare them (#Image is the type Image):
Select * from Photos
where [Photo] = #Image
But receives the error "The data types image and image are incompatible in the equal to operator".
Since the Image data type is a binary and huge space for storing data, IMO, the easiest way to compare Image fields is hash comparison.
So you need to store hash of the Photo column on your table.
If you need to compare the image you should retrive all the images from the database and do that from the language that you use for accessing the database. This is one of the reasons why it's not a best practice to store images or other binary files in a relational database. You should create unique file name every time when you want to store an image in a database. Rename the file with this unique file name, store the image on the disk and insert in your database it's name on the disk and eventually the original name of the file or the one provided by the user of your app.
Generally, as it's been mentioned already, you need to use dedicated algorithms from the image processing shelve.
Moreover, it's hard to give precise answer because the question is too general. Two images may be considered as different or not based on number of properties.
For instance, you can have one image of a flower 100x100 pixels and image of the same flower but resized to 50x50 pixels. For some purposes and applications these two will be considered as similar (regardless of different dimensions), but will be different images for other purposes.
You may want to check how image comparison is realised by some tools and learn how it works:
pdiff
ImageMagick compare
If you don't want to compare image content but just check if two binary streams (image files, other binary files, binary object with image content) are equivalent, then you can compare MD5 checksums of images.
It depends on how accurate you want to be and how many images you need to compare. You can use various functions like DATALENGTH and SUBSTRING or READTEXT to do some comparisons. Alternatively, you could write code in the CLR and implement it through a stored procedure to do comparisons.
Comparing images falls under a specific category of Computer science, that is called image processing. You should search some libraries that provide image processing capabilities. Using that you can compare upto what ratio given two images are same or identical. you can have 2 images matching each other upto 50% or more or less. There are mathematical algorithm that define the comparison formulae which returns the ratio.
Hope this gives you a direction to work further on your Problem..
Related
For example
Instead of a field named, COLUMN_1, that holds an embedded document with fields A, B, and C.
Why not 3 separate fields with names COLUMN_1_A, COLUMN_1_B, COLUMN_1_C.
In mongoDB you prefer to work with documents the way they exist naturaly
, afcourse you can create routines that can translate every time and assemble/disasemble the document:
COLUMN_1:{A:X,B:Y,C:Z}
to:
COLUMN_1_A:X,COLUMN_1_B:Y,COLUMN_1_C:Z
But this is additonal work that you dont want to do every time , you are a lazy effective developer that prefer just store your json document the way it will be used unless there is a specific use case that make sence to do it this way ... :)
Also please, note embedding is possible up to 100 levels , but max document size is still limited to 16MB , so in your document model it is not expected that you embed all your database in single document...
When you need performance , you add indices and optimize queries , that way your most searched data stay in memory , there is no big difference if your fields are embeded or in the root if they are not indexed ...
My data frame has 3.8 million rows and 20 or so features, many of which are categorical. After paring down the number of features, I can "dummy up" one critical column with 20 or so categories and my COLAB with (allegedly) TPU running won't crash.
But there's another column with about 53,000 unique values. Trying to "dummy up" this feature crashes my session. I can't ditch this column.
I've looked up target encoding, but the data set is very imbalanced and I'm concerned about target leakage. Is there a way around this?
EDIT: My target variable is a simple binary one.
Without knowing more details of the problem/feature, there's no obvious way to do this. This is the part of Data Science/Machine Learning that is an art, not a science. A couple ideas:
One hot encode everything, then use a dimensionality reduction algorithm to remove some of the columns (PCA, SVD, etc).
Only one hot encode some values (say limit it to 10 or 100 categories, rather than 53,000), then for the rest, use an "other" category.
If it's possible to construct an embedding for these variables (Not always possible), you can explore this.
Group/bin the values in the columns by some underlying feature. I.e. if the feature is something like days_since_X, bin it by 100 or something. Or if it's names of animals, group it by type instead (mammal, reptile, etc.)
I'd like to use the Levenshtein algorithm to compare two files in VB.NET. I know I can use an MD5 hash to determine if they're different, but I want to know HOW MUCH different the two files are. The files I'm working with are both around 250 megs. I've experimented with different ways of doing this and I've realized I really can't load both files into memory (all kinds of string-related issues). So I figured I'd just stream the bytes I need as I go. Fine. But the implementations that I've found of the Levenshtein algorithm all dimension a matrix that's length 1 * length 2 in size, which in this case is impossible to work with. I've heard there's a way to do this with just two vectors instead of the whole matrix.
How can I compute Levenshtein distance of two large files without declaring a matrix that's the product of their file sizes?
Note that the values in each row of the Levenshtein matrix depend only on the values in the row above it. This means that you only need two one-dimensional arrays: one contains the values of the current row; the other is populated with the new values that you can compute from the current row. Then, you swap their roles (the "new" row becomes the "current" row and vice versa) and continue.
Note that this approach only lets you compute the Levenshtein distance (which seems to be what you want); it cannot tell you which operations must be done in order to transform one string into the other. There exists a very clever modification of the algorithm that lets you reconstruct the edit operations without using nm memory, but I've forgotten how it works.
Context
I have a bunch of PDF files. Some of them are scanned (i.e. images). They consist of text + pictures + tables.
I want to turn the tables into CSV files.
Current Plan:
1) Run Tesseract OCR to get text of all the documents.
2) ??? Run some type of Table Detection Algorithm ???
3) Extract the rows / columns / cells, and the text in them.
Question:
Is there some standard "Table Extraction Algorithm" to use?
Thanks!
Abbyy Fine Reader includes table detection and will be the easiest approach. It can scan, import PDF', TIFF's etc. You will also be able to manually adjust the tables and columns when the auto detection fails.
www.abbyy.com - You should be able to download a trial version and you will also find the OCR results are much more accurate than Tesseract which will also save you a lot of time.
Trying to write something yourself will be hit and miss as there are too many different types of tables to cope with. ie. with lines, without lines, shaded, multiple lines, different alignments, headers, footers etc..
Good luck.
The basic gist of what I'm trying to accomplish is setting up an image processing server. As the page code is created in Coldfusion multiple images on the page may need to be resized and thumbnailed into appropriate sizes, each to a possibly different size and each with a possibly different algorithm.
The basic gist of how it works is using a simple img tag the src attribute will point to the image server along the lines of the following.
<img src="http://imageserver.com/<clientname>/<primarykey>.jpg">
This allows the image resizing to occur asynchronously, and on a different server, thus not slowing down the current page call.
When the image processing server receives the call it will first check if that file exists, if Apache determines the file exists it serves it right away, else, it invokes Coldfusion which reads an entry from the database using the primary key passed to it, to get the URL of the image to be processed and any associated parameters (in this case width, height, method, url, client, but possibly more in the future).
Currently I'm doing this using a hash system where the parameters are ordered alphabetically, and then hashed. Is that a reasonable system, or will hash collisions eventually occur even though the data being hashed is quite small (between 50 to 200 characters). Each client could likely store up to 10,000 images (in their own folder so hash collision would not be a problem cross-client).
To reduce DB calls, as the page processes, each time a processed image is desired, I add that image's information to an array. At the end of the page, I make 2 calls to the DB, first it checks if the rows in my array already exist in the DB, and then if necessary, it adds any rows that do not exist (storing their various parameters). The dilemma here is that the primarykey (or what goes in the image tag) must be known before it is actually inserted into the DB, this way I'm not checking at every single image as some pages could have hundreds of images on them and that would be very inefficient.
Are hash collisions not a concern with this sample size (10k images per client generated by 50-200 character strings)? What about if I did something simple like <width>_<height>_<hash>.jpg or put the images in folders like /<client>/<width>x<height>/<hash>.jpg because that would further reduce the possibility of hash collisions (although not remove them)?
Any advice?
How are you hashing? Use SHA-512 for the hashing algorithm and you'll get a string 128 characters long. You may not want a URL so long, but the idea here is that you can minimize collisions via more complex algorithms.
http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-7c52.html
Even though I doubt you would have to worry about hash collisions, you may want to just use a UUID.
http://help.adobe.com/en_US/ColdFusion/9.0/CFMLRef/WSc3ff6d0ea77859461172e0811cbec22c24-70de.html
EDIT: Or use a uniqueidentifer as the primary key of the table you are storing the file to. Then after an insert, you can use the OUTPUT clause of a query to return the key to be used however you want.
The method I resolved this was by hashing not only the filename but it's parameters, such as width and height. Thus, the possibility of hash collisions is basically 0 until we hit millions (billions?) of records. So far we have no hash collisions.