Assuming Parquet files on AWS S3 (used for querying by AWS Athena).
I need to anonymize a record with specific numeric field by changing the numeric value (changing one digit is enough).
Can I scan a parquet file as Binary and find a numeric value ? Or the compression will make it impossible to find such string ?
Assuming I can do #1 - can I anonymize the record by changing a digit on this number on the binary level without corrupting the parquet file ?
10X
No, this will not be possible. Parquet has two layers in its format that make this impossible: encoding and compression. They both reorder the data to fit into less space, the difference between them is CPU usage and universalness. Sometimes data can be compressed so that we need less than a byte per value if all values are the same / very similar. Changing a single value would than lead to more space usage which in turn makes your edit impossible.
Related
I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.
I have a small Firebird 2.5 database with a blob field called "note" declared as this:
BLOB SUB_TYPE 1 SEGMENT SIZE 80 CHARACTER SET UTF8
The data base page size is:
16.384 (That I'm suspecting is too high)
I have ran this select in order to discover the average size of the blob fields available:
select avg(octet_length(items.note)) from items
and got this information:
2.671
As a beginner, I would like to know the better segment size for this blob field and the best database page size in your opinion (I know that this depends of other information, but I still don't know how to figure it out).
Blobs in Firebird are stored in separate pages of your database. The exact storage format depends on the size of your blob. As described in Blob Internal Storage:
Blobs are created as part of a data row, but because a blob could be
of unlimited length, what is actually stored with the data row is a
blobid, the data for the blob is stored separately on special blob
pages elsewhere in the database.
[..]
A blob page stores data for a blob. For large blobs, the blob page
could actually be a blob pointer page, i.e. be used to store pointers
to other blob pages. For each blob that is created a blob record is
defined, the blob record contains the location of the blob data, and
some information about the blobs contents that will be useful to the
engine when it is trying to retrieve the blob. The blob data could be
stored in three slightly different ways. The storage mechanism is
determined by the size of the blob, and is identified by its level
number (0, 1 or 2). All blobs are initially created as level 0, but
will be transformed to level 1 or 2 as their size increases.
A level 0 blob is a blob that can fit on the same page as the blob
header record, for a data page of 4096 bytes, this would be a blob of
approximately 4052 bytes (Page overhead - slot - blob record header).
In other words, if your average size of blobs is 2671 bytes (and most larger ones are still smaller than +/- 4000 bytes), then likely a page size of 4096 is optimal as it will reduce wasted space from on average 16340 - 2671 = 13669 bytes to 4052 - 2671 = 1381 bytes.
However for performance itself this likely hardly going to matter, and smaller page sizes have other effects that you will need to take into account. For example a smaller page size will also reduce the maximum size of a CHAR/VARCHAR index key, indexes might become deeper (more levels), and less records fit in a single page (or wider records become split over multiple pages).
Without measuring and testing it is hard to say if using 4096 for the page size is the right size for your database.
As to segment sizes: it is a historic artifact that is best ignored (and left off). Sometimes applications or drivers incorrectly assume that blobs need to be written or read in the specified segment size. In those rare cases specifying a larger segment size might improve performance. If you leave it off, Firebird will default to a value of 80.
From Binary Data Types:
Segment Size: Specifying the BLOB segment is throwback to times past,
when applications for working with BLOB data were written in C
(Embedded SQL) with the help of the gpre pre-compiler. Nowadays, it is
effectively irrelevant. The segment size for BLOB data is determined
by the client side and is usually larger than the data page size, in
any case.
I am wondering if there is a file format that will enable me to store values in the format as indicated below. I want the file format to be as efficient as possible (I.E no extra information, apart from what I place inside it). This is for a concept I have of creating a more efficient method of storing images. Here is an example of the data I wish to store:
800 600 0000FF FF0000 00FF00 969696
...
I was originally considering placing them in a .txt file, but I do not think that storing say 1 million numbers (For 1000x1000 image) in a .txt file is very compact.
So, what file format that can be written to in VB.net is the best for storing basic numbers?
EDIT 1: I plan to compress using GZip or some other compression afterwards.
Simply store them in binary format. Look at BinaryWriter.
What is the best way to compare to Images in the database?
I tried to compare them (#Image is the type Image):
Select * from Photos
where [Photo] = #Image
But receives the error "The data types image and image are incompatible in the equal to operator".
Since the Image data type is a binary and huge space for storing data, IMO, the easiest way to compare Image fields is hash comparison.
So you need to store hash of the Photo column on your table.
If you need to compare the image you should retrive all the images from the database and do that from the language that you use for accessing the database. This is one of the reasons why it's not a best practice to store images or other binary files in a relational database. You should create unique file name every time when you want to store an image in a database. Rename the file with this unique file name, store the image on the disk and insert in your database it's name on the disk and eventually the original name of the file or the one provided by the user of your app.
Generally, as it's been mentioned already, you need to use dedicated algorithms from the image processing shelve.
Moreover, it's hard to give precise answer because the question is too general. Two images may be considered as different or not based on number of properties.
For instance, you can have one image of a flower 100x100 pixels and image of the same flower but resized to 50x50 pixels. For some purposes and applications these two will be considered as similar (regardless of different dimensions), but will be different images for other purposes.
You may want to check how image comparison is realised by some tools and learn how it works:
pdiff
ImageMagick compare
If you don't want to compare image content but just check if two binary streams (image files, other binary files, binary object with image content) are equivalent, then you can compare MD5 checksums of images.
It depends on how accurate you want to be and how many images you need to compare. You can use various functions like DATALENGTH and SUBSTRING or READTEXT to do some comparisons. Alternatively, you could write code in the CLR and implement it through a stored procedure to do comparisons.
Comparing images falls under a specific category of Computer science, that is called image processing. You should search some libraries that provide image processing capabilities. Using that you can compare upto what ratio given two images are same or identical. you can have 2 images matching each other upto 50% or more or less. There are mathematical algorithm that define the comparison formulae which returns the ratio.
Hope this gives you a direction to work further on your Problem..
I want to know in sql,how fixed-length data type take places length in memory?I know is that for varchar,if we specify length is (20),and if user input length is 15,it takes 20 by setting space.for varchar2,if we specify length is (20),and if user input is 15,it only take 15 length in memory.So how about fixed-length data type take place?I searched in Google,but I did not find explanation with example.Please explain me with example.Thanks in advance.
A fixed length data field always consumes its full size.
In the old days (FORTRAN), it was padded at the end with space characters. Modern databases might do that too, but either implicitly trim trailing blanks off or the query might have to do it explicitly.
Variable length fields are a relative newcomer to databases, probably in the 1970s or 1980s they made widespread appearances.
It is considerably easier to manage fixed length record offsets and sizes rather than compute the offset of each data item in a record which has variable length fields. Furthermore, a fixed length data record is easily addressed in a data file by computing the byte offset of its beginning by multiplying the record size times the record number (and adding the length of whatever fixed header data is at the beginning of file).