How SSTable stores its triplets to the file system? - bigtable

How exactly SSTable stores its string string keys [row, column, timestamp] triples to the file system,
are all triplets in the same directory, in different directories, in the same file or in some other way?
The question is specifically about the file/directory structure of SSTable , which is part of BigTable and based on GFS. More specifically about the actual names of the files that store that triplets and their directory structure.
It seems that an example of actual storing and retrieving such key value triplets would demystify the concept.
Ideally a nice diagram or multiple diagrams would make it much more clear.

"SSTable and Log Structured Storage: LevelDB" by Ilya Grigorik, an engineer at Google, describes the SSTable structure and includes several diagrams.
Also, LevelDB is an open source project by Google which includes an implementation of SSTables in table.h and table.cc.

Related

Azure Data Factory - optimal design for an IOT pipeline

I am working on an Azure Data Factory solution to solve the following scenario:
Data files in CSV format are dumped into Data Lake Gen 2 paths. There are two varieties of files, let's call them TypeA and TypeB and each is dumped into a path reflecting a grouping of sensors and the date.
For example:
/mycontainer/csv/Group1-20210729-1130/TypeA.csv
/mycontainer/csv/Group1-20210729-1130/TypeB.csv
/mycontainer/csv/Group1-20210729-1138/TypeA.csv
/mycontainer/csv/Group1-20210729-1138/TypeB.csv
I need to extract data from TypeA files in Delta format into a different location on Data Lake Gen 2 storage. I'll need to do similar processing for TypeB files but they'll have a different format.
I have successfully put together a "Data Flow" which, given a specific blob path, accomplishes step 2. But I am struggling to put together a pipeline which applies this for each file which comes in.
My first thought was to do this based on a storage event trigger, whereby each time a CSV file appeared the pipeline would be run to process that one file. I was almost able to accomplish this using a combination of fileName and folderPath parameters and wildcards. I even had a pipeline which will work when triggered manually (meaning I entered a specific fileName and folderPath value manually). However I had two problems which made me question whether this was the correct approach:
a) I wasn't able to get it to work when triggered by real storage events, I suspect because my combination of parameters and wildcards was ending up including the container name twice in the path it was generating. It's hard to check this because the error message you get doesn't tell you what the various values actually resolve to (!).
b) The cluster that is needed to extract the CSV into parquet Delta and put the results into Data Lake takes several minutes to spin up - not great if working at the file level. (I realize I can mitigate this somewhat - at a cost - by setting a TTL on the cluster.)
So then I abandoned this approach and tried to set up a pipeline which will be triggered periodically, and will pick up all the CSV files matching a particular pattern (e.g. /mycontainer/csv/*/TypeA.csv), process them as a batch, then delete them. At this point I was very surprised to find out that the "Delimited Text" dataset doesn't seem to support wildcards, which is what I was kind of relying on to achieve this in a simple way.
So my questions are:
Am I broadly on the right track with my 'batch of files' approach? Is there a way to define a delimited text data source which reads its data from multiple blobs?
Or do I need a more 'iterative' approach using maybe a 'Foreach' step? I'm really really hoping this isn't the case as it seems an odd pattern to be adopting in 2021.
A much wider question: is ADF a suitable tool for this kind of scenario? I was excited about using it at first, but increasingly it feels like one of those 'exciting to demo but hard to actually use' things which so often pop-up in the low/no code space. Are there popular alternatives which will work nicely with Azure storage?
Any pointers very much appreciated.
I believe you're very much on the right track.
Last week I was able to get wildcard CSV's to be imported if the wildcard is in the CSV name. Maybe create an intermediate step to put all Type A's in the same folder?
Concerning ADF - it's a cool technology, with a steep learning curve (and a lot of updates - incl. breaking changes sometimes) if you're looking to get data ingested without too much coding. Some drawbacks:
Monitoring - if you want to have it cheaper, there's a lot of hacking (e.g. mailing via Logic Apps)
Debugging - as you've noticed, debug messages are often cryptic or insufficient
Multiple monthly updates make it feel like a beta. Indeed, often there are straightforward tasks that are quite difficult to achieve.
Good luck ;)

How to see all the possible options for schema metadata in tensorflow?

I am using tensorflow data validation and I am trying to build schemas around my datasets. I've built the initial schemas and I can see/edit them in notepad, but I'm having a hard time actually finding a resource that shows me exactly what kind of parameters I can set in the file for a given data type (ie min or max values or data shapes).
Does anyone know of a good resource or even a comprehensive schema I can use to further edit my schema file?
Schemas are just a kind of protocol buffers message, and they are defined within TensorFlow Metadata. You can find the protocol buffers definition in tensorflow_metadata/proto/v0/schema.proto, which describes and documents all the possible properties and options.

Storing spreadsheets in a database

I am attempting to create a relational database to hold data from experiments which return csv files filled with data. This would allow me to search up an experiment I want based on date, author, experimental values etc.
However, I am not sure how to implement the relational database with the experiments which each generate seperate csv files.
Would it be possible to have a csv file as a column of the database or would it just be better to hold the name of the file?
This is a bit long for a comment.
In general, databases have the ability to store large objects (usually, "BLOB"s -- binary large objects).
Whether this meets your needs depends on several factors. I would say that the first is accessibility to the data. Storing the strings in the database has some advantages:
Anyone with access to the database has access to the data.
To repeat: Users do not need separate user access to a file system.
The same API can be used for the metadata and for the underlying data.
You have more controls over the contents -- the underlying file cannot be deleted without deleting the row in the database, for instance.
The data is automatically included in backups and restores.
Of course, there are downsides as well, some of which are related to the above:
With a separate file, it is simpler to update the file, if that is necessary.
Storing the data in a database imposes overheads (although you might be able to get around this by compressing the data).
If the application is already file-based and you are added a database component, then changing the application to support the database could be cumbersome.
I'm sure these lists are not complete. The point is that there is no "right" answer. It depends on your needs.

"Data Repository" software solution

I am trying to find a software solution that will allow our group to easily upload datasets (scriptable and or through some UI), tag those datasets, retrieve those datasets, access control for the datasets, search the tags, search the files name/attributes/metadata (e.g. file creation date). The datasets can be anything from CSV files, image(binary) datasets, texts, server logs, folders within folders of images, zip files of csv data. It can be anything. We will need to be storing GBs to potentially PBs of data. A single file can range from a few KB to 100's of GB. Usable API to retrieve these datasets programmatically.
We just want to have a centralized location of finding information and we want to be able to answer a question such as "Hey do you know if we have any lightening strike datasets?" If there is a file/folder/zip file tagged with "lightening" when I search it should pull back that dataset.
A possible solution would be something like Dataverse, Dspace, Fedora Commons, CKAN. However, those seem to be really geared towards academia and publications or small datasets. On top of that they remove any type of complex folder structure that might exist (e.g. Folder1-->subFolder1-->subFolder2). I also question the scalability of having a 10 million 100kb files within one of these systems.
A filesystem share would allow us to simply store whatever we want but I don't know of a reasonable way of enabling tagging of data.
It is almost like I am looking for a combination of the two. Does someone know of a tool preferably open source that would be able to do something like this?
From what you have described so far, DSpace does seem to be a good fit.
With following examples I want to address the concerns you raised:
Scalability
Here's an example of a multi-terabyte item:
https://ore.exeter.ac.uk/repository/handle/10871/14881
Complex structure
Dryad is based on DSpace and uses a more complex data model, with data files, data packages and the original publication each being represented as separate objects:
http://datadryad.org/resource/doi:10.5061/dryad.322vn
If that's what you want, you can also start your project off the Dryad codebase, since this one is open source as well:
https://github.com/datadryad/dryad-repo

Understanding lucene segments

I have these 3 files in a folder and they are all related to an index created by Lucene:
_0.cfs
segments_2
segments.gen
What are they all used for, and is it possible to convert any of them to a human-readable format to discern a bit more about how lucene works with its indexes?
The two segments files store information about the segments, and the .cfs is a compound file consisting of other index files (like index, storage, deletion, etc. files).
For documentation of different types of files used to create a Lucene index, see this summary of file extensions
Generally, no, Lucene files are not human readable. They are designed more for efficiency and speed than human readability. The way to get a human readable format is to access them through the Lucene API (via Luke, or Solr, or something like that).
If you want a thorough understanding of the file formats in use, the codecs package would be the place to look.