BLOB's in SQL that stores a Video file - sql

I am hoping someone can explain how to use BLOBs. I see that BLOBs can be used to store video files. My question is why would a person store a video file in a BLOB in a SQL database? What are the advantages and disadvantages compared to storing pointers to the location of the video file?

A few different reasons.
If you store a pointer to a file on disk (presumably using the BFILE data type), you have to ensure that your database is updated whenever files are moved, renamed, or deleted on disk. It's relatively common when you store data like this that over time your database gets out of sync with the file system and you end up with broken links and orphaned content.
If you store a pointer to a file on disk, you cannot use transactional semantics when you're dealing with multimedia. Since you can't do something like issue a rollback against a file system, you either have to deal with the fact that you're going to have situations where the data on the file system doesn't match the data in the database (i.e. someone uploaded a video to the file system but the transaction that created the author and title in the database failed or vice versa) or you have to add additional steps to the file upload to simulate transactional semantics (i.e. upload a second <>_done.txt file that just contains the number of bytes in the actual file that was uploaded. That's cumbersome and error-prone and may create usability issues.
For many applications, having the database serve up data is the easiest way to provide it to a user. If you want to avoid giving a user a direct FTP URL to your files because they could use that to bypass some application-level security, the easiest option is to have a database-backed application where to retrieve the data, the database reads it from the file system and then returns it to the middle tier which then sends the data to the client. If you're going to have to read the data into the database every time the data is retrieved, it often makes more sense to just store the data directly in the database and to let the database read it from its data files when the user asks for it.
Finally, databases like Oracle provide additional utilities for working with multimedia data in the database. Oracle interMedia, for example, provides a rich set of objects to interact with video data stored in the database-- you can easily tag where scenes begin or end, tag where various subjects are discussed, when the video was recorded, who recorded it, etc. And you can integrate that search functionality with searches against all your relational data. Of course, you could write an application on top of the database that did all those things as well but then you're either writing a lot of code or using another framework in your app. It's often much easier to leverage the database functionality.

Take a read of this : http://www.oracle.com/us/products/database/options/spatial/039950.pdf
(obviously a biased view, but does have a few cons (that have now been fixed by the advent of 11g)

Related

How to save millions of files in S3 so that arbitrary future searches on key/path values are fast

My company has millions of files in an S3 bucket, and every so often I have to search for files whose keys/paths contain some text. This is an extremely slow process because I have to iterate through all files.
I can't use prefix because the text of interest is not always at the beginning. I see other posts (here and here) that say this is a known limitation in S3's API. These posts are from over 3 years ago, so my first question is: does this limitation still exist?
Assuming the answer is yes, my next question is, given that I anticipate arbitrary regex-like searches over millions of S3 files, are there established best practices for workarounds? I've seen some people say that you can store the key names in a relational database, Elasticsearch, or a flat file. Are any of these approaches more common place than others?
Also, out of curiosity, why hasn't S3 supported such a basic use case in a service (S3) that is such an established core product of the overall AWS platform? I've noticed that GCS on Google Cloud has a similar limitation. Is it just really hard to do searches on key name strings well at scale?
S3 is an object store, conceptually similar to a file system. I'd never try to make a database-like environment based on file names in a file system nor would I in S3.
Nevertheless, if this is what you have then I would start by running code to get all of the current file names into a database of some sort. DynamoDB cannot query by regular expression but any of PostgreSQL, MySQL, Aurora, and ElasticSearch can. So start with listing every file and put the file name and S3 location into a database-like structure. Then, create a Lambda that is notified of any changes (see this link for more info) that will do the appropriate thing with your backing store when a file is added or deleted.
Depending on your needs ElasticSearch is super flexible with queries and possibly better suited for these types of queries. But traditional relational database can be made to work too.
Lastly, you'll need an interface to the backing store to query. That will likely require some sort of server. That could be a simple as API gateway to a Lambda or something far more complex.
You might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file containing a list of all objects in the bucket.
You could then load this file into a database, or even write a script to parse it. Or possibly even just play with it in Excel.

Splitting Sensenet content repository into multiple databases

Is there a way of splitting a Content repository into multiple databases? There is a great chance I'll have TBs of data, maybe even tens of TBs of data. Maintaining database bigger than 1 TB becomes an issue, so I can't imagine dealing with a bigger database. I've considered using Filestream, but having multiple databases would be much more viable solution.
If not, is there at least a way of having several repositories contained in a single web site?
Currently (as of version 7.2) sensenet requires a central database to connect to, you cannot split that into multiple parts.
There is the blob storage feature however that lets you store binaries outside of the main metadata database. You choose a blob storage implementation (e.g. the MongoDb blob provider), install it and you can start uploading files to sensenet. Binaries above a certain (configured) size will go to the external provider.
You'll have to take care of the backup of the blob storage though, because that is different for every db provider. At least the size of the metadata db will be significantly lower.

Event Hub, Stream Analytics and Data Lake pipe questions

After reading this article I decided to take a shot on building a pipe of data ingestion. Everything works well. I was able to send data to Event Hub, that is ingested by Stream Analytics and sent to Data Lake. But, I have a few questions regarding some things that seem odd to me. I would appreciate if someone more experienced than me is able to answer.
Here is the SQL inside my Stream Analytics
SELECT
*
INTO
[my-data-lake]
FROM
[my-event-hub]
Now, for the questions:
Should I store 100% of my data in a single file, try to split it in multiple files, or try to achieve one-file-per-object? Stream Analytics is storing all the data inside a single file, as a huge JSON array. I tried setting {date} and {time} as variables, but it is still a huge single file everyday.
Is there a way to enforce Stream Analytics to write every entry from Event Hub on its own file? Or maybe limit the size of the file?
Is there a way to set the name of the file from Stream Analytics? If so, is there a way to override a file if a name already exists?
I also noticed the file is available as soon as it is created, and it is written in real time, in a way I can see data truncation inside it when I download/display the file. Also, before it finishes, it is not a valid JSON. What happens if I query a Data Lake file (through U-SQL) while it is being written? Is it smart enough to ignore the last entry, or understand it as an array of objects that is incomplete?
Is it better to store the JSON data as an array or each object in a new line?
Maybe I am taking a bad approach on my issue, but I have a huge dataset in Google Datastore (NoSQL solution from Google). I only have access to the Datastore, with an account with limited permissions. I need to store this data on a Data Lake. So I made an application that streams the data from Datastore to Event Hub, that is ingested by Stream Analytics, who writes down the files inside the Data Lake. It is my first time using the three technologies, but seems to be the best solution. It is my go-to alternative to ETL chaos.
I am sorry for making so much questions. I hope someone helps me out.
Thanks in advance.
I am only going to answer the file aspect:
It is normally better to produce larger files for later processing than many very small files. Given you are using JSON, I would suggest to limit the files to a size that your JSON extractor will be able to manage without running out of memory (if you decide to use a DOM based parser).
I will leave that to an ASA expert.
ditto.
The answer depends here on how ASA writes the JSON. Clients can append to files and U-SQL should only see the data in a file that has been added in sealed extents. So if ASA makes sure that extents align with the end of a JSON document, you should be only seeing a valid JSON document. If it does not, then you may fail.
That depends on how you plan on processing the data. Note that if you write it as part of an array, you will have to wait until the array is "closed", or your JSON parser will most likely fail. For parallelization and be more "flexible", I would probably get one JSON document per line.

Storage Use Case "Logging + Images + Metadata"

I have the following use case for which I'm trying to find an optimal use of either filsystem, database (rdbms or a flavour of noSql solution). Any advice is welcome, as I want to see what is optimal.
Client application: will generate logs intervals of 1-3 seconds. By logs I mean structured log data (either about connections, applications used, processes used, screenshots, etc..). Some log data will be structured, some will be unstructured (where the schema can change thus).
Storage solution: will need to persist all this data very fast. Will sit on 1-* server(s). It doesn't matter if it's a hybrid solution between filesystem/rdbms/(any suitable flavour of) noSql.
Post processing: the data needs to be queryable ofcourse. E.g. just a key-value store would not suffice, that's a given (maybe for the screenshots only yes).
As a reference, here's a more concrete example:
User runs the client for 2-3 hours (during a "monitoring period"). It sends log data over the wire to the server (storage). Writing speed and data accuracy is vital here.
Management system accumulates the data and makes a report on certain characteristics. All log data should be able to be fetched if needed - but there will be a specific query for a set of users in a given monitoring period. Reading speed is less necessary here, but data accuracy and finding all log parts back eventually is necessary.
If I need to give more information, please let me know.
If you prefer to roll your own rather than use logging packages, I would stick with append only text files. You can certainly encode screenshots in Base64 and keep it in the same file, but I would rather store that separately in the file system with a generated filename stored in the log.
As for reporting, you can obviously read it through a text editor, but if you need a more sophisticated and regular management reporting, you can create an ETL of only the info you report on into a RDBMS. You can always go back and rerun ETL if you decide that you want more info later on.

Database of images and text

background:
I'm in the design phase of building an app.
I want the app to display text and images, the problem is that I will have A LOT of them. hundreds to thousands.
This is my largest app so far, and I am unsure on how to handle all the data.
The question???????:
What would be the best way to store and access these images and text?
Would I use a formal database approach like SQL?
Or would it be better to navigate files/folders e.g. dropping all the files in res/drawable?
potentially useful facts:
The database will be stored and accessed natively so it can be accessed off-line.
The user will not be adding to the database in anyway, only accessing the data.
the database will be updated every 6 months.
The application 'page' will display 1-5 images along with several blocks of text.
Concept:
the app will be like a recipe app...the user will pick some parameters e.g. ingredients, type, diet.. then select a recipe. And then several images and blocks of text will be displayed showing and detailing the process of some recipe.
I apologize if this is repeated but I didn't see a specific answer for my purposes.
The "Best" approach will depend on the functionality of the database server in question.
Generally, you should store the images "In" the database until that becomes a performance issue. Once you start storing images "Outside" of the database you will have to handle all the issue that are normally taken care of by the database. Disk space management, orphan records, file name conflicts, folder file limits, to name just a few. Depending on your situation these may be big issues or thay may be nothing to worry about.
I've seen several application where images (or attachements) were kept "Outside" the database, and in each case it was done poorly. There are just so many issues to handle, and most developers don't even think of half of them. In many cases the performance of storing the images "In" the databse was acceptable, but the developers decided against it because they just knew it would not perform well.
If your using SQL server 2008 the Filestream data type is ideal for your case. It stores the binary files outside of the database but behaves as a normal field. Also you are able to read/write the files using a stream instead of getting/setting the whole file as a byte array (like when using varbin(max))
If you don't have this functionality in your database, I would recommend storing the images outside of the DB
Its probably a better idea to use a file based approach for deployed static resources.
At the very least because taking a dependency on file system is typically easier to manage then taking a dependency on a DB.
Also this line indicates some sort of non-web client
The database will be stored and accessed natively so it can be accessed off-line."
This means if you go with the DB approach you'll have a couple of other interesting problems
Deployment
Depending on the platform deploying a DB can be a real bear depending on your target platform. What happens if they if already have the engine but its a different version.
Resources
Is your DB going to be client/server based (like MySQL/SQL Server etc)? If so then your app has to now manage the current state of its process. If not then you'll be using a file-based db SQL Lite/MS Access, at which point I would question why using a static DB is worth doing at all.
One final note. There's nothing stopping your Content Production environment from using a DB. Its quite common for Content producers to maintain a database for their content that will you will later use to produce the files for publishing/deployment.