"Data Repository" software solution - repository

I am trying to find a software solution that will allow our group to easily upload datasets (scriptable and or through some UI), tag those datasets, retrieve those datasets, access control for the datasets, search the tags, search the files name/attributes/metadata (e.g. file creation date). The datasets can be anything from CSV files, image(binary) datasets, texts, server logs, folders within folders of images, zip files of csv data. It can be anything. We will need to be storing GBs to potentially PBs of data. A single file can range from a few KB to 100's of GB. Usable API to retrieve these datasets programmatically.
We just want to have a centralized location of finding information and we want to be able to answer a question such as "Hey do you know if we have any lightening strike datasets?" If there is a file/folder/zip file tagged with "lightening" when I search it should pull back that dataset.
A possible solution would be something like Dataverse, Dspace, Fedora Commons, CKAN. However, those seem to be really geared towards academia and publications or small datasets. On top of that they remove any type of complex folder structure that might exist (e.g. Folder1-->subFolder1-->subFolder2). I also question the scalability of having a 10 million 100kb files within one of these systems.
A filesystem share would allow us to simply store whatever we want but I don't know of a reasonable way of enabling tagging of data.
It is almost like I am looking for a combination of the two. Does someone know of a tool preferably open source that would be able to do something like this?

From what you have described so far, DSpace does seem to be a good fit.
With following examples I want to address the concerns you raised:
Scalability
Here's an example of a multi-terabyte item:
https://ore.exeter.ac.uk/repository/handle/10871/14881
Complex structure
Dryad is based on DSpace and uses a more complex data model, with data files, data packages and the original publication each being represented as separate objects:
http://datadryad.org/resource/doi:10.5061/dryad.322vn
If that's what you want, you can also start your project off the Dryad codebase, since this one is open source as well:
https://github.com/datadryad/dryad-repo

Related

Azure Data Factory - optimal design for an IOT pipeline

I am working on an Azure Data Factory solution to solve the following scenario:
Data files in CSV format are dumped into Data Lake Gen 2 paths. There are two varieties of files, let's call them TypeA and TypeB and each is dumped into a path reflecting a grouping of sensors and the date.
For example:
/mycontainer/csv/Group1-20210729-1130/TypeA.csv
/mycontainer/csv/Group1-20210729-1130/TypeB.csv
/mycontainer/csv/Group1-20210729-1138/TypeA.csv
/mycontainer/csv/Group1-20210729-1138/TypeB.csv
I need to extract data from TypeA files in Delta format into a different location on Data Lake Gen 2 storage. I'll need to do similar processing for TypeB files but they'll have a different format.
I have successfully put together a "Data Flow" which, given a specific blob path, accomplishes step 2. But I am struggling to put together a pipeline which applies this for each file which comes in.
My first thought was to do this based on a storage event trigger, whereby each time a CSV file appeared the pipeline would be run to process that one file. I was almost able to accomplish this using a combination of fileName and folderPath parameters and wildcards. I even had a pipeline which will work when triggered manually (meaning I entered a specific fileName and folderPath value manually). However I had two problems which made me question whether this was the correct approach:
a) I wasn't able to get it to work when triggered by real storage events, I suspect because my combination of parameters and wildcards was ending up including the container name twice in the path it was generating. It's hard to check this because the error message you get doesn't tell you what the various values actually resolve to (!).
b) The cluster that is needed to extract the CSV into parquet Delta and put the results into Data Lake takes several minutes to spin up - not great if working at the file level. (I realize I can mitigate this somewhat - at a cost - by setting a TTL on the cluster.)
So then I abandoned this approach and tried to set up a pipeline which will be triggered periodically, and will pick up all the CSV files matching a particular pattern (e.g. /mycontainer/csv/*/TypeA.csv), process them as a batch, then delete them. At this point I was very surprised to find out that the "Delimited Text" dataset doesn't seem to support wildcards, which is what I was kind of relying on to achieve this in a simple way.
So my questions are:
Am I broadly on the right track with my 'batch of files' approach? Is there a way to define a delimited text data source which reads its data from multiple blobs?
Or do I need a more 'iterative' approach using maybe a 'Foreach' step? I'm really really hoping this isn't the case as it seems an odd pattern to be adopting in 2021.
A much wider question: is ADF a suitable tool for this kind of scenario? I was excited about using it at first, but increasingly it feels like one of those 'exciting to demo but hard to actually use' things which so often pop-up in the low/no code space. Are there popular alternatives which will work nicely with Azure storage?
Any pointers very much appreciated.
I believe you're very much on the right track.
Last week I was able to get wildcard CSV's to be imported if the wildcard is in the CSV name. Maybe create an intermediate step to put all Type A's in the same folder?
Concerning ADF - it's a cool technology, with a steep learning curve (and a lot of updates - incl. breaking changes sometimes) if you're looking to get data ingested without too much coding. Some drawbacks:
Monitoring - if you want to have it cheaper, there's a lot of hacking (e.g. mailing via Logic Apps)
Debugging - as you've noticed, debug messages are often cryptic or insufficient
Multiple monthly updates make it feel like a beta. Indeed, often there are straightforward tasks that are quite difficult to achieve.
Good luck ;)

Read Excel Files from External Tables

I am tasked to create a template that will be Filled up by Business Users with Employee Information, then our program will load this into the Database using External Tables.
However, our Business Users constantly change the template by adding, removing or reordering fields.
I am convinced to use XLSX instead of CSV so that I can lock the Column Headers so they cannot remove, add and reorder the columns.
However, When i query the External Table, it shows Non-ASCII Characters when reading XLSX because its in Binary.
How can i do either of the following?
Effectively Read Excel Files from External Tables
Lock the Headers of CSV Files?
What you have here is a political problem, but you are looking for a technical fix. Not a good fit.
The problem comes in two halves:
Somebody decided it was a good idea to collect user input in a spreadsheet, which it is generally not.
Users are fiddling with the input format, which they should not.
Fixes are:
Strictly enforce the data structure. Reject any CSV which doesn't natch and make the users edit them. They will quickly tire of tweaking the spreadsheets when they realise they're just creating more work for themselves. But they will also get resentful, so consider ...
Building a data input screen. It's pretty simple to knock up a spreadsheet-like grid UI. You don't need anything complicated in Java: Oracle's Apex is intended for exactly this sort of thing. Find out more.
However, if you are stuck with Excel as a UI I suggest you have a look at Anton Scheffer's excellent PLSQL as_read_xlsx package on the Amis site. Check it out. You'll probably need to replace your external table with a view over a table (perhaps pipelined) function.

Using a SQL database file as project files

I am wondering if it makes sense to use multiple SQL database files like sqllite (which I believe is single file based?) as project files in my software. The project files contain basic information as well as multiple records (spectra) with lists of parameters (floating point values) and lists of measurement data (also floating point).
I currently use my own binary format, which is a pain to maintain. I tried to use XML which works very well, but the file sizes explode (500 kB before, 7.5 MB as XML).
Now I wonder if I can structure SQL databases to contain this kind of information and effectively load and save this data in my .NET software.
(I am not very experienced in SQL) so:
Can SQL tables contain sub-tables (like subnodes in XML) or be linked to other tables?
E.g. Can I make a table for the record, and this table has subtables for the lists of measurement data and parameters?
Will this be more efficient than XML in terms of storage space?
I went with a SQLite database. It can be easily implemented into .NET using the System.Data.SQLite Project, that can even be used with AnyCPU Builds.
It is working very nicely, both performance and storage space wise.
You still need to take a lot of care with different versions of your databases. If you try and save a new scheme into a database using an older scheme, some columns or tables might not exist. You need to implement a migration method to a new database file for this.
The real advantage is, that it is an open format, and I stand behind the premise, that the stuff a user saves is his, and does not need to be hidden in an obscure, file structure, if the latter does not bring any significant advantages to the table.
If the user can no longer use your software, he or she can still access all data, using other tools like the Database Browser for SQLite if need be.

Migrating RMS to RDB

We're approaching the migration of legacy OpenVMS RMS files into relational database (both MS SQL 2012 and Oracle 10g are available).
I wonder if there are:
Tools to retrieve schema of indexed files
Tools to parse indexed files
Tools to deal with custom RMS data formats (zoned decimals etc)
as a bundle/API/Library
Perhaps I should change the approach?
There are several tools available, notably through ODBC vendors (I work for one: Attunity).
1 >> Tools to retrieve schema of indexed files
Please clarify. Looking for just record/column layout and indexes within the files or also relationships between files.
1a) How are the files currently being used? Cobol, Basic, Fortran programs? Datatrieve?
They will be using some data definition method, so you want a tool which can exploit that.
Connx, and Attunity Connect can 'import' CDD definitions, BASIC - MAP files, Cobol Copybooks. Variants are typically covered as well. I have written many a (perl/awk) script to convert special definition to XML.
1b ) Analyze/RMS, or a program with calling RMS XAB's can get available index information. Atunity connect will know how to map those onto the fields from 1a)
1c ) There is no formal, stored, relationship between (indexed) files on OpenVMS. That's all in the program logic. However, some modestly smart Perl/Awk/DCL script can often generate a tablem of likely foreign/primary keys by looking at filed names and datatypes matches.
How many files / layouts / gigabytes are we talking about?
2 >> Tools to parse indexed files
Please clarify? Once the structure is known (question 1), the parsing is done by reading using that structure right? You never ever want to understand the indexed file internals. Just tell RMS to fetch records.
3 >> Tools to deal with custom RMS data formats (zoned decimals etc) as a bundle/API/Library
Again, please clarify. Once the structure is known just use the 'right' tool to read using that structure and surely it will honor the detailed data definitions.
(I know it is quite simple to write one yourself, just thought there would be something in the industry)
Famous last words... 'quite simple'. Entire companies have been build and thrive doing just that for general cases. I admit that for specific cases it can be relatively straightforward, but 'the devil is in the details'.
In the Attunity Connect case we have a UDT (User Defined data Type) to handle the 'odd' cases, often involving DATES. Dates in integers, in strings, as units since xxx are all available out of the box, but for example some have -1 meaning 'some high date' which needs some help to be stored in a DB.
All the databases have some bulk load tool (BCP, SQL$LOADER).
As long as you can deliver data conforming to what those expect (tabular, comma-seperated, quoted-or-not, escapes-or-not) you should be in good shape.
The EGH tool Vselect may be a handy, and high performance, way to bulk read indexed files, filter and format some and spit out sequential files for the DB loaders. It can read RMS indexed file faster than RMS can! (It has its own metadata language though!)
Attunity offers full access and replication services.
They include a CDC (change data capture) to not a only load the data, but to also keep it up to date in near-real-time. That's useful for 'evolution' versus 'revolution'.
Check out Attunity 'Replicate'. Once you have a data dictionary, just point to the tables desired (include, exlude filters), point to a target DB and click to replicate. Of course there are options for (global or per-table) transformations (like an AREA-CODE+EXHANGE+NUMBER to single phone number, or adding a modified date columns ).
Will this be a single big switch conversion, or is there desire to migrate the data and keep the old systems alive for days, months, years perhaps, all along keeping the data in close sync?
Hope this helps some,
Hein van den Heuvel.
OP: Perhaps I should change the approach? Probably.
You might consider finding data migration vendors, some which likely have off-the-shelf solutions, if not as a COTS tool, more likely packaged as a service (I don't think this is a big market).
What this won't help you with is what I think of as much bigger problem with the application code: who is going to change all the code that is making RMS calls, in the corresponding code that makes relational DB calls? How will the entity ("Joe Programmer", or some tool), know where the data migrated to, so that he can write the correct call? What are you doing to do about the fact that the data representation is like to change?
Ideally you'd like an automated migration tool, that will move the data itself (therefore knows that datalayouts and representation changes), and will make the code changes that correspond. You can look for these kind of vendors, too.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.