MongoDB or SQL for text file? - sql

I have a 25GB's text file with that structure(headers):
Sample Name Allele1 Allele2 Code metaInfo...
So it just one table with a few millions of records. I need to put it to database coz sometimes I need to search that file looking, for example, specific sample. Then I need to get all row and equals to file. This would be a basic application. What is important? File is constant. It no needed put function coz all samples are finished.
My question is:
Which DB will be better in this case and why? Should I put a file in SQL base or maybe MongoDB would be a better idea. I need to learn one of them and I want to pick the best way. Could someone give advice, coz I didn't find in the internet anything particular.

Your question is a bit broad, but assuming your 25GB text file in fact has a regular structure, with each line having the same number (and data type) of columns, then you might want to host this data in a SQL relational database. The reason for choosing SQL over a NoSQL solution is that the former tool is well suited for working with data having a well defined structure. In addition, if you ever need to relate your 25GB table to other tables, SQL has a bunch of tools at its disposal to make that fast, such as indices.

Both MySQL and MongoDB are equally good for your use-case, as you only want read-only operations on a single collection/table.
For comparison refer to MySQL vs MongoDB 1000 reads
But I will suggest going for MongoDB because of its aggeration pipeline. Though your current use case is very much straight forward, in future you may need to go for complex operations. In that case, MongoDB's aggregation pipeline will come very handy.

Related

Databricks: Best practice for creating queries for data transformation?

My apologies in advance for sounding like a newbie. This is really just a curiosity question I have as an outsider observing my team clash with our client. Please ask any questions you have, and I will try my best to answer it.
Currently, we are storing our transformation queries in a DynamoDB table. When needed, we pull into Databricks and execute the query. Simple as that. Our client has called this out as “hard coding” (more on that soon)
Our client has come up with an alternative that involves creating JSON config files containing the transformation rules (all tables/attributes required, target table names, Alias names, join keys, etc. etc.). From here, the SQL query is dynamically created. This approach is still “hard coding” since these config files would need to be manually edited anytime there is a change in the rules.
The way I see this: I think storing the transform rules in JSON is more business user friendly, but that’s about where I see the pros end. It brings in much more complexity to the code and likely will need to be continuously developed to support new queries. Also, I don’t see anyway to prevent “hard coding”. The client business leads seem to think there is some magical tool to convert plain English text to complex SQL queries
I just wanted to get some experts thoughts on this. Which solution is better, or is there another approach that should be taken?

What is the easiest/non-technical friendly way to upload and sort through a large dataset (400GB+)?

Would like to sort the (largely text) data by date tags first (separating the data by quarters, lets say, even if into different files). Then would like to perform standard functions like conditional sums on the data. Without a substantial programming/database background but with a willingness to spend a few days to learn, what's my best bet for a solution?
It sounds like you want to be able to do some queries on your data. I would look into a SQL database solution. The most difficult part would be getting your data into the database.
All of AWS's relational databases can import from text files:
Microsoft's SQL Server: http://msdn.microsoft.com/en-us/library/ms178129.aspx
Oracle: http://docs.oracle.com/cd/B28359_01/text.111/b28304/aload.htm
MySQL: http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html
I would base the decision purely on whichever one is easier to load your file. If you are budget constrained, you can download MySQL and not do this whole cloud thing. Just keep it on your local computer, assuming you have enough disk space to host the database.
After that's done, they all support SQL, which makes it very easy for you to query your data. If you don't want to write your own SQL, there are tools for you to create queries by drag and drop. But being a programmer, I highly recommend writing your own queries, or coming back here for some query help.

Database model refactoring for combining multiple tables of heterogeneous data in SQL Server?

I took over the task of re-developing a database of scientific data which is used by a web interface, where the original author had taken a 'table-per-dataset' approach which didn't scale well and is now fairly difficult to manage with more than 200 tables that have been created. I've spent quite a bit of time trying to figure out how to wrangle the thing, but the datasets contain heterogeneous values, so it is not reasonably possible to combine them into one table with a set schema for column definitions.
I've explored the possibility of EAV, XML columns, and ended up attempting to go with a table with many sparse columns since the database is running on SQL Server 2008. The DBAs are having some issues with my recently created sparse columns causing some havoc with their backup scripts, so I'm left wondering again if there isn't a better way to do this. I know that EAV does not lead to decent performance, and my experiments with XML data types also demonstrated poor performance, probably thanks to the large number of records in some of the tables.
Here's the summary:
Around 200 tables, most of which have a few columns containing floats and small strings
Some tables have as many as 15,000 records
Table schemas are not consistent, as the columns depended on the number of samples in the original experimental data.
SQL Server 2008
I'll be treating most of this data as legacy in the new version I'm developing, but I still need to be able to display it and query it- and I'd rather not have to do so by dynamically specifying the table name in my stored procedures as it would be with the current multi-table approach. Any suggestions?
I would suggest that the first step is looking to rationalise the data through views; attempt to consolidate similar data sets into logical pools through views.
You could then look at refactoring the code to look at the views, and see if the web platform operates effectively. From there you could decided whether or not the view structure is beneficial and if so, look to physically rationalising the data into a new table.
The benefit of using views in this manner is you should be able to squeak a little performance out of indexes on the views, and it should also give you a better handle on the data (that said, since you are dev'ing the new version, it would suggest you are perfectly capable of understanding the problem domain).
With 200 tables as simple raw data sets, and considering you believe your version will be taking over, I would probably go through the prototype exercise of seeing if you can't write the views to be identically named to what your final table names will be in V2, that way you can also backtest if your new database structure is in fact going to work.
Finally, a word to the wise, when someone has built a database in the way you've described, without looking at the data, and really knowing the problem set; they did it for a reason. Either it was bad design, or there was a cause for what now appears on the surface to be bad design; you raise consistency as an issue - look to wrap the data and see how consistent you can make it.
Good luck!

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.

Migrating from MySQL to arbitrary standards-compliant SQL2003 server

Is there an incantation of mysqldump or a similar tool that will produce a piece of SQL2003 code to create and fill the same databases in an arbitrary SQL2003 compliant RDBMS?
(The one I'm trying right now is MonetDB)
DDL statements are inherently database-vendor specific. Although they have the same basic structure, each vendor has their own take on how to define types, indexes, constraints, etc.
DML statements on the other hand are fairly portable. Therefore I suggest:
Dump the database without any data (mysqldump --no-data) to get the schema
Make necessary changes to get the schema loaded on the other DB - these need to be done by hand (but some search/replace may be possible)
Dump the data with extended inserts off and no create table (--extended-insert=0 --no-create-info)
Run the resulting script against the other DB.
This should do what you want.
However, when porting an application to a different database vendor, many other things will be required; moving the schema and data is the easy bit. Checking for bugs introduced, different behaviour and performance testing is the hard bit.
At the very least test every single query in your application for validity on the new database. Ideally do a lot more.
This one is kind of tough. Unless you've got a very simple DB structure with vanilla types (varchar, integer, etc), you're probably going to get the best results writing a migration tool. In a language like Perl (via the DBI), this is pretty straight-forward. The program is basically an echo loop that reads from one database and inserts into the other. There are examples of this sort of code that Google knows about.
Aside from the obvious problem of moving the data is the more subtle problem of how some datatypes are represented. For instance, MS SQL's datetime field is not in the same format as MySQL's. Other datatypes like BLOBs may have a different capacity in one RDBMs than in another. You should make sure that you understand the datatype definitions of the target DB system very well before porting.
The last problem, of course, is getting application-level SQL statements to work against the new system. In my work, that's by far the hardest part. Date math seems especially DB-specific, while annoying things like quoting rules are a constant source of irritation.
Good luck with your project.
From SQL Server 2000 or 2005 you can have it generate scripts for your objects, but I am not sure how well they will transfer to other RDBMS.
The generate script option is probably the easiest way to go. You'll undoubtedly have to do some search/replace on a few data types though.