What is the easiest/non-technical friendly way to upload and sort through a large dataset (400GB+)? - sql

Would like to sort the (largely text) data by date tags first (separating the data by quarters, lets say, even if into different files). Then would like to perform standard functions like conditional sums on the data. Without a substantial programming/database background but with a willingness to spend a few days to learn, what's my best bet for a solution?

It sounds like you want to be able to do some queries on your data. I would look into a SQL database solution. The most difficult part would be getting your data into the database.
All of AWS's relational databases can import from text files:
Microsoft's SQL Server: http://msdn.microsoft.com/en-us/library/ms178129.aspx
Oracle: http://docs.oracle.com/cd/B28359_01/text.111/b28304/aload.htm
MySQL: http://dev.mysql.com/doc/refman/5.0/en/mysqlimport.html
I would base the decision purely on whichever one is easier to load your file. If you are budget constrained, you can download MySQL and not do this whole cloud thing. Just keep it on your local computer, assuming you have enough disk space to host the database.
After that's done, they all support SQL, which makes it very easy for you to query your data. If you don't want to write your own SQL, there are tools for you to create queries by drag and drop. But being a programmer, I highly recommend writing your own queries, or coming back here for some query help.

Related

MongoDB or SQL for text file?

I have a 25GB's text file with that structure(headers):
Sample Name Allele1 Allele2 Code metaInfo...
So it just one table with a few millions of records. I need to put it to database coz sometimes I need to search that file looking, for example, specific sample. Then I need to get all row and equals to file. This would be a basic application. What is important? File is constant. It no needed put function coz all samples are finished.
My question is:
Which DB will be better in this case and why? Should I put a file in SQL base or maybe MongoDB would be a better idea. I need to learn one of them and I want to pick the best way. Could someone give advice, coz I didn't find in the internet anything particular.
Your question is a bit broad, but assuming your 25GB text file in fact has a regular structure, with each line having the same number (and data type) of columns, then you might want to host this data in a SQL relational database. The reason for choosing SQL over a NoSQL solution is that the former tool is well suited for working with data having a well defined structure. In addition, if you ever need to relate your 25GB table to other tables, SQL has a bunch of tools at its disposal to make that fast, such as indices.
Both MySQL and MongoDB are equally good for your use-case, as you only want read-only operations on a single collection/table.
For comparison refer to MySQL vs MongoDB 1000 reads
But I will suggest going for MongoDB because of its aggeration pipeline. Though your current use case is very much straight forward, in future you may need to go for complex operations. In that case, MongoDB's aggregation pipeline will come very handy.

Transferring data into a reporting database

We have a relational database with data. Currently the queries take to long for reporting, so we would like to use that data in a reporting database.
I want to analyze the queries that we use, run the queries at night and put the un normalized into tables that are read only. I do not believe that there are enough queries to necessitate a data mart.
My boss wants to write everything into xml packages and then whatever database we design, we will read in the xml files.
Like every person on earth, I think that my way is right. But, I have been wrong before. Can I get some advice and the pluses and minuses of each method?
This question is mostly opnion based, but reading from XML is pretty slow and very memory intensive, so i believe that creating a reporting database would be better, of course there are donwsides like that the unormalized database will ocuppy more space, but i think that with a good design it may work fairly well.

Which would be better? Storing/access data in a local text file, or in a database?

Basically, I'm still working on a puzzle-related website (micro-site really), and I'm making a tool that lets you input a word pattern (e.g. "r??n") and get all the matching words (in this case: rain, rein, ruin, etc.). Should I store the words in local text files (such as words5.txt, which would have a return-delimited list of 5-letter words), or in a database (such as the table Words5, which would again store 5-letter words)?
I'm looking at the problem in terms of data retrieval speeds and CPU server load. I could definitely try it both ways and record the times taken for several runs with both methods, but I'd rather hear it from people who might have had experience with this.
Which method is generally better overall?
The database will give you the best performance with the least amount of work. The built in index support and query analyzers will give you good performance for free while a textfile might give you excellent performance for a ton of work.
In the short term, I'd recommend creating a generic interface which would hide the difference between a database and a flat-file. Later on, you can benchmark which one will provide the best performance but I think the database will give you the best bang per hour of development.
For fast retrieval you certainly want some kind of index. If you don't want to write index code yourself, it's certainly easiest to use a database.
If you are using Java or .NET for your app, consider looking into db4o. It just stores any object as is with a single line of code and there are no setup costs for creating tables.
Storing data in a local text file (when you add new records to end of the file) always faster then storing in database. So, if you create high load application, you can save the data in a text file and copy data to a database later. However in most application you should use a database instead of text file, because database approach has many benefits.

querying larg text file containing JSON objects

I have few Gigabytes text file in format:
{"user_ip":"x.x.x.x", "action_type":"xxx", "action_data":{"some_key":"some_value"...},...}
each entry is one line.
First I would like to easily find entries for given ip. This part is easy because I can use grep for example. However even for this I would like to find better solution because I would like to get response as fast as possible.
Next part is more complicated because I would like to find entries from selected ip and of selected type and with particular value of some_key in action_data.
Probably I would have to convert this file to SQL db (probably SQLite, because it will be desktop APP), but I would ask if there are exists better solutions?
You could take a look at MongoDB, a document based database. With it you essentially store JSON objects that you can then index and easily query in an efficient way. You can find about how to query in the docs: Querying.
Yes, put it into a database, any database. Then querying it will be straightforward.
Just wanted to mention that Oracle Berkeley DB 11gR2 (released on April 1st, 2010) introduces support for a SQL API. In fact, the SQL API is the sqlite3() API. So, as Jason mentioned, if you'd like the ease-of-use of SQLite, combined with the scalability and concurrency of Berkeley DB, you can now get both things in a single library.
Regards,
Dave
If you need the relational guarantees of an SQL-based DB, definitely go ahead with SQLite. It will allow for fast queries, joins, aggregations, sorts, and overall any sort of search you could possibly dream up. It sounds like this is just a big list of Actions performed by users at some IP, so you'll probably want to use some sort of sequence as your primary key since none of the other attributes look like good candidates.
On the other hand, if you just need to do very simple queries, e.g. look up entries by IP, look up entries by action type, etc., you might want to look into Oracle Berkeley DB. As long as you don't need any searches that are too fancy, Berkeley DB will let you store Terabytes of data and access them at record speed.
So look over both and see what's best for your use case. They're good for different things, which might be why both are available as storage systems on Android, for instance. I think SQLite will probably win out, but when thinking about embedded local DB systems you should always at least consider both of these technologies.

How to efficiently archive older parts of a big (multi-GB) SQL Server database?

Right now I am working on a solution to archive older data from a big working database to a separate archive database with the same schema. I move the data using SQL scripts and SQL Server Management Objects (SMO) from a .Net executable written in C#.
The archived data should still be accessible and even (occassionally) changeable, we just want it out of the way to keep the working database lean and fast.
Hurling large portions of data around and managing the relations between tables has proven to be quite a challenge.
I wonder if there is a better way to archive data with SQL Server.
Any ideas?
I think if you still want/need the data to be accessible, then partitioning some of your biggest or most-used tables could be an option.
Yep, use table and index partitioning with filegroups.
You don't even have to change the select statements, only if you want to get the last bit of speed out of the result.
Another option can be the workload balancing with two servers and two way replication between them.
We are in a similar situation. For regulatory reasons we cannot delete data for a set period of time, but many of our tables grow very large and unwieldy and realistically much of the data that is older than a month can be removed with few day-to-day problems.
We currently programatically prune the tables, using a custom .NET/shell combination app using BCP to backup files which can be zipped and left on an out of the way network share. This isn't particularly accessible, but it is more space efficient. (It is complicated by our needing to keep certain historic dates, rather than being able to truncate at a certain size or with key fields in certain ranges.)
We are looking into alternatives but, I think surprisingly, there's not much in the way of best-practice on this discussion!