Database Choice for a CSV file - sql

I have a CSV file.
It has 5 columns, 4000 rows.
The database will have a single table, and each year I will add a new table to the database.
The tables itself will never be updated, they will be only created once.
I expect many multiple reads, queries at the same time.
There won't be any complex queries. Queries will be basically filtering on only one column.
The users will use sorting on one column.
Based on this, my gut feeling tells me that I should use a SQL solution, like MySQL or PostgreSQL. I am wondering your thoughts, should I use SQL, NoSQL or something else (Redis maybe?)

In my opinion, MySQL. Providing you have enough DB storage.

Related

SQL - multiple tables vs one big table

I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.

How to best query across both Oracle and SQL Server databases for large tables?

I have a stored procedure in SQL Server that also queries tables in the same database and in a different Oracle database. This is for a data warehouse project that joins several large tables across databases and queries them.
Is it better to copy the table(with ~3 mil records) to the same database and then query it, or is the slowdown not significant from the table being in a different database? The query is complicated and can take hours.
I'm not necessarily looking for a specific answer, informed opinion and/or specific further reading are also very appreciated. Thanks!
I always prefer stage layer, or somebody calls it integration layer.
In your case (on blind) it's perhaps best solution to:
Copy table once
Create a sync step (Insert/Update) based on primary key(s)
Schedule step 2
Run your query
If there is some logical data-integrity rule, you can create second step by simple SQL based on timestamps.

Which database structure is efficient? One table with 10,000 records or 1000 tables with 10 records?

We at college are making an application to generate PDF document from Excel sheet records using Java SE. I have though about two approaches to design the database. In one approach, there will be one table that will contain a lot of records (50K every year). In other approach, there will be a lot of tables created (1000 every year) at runtime and each table will contain max 50 records.
Which approach is efficient comparatively considering better overall time performance?
Multiple tables of identical structure almost never makes sense.
Databases are designed to have many records in few tables.
50K records is not "a lot" of records. You don't specify what database you will be using, but most commercial-grade databases can handle many, many millions of records in a table.
This is assuming you have proper indexes, etc. If you have to keep creating tables for you application, then there is something wrong with your design, and you need to re-think that.
When building a relational database the basic rule would be to avoid redundancy.
Look over your data and try to separate things that tend to repeat. If you notice a column or a group of columns that repeat across multiple entries create a new table for them. This way you will achieve the best performance when querying.
Otherwise, if the values are unique across the entries just keep the minimum number of tables.
You should just look for some design rules for relational databases. You will find some examples as well.
50k records is not much for a database. If it's all the same type of data (same structure), it belongs in the same table. Only if size and speed becomes an issue you should consider splitting up the data over multiple tables (or more likely: different servers).

Store Many Rows In Sql Server Issue?

I'm Working on My Program that Works With SQL Server.
for Store Data in Database Table, Which of the below approaches is correct?
Store Many Rows Just in One Table (10 Million Record)
Store Fewer Rows in Several Table (500000 Record) (exp: for each Year Create One Table)
It depends on how often you access data.If you are not using the old records, then you can archive those records. Splitting up of tables is not desirable as it may confuse you while fetching data.
I would say to store all the data in a single table, but implement a table partition on the older data. Partioning the data will increase query performance.
Here are some references:
http://www.mssqltips.com/sqlservertip/1914/sql-server-database-partitioning-myths-and-truths/
http://msdn.microsoft.com/en-us/library/ms188730.aspx
http://blog.sqlauthority.com/2008/01/25/sql-server-2005-database-table-partitioning-tutorial-how-to-horizontal-partition-database-table/
Please note that this table partioning functionality is only available in Enterprise Edition.
Well, it depends!
What are you going to do with the data? If you are querying this data a lot of times it could be a better solution to split the data in (for example) year tables. That way you would have a better performance since you have to query smaller tables.
But on the other side. With a bigger table and with good query's you might not even see a performance issue. If you only need to store this data it would be better to just use 1 table.
BTW For loading this data into the database you could use BCP (bulkcopy), which is a fast way of inserting a lot of rows.

Querying large dataset for statistics in SQL Server?

Say I have a sample for which 5 million data objects are stored as rows in SQL Server. If I need to run some stats on the data, would it be better to have a table for each sample, or one giant table, where I would select by sample id and then run the stats?
There may eventually be hundreds or even thousands of samples- which seems like one massive table.
But I'm not a SQL Server expert so I can't say whether one would be faster than the other...
Or maybe a better way to deal with such a large data set? I was hoping to use SQL CLR with C# to do my heavy lifting...
If you need to deal with such a large dataset, my gut feeling tells me T-SQL and working in sets will be significantly faster than anything you can do in SQL-CLR and a RBAR (row-by-agonizing-row) approach... dealing with large sets of data, summing up and selecting, that's what T-SQL is always been made for and what it's good at.
5 million rows isn't really an awful lot of data - it's a nice size dataset. But if you have the proper indices in place, e.g. on the columns you use in your JOIN conditions, in your WHERE clause and your ORDER BY clause, you should be just fine.
If you need more and more detailed advice - try to post your table structure, explain how you will query that table (what criteria you use for WHERE and ORDER BY) and we should be able to provide some more feedback.