Save and query huge amount of data in SQL Server

Save and query huge amount of data in SQL Server - sql

I work on a project in a Financial Institution. In this company databases are distributed in separate branches. We want to set up a data center and turn all databases to one database. But in this situation we have a database table with more than 100 million records. I think SQL operations (ex insert, update, select) in this table will be too slow and costly. Which scenarios can help me? We use code first approach of Entity Framework in our project.

A) 100 million is not too much for SQL server. With appropriate indexes, disk topology, memory and CPU allocations plus a good DBA to oversee for a while. Things should be fine.
b) Initial migration is NOT an EF topic. I would not recommend EF for that task.
EF can Create the DB, but use tools to load data.
Sample SO post
c) Test and/or do some research on expected insert/Select times on the SQL server with 100 million rows.
d) The trick to getting good performance with EF is holding as FEW records in a context as possible.
Good EF code first code is the key to it working well.

Take a look on Bulk and BCP commands. They are used to copy large amount of data.
http://technet.microsoft.com/en-us/library/ms130809(v=sql.110).aspx
http://technet.microsoft.com/en-us/library/ms190923(v=sql.105).aspx
If you don't use MS SQL Server, look for the correspondent feature in your server.
Note that 100 milion records may not be really large amount of data. I recommend you do some performance test to realize if it will really be an issue.

Related

how to increase Oracle SQL database or web service performance?

I got a task to increase Oracle SQL database or web service performance. The web service required billions of data from the Oracle SQL database. Web service needs to populate those billions of data for each startup. Those data is mostly read-only and very rarely need an update or write data.
It is a very old codebase. That is why the solution was done in a way that it loads all data in memory to increase the performance. That is why it is slowing down development. It is like the first launch takes 30+ minutes. If for some reason those in-memory cached data becomes corrupted, I have to reload data from the database. It means another 30+ minutes waiting.
My task is to update this process. I have the flexibility to change the SQL database to something else that could help to speed up this process. Do you have any suggestion? Thanks in advance!

You can try to use MySQL. From my knowledge, MySQL has no limitation for the size of the database. I've attached a comparison you can look at between MySQL and Oracle. Comparison

Azure SQL Database or SQL Data Warehouse

I am working on a solution architecture and am having hard time choosing between Azure SQL DB or SQL DW.
The current scope involves around developing real-time BI reporting solution which is based on multiple sources. But in the long run the solution may be extended into a full fledged EDW and Marts.
I initially thought of using SQL DW so that for future scope the MPP capabilities could be used. But when I spoke to a mate who recently used SQL DW, he explained that the the development in SQL DW is not similar to SQL DB.
I have worked previously on Real Time reporting with no scope for EDW and we successfully used SQL DB. With this as well we can create Facts and Dimension and Marts.
Is there a strong case where I should be choosing SQL DW over SQL DB?

I think the two most important data points you can have here is the volume of data you're processing and the number of concurrent queries that you need to support. When talking about processing large volume data, and by large, I mean more than 3tb (which is not even really large, but large enough), then Azure SQL Data Warehouse becomes a juggernaut. The parallel processing is simply amazing (it's amazing at smaller volumes too, but you're paying a lot of money for overkill). However, the one issue can be the simultaneous query limit. It currently has a limit of 128 concurrent queries with a limit of 1,000 queries queued (read more here). If you're using the Data Warehouse as a data warehouse to process large amounts of data and then feed them into data marts where the majority of the querying takes place, this isn't a big deal. If you're planning to open this to large volume querying, it quickly becomes problematic.
Answer those two questions, query volume and data volume, and you can more easily decide between the two.
Additional factors can include the issues around the T-SQL currently supported. It is less than traditional SQL Server. Again, for most purposes around data warehousing, this is not an issue. For a full blown reporting server, it might be.
Most people successfully implementing Azure SQL Data Warehouse are using a combination of the warehouse for processing and storage and Azure SQL Database for data marts. There are exceptions when dealing with very large data volumes that need the parallel processing, but don't require lots of queries.

The 4 TB limit of Azure SQL Database may be an important factor to consider when choosing between the two options. Queries can be faster with Azure SQL Data Warehouse since is a MPP solution. You can pause Azure SQL DW to save costs with Azure SQL Database you can scale down to Basic tier (when possible).
Azure SQL DB can support up to 6,400 concurrent queries and 32k active connections, where Azure SQL DW can only support up to 32 concurrent queries and 1,024 active connections. So SQL DB is a much better solution if you are using something like a dashboard with thousands of users.
About developing for them, Azure SQL Database supports Entity Framework but Azure SQL DW does not support it.
I want also to give you a quick glimpse of how both of them compare in terms of performance 1 DWU is approximately 7.5 DTU (Database Throughput Unit, used to express the horse power of an OLTP Azure SQL Database) in capacity although they are not exactly comparable. More information about this comparison here.

Thanks for you responses Grant and Alberto. The responses have cleared a lot of air to make a choice.
Since, the data would be subject to dash-boarding and querying, I am tilting towards SQL Database instead of SQL DW.
Thanks again.

Directly query databases with 1b rows of data using Tableau or PowerBI

I occasionally see people or companies showcasing querying a db/cube/etc from Tableau or PowerBI with less than 5s of response, sometimes even less than 1s. How do they do this? Is the data optimized to the gills? Are they using a massive Db?
On a related question, I've been experimenting with analysing a much smaller dataset 100m rows with Tableau against SQL DW and it still takes nearly a minute to calculate. Should I try some other tech? Perhaps Analysis Services or a big data technology?
These are usually one-off data analysis assignments so I do not have to worry about data growth.

Live connections in Tableau will only be as fast as the underlying data source. If you look at your log (C:\Users\username\Documents\My Tableau Repository\Logs\log.txt), you will see the sql tableau issued to the database. Run that query on the server itself...should take about the same amount of time. Side note: Tableau has a new data engine coming with the next release. It's called 'Hyper'. This should allow you to create an extract from 2b rows with very good performance. You can download the beta now...more info here

SQL Server Architecture on Production Environment

I want to understand the best approach for SQL Server architecture on production environment.
Here is my problem:
I have database which has on average around 20,000 records being inserted every second in various tables.
We have reports also implemented for the same, now what's happening is whenever reports is searched by user, performance of other application steeps down.
We have implemented
Table Partitioning
Indexing
And all other required things.
My question is: can anyone suggest an architecture that have different SQL Server databases for reports and application, and they can sync themselves online every time when new data is entered in master SQL Server?
Some what like Master and Slave Architecture. I understand Master and Slave architecture, however need to get more idea around it.
Our main tables are having around 40 millions rows (table partitioning done)

In SQL Server 2008R2 you have database mirroring and replication available, which will keep two databases in sync.
A schema which is efficient for OLTP is unlikely to be efficient for large volume reporting. The 'live' and 'reporting' databases should have different schema with an ETL process moving data from one to the other. I'd would like to negotiate with the business just how synchronised the reporting database needs to be. If the reports are processing large amounts of data they will take some time to run so a lag in data replication will not be noticed, I would suggest. In extremis you could construct a solution using Service Broker to move the data and processing on the reporting server to distribute it amonst the reporting tables.
The numbers you quote (20,000 inserts per second, 40 millions rows in largest table) suggests a record doesn't reside in the DB for long. You would have a significant load performing DELETEs. Optimising these out of peak hours could be sufficient to solve your problems.

Performance hit on DB2 transactional database after linking to SQL Server 2005

We have an AS400 mainframe running our DB2 transactional database. We also have a SQL Server setup that gets loaded nightly with data from the AS400. The SQL Server setup is for reporting.
I can link the two database servers, BUT, there's concern about how big a performance hit DB2 might suffer from queries coming from SQL Server.
Basically, the fear is that if we start hitting DB2 with queries from SQL Server we'll bog down the transactional system and screw up orders and shipping.
Thanks in advance for any knowledge that can be shared.

Anyone who has a pat answer for a performance question is wrong :-) The appropriate answer is always 'it depends.' Performance tuning is best done via measure, change one variable, repeat.
DB2 for i shouldn't even notice if someone executes a 1,000 row SELECT statement. Take Benny's suggestion and run one while the IBM i side watch. If they want a hint, use WRKACTJOB and sort on the Int column. That represents the interactive response time. I'd guess that the query will be complete before they have time to notice that it was active.
If that seems unacceptable to the management, then perhaps offer to test it before or after hours, where it can't possibly impact interactive performance.
As an aside, the RPG guys can create Excel spreadsheets on the fly too. Scott Klement published some RPG wrappers over the Java POI/HSSF classes. Also, Giovanni Perrotti at Easy400.net has some examples of providing an Excel spreadsheet from a web page.

I'd mostly agree with Buck, a 1000 row result set is no big deal...
Unless of course the system is looking through billions of rows across hundreds of tables to get the 1000 rows you are interested in.
Assuming a useful index exists, 1000 rows shouldn't be a big deal. If you have IBM i Access for Windows installed, there's a component of System i Navigator called "Run SQL Scripts" that includes "Visual Explain" that provides a visual explanation of the query execution plan. View that you can ensure that an index is being used.
On key thing, make sure the work is being done on the i. When using a standard linked table MS SQL Server will attempt to pull back all the rows then do it's own "where".
select * from MYLINK.MYIBMI.MYLIB.MYTABE where MYKEYFLD = '00335';
Whereas this format sends the statement to the remote server for processing and just gets back the results:
select * from openquery(MYLINK, 'select * from mylib.mytable where MYKEYFLD = ''00335''');
Alternately, you could ask the i guys to build you a stored procedure that you can call to get back the results you are looking for. Personally, that's my preferred method.
Charles

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas