Power Pivot in C# & Columnar Database - sql

I want to use Power Pivot for one of my Presentation Engine Applicaiton for Transactional Data.
Following are the questions for which I am looking for an answer.
What is PowerPivot?
Can I use power pivot if I have 100 M rows in one of my SQL server table?
For Handling 100M rows can I store it in simple SQL server database table or do I need columnar database?
How exactly does power pivot function?

PowerPivot is simply a BI tool. There are many good BI tools, especially if you want to get into the open-source areas. Look at Pentaho, Jaspersoft, and BIRT/Actuate. These tools also can connect to many different sources/databases.
For question 3, it's all about how you're using the data. If you always query based upon the same filtering criteria, then using indexes may work for you. Assuming 100 million rows is about 50 gigs of raw data, you're starting to see the "shift" in query response/scale between a row-oriented approach and a column-oriented approach. If the queries are ad-hoc or your database size will continue to grow, then you should consider a columnar database like Infobright.

Related

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc.
The examples that I was able to find are always including implementation of the task using cursors. For big table such solution is performance killer.
I would be happy if I receive suggestions and examples which give better performance alternatives. Is it possible to create multiple stored procedures and combine the results in a table? I have not used stored procedures so far, so I base my question only on understanding that I got from documentation.
As Gordon mentioned, you should include your table's schema and some sample data to get the best answers, but a couple things you can look into are as follows:
Columnstore Indexes - These can be helpful for analytical querying against a table, e.g. SUM(), COUNT(), COUNT(DISTINCT) etc. This is because of the efficiencies in compression that can be achieved up and down the column for analytics. This is useful if you need a "real time" based on answer every time you query against the data.
You can periodically stage and update the results in a data warehouse type table. You basically can store the results to those aggregations in it's own table and periodically update it with either a SQL Agent Job (this isn't necessarily a real time solution) or use triggers to automatically update your data warehouse table (which will be closer to a real time solution but can be performance heavy if not implemented in a lean manner).
OLAP Cubes - This is more of an automated way to the above solution and has better maintainability but is also more advanced of a solution. This is a methodology for building out an actual OLAP based data warehouse.
In terms of difficulty of implementation and based on the size of your data (which isn't anything too huge) my recommendation would be to start with columnstore indexes and see how that helps your queries. I've had much success using them for analytical querying. Otherwise my remaining recommendations are in order of difficulty as well.

What are Apache Kylin Use Cases?

I recently came across Apache Kylin, and was curious what it's use cases are. From what I can tell, it seems to be a tool designed to solve very specific problems related to upwards of 10+ billion rows, aggregating, caching and querying data from other sources (HBase, Hadoop, Hive). Am I correct in this assumption?
Apache Kylin's use case is interactive big data analysis on Hadoop. It lets you query big Hive tables at sub-second latency in 3 simple steps.
Identify a set of Hive tables in star schema.
Build a cube from the Hive tables in an offline batch process.
Query the Hive tables using SQL and get results in sub-seconds, via Rest API, ODBC, or JDBC.
The use case is pretty general that it can fast query any Hive tables as long as you can define star schema and model cubes from the tables. Check out Kylin terminologies if you are not sure what is star schema and what is cube.
Kylin provides ANSI SQL interface, so you can query the Hive tables pretty much the same way you used to. One limitation however is Kylin provides only aggregated results, or in other word, SQL should contain a "group by" clause to yield correct result. This is usually fine because big data analysis focus more on the aggregated results rather than individual records.

COLUMN STORE INDEX vs CLUSTERED INDEX..Which one to use?

I’m trying to evaluate which type of indexes to use on our tables in the SQL Server 2014 data mart, which we are using to power our OLAP cube in SSAS. I have read the documentation on MSDN and still a bit unclear which is the right strategy for our use case with the ultimate goal of speeding up the queries on SQL Server that the cube issues when people browse the cube.
I have the tables related to each other as shown in the following snow flake dimensional model. The majority of the calculations that we are going to do in the cube, is COUNT DISTINCT of the users (UserInfoKey) based on different combination of dimensions (both filters and pivots). Keeping that in mind, what would the SQL experts suggest I do in terms of creating indexes on the tables?. I have the option of creating COLUMN STORE INDEXES on all my tables (partitioned by the HASH of primary keys) or create the regular primary keys (clustered indexes) on all my tables. Which one is better for my scenario? From my understanding the cube will be doing a lot of joins and groupby’s under the covers based on the dimensions selected by the user.
I tried both versions with some sample data and the performance isn’t that different in both cases. Now before I do the same experiment with real data (it’s going to take a lot of time to produce the real data and load it into our data mart), I wanted to check with the experts about their suggestions.
We are also evaluating if we should use PDW( Parallel Datawarehouse) as our data mart instead of vanilla SQL Server 2014.
Just to give an idea on the scale of data we are dealing with
The 2 largest tables are
ActivityData fact table : 784+ million rows
DimUserInfo dimension table: 30 + million rows
Any help or pointers are appreciated

How to best query across both Oracle and SQL Server databases for large tables?

I have a stored procedure in SQL Server that also queries tables in the same database and in a different Oracle database. This is for a data warehouse project that joins several large tables across databases and queries them.
Is it better to copy the table(with ~3 mil records) to the same database and then query it, or is the slowdown not significant from the table being in a different database? The query is complicated and can take hours.
I'm not necessarily looking for a specific answer, informed opinion and/or specific further reading are also very appreciated. Thanks!
I always prefer stage layer, or somebody calls it integration layer.
In your case (on blind) it's perhaps best solution to:
Copy table once
Create a sync step (Insert/Update) based on primary key(s)
Schedule step 2
Run your query
If there is some logical data-integrity rule, you can create second step by simple SQL based on timestamps.

SQL Server 2012 Query Performance

I will be starting a project soon using SQL Server 2012 where I will be required to provide real-time querying of database tables in excess of 4 billion records in 1 of the tables alone. I am fairly familiar with SQL Server (I have indexes on the relevant columns), but have never had to deal with databases so large before.
I have been looking into partitioning and am fairly confident at using it, however it is only available in the Enterprise version(?) for which the licenses are WAY too expensive. Column Store indexes also look promising, but as well as only being in Enterprise version, they also render your table read-only(??). Another option is to archive data as soon as it is not being used in live so that I keep as little data in the live tables as possible.
The main queries on the largest table will be on a NVARCHAR(50) column which contains an ID. Initial testing with 4 billion records using a query to pull a single record based on the ID is taking in excess of 5 mins even with indexing. So my question is (and sorry if it sounds naive!): can somebody please suggest a way to speed up the queries on this table that I haven't mentioned (and therefore don't know about)? Many thanks in advance.