Method to create calculated column from all columns in PowerPivot model - powerpivot

I'm wanting to compare data in two powerpivot tables.
Is there a method in PowerPivot to compare two tables of data?
Or alternatively ...
I've created a "key" calculated column (as concatenation of 6 columns using '&') and I am creating a calculated column from all the remaining data - about 100 columns.
Is there a method / function that will allow me create that calculated column?
Edit: the reason is to perform data comparison checks on data before and after a data migration. Additionally, PowerPivot was dictated as being the technology of choice for this solution, much easier might have been using one of the RedGate compares.

The best answer I could find was what I was originally doing.
Create a string concatenation of the 6 key columns as a CompoundKey Column
Create a string concatenation of the 100 (approx) data columns as a CombinedData Column
After initially checking that there were identical number of observations, I then did ordered the data in each table by the CompoundKey and performed a comparison of table1.CompoundKey to table2.CompoundKey and table1.CombinedData to table2.CombinedData.
This enabled me to find the Keys that were different between the two datasets then additionally to find any rows of data that were different for matching key rows.

Related

Multiple column selection in columnar database

I am just understanding the difference between row based and column based databases. I know their benefits but I have few questions.
Let's say I have a table with 3 columns - col1, col2 and col3. I want to fetch all col2, col3 pairs where col3 matches particular value. Let's say column values are stored in disk like below.
Block1 = col1
Block2,Block3 = col2
Block4 = col3
My understanding is that column value along with row id information will be stored in a block. Eg: (Block4 -> apple:row_2, banana:row_1). Am I correct?
Are values in the block sorted by column value? Eg: (Block4 -> apple:row_2, banana:row_1 instead of Block4 -> banana:row_1, apple:row_2). If not, how does filtering or search work without compromising the performance?
Assuming values in the block are sorted based on column value, how does corresponding col2 values will be filtered based on row ids fetched from Block4 ? Does it require linear search then?
The purpose of a columnar database is to improve performance for read queries by limiting the IO only to those columns used in the query. It does this by separating the columns into separate storage spaces.
A naive form of a columnar database would store one or a set of columns with a primary key and then use JOIN to bring together all the columns for a table. Columns that are not referenced would not be included.
However, databases that provide native support for columnar databases have much more sophisticated functionality than the naive example. Each columnar database stores data in its own way. So, your answer depends on the particular database which you haven't specified.
They might store "blocks" of values for a column and these blocks represent (in some way) a range of rows. So, if you are choosing 1 row from a billion row table, only the blocks with those rows need to be read.
Storing columns separately allows for enhanced functionality at the column level:
Compression. Values with the same data type can be much more easily compressed than rows which contain different values.
Block statistics. Blocks can be summarized statistically -- such as min and max values -- which can facilitate filtering.
Secondary data structures. Indexes for instance can be used within blocks (and these might be akin to "sorting" the values, actually).
The cost of all this is that inserts are no longer simple, so ACID properties are trickier with a column orientation. Because such databases are often used for decision support queries, this may not be an important limitation.
The "rows" are determined -- essentially -- by row ids. However, the row ids may actually consist of multiple parts, such as a block id and a row-within-a-block. This allows the store to use, say, 4 bytes for each component but not be limited to 4 billion rows.
Reconstructing rows between different columns is obviously a critical piece of functionality for any such database. In the "naive" example, this is handled via JOIN algorithms. However, specialized data structures would clearly have more specific approaches. Storing the data essentially in "row order" would be a typical solution.

SQL split CSV values to rows in denormalized table

So I have some quite large denormalized tables that have multiple columns that contain comma separated values.
The CSV values vary in length from column to column. One table has 30 different columns that can contain CSV's! For reporting purpose I need to do a count on the CSV values for each column (essentially different types)
Having never done this before what is my best approach?
Create a new table using a CSV split method to populate and have a
type field and type table for the different types?
Use the XML approach using XPath and the .nodes() and .value()
methods to split each column on the fly and perform a count as I go
or should I create some views that would show me what I want.
Please advise

does duplicate values in index takes duplicate space

I want to optimize storage of a big table by taking out values of columns of type varchar to external lookup table (there are many duplicated values)
the process of doing it is very technical in it's nature (creating lookup table and reference it instead of the actual value) and it sounds like it should be part of the infrastructure (sql server in this case or any RDBMS).
than I thought, it should be an option of a index - do not store duplicate values.
only a reference to the duplicated value.
can index be optimized in such a manner - not holding duplicated values, but just reference?
it should make the size of the table and index much smaller when there are many duplicated values.
SQL Server cannot do deduplication of column values. An index stores one row for each row of the base table. They are just sorted differently.
If you want to deduplicate you can keep a separate table that holds all possible (or actually occurring) values with a much shorter ID. You can then refer to the values by only storing their ID.
You can maintain that deduplication table in the application code or using triggers.

Dynamic Dataset updation

I am creating a project in VB.NET in which one of the reports require that the name of employees should be displayed as column names and whatever work they have done for a stated period should appear in rows below that particular column.
Now as it is clear, the columns will have to be added at runtime. Am using an ODBC Data source to populate the grid. Also since a loop will have to be done to find out the work done by employees individually, so the number of rows under one column might be less or more than the rows in the next column.
Is there a way to create an empty data table and then update its contents based on columns and not add data based on addition of new rows.
Regards
A table consists of rows and columns: the rows hold the data, the columns define the data.
So you have no other choice than to add at least that many rows as your longest column will need. You could just fill up empty values in the other columns. That should give you the view you need.
Wouldn't it be better to simply switch the table orientation?
If most of your columns are names or maybe regroupment I dont' know,
then you'd have one column for each of the data you could display,
And you'd add each rows with the names and stats dynamically, which is more common.
I'm only suggesting that, because I don't know all your table structure.

Sql Design Question

I have a table with 25 columns where 20 columns can have null values for some (30-40%) rows.
Now what is the cost of having rows with 20 null columns? Is this OK?
Or
is it a good design to have another table to store those 20 columns and add a ref to the first table?
This way I will only write to the second table only when there is are values.
I am using SQL server 2005. Will migrate to 2008 in future.
Only 20 columns are varchar, rest smallint, smalldate
What I am storing:
These columns store different attributes of the row it belongs to. These attributes can be null sometimes.
The table will hold ~billion of rows
Please comment.
You should describe the type of data you are storing. It sounds like some of those columns should be moved to another table.
For example, if you have several columns that represent multiple columns for the same type of data, then I would say move it to another table On the other hand, if you need this many columns to describe different types of data, then you may need to keep it as it is.
So it kind of depends on what you are modelling.
Are there some circumstances where some of those columns are required? If so, then perhaps you should use some form of inheritance. For instance, if this were information about patients in a hospital, and there was some data that only made sense for female patients, then you could create a FemalePatients table with those columns. Those columns that must always be collected for female patients could then be declared NOT NULL in that separate table.
It depends on the data types (40 nullable ints is going to basically take the same space as 40 non-nullable ints, regardless of the values). In SQL Server, the space is fairly efficient with ordinary techniques. In 2008, you do have the SPARSE feature.
If you do split the table vertically with an optional 1:1 relationship, there is a possibility of wrapping the two tables with a view and adding triggers on the view to make it updatable and hide the underlying implementation.
So there are plenty of options, many of which can be implemented after you see the data load and behavior.
Create tables based on the distinct sets of attributes you have. So if you have some data where some of your columns do not apply then it would make sense to have that data in a table which doesn't have those columns. As far as possible, avoid repeating the same attribute in multiple tables. Make sure your data is in at least Boyce-Codd / 5th Normal Form and you won't go far wrong.