SQL - Merging planned vs actual tables with a date criteria in the planned table. for Power BI - sql

I have a general question, with regards to planned and actual tables and their merging.
I have a planned table for products and their amounts, and the same for what really exists(actual). There are gaps on both sides, as in, not every planned has an actual and not every actual has a planned. I used a select union to combine the two tables, and I created another column called "type" where I then labeled the figures actual or planned. However I feel that is not a good way of doing things. I also should add that I have a data criteria with the planned table.
Is there a good way of structuring an SQL for the merging of the two tables to view in Power BI??

What you have done is look good, merging 2 tables with a additional column to identify planned/actual, what final output you are planning that matter the most,
if you have achieve your result by doing this, then it is good method.

thanks #msta42a. The secret was simply creating a common key.
I also found that having the two columns in seprate columns rather, than in the same column and seperated by a "type" was easier to deal with in Power BI, especially with regards to creating a trand column or comparison of the two.

Related

Merging multiple tables from multiple databases with all rows and columns

I have 30 databases from a survey application that all have a table of results with approximately 100 columns in each. Most of the columns are identical but each survey seems to have a unique column or two added in with no real pattern (these are the added questions and results of the survey). As I am working on the statement to join all of the tables into one large master table the code is getting quite complex. Is there a more efficient way to merge these tables from multiple databases and just select all rows and columns so it will merge if the column exists and create if it encounters a new column?
No, there isn't an automatic way to merge a bunch of similar, but not quite the same, tables into one. At least, not in any database system that I know of.
You could possibly automate something like that with a fairly simple script that relies on your database's information schema (or equivalent).
However, with only 30 tables and only a column or two different in each, I'm not sure it's worth it. A manual approach, with copying and pasting and making minor changes, would probably be faster.
Also, consider whether the "extra" columns that are unique to individual tables need to go into the combined table. The point of making a big single table is to process/analyze all the data together. If something only applies to a single source, this isn't possible.

Best Practice - Should I make one table or two for two similar sets of data?

I need a table to store types of tests. I've been provided with two excel spreadsheets, one for microbial tests, one for pathogens. Microbial has 5 columns and Pathogens has 10. The 5 columns are in both tables. So one has 5 extra columns.
Just to give you an idea, the table columns would be something like this:
**Microbial**
Test Method IncubationStage1
**Pathogens**
Test Method IncubationStage1 IncubationStage2 Enrichment
So Is it better to have one table for Microbial and one for Pathogens, or better to have one table for Tests and have both within it? Is it bad to have a Microbial in a table where I know for certain only half the columns will be utilized? Or is it better to keep related items in the same table, and separate them by a column "Type"?
Obviously both will work fine but I'm wondering which is better.
The answer to these sorts of questions is always "it depends."
For my opinion, if you think you'll ever want to aggregate the data by test or by method across pathogenic or microbial types, then certainly you should put the data in the same table with an additional column that differentiates them.
You also could potentially better "normalize" your tables like this:
Table1: ExperimentID_PK ExperimentTypeID_FK Test Method
Table2: MeasurementRecordID_PK ExperimentID_FK Timestamp Other metadata about the record
Table3 MeasurementID_PK MeasurementTypeID_FK MeasurementValue MeasurementRecordID_FK
Table4: MeasurmentTypeId_PK Metadata About Measurement Types
Table5: ExperimentTypeId_PK Metadata About Experiment Types
... where all the leaf data elements point back to their parent data elements through foreign keys, and then you'd join data together in SQL statements, with indexes applied for optimal performance based on the types of queries you wanted to make. Obviously one of your rows in the question would end up appearing as multiple rows across multiple tables in this schema, and only at query time could they conceivably be reunited into individual rows (e.g. bound by MeasurementRecordID).
But there are other patterns too, in No-SQL land normalization can be the enemy. Slicing and dicing data sets turns out to be easier in some domains if it is stored in a more bloated format to make query structures more obvious. So it kind of comes down to thinking through your use cases.

SQL - multiple tables vs one big table

I want to move multiple SQLite files to PostgreSQL.
Data contained in these files are monthly time-series (one month in a single *.sqlite file). Each has about 300,000 rows. There are more than 20 of these files.
My dilemma is how to organize the data in the new database:
a) Keep it in multiple tables
or
b) Merge it to one huge table with new column describing the time period (e.g. 04.2016, 05.2016, ...)
The database will be used only to pull data out of it (with the exception of adding data for new month).
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Which structure should I go for - one huge table or multiple smaller tables?
Think I would definitely go for one table - just make sure you use sensible indexes.
If you have the space and the resource 1 table, as other users have appropriately pointed out databases can handle millions of rows no problem.....Well depends on the data that is in them. The row size can make a big difference... Such as storing VARCHAR(MAX), VARBINARY(MAX) and several per row......
there is no doubt writing queries, ETL (extract transform load) is significantly easier on a single table! And maintenance of that is easier too from a archival perspective.
But if you never access the data and you need the performance in the primary table some sort of archive might make since.
There are some BI related reasons to maintain multiple tables but it doesn't sound like that is your issue here.
There is no perfect answer and will depend on your situation.
PostgreSQL is easily able to handle millions of rows in a table.
Go for option b) but..
with new column describing the time period (e.g. 04.2016, 05/2016, ...)
Please don't. Querying the different periods will become a pain, an unnecessary one. Just put the date in one column, put a index on the column and you can, probably, execute fast queries on it.
My concern is that selecting data from multiple tables (join) would not perform very well and the queries can get quite complicated.
Complicated for you to write or for the database to execute? An Example would be nice for us to get an image of your actual requirements.

100 Join SQL query

I'm looking to essentially have a centralized table with a number of lookup tables that surround it. The central table is going to be used to store 'Users' and the lookup tables will be user attributes, like 'Religion'. The central table will store an Id, like ReligionId, and the lookup table would contain a list of religions.
Now, I've done a lot of digging into this and I've seen many people comment saying that a UserAttribute table might be the best way to go, essentially using an EAV pattern. I'm not looking to do this. I realize that my strategy will be join-heavy and that's why I ask this question here. I'm looking for a way to optimize those joins.
If the table has 100 lookup tables, how could it be optimized to be faster than just doing a massive 100 table inner join? Some ideas come to mind like using many smaller joins, sub-selects and views. I'm open to anything, including a combination of these strategies. Again, just to note, I'm not looking to do anything that's EAV-related. I need the lookup tables for other reasons and I like normalized data.
All suggestions considered!
Here's a visual look:
Edit: Is this insane?
Optimization techniques will likely depend on the size of the center table and intended query patterns. This is very similar to what you get in data warehousing star schemas, so approaches from that paradigm may help.
For one, ensuring the size of each row is absolutely as small as possible. Disk space may be cheap, but disk throughput, memory, and CPU resources are potential bottle necks. You want small rows so that it can read them quickly and cache as much as possible in memory.
A materialized/indexed view with the joins already performed allows the joins to essentially be precomputed. This may not work well if you are dealing with a center table that is being written to alot or is very large.
Anything you can do to optimize a single join should be done for all 100. Appropriate indexes based on the selectivity of the column, etc.
Depending on what kind of queries you are performing, then other techniques from data warehousing or OLAP may apply. If you are doing lots of group by's then this is likely an area to look in to. Data warehousing techniques can be applied within SQL Server with no additional tooling.
Ask yourself why so many attributes are being queried and how they are being presented? For most analysis it is not necessary to join with lookup tables until the final step where you materialize a report, at which time you may only have grouped by on a subset of columns and thus only need some of the lookup tables.
Group By's generally should be able to group on the lookup Id's without needing the text/description from the lookup table so a join is not necessary. If your lookups have other information relevant to the query at hand then consider denormalizing it into the central table to eliminate the join and/or make that discreet value its own lookup, essentially splitting the existing lookup ID into another ID.
You could implement a master code table that combines the code tables into a single table with a CodeType column. This is not the same as EAV because you'd still have a column in the center table for each code type and a join for each, where as EAV is usually used to normalize out an arbitrary number of attributes. (Note: I personally hate master code tables.)
Lastly, consider normalization the center table if you are not doing data warehousing.
Are there lots of null values in certain lookupId columns? Is the table sparse? This is an indication that you can pull some columns out into a 1 to 1/0 relationships to reduce the size of the center table. For example, a Person table that includes address information can have a PersonAddress table pulled out of it.
Partitioning the table may improve performance if there's a large number of rows and you can determine that certain rows, perhaps with a certain old datetime from couple years in the past, would rarely be queried.
Update: See "Ask yourself why so many attributes are being queried and how they are being presented?" above. Consider a user wants to know number of sales grouped by year, department, and product. You should have id's for each of these so you can just group by those IDs on the center table and in an outer query join lookups for only what columns remain. This ensures the aggregation doesn't need to pull in unnecessary information from lookups that aren't needed anyway.
If you aren't doing aggregations, then you probably aren't querying large numbers of records at a time, so join performance is less of a concern and should be taken care of with appropriate indexes.
If you're querying large numbers of records at a time pulling in all information, I'd look hard at the business case for this. No one sits down at their desk and opens a report with a million rows and 100 columns in it and does anything meaningful with all of that data, that couldn't be accomplished in a better way.
The only case for such a query be a dump of all data intended for export to another system, in which case performance shouldn't be as much as a concern as it can be scheduled overnight.
Since you are set on your way. you can consider duplicating data in order to join less times in a similar way to what is done in olap database.
http://en.wikipedia.org/wiki/OLAP_cube
With that said I don't think this is the best way to do it if you have 100 properties.
Have you tried to export it to Microsoft Excel Power Pivot with Power Query? you can make fast data analysis with pretty awsome ways to show it with Power view video sample

database design bigger table vs split table have the same col

i have a database program for store as you know there is too type of invoice in it one for the thing i bought and the other for me when i sold them the two table is almost identical like
invoice table
Id
customerName
date
invoiceType
and invoiceDetails which have
id
invoiceId
item
price
amount
my question is simple its what best to keep the design like that or split every table for two sperate tabels
couple of my friend suggest splitting the tables as one for saleInvoice and the other for buyInvoice to speed the time for querying
so whats the pro and con of every abrouch i feel that if i split them its like i dont follow DRY rule
i am using Nhibernate BTW so its kindda weird to have to identical class with different names
Both approached would work. If you use the single table approach, then the invoiceType column would be your discriminator field. In your nHibernate mapping, this discriminator field would be used by nHibernate to decide which type (i.e. a purchase or a sale) to instantiate for a given row in the table (see section 5.1.6 of the nHibernate mapping guide. For ad hoc SQL queries or reporting queries, you could create two views, one to return only rows with invoiceType = purchase and one to return only rows where invoiceType=sales.
Alternatively, you could create two separate tables, one for purchase and one for sales. As you point out, these two tables would have nearly identical schemas and nhibernate mapping files.
If you are anticipating very high transaction volumes, you would want to put purchases and sales on two different physical discs. With two different tables, this can be accomplished by putting them into different file groups. With a single table, you still could accomplish this by creating a SQL Server Partitioned Table. Before you go to this trouble, you might want to evaluate if this really is necessary and that disc access to the table is really going to be the performance bottleneck. You don't want to spend a lot of time doing premature optimization if it is not necessary.
My preference would be to have a single table with a discriminator column, to better follow DRY principles. Unless I had solid numbers that indicated that indicated it was necessary, I would hold off implementing a partitioned table until if and when it became necessary.
I'd ask myself, how do I intend to use this information? Will I need sales and buy invoices in the same queries? Am I likely to need specialized information eventually (highly likely in my experience) for each type? And if I do will I need to have child tables for only 1 type? How would that affect referntial integrity? Would a change to one automatically mean I needed a change to the other? How large is the table likely to be (It would have to be in the multi-millions before I would consider that it might need to be split out only due to size). How likely is it that I would get the information mixed up by accident if they are in the same table and include both when I didn't want to? The answers would determine whether I needed to split it out for me. I would tend to see these as two separate functions and it would take alot to convince me to put them in one table.