How to Combine Multiple Nested SQL tables into one? - sql

First of all, I should preface this by letting you know that I'm a SQL novice - I've never really used SQL Server before and what I'd like to do must be quite rare or challenging because I've been unable to find any relevant answers on StackOverflow or Google.
I'd really, really appreciate your help on this. In the meantime, I myself am currently trying to improve my SQL knowledge and unearth a way to tackle this - but let's get straight to the point
I'm currently in possession of a SQL Server (which I browse through SQL Server Management Studio) with 4 tables. Everything's in Greek so no point in writing the real names. The point is that each row in Table 1 is associated with multiple rows in Table 2, which in turn is associated with multiple rows in Table 3, which in turn is associated with multiple rows in Table 4
My task is to perform AI/Machine Learning on this multi-instance multi-label problem, but to do that, I have to make it so there is only 1 table containing all the information of all tables.
SQL Server database structure:
4 Tables
3.75 GB
Table 1:
Holds information about tasks
100 columns
400,000 rows
ID is connected to table 2's Research_ID
Table 2:
Each task has multiple sub-tasks (which is what this table holds)
11 columns
2,500,000 rows
ID is connected to table 3's Task_Group_ID
Table 3:
Each sub-task requires things to be bought or changed or thrown away (held in this table)
8 columns
17,000,000 rows
Material_ID connected to table 4's ID
Table 4:
Each material has a certain cost and stuff (held in this table)
12 columns
3,700 rows
The way I see it, maybe it needs to happen in stages starting from the bottom to top.
For each row in table 3, there are many associated rows in table 4; hence, each row in table 3 is inserted in a new table as many times as the number of rows associated with it in table 4.
This means that a lot of the information will be duplicated and the 3.75GB will become much bigger, but that's normal and is what the problem needs.
After this happens for table 3 and 4, same thing needs to happen for table 2, and then for table 1. Note that a couple of columns from each table have to not be included in the final table. As I understand it, the only thing this changes is the use of each column's name in the "Select" instead of the asterisk (*). Lastly, remember that I need to actually create a new table because it needs to occur only once and stay available for months to be read by machine learning programs (WEKA, R, etc) and programming libraries (Accord.NET, etc)
The thing is.. how can I combine all these tables into one table that persists?
If I've neglected to share any needed information, please inform me and I shall do so as soon as I see the message.

You use joins to get the information. Tehcnically, you can do something like
SELECT * FROM Table1
JOIN Table2 ON Table1.Table2Id = Table1.ID
JOIN Table3 ON Table2.Table3Id = Table3.ID
Etc. But, you end up with repeats that can mess things up, so you are better to only select the columns you require. The joins here are one way, and will exclude nulls, so you might need other types of joins. The most information comes from a cross join, but it makes a Cartesian product of all of the tables involved, so you have the potential of getting more back than what you require.
Here is a link that explains joins in T-SQL: http://www.techonthenet.com/sql_server/joins.php
It is a good place to get started and may answer your question with a little bit of experimentation on your part.

Related

SQLite data comparison on two tables with 2 million rows of data

I've been trying to find a way of comparing huge amount of data in two different tables but I am not sure if this is the correct approach.
So, this is why I am asking it here to understand a problem more and getting some clarity in order to solve it.
As title says I have two tables with less than 2 million rows of data and I need to a data comparison on them. So basically I only need to check if data in one table matches the data from other tables. Each table comes from separate database and I've managed to create views in order to have the same column names.
Here is my approach which gives me differences from two tables.
SELECT db1.ADDRESS_ID, db1.address
FROM UAT_CUSTOMERS_DB1 db1
EXCEPT
SELECT db2.ADDRESS_ID, db2.address
FROM UAT_CUSTOMERS_DB2 db2;
I have two questions, so basically this works:
It seems pretty straightforward but can someone explain to me in a bit more depth how this query works with such speed ? Yes I know - read docs but I would really appreciate alternative answer.
How can I include all the columns from tables without specifying manually each column name ?

Understanding a table's structure/schema in SQL

I wanted to reach out to ask if there is a practical way of finding out a given table's structure/schema e.g.,the column names and example row data inserted into the table(like the head function in python) if you only have the table name. I have access to several tables in my current role, however, a person who developed the tables left the team I am on. I was interested in examining the tables closer via SQL Assistant in Teradata (these tables often contain often hundreds of thousands of rows hence there are issues of hitting CPU exception criteria errors).
I have tried the following select statement, but there is an issue of hitting internal CPU exception criteria limits.
SELECT top10 * FROM dbc.table1
Thank you in advance for any tips/advice!
You can use one of these commands to get table's structure details in teradata
SHOW TABLE Database_Name.Table_Name;
or
HELP TABLE Database_Name.Table_Name;
It shows the table structure details

Simple Inner join suggesting an Include index

I have this simple inner join query and its execution plan master table has around 34K records and detail table has around 51K records. But this simple query is suggesting to add an index with include (containing all master columns that I included in the select). I wasn't expecting this what could be the reason and remedy.
DECLARE
#StartDrInvDate Date ='2017-06-01',
#EndDrInvDate Date='2017-08-31'
SELECT
Mastertbl.DrInvoiceID,
Mastertbl.DrInvoiceNo,
Mastertbl.DistributorInvNo,
PreparedBy,
detailtbl.BatchNo, detailtbl.Discount,
detailtbl.TradePrice, detailtbl.IssuedUnits,
detailtbl.FreeUnits
FROM
scmDrInvoices Mastertbl
INNER JOIN
scmDrInvoiceDetails detailtbl ON Mastertbl.DrInvoiceID = detailtbl.DrInvoiceID
WHERE
(Mastertbl.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate)
My real curiosity is why it is suggesting this index - I normally not see this behavior with larger tables
For this query:
SELECT m.DrInvoiceID, m.DrInvoiceNo, m.DistributorInvNo,
PreparedBy,
d.BatchNo, d.Discount, d.TradePrice, d.IssuedUnits, d.FreeUnits
FROM scmDrInvoices m INNER JOIN
scmDrInvoiceDetails d
ON m.DrInvoiceID = d.DrInvoiceID
WHERE m.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate;
I would expect the basic indexes to be: scmDrInvoices(DrInvDate, DrInvoiceID) and scmDrInvoiceDetails(DrInvoiceID). This index would allow the query engine to quickly identify the rows that match the WHERE in the master table and then look up the corresponding values in scmDrInvoiceDetails.
The rest of the columns could then be included in either index so the indexes would cover the query. "Cover" means that all the columns are in the index, so the query plan does not need to refer to the original data pages.
The above strategy is what SQL Server is suggesting.
You can perhaps see the logic of why it's suggesting to index the invoice date; it's done some calculation on the number of rows you want out of the number of rows it thinks there are currently, and it appears that the selectivity of an index on that column makes it worth indexing. If you want 3 rows out of 55,000, and you want it every 5 minutes forever, it makes sense to index. Especially if the growth rate of that table means that next year it'll be 3 rows out of 5.5 million.
The include recommendation is perhaps more naively recommending associating sufficient additional data with the indexed values such that the entire dataset demanded from the master table can be answered from the index, without hitting the table - indexes are essentially pointers to rows in a table; when the query engine has used the index to locate all the rows it will need, it then still needs to bash the table to actually get the data you want. By including data in an index you remove the need to go to the table and it's sensible sometimes, but not others (creating many indexes that essentially replicate most/all of a table data for seldom run queries is a waste of disk space).
Consider too, that the frequency with which you're running this query now, in a debug tool, is affecting SQLServer's opinion of how often the query is used. I routinely find my SQLAzure portal making index recommendations thanks to the devs running a query over and over, debugging it, when I actually know that in prod, that query will be used once a month, so I discard the recommendation to make an index that includes most the table, when the straight "index only the columns searched" will do fine, no include necessary
These recommendations thus shouldn't be blindly heeded as SQLServer cannot know what you intend to use this, or similar queries for in the real world applications. Index creation and maintenance should be done carefully and thoughtfully; for example it may be that this query is asking for this index, another query would want an index on a different column but it might make sense to create an index that keys on both columns (in a particular order) and then in whichever query searches on the column that is indexed second, include a predicate that hits the first indexed column regardless of whether the query needs it
Example, in your invoices table you have a column indicating whether it's paid or not, and somewhere else in your app you have another query that counts the number of unpaid invoices. You can either have 2 indexes - one on invoice date (for this query) and one on status (for that query) or one on both columns (status, date) and in this query have predicates of WHERE status = 'unpaid' AND date between... even though the status predicate is redundant. Why might it be redundant? Suppose you know you'll only ever be choosing invoices from last week that have not been sent out yet, so can only ever be unpaid.. This is what I mean by "be thoughtful about indexing" - you know lots about your app that SQLServer can never figure out.. By including the redundant status column in the "get invoices from last week" query (even though status is logically redundant) you allow the query engine to use an index that is ordered first by status, then by date. This means you can get away with having to only maintain one index, and it can be used by two queries
Index maintenance and logic of creation can be a full time job.. ;)

Performance Improve on SQL Large table

Im having 260 columns table in SQL server. When we run "Select count(*) from table" it is taking almost 5-6 to get the count. Table contains close 90-100 million records with 260 columns where more than 50 % Column contains NULL. Apart from that, user can also build dynamic sql query on to table from the UI, so searching 90-100 million records will take time to return results. Is there way to improve find functionality on a SQL table where filter criteria can be anything , can any1 suggest me fastest way get aggregate data on 25GB data .Ui should get hanged or timeout
Investigate horizontal partitioning. This will really only help query performance if you can force users to put the partitioning key into the predicates.
Try vertical partitioning, where you split one 260-column table into several tables with fewer columns. Put all the values which are commonly required together into one table. The queries will only reference the table(s) which contain columns required. This will give you more rows per page i.e. fewer pages per query.
You have a high fraction of NULLs. Sparse columns may help, but calculate your percentages as they can hurt if inappropriate. There's an SO question on this.
Filtered indexes and filtered statistics may be useful if the DB often runs similar queries.
As the guys state in the comments you need to analyse a few of the queries and see which indexes would help you the most. If your query does a lot of searches, you could use the full text search feature of the MSSQL server. Here you will find a nice reference with good examples.
Things that came me up was:
[SQL Server 2012+] If you are using SQL Server 2012, you can use the new Columnstore Indexes.
[SQL Server 2005+] If you are filtering a text column, you can use Full-Text Search
If you have some function that you apply frequently in some column (like SOUNDEX of column, for example), you could create PERSISTED COMPUTED COLUMN to not having to compute this value everytime.
Use temp tables (indexed ones will be much better) to reduce the number of rows to work on.
#Twelfth comment is very good:
"I think you need to create an ETL process and start changing this into a fact table with dimensions."
Changing my comment into an answer...
You are moving from a transaction world where these 90-100 million records are recorded and into a data warehousing scenario where you are now trying to slice, dice, and analyze the information you have. Not an easy solution, but odds are you're hitting the limits of what your current system can scale to.
In a past job, I had several (6) data fields belonging to each record that were pretty much free text and randomly populated depending on where the data was generated (they were search queries and people were entering what they basically would enter in google). With 6 fields like this...I created a dim_text table that took each entry in any of these 6 tables and replaced it with an integer. This left me a table with two columns, text_ID and text. Any time a user was searching for a specific entry in any of these 6 columns, I would search my dim_search table that was optimized (indexing) for this sort of query to return an integer matching the query I wanted...I would then take the integer and search for all occourences of the integer across the 6 fields instead. searching 1 table highly optimized for this type of free text search and then querying the main table for instances of the integer is far quicker than searching 6 fields on this free text field.
I'd also create aggregate tables (reporting tables if you prefer the term) for your common aggregates. There are quite a few options here that your business setup will determine...for example, if each row is an item on a sales invoice and you need to show sales by date...it may be better to aggregate total sales by invoice and save that to a table, then when a user wants totals by day, an aggregate is run on the aggreate of the invoices to determine the totals by day (so you've 'partially' aggregated the data in advance).
Hope that makes sense...I'm sure I'll need several edits here for clarity in my answer.

Increase Search Performance on a Master Table by Creating a Second One? (SQL Server)

Had trouble choosing a good title for this, so its better that i just explain it:
I have one table (T1 from now on), T1 is a master table with a bunch of columns but only one real index (Namely the ID Column), this table is really important and i must not touch or alter any of the schema/constraint/indexes/etc in it since it's modified and maintained by a Third-Party Software, so to prevent any kind of unpredictable behavior i try to avoid to modify or add anything directly in the table (So this one is mostly for consults).
Now in T1 i have 3 columns; Group1, Group2 and Group3, i noticed lately that i tend to do a good amount of queries by searching and grouping values using these 3 values as key, but since they are not indexed they hurt the query performance by a large margin and wanted to improve performance while using these columns.
Here is my question:
Is it a good idea to make a Second Table (T2), created and maintained by me, where i only include the ID(Primary Key) of T1 along with these 3 Group Columns(Indexes), to perform a join t1 with t2 and perform any filter operation on t2? (Ignoring Groups on T1 and only taking the ID)
Edit: Reading about the indexed view i found this bit:
"You can’t modify the underlying tables and columns. The view is created with the WITH SCHEMABINDING option."
Does that mean i can't modify the Groups columns in T1?
Edit2: Just went with the duplicate table (Space is cheap), and i did see a amazing increase in performance the second time i did a search using one of the groups as filter parameters. From 40-50 secs to 7-10 secs, have to see if i can reduce that a little bit more (Love how SQL can be just like a sprint, every extra second you shave can count :D)