Developing SCV using SQL - sql

I am trying to identify all related records using IDs from two different systems.
I have seen solutions that matches SourceA to SourceB and back to SourceA but obviously this will not pick up everything.
The below table shows that 1-A is seemingly unrelated to 4-C, however when we pair them up we can see that all of the below records are related and the latest ID combination is 4-C.
| SystemA_ID | SystemB_ID | Date | PrimaryA | PrimaryB |
| 1 | A | 1/1/2016 | 4 | C |
| 2 | A | 2/1/2016 | 4 | C |
| 2 | B | 3/1/2016 | 4 | C |
| 3 | B | 4/1/2016 | 4 | C |
| 3 | C | 5/1/2016 | 4 | C |
| 4 | C | 6/1/2016 | 4 | C |
What I need is to populate the PrimaryA and PrimaryB columns with 4 and 'C' respectively.
I was thinking of doing a double loop similar to the solution described here
However, I could not get it working and also there might be a better solution.

Related

Is it a good idea to have SQL table entries refer to other ids in the same table?

I'm designing a table for product categories for a kinda-e-commerce site. The table currently looks a bit like this:
| id | name | level | value | parent_id |
+----+-------------+-------+-------------+-----------+
| 1 | Food | 0 | food | NULL |
| 2 | Phone | 0 | phone | NULL |
| 3 | Thing | 0 | thing | NULL |
| 4 | Pasta | 1 | pasta | 1 |
| 5 | Apple | 1 | apple | 2 |
| 6 | SubThing | 1 | subthing | 3 |
| 7 | Tagliatelle | 2 | tagliatelle | 4 |
| 8 | iPhone 11 | 2 | iphone_11 | 5 |
| 9 | SubSubThing | 2 | subsubthing | 6 |
Basically I don't want to create a whole new table and map the relationships every time people want to add a new sub-level to the category structure, and rely on level and parent_id columns to let my code know how to do with this category and what its parent is. I'm completely new to model designing and this is the best I could come up with. Is there any downside to this self-referencing structure that I'm just too noob to realize?
If you are certain the sub level (child) will only ever be referenced by that single row or parent then the design should suffice. You may run into issues if multiple child elements need to roll up into that parent entity.

Looking up parent item based on a bill of materials

I'm trying to figure out how to put together a SQL statement that will let me find an end-item in our database based on its bill of materials. I guess you could say this is like a reverse BOM lookup question.
My table structure is pretty simple.
-End-item table
-Component table
-Linking table to tie together multiple components to an end item record.
The data I have is just the component list, and I want to find the end item. Since every bill of material is unique it has to match the bill of materials perfectly ie exact number of components and exact matches to the component SKU numbers. In some cases 2 end-items might use all the same components, but one of them just uses an extra part or two that makes the end-item SKU number different, so it has to account for that. That is, again, it has to match the BOM perfectly.
If not an outright answer, could someone at least steer me on the correct path to finding one?
------ UPDATE ----------
Table structure would be something like this.
ManufacturedPart
,--------------------,
| ID | PART_NUM |
|--------------------|
| 1 | V3175-01 |
| 2 | V3367-01 |
| 3 | V3988-01 |
| 4 | V3175-CV |
`--------------------`
Component
,--------------------,
| ID | COMP_NUM |
|--------------------|
| 1 | V3175 |
| 2 | V3367 |
| 3 | V3369 |
| 4 | V3114 |
| 5 | V3370 |
| 6 | V4060 |
| 7 | V3550 |
| 8 | V3988 |
`--------------------`
ManufacturedComponent
,-------------------------------------------------,
| ID | MANUFACTURED_PART_ID | COMPONENT_ID |
|-------------------------------------------------|
| 1 | 1 | 1 |
| 2 | 1 | 4 |
| 3 | 1 | 6 |
| 4 | 2 | 2 |
| 5 | 2 | 3 |
| 6 | 2 | 5 |
| 7 | 2 | 7 |
| 8 | 3 | 1 |
| 9 | 3 | 8 |
| 10 | 4 | 1 |
| 11 | 4 | 4 |
`-------------------------------------------------`
Assuming I have only the COMP_NUMs (component numbers) to search with I want to match back to the ManufacturedPart that contains that exact list of components.
So some examples: If I have components V3175, V3114, and V4060, it should match back to V3175-01 manufactured part. But, if I only have components V3175 and V3114 it should match back to V3175-CV manufactured part. If I have components V3367, V3369, V3370, and V3550 it should match back to manufactured part V3367-01.
I have no SQL written at all yet as I'm unsure of how to break the problem down..

Keep newest duplicate row depending on multiple Columns

I seem to have a workflow problem with Open Refine (Google Refine 2.5 [r2407]) to do sophisticated duplicate row cleaning. All I have found so far is how to delete duplicate rows based on a single column.
My aim is to delete duplicate rows based on multiple columns, at best, in a specific hierarchy.
Example
Given the following dummy data in Refine
+----+---------+---------+--------+------------+------+-----------------------------------+
| id | timeAgo | title | author | date | val1 | [After Refine, keep Record] |
+----+---------+---------+--------+------------+------+-----------------------------------+
| 1 | 10 | Faust | Mr. A | 2014-01-15 | 10 | ->B, older entry |
| 2 | 11 | Faust | Mr. A | 2014-01-21 | 10 | A (because of Date) |
| 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 | B |
| 4 | 8 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, older entry |
| 5 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | ->D, same time Ago, but lower ID |
| 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 | C (because of author, date, val1) |
| 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 | D |
+----+---------+---------+--------+------------+------+-----------------------------------+
I want to kill the duplicate rows based on following logic. If
title && auther && date && val1 are the same, than
keep the newest (least timeAgo) row, if there are multiple, than
keep the one with the highest id
The Result would be:
+---------+----+---------+---------+--------+------------+------+
| Refined | id | timeAgo | title | author | date | val1 |
+---------+----+---------+---------+--------+------------+------+
| A | 2 | 10 | Faust | Mr. A | 2014-01-21 | 10 |
| B | 3 | 8 | Faust | Mr. A | 2014-01-15 | 10 |
| C | 6 | 7 | RedHead | Mr. A | 2014-01-01 | 13 |
| D | 7 | 7 | RedHead | Mr. B | 2014-01-21 | 34 |
+---------+----+---------+---------+--------+------------+------+
Easy Approach?
If there is no other solution, I thankfully take a scripting/GREL one.
But could it be done by Refines famous workflow "recording" to achieve above logic, so it could be extracted and applied to other same format datasets?
My motivation behind this is to enable employees to work more thoughtfully with data (beyond excel) but without confronting them right away with a full blown scripting language.
That sounds like a straightforward sorting problem.
Sort the records by title, author, time ago, and ID
Re-order rows permanently (IMPORTANT - it won't work if you forget this step)
Blank down on Title & Author
Move those two columns to the two left most positions
Join multivalued cells on remaining columns
Transform all columns from step 5 using value.split(',')[0] to extract the first value (which should be the value for the record you want if you sorted them in the right order

Possible fallbacks in my pagination technique and how can I improve it?

I want to perform pagination for my web page.The method that I am using (and I found mostly on internet ) is explained below with an example.
Suppose I have the following table user
+----+------+----------+
| id | name | category |
+----+------+----------+
| 1 | a | 1 |
| 2 | b | 2 |
| 3 | c | 2 |
| 4 | d | 3 |
| 5 | e | 1 |
| 6 | f | 3 |
| 7 | g | 1 |
| 8 | h | 3 |
| 9 | i | 2 |
| 10 | j | 2 |
| 11 | k | 1 |
| 12 | l | 3 |
| 13 | m | 3 |
| 14 | n | 3 |
| 15 | o | 1 |
| 16 | p | 1 |
| 17 | q | 2 |
| 18 | r | 1 |
| 19 | s | 3 |
| 20 | t | 3 |
| 21 | u | 3 |
| 22 | v | 3 |
| 23 | w | 1 |
| 24 | x | 1 |
| 25 | y | 2 |
| 26 | z | 2 |
+----+------+----------+
And I want to show information about category 3 users with 2 users per page, I am using the following query for this
select * from user where category=3 limit 0,2;
+----+------+----------+
| id | name | category |
+----+------+----------+
| 4 | d | 3 |
| 6 | f | 3 |
+----+------+----------+
and for next two
select * from user where category=3 limit 2,2;
+----+------+----------+
| id | name | category |
+----+------+----------+
| 8 | h | 3 |
| 12 | l | 3 |
+----+------+----------+
and so on.
Now in practice I have around 7000 tuples in a single table.So is there any better way in terms of speed to achieve this or in terms of any fallback this method may have.
Thanks.
You don't want to fetch more values than your current page can handle, so yes, you will essentially be making one query per page. Some other solutions (such as Rails will_paginate) will execute essentially the same queries.
Now, you could build some logic into your client side to do the pagination there - prefetch multiple (or all) pages at once and store them on the client side. This way pagination is handled completely on the client side without need for further queries. It is a bit wasteful if a user is likely to only look at a small percentage of pages overall though.
If your actual production table has more columns in it, you could select only the relevant columns instead of *, or potentially add some sort of order by (for sorting).
I hope this will help, you gotta put your page number in place of your_page_number, and records per page in place of records_per_page which in your sample is 2:
select A.* from
(select #row := #row + 1 as Row_Number, User.* from User
join (select #row := 0) Row_Temp_View
where category = 3
) A
where row_number
between (your_page_number * records_per_page)-records_per_page+1
and your_page_number * records_per_page;
notice that this will fetch you the right records, where your sample will not, and this is because your sample will fetch you always two records, which is not always true, lets say that you have 3 users you wonna show in two pages so your sample will show the first and the second in the first page and it will show the second and the third in the second page which is not right, my code will show you the first and the second in the first page and in the second page it will show you only the third one....
You can use Datatables. It's meant for exact same thing that you are looking for. I successfully use it for paginating more than a million rows, it's very fast & easy to implement.

Quickly calculating running totals in sql server using set based operations

I have some data that looks like this:
+---+--------+-------------+---------------+--------------+
| | A | B | C | D |
+---+--------+-------------+---------------+--------------+
| 1 | row_id | disposal_id | excess_weight | total_weight |
| 2 | 1 | 1 | 0 | 30 |
| 3 | 2 | 1 | 10 | 30 |
| 4 | 3 | 1 | 0 | 30 |
| 5 | 4 | 2 | 5 | 50 |
| 6 | 5 | 2 | 0 | 50 |
| 7 | 6 | 2 | 15 | 50 |
| 8 | 7 | 2 | 5 | 50 |
| 9 | 8 | 2 | 5 | 50 |
+---+--------+-------------+---------------+--------------+
And I am transforming it to look like this:
+---+--------+-------------+---------------+--------------+
| | A | B | C | D |
+---+--------+-------------+---------------+--------------+
| 1 | row_id | disposal_id | excess_weight | total_weight |
| 2 | 1 | 1 | 0 | 30 |
| 3 | 2 | 1 | 10 | 30 |
| 4 | 3 | 1 | 0 | 20 |
| 5 | 4 | 2 | 5 | 50 |
| 6 | 5 | 2 | 0 | 45 |
| 7 | 6 | 2 | 15 | 45 |
| 8 | 7 | 2 | 5 | 30 |
| 9 | 8 | 2 | 5 | 25 |
+---+--------+-------------+---------------+--------------+
Basically, I need to update the total_weight column by subtracting the sum of the excess_weights from previous rows in the table which belong to the same disposal_id.
I'm currently using a cursor because it's faster then other solutions I've tried (cte, triangular join, cross apply). My cursor solution keeps a running total that is reset to zero for each new disposal_id, increments it by the excess weight, and performs updates when needed and runs in about 40 seconds. The other solutions I've tried took anywhere from 3-5 minutes and I'm wondering if there is a relatively performant way to do this using set based operations?
I've spent a lot of time optimizing such queries, ended up with two performant options: either store precalculated running totals, as described in Denormalizing to enforce business rules: Running Totals, or calculate them on the client, which is also fast and easy.
The other solution you probably already tried is to do something like the answers found here
Unless you are using Oracle, which has decent aggregates for cumulative sum, you're better off using a cursor. At best, you're going to have to rejoin the table to itself or use another methods for what should be a O(n) operation. In general, the set based solution for problems like these are messy or really messy.
'previous rows' implies an ordering. so no - no set based operations there.
Oracle's LEAD and LAG are built for this, but SQL Server forces you into triangular joins... which i suppose you have investigated.