sequential numbering of rows - sql

I have to customize a SQL view to be able to perform a system integration test. One table keeps track of the transactions and the LINE ITEMS per transactions (for example, transaction 2 has three items, so there are three consecutive rows corresponding to the same transaction number) What I need to accomplish is to get a column where I keep track of the number of items for THAT transaction, always starting from 1 for each transaction.
For example, the first column is TransactionNumber and the second is DesiredOutput:
1 1
1 2
2 1
2 2
2 3
3 1
4 1
4 2
ETC...
I know how to number consecutive rows on the WHOLE table, but I cannot find any reference to this numbering that depends on a value on another row.

For SQL Server 2005+ and Oracle you can use the following:
SELECT TransNumber, ROW_NUMBER() OVER(PARTITION BY Transnumber ORDER BY Something) [DESIRED COLUMN OUTPUT]
FROM YourTable

Related

Windowing function in Hive

I am exploring windowing functions in Hive and I am able to understand the functionalities of all the UDFs. Although, I am not able to understand the partition by and order by that we use with the other functions. Following is the structure that is very similar to the query which I am planning to build.
SELECT a, RANK() OVER(partition by b order by c) as d from xyz;
Just trying to understand the background process involved for both keywords.
Appreciate the help :)
RANK() analytic function assigns a rank to each row in each partition in the dataset.
PARTITION BY clause determines how the rows to be distributed (between reducers if it is hive).
ORDER BY determines how the rows are being sorted in the partition.
First phase is distribute by, all rows in a dataset are distributed into partitions. In map-reduce each mapper groups rows according to the partition by and produces files for each partition. Mapper does initial sorting of partition parts according to the order by.
Second phase, all rows are sorted inside each partition.
In map-reduce, each reducer gets partitions files (parts of partitions) produced by mappers and sorts rows in the whole partition (sort of partial results) according to the order by.
Third, rank function assigns rank to each row in a partition. Rank function is being initialized for each partition.
For the first row in the partition rank starts with 1. For each next row Rank=previous row rank+1. Rows with equal values (specified in the order by) given the same rank, if the two rows share the same rank, next row rank is not consecutive.
Different partitions can be processed in parallel on different reducers. Small partitions can be processed on the same reducer. Rank function re-initializes when it crossing the partition boundary and starts with rank=1 for each partition.
Example (rows are already partitioned and sorted inside partitions):
SELECT a, RANK() OVER(partition by b order by c) as d from xyz;
a, b, c, d(rank)
----------------
1 1 1 1 --starts with 1
2 1 1 1 --the same c value, the same rank=1
3 1 2 3 --rank 2 is skipped because second row shares the same rank as first
4 2 3 1 --New partition starts with 1
5 2 4 2
6 2 5 3
If you need consecutive ranks, use dense_rank function. dense_rank will produce rank=2 for the third row in the above dataset.
row_number function will assign a position number to each row in the partition starting with 1. Rows with equal values will receive different consecutive numbers.
SELECT a, ROW_NUMBER() OVER(partition by b order by c) as d from xyz;
a, b, c, d(row_number)
----------------
1 1 1 1 --starts with 1
2 1 1 2 --the same c value, row number=2
3 1 2 3 --row position=3
4 2 3 1 --New partition starts with 1
5 2 4 2
6 2 5 3
Important note: For rows with the same values row_number or other such analytic function may have non-deterministic behavior and produce different numbers from run to run. First row in the above dataset may receive number 2 and second row may receive number 1 and vice-versa, because their order is not determined unless you will add one more column a to the order by clause. In this case all rows will always have the same row_number from run to run, their order values are different.

SQL to return records that do not have a complete set according to a second table

I have two tables. I want to find the erroneous records in the first table based on the fact that they aren't complete set as determined by the second table. eg:
custID service transID
1 20 1
1 20 2
1 50 2
2 49 1
2 138 1
3 80 1
3 140 1
comboID combinations
1 Y00020Y00050
2 Y00049Y00138
3 Y00020Y00049
4 Y00020Y00080Y00140
So in this example I would want a query to return the first row of the first table because it does not have a matching 49 or 50 or (80 and 140), and the last two rows as well (because there is no 20). The second transaction is fine, and the second customer is fine.
I couldn't figure this out with a query, so I wound up writing a program that loads the services per customer and transid into an array, iterates over them, and ensures that there is at least one matching combination record where all the services in the combination are present in the initially loaded array. Even that came off as hamfisted, but it was less of a nightmare than the awkward outer joining of multiple joins I was trying to accomplish with SQL.
Taking a step back, I think I need to restructure the combinations table into something more accommodating, but I still can't think of what the approach would be.
I do not have DB2 so I have tested on Oracle. However listagg function should be there as well. The table service is the first table and comb the second one. I assume the service numbers to be sorted as in the combinations column.
select service.*
from service
join
(
select S.custid, S.transid
from
(
select custid, transid, listagg(concat('Y000',service)) within group(order by service) as agg
from service
group by custid, transid
) S
where not exists
(
select *
from comb
where S.agg = comb.combinations
)
) NOT_F on NOT_F.custid = service.custid and NOT_F.transid = service.transid
I dare to say that your database design does not conform to the first normal form since the combinations column is not atomic. Think about it.

Selecting Rows and removing duplicates based on some fields based on two fields and limit to Top Ten?

Having this table:
Row Athlete Event Mark Meet
1 1 3 10 A
2 2 2 5 A
3 3 3 3 A
4 4 4 7 A
5 2 2 4 A
6 3 2 5 B
7 1 1 10 C
How can I select all rows but remove duplicate rows with have the athlete in the same event (Fields Athlete and Event), and pick the lowest (or highest Mark for that athlete), I would also like to limit each event to top 10 athletes (not shown in results)
Expected Output (choosing highest mark), (row 5 is removed)
Row Athlete Event Mark Meet
1 1 3 10 A
2 2 2 5 A
3 3 3 3 A
4 4 4 7 A
6 3 2 5 B
7 1 1 10 C
Thanks for the help the query that did what I wanted (minus the top ten) is:
SELECT [tblPerformanceData-FieldBoys].Eventnum, [tblPerformanceData- FieldBoys].Mark, [tblPerformanceData-FieldBoys].Meet, [tblPerformanceData-FieldBoys].CY, [tblPerformanceData-FieldBoys].AthleteID, [tblPerformanceData-FieldBoys].MeetID
FROM [tblPerformanceData-FieldBoys] INNER JOIN MaxAthleteByEventBoysField ON ([tblPerformanceData-FieldBoys].AthleteID = MaxAthleteByEventBoysField.AthleteID) AND ([tblPerformanceData-FieldBoys].Mark = MaxAthleteByEventBoysField.MaxOfMark) AND ([tblPerformanceData-FieldBoys].Eventnum = MaxAthleteByEventBoysField.Eventnum)
GROUP BY [tblPerformanceData-FieldBoys].Eventnum, [tblPerformanceData-FieldBoys].Mark, [tblPerformanceData-FieldBoys].Meet, [tblPerformanceData-FieldBoys].CY, [tblPerformanceData-FieldBoys].AthleteID, [tblPerformanceData-FieldBoys].MeetID
ORDER BY [tblPerformanceData-FieldBoys].Mark DESC;
You can do it using cascading queries. Try running a group-by query on the main table that only includes the athlete, event, and mark. The max or min clause would be applied to the mark depending on the outcome you're looking for. Use this query as the source for a second query where you link back to the initial table using direct links between the athlete, event, and Mark field. what the second query should look like
That solves the first part. I'm not sure how to get the top ten for each event using queries.
I don't own or have access to MS Access, but I can give you SQL, and hope Access will support some basic syntax.
Option 1: it's easier if Row is your primary key but you do not need to return it in the result; in this case you can even get both MIN and MAX of the Mark for the same athlete in the same row using a simple query:
SELECT
Athlete, Event, Meet, MAX(Mark) AS HighestMark, MIN(Mark) AS LowestMark
FROM
MyTable
GROUP BY
Athlete, Event, Meet
Note: I assumed you also want to group by Meet, but if that's not the case, you could remove it from GROUP BY, but then its value loses meaning in the result.
Option 2: Row is primary key, but you do need to return it - obviously in this case min and max cannot be returned in the same row and the query looks quite different:
SELECT
Row, Athlete, Event, Mark, Meet
FROM
MyTable m0
WHERE m0.Row IN
(SELECT MAX(Row)
FROM MyTable m1
WHERE
Athlete = m0.Athlete AND
Event = m0.Event AND
Meet = m0.Meet
Mark = (SELECT MAX(Mark)
FROM MyTable
WHERE
Athlete = m1.Athlete AND
Event = m1.Event AND
Meet = m1.Meet)
GROUP BY
Athlete, Event, Meet, Mark)
Few notes:
above query returns MAX(Mark); change it to MIN(Mark) to return lowest values
this query could be rewritten with JOINs as well; I'm not sure which method Access likes better (i.e. runs faster)
it has 2 sub-queries; the top sub-query MAX(Row) is there to make sure only 1 row is selected if the same athlete in the same meet and event gets the same Mark; in this case, the greater Row is returned
it is possible to return both MIN and MAX with one query (as separate rows) at the expense of additional sub-queries, but that you didn't ask for

Multicriteria Insert/Update

I'm trying to create a query that will insert new records to a table or update already existing records, but I'm getting stuck on the filtering and grouping for the criteria I want.
I have two tables: tbl_PartInfo, and dbo_CUST_BOOK_LINE.
I'm want to select from dbo_CUST_BOOK_LINE based upon the combination of CUST_ORDER_ID, CUST_ORDER_LINE_NO, and REVISION_ID. Each customer order can have multiple lines, and each line can have multiple revision. I'm trying to select the unique combinations of each order and it's connected lines, but take the connected information for the row with the highest value in the revision column.
I want to insert/update from dbo_CUST_BOOK_LINE the following columns:
CUST_ORDER_ID
PART_ID
USER_ORDER_QTY
UNIT_PRICE
I want to insert/update them into tbl_PartInfo as the following columns respectively:
JobID
DrawingNumber
Quantity
UnitPrice
So if I have the following rows in dbo_CUST_BOOK_LINE (PART_ID omitted for example)
CUST_ORDER_ID CUST_ORDER_LINE_NO REVISION_ID USER_ORDER_QTY UNIT_PRICE
SCabc 1 1 0 100
SCabc 1 2 4 150
SCabc 1 3 4 125
SCabc 2 3 2 200
SCxyz 1 1 0 0
SCxyz 1 2 3 50
It would return
CUST_ORDER_ID CUST_ORDER_LINE_NO (REVISION_ID) USER_ORDER_QTY UNIT_PRICE
SCabc 1 3 4 125
SCabc 2 3 2 200
SCxyz 1 2 3 50
but with PART_ID included and without REVISION_ID
So far, my code is just for the inset portion as I was trying to get the correct records selected, but I keep getting duplicates of CUST_ORDER_ID and CUST_ORDER_LINE_NO.
INSERT INTO tbl_PartInfo ( JobID, DrawingNumber, Quantity, UnitPrice, ProductFamily, ProductCategory )
SELECT dbo_CUST_BOOK_LINE.CUST_ORDER_ID, dbo_CUST_BOOK_LINE.PART_ID, dbo_CUST_BOOK_LINE.USER_ORDER_QTY, dbo_CUST_BOOK_LINE.UNIT_PRICE, dbo_CUST_BOOK_LINE.CUST_ORDER_LINE_NO, Max(dbo_CUST_BOOK_LINE.REVISION_ID) AS MaxOfREVISION_ID
FROM dbo_CUST_BOOK_LINE, tbl_PartInfo
GROUP BY dbo_CUST_BOOK_LINE.CUST_ORDER_ID, dbo_CUST_BOOK_LINE.PART_ID, dbo_CUST_BOOK_LINE.USER_ORDER_QTY, dbo_CUST_BOOK_LINE.UNIT_PRICE, dbo_CUST_BOOK_LINE.CUST_ORDER_LINE_NO;
This has been far more complicated that anything I've done so far, so any help would be greatly appreciated. Sorry about the long column names, I didn't get to choose them.
I did some research and think I found a way to make it work, but I'm still testing it. Right now I'm using three queries, but it should be easily simplified into two when complete.
The first is an append query that takes the two columns I want to get distinct combo's from and selects them and using "group by," while also selecting max of the revision column. It appends them to another table that I'm using called tbl_TempDrop. This table is only being used right now to reduce the number of results before the next part.
The second is an update query that updates tbl_TempDrop to include all the other columns I wanted by setting the criteria equal to the three selected columns from the first query. This took an EXTREMELY long time to complete when I had 700,000 records to work with, hence the use of the tbl_TempDrop.
The third query is a basic append query that appends the rows of tbl_TempDrop to the end destination, tbl_PartInfo.
All that's left is to run all three in a row.
I didn't want to include the full details of any tables or queries yet until I ensure that it works as desired, and because some of the names are vague since I will be using this method for multiple query searches.
This website helped me a little to make sure I had the basic idea down. http://www.techonthenet.com/access/queries/max_query2_2007.php
Let me know if you see any flaws with the ideology!

SQL-Have 2 number columns. Trying to replace a context number with a sequence

I have a data set right now with 3 columns.
Column 1 is Order number and it is sequential in its own right and a foreign key
Column 2 is Batch number and it is a sequence all of its own.
Column 3 is a time stamp
The problem I have is as follows
Order Batch TimeStamp
1 1
2 2
1 3
3 4
2 5
1 6
I am trying to work out the time differences between batches on a per order basis.
Usually I get a sequence number PER orderid but this isnt the case. I am trying to create a view that will do that but my first obstacle is translating those batch sequences into a sequence number PER Order
My ideal Output
Order Batch SequenceNumber TimeStamp
1 1 1
2 2 1
1 3 2
3 4 1
2 5 2
1 6 3
All help is appreciated!!
This is what row_number() does:
select t.*, row_number() over (partition by order order by batch) as seqnum
from t;
Note: you have to escape the column name order because it is a SQL reserved words. Just don't use reserved words for column names.
row_number() is ANSI standard functionality available in most databases (your question doesn't have a database tag). There are other ways to do this, but row_number() is the simplest.