SQL Based Query Using Rank and Window Functions - sql

The following is my dataset.
P A Rank B Rank C Rank
P1 1 1 2
P2 2 6 4
P3 3 4 1
P4 4 5 6
P5 5 2 5
P6 6 3 3
I have ranks of certain products basis three different parameters. There are n no of rows and this is just a sample. I want to create a column in this dataset which gives final rank as follows:
P Code Final Rank Ranking Logic
P1 1 Top A
P6 2 Top B (out of remaining 5 properties)
P4 3 Top C (out of remaining 4 properties)
P3 4 Top A (out of remaining 3 properties)
P2 5 Top B (out of remaining 2 properties)
P5 6 Top C (out of remaining 1 properties)
I need to understand how this can be implemented using SQL, or SQL Procedures.

Related

Distinct count on a rolling time window

I want to count the number of distinct catalog numbers that have appeared within the last X minutes. This is usually called a rolling time window.
For instance, if I have:
row startime orderNumber catalogNumb
1 2007-09-24-15.50 o1 21
2 2007-09-24-15.51 o2 21
3 2007-09-24-15.52 o2 21
4 2007-09-24-15.53 o3 21
5 2007-09-24-15.54 o4 22
6 2007-09-24-15.55 o4 23
7 2007-09-24-15.56 o4 21
8 2007-09-24-15.57 o4 21
For instance, if I want to get this for the last 5 minutes (5 is just one of the possible values), the output should be:
row startime orderNumber catalogNumb countCatalog
1 2007-09-24-15.50 o1 21 1
2 2007-09-24-15.51 o2 22 2
3 2007-09-24-15.52 o2 23 3
4 2007-09-24-15.53 o3 24 4
5 2007-09-24-15.54 o4 21 4
6 2007-09-24-15.55 o4 21 4
7 2007-09-24-15.56 o4 21 4
8 2007-09-24-15.57 o4 21 3
I am using Big SQL for infosphere BigInsights v3.0.
Resulting query can use any db2 Olap windows functions except for
count (distinct catalogNumb) OVER()... which is not supported by my db2 version.
In addition to count, I may also need to use other aggregate functions (avg, sum...) over the catalogNumb and other attributes.
Any feedback would be appreciated.
True Db2 does not support count distinct as OLAP function but there is an easy workaround:
You can use
dense_rank
instead - the highest number (max) from dense rank is your count distinct!
You can try something like this:
select ...
from mytable
where starttime between current_time - 5 minutes and current_time
That will get all the rows for the last 5 minutes. 5 can be a variable. then count() or sum() or average() the rows.

Pig Latin Count difference between two tables

I have one table loaded twice to perform a self join called current and previous. Both contain columns "key" (not unique) and "value". I have grouped by key, and counted the number of values in each group of keys.
I would like to find how many more values were added to the current table compared to the previous table, but I get the error "Invalid scalar projection: cur_count : A column needs to be projected from a relation for it to be used as a scalar". I am relatively new to pig latin, so I'm unsure of what the syntax should be for performing this difference.
Please disregard syntax for the cur_count and prev_count.
cur_count = FOREACH cur_grouped GENERATE COUNT(current);
prev_count = FOREACH prev_grouped GENERATE COUNT(previous);
left_join = join current by key LEFT OUTER, previous by key-1;
difference = FOREACH left_join GENERATE key, cur_count-prev_count; //error here
dump difference;
Below are some sample data
key value
1 12
1 34
1 11
1 45
2 4
3 34
3 34
3 23
4 15
4 19
What my script does so far: it counts the number of values in each group of keys
key count
1 4
2 1
3 3
4 2
I would like to find the difference in number of values between a key and the previous key
key difference
2 -3
3 2
4 -1
cur_count and prev_count are relations and cannot be used the way you are using.You can achieve the desired output using the script below.After joining the relations with (key-1),use the columns from the relation to get the difference.
A = LOAD 'data.txt' USING PigStorage(',') AS (f1:int,f2:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE group,COUNT(A);
D = FOREACH B GENERATE group,COUNT(A);
E = JOIN C BY $0,D BY ($0-1);
F = FOREACH E GENERATE $2,$3-$1;
DUMP F;
Presume you have two groups grp1 and grp2 with the content you described earlier
key count
1 4
2 1
3 3
4 2
Note: I have not executed below Pig statements.
-- Generate the Ranks for two relations
grp1 = rank grp1;
grp2 = rank grp2;
-- Increment rank by 1 for each record in grp2
grp2 = foreach grp2 generate ($0+1) as rank,key,count
After these the two relations would look like below. Arranged them side by side for comparison.
Group 1 Group 2
Rank key count Rank key count
1 1 4 2 1 4
2 2 1 3 2 1
3 3 3 4 3 3
4 4 2 5 4 2
Join the two groups by RANK which would yield below output
Rank key count Rank key count
2 2 1 2 1 4
3 3 3 3 2 1
4 4 2 4 3 3
5 4 2
Now you can run another "foreach" statement that finds the difference in two count columns above.
result = FOREACH <<joined relation>> GENERATE $1 as key,($2-$5) as difference

How to write sql query to generate a group no for each grouped record

following is scenario:
I have data in following format:
entryid , ac_no, db/cr, amt
-----------------------------------------------
1 10 D 5
1 11 C 5
2 01 D 8
2 11 C 8
3 12 D 10
3 13 C 10
4 14 D 5
4 16 C 5
5 14 D 2
5 17 C 2
6 14 D 3
6 18 C 3
I want data in following format:
So far i have acheived the first 3 columns by query
select wm_concat(entryid),ac_no,db_cr,Sum(amt) from t1 group by ac_no,db_cr
wm_Concat(entryid),ac_no, db/cr, Sum(amt), set_id
------------------------------------------------
1 10 D 5 S1
2 01 D 8 S1
1,2 11 C 13 S1
3 12 D 10 S2
3 13 C 10 S2
4,5,6 14 D 10 S3
4 16 C 5 S3
5 17 C 2 S3
6 18 C 3 S3
I want an additional column `set_id` that either shows this S1, S2.. or any number 1,2.. so that the debit & credit entries sets can be identified.
I am making sets of debit and credit entries based on their Ac_no values.
Any little help will be highly appreciated. Thanks
Create a new column say set and give a unique identifier to the particular set. So for example the first three records will have set id S1, next two will have S2 and so on.
To distinguish a transaction from a set you can use column db/cr along with newly added set column. You can identify that the 3rd row is a set since it's transaction type is 'C' whereas the transactions are of type 'D'.
Here I have assumed that your transactions are debit only, if not please provide more details in the question. Hope this helps.

Excel VBA function to solve impossible round-robin tournament roster with venue constraint

I am really having difficulty generating a round-robin tournament roster with the following conditions:
10 Teams (Teams 1 - 10)
5 Fields (Field A - E)
9 Rounds (Round 1 - 9)
Each team must play every other team exactly once.
Only two teams can play on a field at any one time. (i.e. all 5 fields always in use)
No team is allowed to play on any particular field more than twice. <- This is the problem!
I have been trying on and off for many years to solve this problem on paper without success. So once and for all, I would like to generate a function in Excel VBA to test every combination to prove it is impossible.
I started creating a very messy piece of code that generates an array using nested if/while loops, but I can already see it's just not going to work.
Is there anyone out there with a juicy piece of code that can solve?
Edit: Thanks to Brian Camire's method below, I've been able to include further desirable constraints and still get a solution:
No team plays the same field twice in a row
A team should play on all the fields once before repeating
The solution is below. I should have asked years ago! Thanks again Brian - you are a genius!
Round 1 2 3 4 5 6 7 8 9
Field A 5v10 1v9 2v4 6v8 3v7 4v10 3v9 7v8 1v2
Field B 1v7 8v10 3v6 2v9 4v5 6v7 1v8 9v10 3v5
Field C 2v6 3v4 1v10 5v7 8v9 1v3 2v5 4v6 7v10
Field D 4v9 2v7 5v8 3v10 1v6 2v8 4v7 1v5 6v9
Field E 3v8 5v6 7v9 1v4 2v10 5v9 6v10 2v3 4v8
I think I've found at least one solution to the problem:
Round Field Team 1 Team 2
1 A 3 10
1 B 7 8
1 C 1 9
1 D 2 4
1 E 5 6
2 A 8 10
2 B 1 5
2 C 2 6
2 D 3 7
2 E 4 9
3 A 1 4
3 B 2 3
3 C 8 9
3 D 5 7
3 E 6 10
4 A 6 7
4 B 4 10
4 C 2 8
4 D 5 9
4 E 1 3
5 A 2 9
5 B 3 8
5 C 4 7
5 D 1 6
5 E 5 10
6 A 3 9
6 B 4 5
6 C 7 10
6 D 6 8
6 E 1 2
7 A 5 8
7 B 6 9
7 C 1 10
7 D 3 4
7 E 2 7
8 A 4 6
8 B 2 10
8 C 3 5
8 D 1 8
8 E 7 9
9 A 2 5
9 B 1 7
9 C 3 6
9 D 9 10
9 E 4 8
I found it using the OpenSolver add-in for Excel (as the problem was too large for the built-in Solver feature). The steps were something like this:
Set up a table with 2025 rows representing the possible matches -- that is, possible combinations of round, field, and pair of teams (with columns like the table above), plus one extra column that will be a binary (0 or 1) decision variable indicating if the match is to be selected.
Set up formulas to use the decision variables to calculate: a) the number matches at each field in each round, b) the number of matches between each pair of teams, c) the number of matches played by each team in each round, and, d) the number of matches played by each team at each field.
Set up a formula to use the decision variables to calculate the total number of matches.
Use OpenSolver to solve a model whose objective is to maximize the result of the formula from Step 3 by changing the decision variables from Step 1, subject to the constraints that the decision variables must be binary, the results of the formulas from Steps 2.a) through c) must equal 1, and the results of the formulas from Step 2.d) must be less than or equal to 2.
The details are as follows...
For Step 1, I set up my table so that columns A, B, C, and D represented the Round, Field, Team 1, and Team 2, respectively, and column E represented the decision variable. Row 1 contained the column headings, and rows 2 through 2026 each represented one possible match.
For Step 2.a), I set up a vertical list of rounds 1 through 9 in cells I2 through I10, a horizontal list of fields A through E in cells J1 through N1, and a series of formulas to calculate the number of matches in each field in each round in cells J2 through N10 by starting with =SUMIFS($E$2:$E$2026,$A$2:$A$2026,$I2,$B$2:$B$2026,J$1) in cell J2 and then copying and pasting.
For Step 2.b), I set up a vertical list of teams 1 through 9 in cells I13 through I21, a horizontal list of opposing teams 2 through 10 in cells J12 through R12, and a series of formulas to calculate the number of matches between each pair of teams in the "upper right triangular half" of cells J13 through R21 (including the diagonal) by starting with =SUMIFS($E$2:$E$2026,$C$2:$C$2026,$I13,$D$2:$D$2026,J$12) in cell J13 and then copying and pasting.
For Step 2.c), I set up a vertical list of teams 1 through 10 in cells I24 through I33, a horizontal list of rounds 1 through 9 in cells J23 through R23, and a series of formulas to calculate the number of matches played by each team in each round in cells J24 through R33 by starting with =SUMIFS($E$2:$E$2026,$C$2:$C$2026,$I24,$A$2:$A$2026,J$23)+SUMIFS($E$2:$E$2026,$D$2:$D$2026,$I24,$A$2:$A$2026,J$23) in cell J24 and then copying and pasting.
For Step 2.d), I set up a vertical list of teams 1 through 10 in cells I36 through I45, a horizontal list of fields A through B in cells J35 through N45, and series of formulas to calculate the number of matches played by each team at each field in cells J36 through N45 by starting with =SUMIFS($E$2:$E$2026,$C$2:$C$2026,$I36,$B$2:$B$2026,J$35)+SUMIFS($E$2:$E$2026,$D$2:$D$2026,$I36,$B$2:$B$2026,J$35) in cell J36 and then copying and pasting.
For Step 3, I set up a formula to calculate the total number of matches in cell G2 as =SUM($E$2:$E$2026).
For Step 4, in the OpenSolver Model dialog (available from Data, OpenSolver, Model) I set the Objective Cell to $G$2, the Variable Cells to $E$2:$E$2026, and added constraints as described above and detailed below (sorry that the constraints are not listed in the order that I described them):
Note that, for the constraints described in Step 2.b), I needed to add the constraints separately for each row, since OpenSolver raised an error message if the constraints included the blank cells in the "lower left triangular half".
After setting up the model, OpenSolver highlighted the objective, variable, and constraint cells as shown below:
I then solved the problem using OpenSolver (via Data, OpenSolver, Solve). The selected matches are the ones with a 1 in column E. You might get a different solution than I did, as there might be many feasible ones.
come on ... that's an easy one for manual solution ;-)
T1 T2 VE
1 2 A
1 3 A
1 4 B
1 5 B
1 6 C
1 7 C
1 8 D
1 9 D
1 10 E
2 3 A
2 4 B
2 5 B
2 6 C
2 7 C
2 8 D
2 9 D
2 10 E
3 4 C
3 5 C
3 6 D
3 7 D
3 8 E
3 9 E
3 10 B
4 5 C
4 6 D
4 7 D
4 8 E
4 9 E
4 10 A
5 6 E
5 7 E
5 8 A
5 9 A
5 10 D
6 7 E
6 8 A
6 9 A
6 10 B
7 8 B
7 9 B
7 10 A
8 9 B
8 10 C
9 10 C
As far as I have checked no team more then twice on the same venue. Please double check.
To divide it into rounds should be a easy one.
Edit: this time with only 5 venues :-)
Edit 2: now also with allocated rounds :-)
Edit 3: deleted the round allocation again because it was wrong.

How to match already-calculated means to the original data set?

I am now learning R. I feel that there is a very easy succinct answer to my problem, but I am having trouble solving it myself.
I have a large data set. One column contains various 'categories'. I aggregated these categories to get the mean for each one. So, right now, my aggregated table looks like this:
Category __ Average
A ________ a
B ________ b
C ________ c
etc...
I want now to take these average and combine it as another column onto my original data.
So, I want it to look something like this:
Categories _____ Averages
B _____________ b
A______________a
B______________b
C______________c
B______________b
C______________c
In other words, I want to match each category with its corresponding mean. I have tried variations of merge(), match(), and different apply functions. The fact that my aggregated table is so much smaller than my original data is causing some problems.
Is there a specific function I can use for this simple problem? Thanks in advance.
In base R:
data <- data.frame(Category=c(rep("A",3), rep("B",4), rep("C",2)), Value=1:9)
> data
Category Value
1 A 1
2 A 2
3 A 3
4 B 4
5 B 5
6 B 6
7 B 7
8 C 8
9 C 9
> avg <- lapply(split(data$Value, data$Category), mean)
$A
[1] 2
$B
[1] 5.5
$C
[1] 8.5
> data$Averages <- avg[data$Category]
> data
Category Value Averages
1 A 1 2
2 A 2 2
3 A 3 2
4 B 4 5.5
5 B 5 5.5
6 B 6 5.5
7 B 7 5.5
8 C 8 8.5
9 C 9 8.5
You can use plyr, data.table, etc. more efficiently for larger datasets.