Compare last row to previous row by group and populate new column - sql

I need to compare the last row of a group to the row above it, see if changes occur in a few columns, and populate a new column with 1 if a change occurs. The data presentation below will explain better.
Also need to account for having a group with only 1 row.
what we have:
Group Name Sport DogName Eligibility
1 Tom BBALL Toto Yes
1 Tom BBall Toto Yes
1 Tom golf spot Yes
2 Nancy vllyball Jimmy yes
2 Nancy vllyball rover no
what we want:
Group Name Sport DogName Eligibility N_change S_change D_Change E_change
1 Tom BBALL Toto Yes 0 0 0 0
1 Tom BBall Toto Yes 0 0 0 0
1 Tom golf spot Yes 0 1 1 0
2 Nancy vllyball Jimmy yes 0 0 0 0
2 Nancy vllyball rover no 0 0 1 1
Only care about changes from row to row within group. Thank you for any help in advance.
The rows are already ordered so we only need to last two of the group. If it is easier to compare sequential rows in a group then that is just as good for my purposes.
I did know this would be arrays and I struggle with these because never use them for my typical sas modeling. Wanted to keep things short and sweet.

Use the data step and lag statements. Ensure your data is sorted by group first, and that the rows within groups are sorted in the correct order. Using arrays will make your code much smaller.
The logic below will compare each row with the previous row. A flag of 1 will be set only if:
It's not the first row of the group
The current value differs from the previous value.
The syntax var = (test logic); is a shortcut to automatically generate dummy flags.
data want;
set have;
by group;
array var[*] name sport dogname eligibility;
array lagvar[*] $ lag_name lag_sport lag_dogname lag_eligibility;
array changeflag[*] N_change S_change D_change E_change;
do i = 1 to dim(var);
lagvar[i] = lag(var[i]);
changeflag[i] = (var[i] NE lagvar[i] AND NOT first.group);
end;
drop lag: i;
run;

It is not uncommon for procedural programmers to find this kind of dilemma in SQL, which is predominately a set language where rows have no position. If you write a procedure that reads the select data (sorted in the desired order), it can have variables to control creating the desired additional columns in the output, similar to the lag function above.
Or you can put it into a spreadsheet, which is happier detecting the changes in formula filled columns =if(a2<>a1,1,0). Just make sure nobody re-sorts the spreadsheet data into a new order!

Related

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

Business Objects CountIf by cell reference

So I have a column with this data
1
1
1
2
3
4
5
5
5
how can I do a count if where the value at any given location in the above table is equal to a cell i select? i.e. doing Count([NUMBER]) Where([NUMBER] = Coordinates(0,0)) would return 3, because there are 3 rows where the value is one in the 0 position.
it's basically like in excel where you can do COUNTIF(A:A, 1) and it would give you the total number of rows where the value in A:A is 1. is this possible to do in business objects web intelligence?
Functions in WebI operate on rows, so you have to think about it a little differently.
If your intent is to create a cell outside of the report block and display the count of specific values, you can use Count() with Where():
=Count([NUMBER];All) Where ([NUMBER] = "1")
In a freestanding cell, the above will produce a value of "3" for your sample data.
If you want to put the result in the same block and have it count up the occurrences of values on that row, for example:
NUMBER NUMBER Total
1 3
1 3
1 3
2 1
3 1
4 1
5 3
5 3
5 3
it gets a little more complicated. You have to have at least one other dimension in the query to reference. It can be anything, but you have to be counting something in conjunction with the NUMBER dimension. So, the following would work, assuming there's another dimension in the query named [Duh]:
=Count([NUMBER];All) ForAll([Duh])

How to find conditional cumulative sums in an excel table using VBA macro

Let's say I have two columns.
3.5463 11
4.5592 12
1.6993 111
0.92521 112
1.7331 121
2.1407 122
1.4082 1111
2.0698 1112
2.3973 1121
2.4518 1122
1.1719 1211
1.153 1212
0.67139 1221
0.64744 1222
1.3705 11111
0.9557 11112
0.64868 11121
0.7325 11211
0.58874 11212
0.86673 11221
0.17075 11222
0.64026 12111
0.80229 12112
0.43422 12122
1.0405 12211
0.63376 12212
0.56491 12221
0.34626 12222
0.81631 111111
0.91837 111112
0.70013 111121
0.87384 111122
1.1474 111211
0.47411 111221
0.12249 111222
0.56728 112111
0.88169 112112
0.14509 112121
0.68655 112211
0.36274 112212
1.1652 121111
0.99314 121112
0.42024 121121
0.23937 121122
1.0346 122111
0.64642 122112
0.15632 122121
0.41725 122122
0.40793 122211
In the first column, there is a number. With every one of those numbers, in the second column, is an associated ID. Now, there are some blank rows that do not contain any numbers in them.
Define one of these numbers to be a "daughter" of another number if the ID of the first number is the same as the ID of the second, with an extra digit on the end. For example, both IDs 11211 and 11212 are daughters of 1121, because the ID of 1121 has an extra digit, either a 1 or a 2, added onto the end to form the ID of its daughters. Thus, 1121 is the parent of both 11211 and 11212.
Here is what I want the macro to do. It must output a third column which contains, for every row, a cumulative sum of the number of the first column in that row, plus the parent number of that number, and the parent number of the parent number, etc. all the way up until it reachers either 11 or 12. It will begin by simply outputting the numbers in column 1 for 11 and 12 in the third column. Then, in a loop beginning with 111, it will add up the cumulative sum of every row (the number in that row plus the third column output of the parent), only if that row has a number and an id, and only if the parent exists and has an output in column 3. So for example, the number in the 3rd column of the row with ID 11222 should be the number in column 1 of that row, plus that of 1122, plus that of 112, plus that of 11. So, 0.17075+2.4518+0.92521+3.5463, or 7.09406. However, if you try to do this for ID 111221, you will notice that the row where the parent 11122 should be is empty. Thus, the parent does not exist, and no value will be outputted in column 3 for 111221.
I would greatly appreciate it if someone has some time on their hands to code up this VBA macro for me in exchange for an accepted solution.
Thanks
I don't think a macro is needed, just some formulas. First, I put a header on my columns of data, such as "value," and "id." If you then highlight the column labels (i.e., A and B) and sort by B ("id") then A ("value"), you'll group your blank rows. You can then delete those rows. Now you have the data almost ready. When I did this, I converted the id column to text, as opposed to a number value, so if I sort the table by id, the pattern will be, "11, 111, 1111," and so on, instead of, "11, 12, 111, 112, 121." Then, I added columns to separate the separate characters or levels of the ids. This is to help with parents and children. You can use text-to-columns, or a MID formula, but what I did was have 6 more columns to the right. For each id row, each column would either have a "1," a "2," or a blank (null) value. Then I added another column, calling it "level." I used a formula like COUNTA across all my id splitting columns. So, for 11, my level value was 2. 111 would be 3, 11221 would be 5, and so on. This gives me the id level (parent, child, grandchild, etc). Then I added my final column to the right to compute my cumulative sum of the values. In concept I have one big nested IF statement, but in practice, I needed two. My formula says, if the row above me has a lower level number (i.e., it is some kind of parent), add the value of the current row to the value of the above row. Otherwise, keep going up a row till I do get a parent, and add the current row value to that number.
My final formula for all but the first 5 rows of data was (in the 6th row of data):
=if(K6
rest of answer is below
=if(K6<K7,L6+C7,if(K5<K7,L5+C7,if(K4<K7,L4+C7,if(K3<K7,L3+C7,if(K2<K7,L2+C7,C7)))))
The values were column C, the original id in column D, the id split columns were E through J, the level column was K, and my formula was in L. This formula can be copied down the table. For the first 4 rows, you just need 1 less IF statement each row you go up. The fifth row of data might take the above formula; it depends how it will deal with the column headers in row one. The formula on the 4 row of data might be:
=if(K4<K5,L4+C5,if(K3<K5,L3+C5,if(K2<K5,L2+C5,if(K1<K5,L1+C5,C5))))
I'm still learning how to format these comments, so I'll try to provide a sample of the layout I have...
C D E F G H I J K L
1 value id 1 2 3 4 5 6 lvl cumul_sum
2 3.546300 11 1 1 2 3.546300
3 1.699300 111 1 1 1 3 5.245600
4 1.408200 1111 1 1 1 1 4 6.653800
5 1.370500 11111 1 1 1 1 1 5 8.024300
6 0.816310 111111 1 1 1 1 1 1 6 8.840610
7 0.918370 111112 1 1 1 1 1 2 6 8.942670
8 0.955700 11112 1 1 1 1 2 5 7.609500
So for example, the number in the 3rd column of the row with ID 11222 should be the number in column 1 of that row, plus that of 1122, plus that of 112, plus that of 11. So, 0.17075+2.4518+0.92521+3.5463, or 7.09406.However, if you try to do this for ID 111221, you will notice that the row where the parent 11122 should be is empty. Thus, the parent does not exist, and no value will be outputted in column 3 for 111221.
As a native worksheet array formula¹ in D1,
=IF(LEN(B1), SUM(SUMIFS(A$1:INDEX(A:A, MATCH(1E+99, A:A)),
B$1:INDEX(B:B, MATCH(1E+99, A:A)), LEFT(B1, ROW(INDIRECT("2:"&LEN(B1)))))), TEXT(,))
The above does not compensate for missing parents (null string). It totals everything it can find and uses zero for missing parents.
As a VBA UDF² in E1,
Function conditionalCumulativeSum(nums As Range, _
ids As Range, sib As Range, _
Optional nullOnBlank As Boolean = True)
Dim i As Integer
'truncate any full column reference to the UsedRange
Set nums = Intersect(nums, nums.Parent.UsedRange)
'match the nums and ids ranges
Set ids = ids.Resize(nums.Rows.Count, nums.Columns.Count)
For i = Len(sib.Value2) To 2 Step -1
If nullOnBlank And IsError(Application.Match(--Left(sib, i), ids, 0)) Then
conditionalCumulativeSum = vbNullString
Exit For
End If
conditionalCumulativeSum = conditionalCumulativeSum + _
Application.SumIfs(nums, ids, Left(sib, i))
Next i
If i = 0 Then conditionalCumulativeSum = vbNullString
End Function
The above defaults to return a null string when it encounters any missing parent through the hereditary chain. This can be turned off by adding FALSE as the optional fourth parameter and then the UDF will behave identically to the native formula.
Results from sample data
    
¹ Array formulas need to be finalized with Ctrl+Shift+Enter↵. If entered correctly, Excel with wrap the formula in braces (e.g. { and }). You do not type the braces in yourself. Once entered into the first cell correctly, they can be filled or copied down or right just like any other formula. Try and reduce your full-column references to ranges more closely representing the extents of your actual data. Array formulas chew up calculation cycles logarithmically so it is good practise to narrow the referenced ranges to a minimum. See Guidelines and examples of array formulas for more information.
² A User Defined Function (aka UDF) is placed into a standard module code sheet. Tap Alt+F11 and when the VBE opens, immediately use the pull-down menus to Insert ► Module (Alt+I,M). Paste the function code into the new module code sheet titled something like Book1 - Module1 (Code). Tap Alt+Q to return to your worksheet(s).

Karnaugh map group sizes

Full disclosure, this is for an assignment I don't think I'm looking for spoon feeding, more so just a general question. Am a I allowed to break that into a group of 8 and 2 groups of 4, or do all group sizes have to be equal, ie 4 groups of 4
1 0 1 1
0 0 0 0
1 1 1 1
1 1 1 1
Sorry if this is obvious, but my searches haven't been explicit and my teacher was quite vague. Thanks!
TL;DR: Groups don't have to be equal in size.
Let see what happens if, in your case, you take 11 groups of one. Then you will have an equation of eleven terms. (ie. case_1 or case_2 or... case_11).
By making big group, in your case 1 group of 8 and 2 groups of 4, you will have a very short and simplified equation like: case_group_8 or case_group_4_1 or case_group_4_2.
Both grouping are correct (we took all the one in the map) but the second is the most optimized. (i.e. you cannot simplified more)
Making 4 groups of 4 will bring you an equation that can be simplified more.
The best way now is for you to try both grouping (all 4 vs 8/4/4) and see the output result.

SQL: How to sort overlapping groups efficiently

I'm trying to make groups on a database with 10.000+ rows.
I need to be fast and efficient, so I'm doing binary variables for each cluster.
One, Two, Four, Five and Six is in Group1.
But 'Two' might also be in Group nr. 2, because of errors I cannot overcome because my dataset is from a webscrape. I try to sort everything in a unique way, but it's basically impossible not to do errors, if I wish to be efficient and fast.
ID Title Group1 Group2 Group3 Ungrouped
1 One 1 0 0 0
2 Two 1 1 0 0
3 Three 0 1 1 0
4 Four 1 0 1 0
5 Five 1 0 0 0
6 Six 1 1 1 0
7 Seven 0 0 0 1
My idea for a sollution:
Assign groups (one's) until everything is grouped one or more times.
Make a query for everything that has more than one group assigned (2, 3, 4, 6)
Manually decide which 1's to remove, until they only have one group assigned each.
It's actually a good idea to do the 3rd part manually, because it requires content analysis of the documents)
My question:
How do I specify, that I need to see everything with more than one group? Does it have something to do with constraints and unique values, or is there a more simple and obvious way that I'm not seeing?
If your clusters are stored as integers, you can just do:
select c.*
from clusters c
where (cluster1 + cluster2 + cluster3) > 1;
I don't know what a "binary variable" is in SQLite. Some databases do support binary flags, and you would need to convert the values to integers for the where clause.