Database transactions theory - sql

No book seems to be able to answer this.
Suppose I have two transactions:
T1: Lock A, Lock B, Unlock A
T2: Lock B, Unlock B, Lock A, Unlock A
Q1. How many ways are there to plan these transactions? (Is it just a simple graph and the result is 3! * 4! ?)
Q2. How many of these ways are serializable?
I would really like to know what is the thinking process, how do you get to the answer?

Q1 is 7.
Proof: First of all, we have to merge the set 'Lock A', 'Lock B', 'Unlock A' (I denote the items as A1, A2, A3) into the set 'Lock B',..,'Unlock A' (I denote them B1..B4), that is to put 3 items into 5 places (between B's) with repetitions allowed, that is binomial coeff. choose 3 from (5-1+3). It is equal to 7!/(3!*4!) = 35.
Next, we have to drop 'bad' solutions (the ones prevented by locking conditions). It's where A1 stands between B3 and B4 (3 solutions) and A2 between B1 and B2 (2*4 = 8). Also, we have to exclude the solutions with B3 between A1 and A3 too. There are 3*3=9 with B3 between A1 and A2, and 6*2=12 with B3 between A2 and A3. Thus, we have 35-3-8-9-12=3. But we should also satisfy inclusion-exclusion principle: add solutions which violates two rules simultaneously. They could be only like this: B1 A2 B2 B3 B4, with A1 in either of two left positions, and A3 in either of two right ones. 4 in total. So, we have the final answer 35 - 3 - 8 - 9 - 12 + 4 = 7.

Related

Turning one-hot encoded table into 2D table of counts

I think I can solve this problem without too much difficulty but suspect that any solution I come up with will be sub-optimal, so am interested in how the real pandas experts would do it; I'm sure I could learn something from that.
I have a table of data that is one-hot encoded, something like:
Index. A1. A2. A3. B1. B2. C1. C2. C3. C4.
0. True. False. True. True. True. False. True. False. False.
...
So every entry is a Boolean and my columns consist of several groups of categories (the A's, B's and C's).
What I want to create is new DataFrames where I pick any two categories and get a table of counts of how many people are in the pair of categories corresponding to that row/column. So, if I was looking at categories A and B, I would generate a table:
Index. A1. A2. A3. None Total
B1. x11. x12. x13. x1N x1T
B2. x21. x22. x23. x2N. x2T
None. xN1. xN2. xN3. xNN xNT
Total. xT1. xT2. xT3. xTN xTT
where x11 is the count of rows in the original table that have both A1 and B1 True, x12 is the count of those rows that have A1 and B2 True, and so on.
I'm also interested in the counts of those entries where all the A values were False and/or all the B values were false, which are accounted for in the None columns.
Finally, I would also like the totals of rows where any of the columns in the corresponding category were True. So x1T would be the number of rows where B1 was True and any of A1, A2 or A3 were True, and so on (note that this is not just the sum of x11, x12 and x13 as the categories are not always mutually exclusive; a row could have both A1 True and A2 True for example). xNN is the number of rows that have all false values for A1, A2, A3, B1, B2, and xTT is the number of rows that have at least one true value for any of A1, A2, A3, B1 and B2, so xNN + xTT would equal the total number of rows in the original table.
Thanks
Graham
This is my approach:
def get_table(data, prefix):
'''
get the columns in the respective category
and assign `none` and `Total` columns
'''
return (df.filter(like=prefix)
.assign(none=lambda x: (1-x).prod(1),
Total=lambda x: x.any(1))
)
# sample data
np.random.seed(1)
df = pd.DataFrame(np.random.choice([True,False], size=(5,9), p=(0.6,0.4)),
columns=[f'{x}{y}'for x in 'ABC' for y in '123'])
# the output
df = df.astype(int)
get_table(df, 'B').T # get_table(df, 'A')
Output:
A1 A2 A3 none Total
B1 3 2 1 0 3
B2 3 2 1 0 3
B3 2 1 1 0 2
none 0 0 1 0 1
Total 4 3 2 0 5
Here I don't understand as why (none, Total) must be zero. Since none corresponds to all False in B and Total corresponds to some True in A.

Plotting CDF for ranking distribution

I have a panda dataframe that looks like this, this is generated with the groupby command and then sorted by # of users to give me user count for top X feature combination.
count_28day, Feature1, Feature2, Feature3
5000 a1 b1 c1
1000 a2 b2 c2
50 a3 b3 c3
I'm trying to plot cdf of user distribution. I don't need to know the features. I just want to show the top X feature combinations that will give me 90% of total user.
I'm doing this in a very hacky way.
topx = table.count_28day.sort_values(ascending=False).cumsum()/sum(table.count_28day)
ser_cdf = pd.Series(topx.tolist()[1:100], index=pd.Series(range(1,100)))
ser_cdf.plot(drawstyle='steps')
Is there a more elegant way to do this using histogram or ecdf or something?

using 'loop' or 'for' with table data to pull each row data and use the pulled data for two parameters in gams

I am new to GAMS and I have a table data which has 3 rows and 6 columns. I want to pull each row and use its data for two parameters(pull each row which has 6 element and use the first three elements for one parameter and the other three elements for the second parameter) using loop or for statement. i tried to use both of them but for the loop i received zero value for my parameter which is incorrect and for the for statement i received some errors.
this is my code for the first row which both 'loop' and 'for' are used (i used them separately each time but for show what was my code i just wrote them together).
Please help me.
Thanks
scalars j;
sets
o /red,green,blue/
p /b1,b2,b3,p1,p2,p3/
k /1*3/;
Table sup(*,*)
b1 b2 b3 p1 p2 p3
red 12 15 20 200 50 50
green 16 17 0 150 50 0
blue 13 18 0 100 50 0 ;
parameters Bid_Red(k),Pmax_Red(k),t;
*for statement***************
for(j= 1 to 3,
t=card(o)+j;
Bid_Red(k)$( ord(k) = j )=sup('red',j);
Pmax_Red(k)$( ord(k) = j )=sup('red',t);
);
*loop statement***************
t=card(o);
loop(k,
Bid_Red(k)=sup('red',k);
Pmax_Red(k)=sup('red',k+t);
);
display Bid_red, Pmax_Red
One of the core features of GAMS is how it deals with set structures and indexing. I'd recommend looking at the excellent documentation, for example on set definition https://www.gams.com/latest/docs/UG_SetDefinition.html, to really get a feel for how to get the best out of it.
In your case, you can proceed as follows. p is a set. Create some subsets of it p_ and b_, given by the syntax subset_name(set_name).
sets p_(p) / p1, p2, p3 /,
b_(p) / b1, b2, b3 /;
Create parameters over appropriate dimensions (i.e. the full set), and define them over the subset you are interested in:
parameters bid_red(o,p),pmax_red(o,p);
bid_red(o,b_) = sup(o,b_);
pmax_red(o,p_) = sup(o,p_);
Then display bid_red, pmax_red; gives:
---- 21 PARAMETER bid_red
b1 b2 b3
red 12.000 15.000 20.000
green 16.000 17.000
blue 13.000 18.000
---- 21 PARAMETER pmax_red
p1 p2 p3
red 200.000 50.000 50.000
green 150.000 50.000
blue 100.000 50.000
If you do want to select individual rows, you can use e.g. pmax_red('red',p_) in your code. This is essentially just a special case of subsetting in which the subset is of size 1.

Redis and linked hashes

everyone
I would like to ask community of help to find a way of how to cache our huge plain table by splitting it to the multiple hashes or otherwise.
The sample of table, as an example for structure:
A1 B1 C1 D1 E1 X1
A1 B1 C1 D1 E1 X2
A7 B5 C2 D1 E2 X3
A8 B1 C1 D1 E2 X4
A1 B6 C3 D2 E2 X5
A1 B1 C1 D2 E1 X6
This our denormalized data, we don't have any ability to normalize it.
So currently we must perform 'group by' to get required items, for instance to get all D* we perform 'data.GroupBy(A1).GroupBy(B1).GroupBy(C1)' and it takes a lot of time.
Temporarily we had found workaround for this by creating composite a string key:
A1 -> 'list of lines begin A1'
A1:B1 -> 'list of lines begin A1:B1'
A1:B1:C1 -> 'list of lines begin A1:B1:C1'
...
as a cache of results of grouping operations.
The question is how it can be stored efficiently?
Estimated number of lines in denormalized data around 10M records and as in my an example there is 6 columns it will be 60M entries in hash. So, I'm looking an approach to lookup values in O(N) if it's possible.
Thanks.

Fortran read file into array - transposed dimensions

I'm trying to read a file into memory in a Fortran program. The file has N rows with two values in each row. This is what I currently do (it compiles and runs, but gives me incorrect output):
program readfromfile
implicit none
integer :: N, i, lines_in_file
real*8, allocatable :: cs(:,:)
N = lines_in_file('datafile.txt') ! a function I wrote, which works correctly
allocate(cs(N,2))
open(15, 'datafile.txt', status='old')
read(15,*) cs
do i=1,N
print *, cs(i,1), cs(i,2)
enddo
end
What I hoped to get was the data loaded into the variable cs, with lines as first index and columns as second, but when the above code runs, it first gives prints a line with two "left column" values, then a line with two "right column" values, then a line with the next two "left column values" and so on.
Here's a more visual description of the situation:
In my data file: Desired output: Actual output:
A1 B1 A1 B1 A1 A2
A2 B2 A2 B2 B1 B2
A3 B3 A3 B3 A3 A4
A4 B4 A4 B4 B3 B4
I've tried switching the indices when allocating cs, but with the same results (or segfault, depending on wether I also switch indices at the print statement). I've also tried reading the values row-by-row, but because of the irregular format of the data file (comma-delimited, not column-aligned) I couldn't get this working at all.
How do I read the data into memory the best way to achieve the results I want?
I do not see any comma in your data file. It should not make any difference with the list-directed input anyway. Just try to read it like you write it.
do i=1,N
read (*,*) cs(i,1), cs(i,2)
enddo
Otherwise if you read whole array in one command, it reads it in column-major order, i.e., cs(1,1), cs(2, 1), ....cs(N,1), cs(1, 2), cs(2,2), ... This is the order in which the array is stored in memory.