Use of bitwise OR operator in calculating a return value - vba

I am using some code that I found at AllenBrowne.com, which works fine, but I have a question about what it's doing.
The code is designed to return information about any index found on a specific column of a table in MS Access. Index types are identified with a constant, and there are four possible index types (including None):
Private Const intcIndexNone As Integer = 0
Private Const intcIndexGeneral As Integer = 1
Private Const intcIndexUnique As Integer = 3
Private Const intcIndexPrimary As Integer = 7
The relevant piece of code is as follows:
Private Function IndexOnField(tdf As DAO.TableDef, fld As DAO.Field) As Integer
'Purpose: Indicate if there is a single-field index on this field in this table.
'Return: The constant indicating the strongest type.
Dim ind As DAO.Index
Dim intReturn As Integer
intReturn = intcIndexNone
For Each ind In tdf.Indexes
If ind.Fields.Count = 1 Then
If ind.Fields(0).Name = fld.Name Then
If ind.Primary Then
intReturn = (intReturn Or intcIndexPrimary)
ElseIf ind.Unique Then
intReturn = (intReturn Or intcIndexUnique)
Else
intReturn = (intReturn Or intcIndexGeneral)
End If
End If
End If
Next
'Clean up
Set ind = Nothing
IndexOnField = intReturn
End Function
To be truthful, I didn't really understand the concept of a bitwise OR operator, so I've spent the last couple of hours researching that, so now I think I do. And along the way, I noticed that the four possible index values equate to a clear binary pattern:
None: 0
General: 1
Unique: 11
Primay: 111
All of which is good. But I don't understand the use in the function of the OR operator, in the lines:
If ind.Primary Then
intReturn = (intReturn Or intcIndexPrimary)
ElseIf ind.Unique Then
intReturn = (intReturn Or intcIndexUnique)
Else
intReturn = (intReturn Or intcIndexGeneral)
End If
Given that the structure of this code means that only one path can ever be returned, why not just return the actual required constant, without the use of OR? I know that Allen Browne's code is always well crafted, so he won't, I assume, have done this without good reason, but I can't see what it is.
Can someone help, so that I can better understand - and write better code myself in future?
Thanks

As basodre pointed to about the bitwise is correct, but not the 2, 4, 8 basis.
When dealing with an index, ALL of the possibilities are possible, hence the 1, 3, 7 (right-most 3 bits).
0000 = No index
0001 = regular index
0011 = unique index
0111 = PRIMARY index
So, the IF block is testing with the HIGHEST QUALIFIER of the type of index.
Any index can be regular, no problem.
some indexes can be unique, and they could be on some sort of concatenated fields to want as unique that have NOTHING to do with the primary key of the table
Last IS the primary key of the table - which is ALSO UNIQUE.
So, if the index you are testing against IS the primary, it would also show as true if you asked if it was an index or even if it was a unique index.
So, what it is doing is starting the
intReturn = intcIndexNone
which in essence sets the return value to a default of 0. Then it cycles through all indexes in the table that have the given field as part of an index. A table could have 20 indexes on it and 5 of them have an index using the field in question. That one field could be used as any possible part of a regular, unique or primary key index.
So the loop is starting with NONE (0). Then going through each time the field is found as associated with an index. Then whatever type of index that current index is, ORs the result.
So lets say that the index components as it goes through show a given field as Unique first, then regular, then Primary just for grins to see the result of the OR each cycle.
def intReturn 0000
OR Unique 0011
====
0011 NEW value moving forward
intReturn 0011
OR Regular 0001
====
0011 Since unique was higher classification, no change
intReturn 0011
OR Primary 0111
====
0111 Just upgraded to the highest classification index
So now, its returning the OR'd result of whatever the previous value was. In this case, the highest index association is "Primary" index
Does that clarify it for you?

The bitwise OR is useful in cases where combinations of values can exist, and you'd want to return an additive value. In this specific code block, the code is looping through each of the indices, and setting the flag based on the specific index. If there are two indexes, and one of them is general and the other is primary, you can encode this information in resultant bit pattern.
I'm confused by the choice of bitmaps, though. By choosing values with all of the bits set to true, you'd lose information about individual items (maybe that's a design element).
Generally, bitmaps might look something like:
Option A = 2 --> 0010
Option B = 4 --> 0100
Option C = 8 --> 1000
If you want both Option A and Option B to be true, the BIT OR would return 6, which is 0110.
Now, if you need to test if option A is true, you use the BIT AND operation. If you test (6 BIT AND 2) it will return a value greater than 0. However, if you test (8 BIT AND 6), which is the value for option c, it will return a 0.
Hopefully that adds some clarity. I don't have much information about how Access specifically works with indexes, so I'm just speaking to the general case.
EDIT: So I re-read the function definition, and it seems like the choice of integers is intentional. The function intentionally returns the strongest type of index. So, if there is a primary index, it will only show a primary. Considering this, I'm not sure that the bitwise or is the most self-descriptive option here. Maybe there is another consideration at play.

Related

How do I reverse each value in a column bit wise for a hex number?

I have a dataframe which has a column called hexa which has hex values like this. They are of dtype object.
hexa
0 00802259AA8D6204
1 00802259AA7F4504
2 00802259AA8D5A04
I would like to remove the first and last bits and reverse the values bitwise as follows:
hexa-rev
0 628DAA592280
1 457FAA592280
2 5A8DAA592280
Please help
I'll show you the complete solution up here and then explain its parts below:
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
There are possibly a couple ways of doing it, but this way should solve your problem. The general strategy will be defining a function and then using the apply() method to apply it to all values in the column. It should look something like this:
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
Now we need to define the function we're going to apply to it. Breaking it down into its parts, we strip the first and last bit by indexing. Because of how negative indexes work, this will eliminate the first and last bit, regardless of the size. Your result is a list of characters that we will join together after processing.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
The second line iterates through the list of characters, matches the first and second character of each bit together, and then concatenates them into a single string representing the bit.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
The second to last line returns the list you just made in reverse order. Lastly, the function returns a single string of bits.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
I explained it in reverse order, but you want to define this function that you want applied to your column, and then use the apply() function to make it happen.

Find index of an ordered set of N elements

Problem description:
A set of lists of N integers i1,i2,....,iN with 0<= i1<=i2<=i3<=....<=iN <=M, is created by starting with one integer 0<=i1<=M, and repeatedly adding one integer that is greater or equal to the last integer added.
When adding the last integer to get the final set of lists, the index runs starting from 0 to BinomialC[M+N,N)]-1.
For example, for M=3, i1=0,1,2,3
so the lists are
{0},{1},...,{3}.
Adding another integer i2>=i1 will result in
{0,0},{0,1},{0,2},{0,3},
{1,1},{1,2},{1,3},
{2,2},{2,3}
{3,3}
with indices
0,1,2,3,
4,5,6,
7,8,
9.
This index can be represented in terms of i1,i2,...,iN and M. If the conditions >= were not present, then it would be simply i1*(M+1)^(N-1)+i2*(M+1)^(N-2)+...+iN*(M+1)^(N-N). But, in the case above, there is a negative shift in the index due to the restrictions. For example, N=2 the shift is -i1(i1+1)/2 and index is i = i1*(M+1)^1 + i2*(M+1)^0 -i1(i1+1)/2.
Question:
Does anyone especially from mathematics background knows how to write the index for general N element case? or just the final expression? Any help would be appriciated!
Thanks!

Finding the count of a set of substrings in pandas dataframe

I am given a set of substrings. I need to find the count of occurrence of all those substrings in a particular column in a dataframe. The relevant datframe would look like this
training['concat']
0 svAxu$paxArWAn
1 xvAxaSa$varRANi
2 AxAna$xurbale
3 go$BakwAH
4 viXi$Bexena
5 nIwi$kuSalaM
6 lafkA$upamam
7 yaSas$lipsoH
8 kaSa$AGAwam
9 hewumaw$uwwaram
10 varRa$pUgAn
My set of substrings is a dictionary, where the keys are the substrings and values are the probabilities with which they occur
reg = {'anuBavAn':0.35, 'a$piwra':0.2 ...... 'piwra':0.7, 'pa':0.03, 'a':0.0005}
#The length of dicitioanry is 2000
Particularly I need to find those substrings which occur more than twice
I have written the following code that performs the task. Is there a more elegant pythonic way or panda specific way to achieve the same as the current implementation is taking quite some time to execute.
elites = dict()
for reg_pat in reg_:
count = 0
eliter = len(training[training['concat'].str.contains(reg_pat)]['concat'])
if eliter >=3:
elites[reg_pat] = reg_[reg_pat]
You can use apply instead str.contains, it is faster:
reg_ = {'anuBavAn':0.35, 'a$piwra':0.2, 'piwra':0.7, 'pa':0.03, 'a':0.0005}
elites = dict()
for reg_pat in reg_:
if training['concat'].apply(lambda x: reg_pat in x).sum() >= 3:
elites[reg_pat] = reg_[reg_pat]
print (elites)
{'a': 0.0005}
Hopefully I have interpreted your question correctly. I'm inclined to stay away from regex here (in fact, I've never used it in conjunction with pandas), but it's not wrong, strictly speaking. In any case, I find it hard to believe that any regex operations are faster than a simple in check, but I could be wrong on that.
for substr in reg:
totalStringAppearances = training.apply((lambda string: substr in string))
totalStringAppearances = totalStringAppearances.sum()
if totalStringAppearances > 2:
reg[substr] = totalStringAppearances / len(training)
else:
# do what you want to with the very rare substrings
Some gotchas:
If you wanted something like a substring 'a' in 'abcdefa' to return 2, then this will not work. It merely checks for existence of the substring in each string.
Inside the apply(), I am using a potentially unreliable exploitation of booleans. See this question for more details.
Post-edit: Jezrael's answer is more complete as it uses the same variable names. But, in a simple case, regarding regex vs. apply and in, I validate his claim, and my presumption:

What is Lazy Binary Search?

I don't know whether the term "Lazy" Binary Search is valid, but I was going through some old materials and I just wanted to know if anyone can explain the algorithm of a Lazy Binary Search and compare it to a non-lazy Binary Search.
Let's say, we have this array of numbers:
2, 11, 13, 21, 44, 50, 69, 88
How to look for the number 11 using a Lazy Binary Search?
Justin was wrong.
The common binarySearch algorithm first checks whether the target is equal to current middle entry before proceeding to the left or right halves if required. Lazy binarySearch algorithm postpones the equality check until the very end.
algorithm lazyBinarySearch(array, n, target)
left<-0
right<-n-1
while (left<right) do
mid<-(left+right)/2
if(target>array[mid])then
left<-mid+1
else
right<-mid
endif
endwhile
if(target==array[left])then
display "found at position", left
else
display "not found"
endif
In your case, in an array,
2 11 13 21 44 50 69 88
and you want to search for 11
First we do a trace of common binary search,
index 0 1 2 3 4 5 6 7
2 11 13 21 44 50 69 88
left mid right
First while loop:
left <= right, we enter the first while loop. We calculated the mid index by (0+7)/2=3.5=3 by integer division, mid = 3. straight away we check if target 11 is equal to the mid index entry, 11 != 21, then we decide whether to go left or right, we finds out 11 < 21, should go left. left index remains unchanged, right index becomes mid index -1, right = 3 - 1 = 2. Done this step.
Second while loop:
left <= right, 0 <= 2, we enter the seond while loop. Mid index is recalcuated: (0+2)/2=1, mid = 1. At once we do the equality check, target 11 is the same as the index 1 entry, 11 == 11. We found this entry, leaving behind all the left right mid indexes (don't care) and breaks out the while loop, return index 1.
Now we trace this search by lazy binazySearch algorithm, initial array with left/right indexes set up the same as previous.
First while loop:
left < right, we enter the first while loop. Mid index is calculated as the same above = 3. Instead of doing an equality check in common binarySearch we do a comparison with the mid index entry this time. And the comparison only checks if our target 11 is greater than the mid index entry, leaving whether they equal or not to the very end outside the while loop. So we find out 11 < 21, right index is reset to the mid index, right = 3. Done this step.
Second while loop:
left < right, 0 < 3, we enter the second while loop. mid index is recalculated as mid = (0+3)/2 = 1 by integer division. Again we do a comparison with mid index entry 11, we realise it's not greater than mid index entry. We fall into the else part of the while loop and reset the right index to be mid index, right = 1. Done this step.
Third while loop:
left < right, 0 < 1, this time we have to re-enter the while loop again since it still satisfies the while condition. Mid index becomes (0+1)/2=0, mid = 0. After comparing target 11 with mid index entry 2 we found out it's greater than it (11 > 2), left index is reset to mid + 1, left = 0 + 1 = 1. Done this step.
With left index = 1 and right index = 1, left = right, while loop condition is no longer satisfied, so there's no need to re-enter. We fall into the if part down below. Target 11 equals left index entry 11, we found the target and returns left index 1.
As you can see, lazy binarySearch does one more while loop step to finally realise the index is actually 1. And this is how the word "postpones the equality check" means in the definition I mentioned in the very beginning. Clearly the lazy binarySearch algorithm does more things than common binarySearch before reaching the program termination. And the term "lazy" is reflected in the time of when to check the target equals the mid index entry.
However lazy binarySearch is more preferable to use under some other circumstances, but it's not in the context of this case.
(The reasoning part of the algorithm is done by me, anyone wishes to copy please do credit)
source: "Data Structures Outside In With Java" by Sesh Venugopal, Prentice Hall
As far as I am aware "Lazy binary search" is just another name for "Binary search".

Power-law distribution in T-SQL

I basically need the answer to this SO question that provides a power-law distribution, translated to T-SQL for me.
I want to pull a last name, one at a time, from a census provided table of names. I want to get roughly the same distribution as occurs in the population. The table has 88,799 names ranked by frequency. "Smith" is rank 1 with 1.006% frequency, "Alderink" is rank 88,799 with frequency of 1.7 x 10^-6. "Sanders" is rank 75 with a frequency of 0.100%.
The curve doesn't have to fit precisely at all. Just give me about 1% "Smith" and about 1 in a million "Alderink"
Here's what I have so far.
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank] = ROUND(88799 * RAND(), 0)
But this of course yields a uniform distribution.
I promise I'll still be trying to figure this out myself by the time a smarter person responds.
Why settle for the power-law distribution when you can draw from the actual distribution ?
I suggest you alter the LastNames table to include a numeric column which would contain a numeric value representing the actual number of indivuduals with a name that is more common. You'll probably want a number on a smaller but proportional scale, say, maybe 10,000 for each percent of representation.
The list would then look something like:
(other than the 3 names mentioned in the question, I'm guessing about White, Johnson et al)
Smith 0
White 10,060
Johnson 19,123
Williams 28,456
...
Sanders 200,987
..
Alderink 999,997
And the name selection would be
SELECT TOP 1 [LastName]
FROM [LastNames] as LN
WHERE LN.[number_described_above] < ROUND(100000 * RAND(), 0)
ORDER BY [number_described_above] DESC
That's picking the first name which number does not exceed the [uniform distribution] random number. Note how the query, uses less than and ordering in desc-ending order; this will guaranty that the very first entry (Smith) gets picked. The alternative would be to start the series with Smith at 10,060 rather than zero and to discard the random draws smaller than this value.
Aside from the matter of boundary management (starting at zero rather than 10,060) mentioned above, this solution, along with the two other responses so far, are the same as the one suggested in dmckee's answer to the question referenced in this question. Essentially the idea is to use the CDF (Cumulative Distribution function).
Edit:
If you insist on using a mathematical function rather than the actual distribution, the following should provide a power law function which would somehow convey the "long tail" shape of the real distribution. You may wan to tweak the #PwrCoef value (which BTW needn't be a integer), essentially the bigger the coeficient, the more skewed to the beginning of the list the function is.
DECLARE #PwrCoef INT
SET #PwrCoef = 2
SELECT 88799 - ROUND(POWER(POWER(88799.0, #PwrCoef) * RAND(), 1.0/#PwrCoef), 0)
Notes:
- the extra ".0" in the function above are important to force SQL to perform float operations rather than integer operations.
- the reason why we subtract the power calculation from 88799 is that the calculation's distribution is such that the closer a number is closer to the end of our scale, the more likely it is to be drawn. The List of family names being sorted in the reverse order (most likely names first), we need this substraction.
Assuming a power of, say, 3 the query would then look something like
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 88799 - ROUND(POWER(POWER(88799.0, 3) * RAND(), 1.0/3), 0)
Which is the query from the question except for the last line.
Re-Edit:
In looking at the actual distribution, as apparent in the Census data, the curve is extremely steep and would require a very big power coefficient, which in turn would cause overflows and/or extreme rounding errors in the naive formula shown above.
A more sensible approach may be to operate in several tiers i.e. to perform an equal number of draws in each of the, say, three thirds (or four quarters or...) of the cumulative distribution; within each of these parts list, we would draw using a power law function, possibly with the same coeficient, but with different ranges.
For example
Assuming thirds, the list divides as follow:
First third = 425 names, from Smith to Alvarado
Second third = 6,277 names, from to Gainer
Last third = 82,097 names, from Frisby to the end
If we were to need, say, 1,000 names, we'd draw 334 from the top third of the list, 333 from the second third and 333 from the last third.
For each of the thirds we'd use a similar formula, maybe with a bigger power coeficient for the first third (were were are really interested in favoring the earlier names in the list, and also where the relative frequencies are more statistically relevant). The three selection queries could look like the following:
-- Random Drawing of a single Name in top third
-- Power Coef = 12
SELECT [LastName]
FROM [LastNames] as LN
WHERE LN.[Rank]
= 425 - ROUND(POWER(POWER(425.0, 12) * RAND(), 1.0/12), 0)
-- Second third; Power Coef = 7
...
WHERE LN.[Rank]
= (425 + 6277) - ROUND(POWER(POWER(6277.0, 7) * RAND(), 1.0/7), 0)
-- Bottom third; Power Coef = 4
...
WHERE LN.[Rank]
= (425 + 6277 + 82097) - ROUND(POWER(POWER(82097.0, 4) * RAND(), 1.0/4), 0)
Instead of storing the pdf as rank, store the CDF (the sum of all frequencies until that name, starting from Aldekirk).
Then modify your select to retrieve the first LN with rank greater than your formula result.
I read the question as "I need to get a stream of names which will mirror the frequency of last names from the 1990 US Census"
I might have read the question a bit differently than the other suggestions and although an answer has been accepted, and a very through answer it is, I will contribute my experience with the Census last names.
I had downloaded the same data from the 1990 census. My goal was to produce a large number of names to be submitted for search testing during performance testing of a medical record app. I inserted the last names and the percentage of frequency into a table. I added a column and filled it with a integer which was the product of the "total names required * frequency". The frequency data from the census did not add up to exactly 100% so my total number of names was also a bit short of the requirement. I was able to correct the number by selecting random names from the list and increasing their count until I had exactly the required number, the randomly added count never ammounted to more than .05% of the total of 10 million.
I generated 10 million random numbers in the range of 1 to 88799. With each random number I would pick that name from the list and decrement the counter for that name. My approach was to simulate dealing a deck of cards except my deck had many more distinct cards and a varing number of each card.
Do you store the actual frequencies with the ranks?
Converting the algebra from that accepted answer to MySQL is no bother, if you know what values to use for n. y would be what you currently have ROUND(88799 * RAND(), 0) and x0,x1 = 1,88799 I think, though I might misunderstand it. The only non-standard maths operator involved from a T-SQL perspective is ^ which is just POWER(x,y) == x^y.