Most common sequence of characters in a given string

Most common sequence of characters in a given string - sequence

Suppose I am given a string of characters. How to find the most common sequence of characters with a minimum length of l?
Programming language doesn't matter but it should work with a string of 1000+ at an usual Computer.

You have to find all possible sequences and count them. That is,
for (each position in string) {
length = 0;
do {
sequence = (string from position to position + length);
count sequence locations in string;
if (count is higher than max count) {
remember sequence;
update max count;
}
length++;
if (position + length > string.length or length > sequence limit) break;
}
}
It is possible that same sequences will be met in different string places so they will be counted abundantly. This is harmless but takes some extra cycles. A way to avoid that is to store found sequences and don't check those already checked. But memory requirements for long strings and long sequences may become huge.

Related

Time and Space Complexity(for specific algorithm)

Despite the last 30 minutes i spent on trying to understand time and space complexity better, i still can't confidently determine those for the algorithm below:
bool checkSubstr(std::string sub)
{
//6 OR(||) connected if statement(checks whether the parameter
//is among the items in the list)
}
void checkWords(int start,int end)
{
int wordList[2] ={0};
int j = 0;
if (start < 0)
{
start = 0;
}
if (end>cAmount)
{
end = cAmount -1;
}
if (end-start < 2)
{
return;
}
for (int i = start; i <= end-2; i++)
{
if (crystals[i] == 'I' || crystals[i] == 'A')
{
continue;
}
if (checkSubstr(crystals.substr(i,3)))
{
wordList[j] = i;
j++;
}
}
if (j==1)
{
crystals.erase(wordList[0],3);
cAmount -= 3;
checkWords(wordList[0]-2,wordList[0]+1);
}
else if (j==2)
{
crystals.erase(wordList[0],(wordList[1]-wordList[0]+3));
cAmount -= wordList[1]-wordList[0]+3;
checkWords(wordList[0]-2,wordList[0]+1);
}
}
The function basically checks a sub-string of the whole string for predetermined (3 letter, e.g. "SAN") combinations of letters. Sub-string length can be 4-6 no real way to determine, depends on the input(pretty sure it's not relevant, although not 100%).
My reasoning:
If there are n letters in the string, worst case scenario, we have to check each of them. Again depending on the input, this can be done 3 ways.
All 6 length sub-strings: If this is the case the function runs n/6 times, each running 8(or 10?) processes, which(i think) means that its time complexity is O(n).
All 4 length sub-strings: Pretty much the same reason above, O(n).
4 and 6 length sub-strings mixed: Can't see why this would be different than previous 2. O(n)
As for the space complexity, i am completely lost. However, i have an idea:
If the function recurs for maximum amount of time,it will require:
n/4 x The Amount Used In One Run
which made me think it should be O(n). Although, i'm not convinced this is correct. I thought maybe seeing someone else's thought process on this example would help me understand how to calculate time and space complexity better.
Thank you for your time.
EDIT: Let me provide clearer information. We read a combination of 6 different letters into a string, this can be (almost)any combination in any length. 'crystals' is the string, and we are looking for 6 different 3 letter combinations in that list of letters. Sort of like a jewel matching game. Now the starting list contains no matches(none of the 6 predetermined combinations exist in the first place). Therefore the only way matches can occur from then on is by swaps or matches disappearing. Once a swap is processed by top level code, the function is called to check for matches, and if a match is found the function recurs after deleting the "match" part of the string.
Now let's look at how the code is looking for a match. To demonstrate a swap of 2 letters:
ABA B-R ZIB(no spaces or '-' in the actual string, used for better demonstration),
B and R is being swapped. This swap only effects the 6 letters starting from 2nd letter and ending on 7th letter. In other words, the letters the first A and last B can form a match with are same, before and after the swap, thus no point checking for matches including those words. So a sub-string of 6 letters sent to the checking algorithm. Similarly, if a formed match disappears(gets deleted from the string) the range of effected letters is 4. So when i thought of a worst case scenario, i imagined either 1 swap creating a whole chain reaction and matching all the way till there are not enough letters to form a match, or each match happens with a swap. Again, i am not saying this is how we should think when calculating time and space complexity but this is how the code works. Hope this is clear enough if not let me know and i can provide more details. It's also important to note that swap amount and places are a part of the input we read.
EDIT: Here is how the function is called on top level for the first time:
checkWords(swaps[i]-2,swaps[i]+3);

Sub-string length can be 4-6 no real way to determine, depends on the
input (pretty sure it's not relevant, although not 100%).
That's not what the code shows; the line if (checkSubstr(crystals.substr(i,3))) conveys that substrings always have exactly 3 characters. If the substring length varies, it is relevant, since your naive substring match will degrade to O(N*M) in the general case, where N is start-end+1 (the size of the input string) and M is the size of the substring being searched. This happens because in the worst case you'll compare M characters for each of the N characters of the source string.
The rest of this answer assumes that substrings are of size 3, since that's what the code shows.
If substrings are always 3 characters long, it's different: you can essentially assume checkSubstr() is O(1) because you will always compare at most 3 characters. The bulk of the work happens inside the for loop, which is O(N), where N is end-1-start.
After the loop, in the worst case (when one of the ifs is entered), you erase a bunch of characters from crystal. Assuming this is a string backed by an array in memory, this is an O(cAmount) operation, because all elements after wordList[0] must be shifted. The recursive call always passes in a range of size 4; it does not grow nor shrink with the size of the input, so you can also say there are O(1) recursive calls.
Thus, time complexity is O(N+cAmount) (where N is end-1-start), and space complexity is O(1).

How to read 8-byte integers in GMS 2.x?

I need to read 8-byte integers from a stream. I could not find any documentation how to read 8-byte integers in DM. It would be something similar to a long long integer.
Is there a trick how to stream 8-byte integers from file in GMS 2.x ?

We can use the "Stream" object to read/import data of various kinds. Please refer to the DM Help > Scripting > File Input and Output:
Other examples can also be found at DM-Script-Database :
Read-Ser (http://donation.tugraz.at/dm/source_codes/127)
JEMS_.ems file reader (http://donation.tugraz.at/dm/source_codes/108)
Hope this helps.

I used the following (stupid) method to do so:
number readint32(object s){
number stream_byte_order=2
number result=0
TagGroup tg = NewTagGroup();
tg.TagGroupSetTagAsLong( "SInt32_0", 0 )
TagGroupReadTagDataFromStream( tg, "SInt32_0", s, stream_byte_order );
tg.TagGroupGetTagAsLong( "SInt32_0", result)
return result
}
number readint64(object s){
//new for reading 8-byte integer in TIA ver >3.7
//DM automatic convert result to float when the second 4-byte >1
number result = readint32(s)+ (readint32(s)*4294967296)
// 4294967296 equals to 0xFFFFFFFF in hex form
return result
}
It works with reading ser <2GB, but does not for larger file. I still did not figure it out...
#09-04-2016
Now i got a solution to the data offset problem in ser:
Here is the solution:
Void b_readint64(object s, number &lo, number &hi){
//new for reading 8-byte (64bit) integer in TIA ver >3.7
//read the low and high section individually and later work
//together with StreamSetPos32singed, StreamSetPos64 funcsions
lo = b_readint32(s)
hi = b_readint32(s)
}
Void StreamSetPos32Signed(object s, number base, number lo){
if (lo>0) StreamSetPos(s, base, lo)
else StreamSetPos(s, base, 4294967296+lo)
}
Void StreamSetPos64(object s, number base, number lo, number hi){
if (hi!=0){
StreamSetPos(s, base, 0)
for (number i=0; i<hi; i++) StreamSetPos(s, 1, 4294967296)
StreamSetPos32Signed(s, 1, lo)
} else StreamSetPos32signed(s, base, lo)
}
BTW, I just uploaded this upgraded script to
http://portal.tugraz.at/portal/page/portal/felmi/DM-Script/DM-Script-Database

There is nothing like an 8-byte integer in DigitalMicrograph. You can use the streaming to read in two successive 4-byte sections as integers (See answer above) and then display them as binary using binary() or hexadecimal using hex(), but you will have to do the maths yourself for the "meaning" of the 8-byte integer (storing it as real-number). You can use the binary operators & | ^ for bitwise numeric, when needed.

Dealing with Int64 value with Booksleeve

I have a question about Marc Gravell's Booksleeve library.
I tried to understand how booksleeve deal the Int64 value (i have billion long value in Redis actually)
I used reflection to undestand the Set long value overrides.
// BookSleeve.RedisMessage
protected static void WriteUnified(Stream stream, long value)
{
if (value >= 0L && value <= 99L)
{
int i = (int)value;
if (i <= 9)
{
stream.Write(RedisMessage.oneByteIntegerPrefix, 0, RedisMessage.oneByteIntegerPrefix.Length);
stream.WriteByte((byte)(48 + i));
}
else
{
stream.Write(RedisMessage.twoByteIntegerPrefix, 0, RedisMessage.twoByteIntegerPrefix.Length);
stream.WriteByte((byte)(48 + i / 10));
stream.WriteByte((byte)(48 + i % 10));
}
}
else
{
byte[] bytes = Encoding.ASCII.GetBytes(value.ToString());
stream.WriteByte(36);
RedisMessage.WriteRaw(stream, (long)bytes.Length);
stream.Write(bytes, 0, bytes.Length);
}
stream.Write(RedisMessage.Crlf, 0, 2);
}
I don't understand why, with more than two digits int64, the long is encoding in ascii?
Why don't use byte[] ? I know than i can use byte[] overrides to do this, but i just want to understand this implementation to optimize mine. There may be a relationship with the Redis storage.
By advance thank you Marc :)
P.S : i'm still very enthusiastic about your next major version, than i can use long value key instead of string.

It writes it in ASCII because that is what the redis protocol demands.
If you look carefully, it is always encoded as ASCII - but for the most common cases (0-9, 10-99) I've special-cased it, as these are very simple results:
x => $1\r\nX\r\n
xy => $2\r\nXY\r\n
where x and y are the first two digits of a number in the range 0-99, and X and Y are those digits (as numbers) offset by 48 ('0') - so decimal 17 becomes the byte sequence (in hex):
24-32-0D-0A-31-37-0D-0A
Of course, that can also be achieved simply via the writing each digit sequentially and offsetting the digit value by 48 ('0'), and handling the negative sign - I guess the answer there is simply "because I coded it the simple but obviously correct way". Consider the value -123 - which is encoded as $4\r\n-123\r\n (hey, don't look at me - I didn't design the protocol). It is slightly awkward because it needs to calculate the buffer length first, then write that buffer length, then write the value - remembering to write in the order 100s, 10s, 1s (which is much harder than writing the other way around).
Perfectly willing to revisit it - simply: it works.
Of course, it becomes trivial if you have a scratch buffer available - you just write it in the simple order, then reverse the portion of the scratch buffer. I'll check to see if one is available (and if not, it wouldn't be unreasonable to add one).
I should also clarify: there is also the integer type, which would encode -123 as :-123\r\n - however, from memory there are a lot of places this simply does not work.

Objective-C - Get number of decimals of a double variable

Is there a nice way to get the number of decimals of a double variable in Objective-C?
I am struggling for a while to find a way but with no success.
For example 231.44232000 should return 5.
Thanks in advance.

You could, in a loop, multiply by 10 until the fractional part (returned by modf()) is really close to zero. The number of iterations'll be the answer you're after. Something like:
int countDigits(double num) {
int rv = 0;
const double insignificantDigit = 8;
double intpart, fracpart;
fracpart = modf(num, &intpart);
while ((fabs(fracpart) > 0.000000001f) && (rv < insignificantDigit)) {
num *= 10;
rv++;
fracpart = modf(num, &intpart);
}
return rv;
}

Is there a nice way to get the number of decimals of a double variable in Objective-C?
No. For starters, a double stores a number in binary, so there may not even be an exact binary representation that corresponds to your decimal number. There's also no consideration for the number of significant decimal digits -- if that's important, you'll need to track it separately.
You might want to look into using NSDecimalNumber if you need to store an exact representation of a decimal number. You could create your own subclass and add the ability to store and track significant digits.

What are the practical uses of modulus (%) in programming? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Recognizing when to use the mod operator
What are the practical uses of modulus? I know what modulo division is. The first scenario which comes to my mind is to use it to find odd and even numbers, and clock arithmetic. But where else I could use it?

The most common use I've found is for "wrapping round" your array indices.
For example, if you just want to cycle through an array repeatedly, you could use:
int a[10];
for (int i = 0; true; i = (i + 1) % 10)
{
// ... use a[i] ...
}
The modulo ensures that i stays in the [0, 10) range.

I usually use them in tight loops, when I have to do something every X loops as opposed to on every iteration..
Example:
int i;
for (i = 1; i <= 1000000; i++)
{
do_something(i);
if (i % 1000 == 0)
printf("%d processed\n", i);
}

One use for the modulus operation is when making a hash table. It's used to convert the value out of the hash function into an index into the array. (If the hash table size is a power of two, the modulus could be done with a bit-mask, but it's still a modulus operation.)

To print a number as string, you need the modulus to find the value of a digit.
string number_to_string(uint number) {
string result = "";
while (number != 0) {
result = cast(char)((number % 10) + '0') ~ result;
// ^^^^^^^^^^^
number /= 10;
}
return result;
}

For the control number of international bank account numbers, the mod97 technique.
Also in large batches to do something after n iterations. Here is an example for NHibernate:
ISession session = sessionFactory.openSession();
ITransaction tx = session.BeginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.Save(customer);
if ( i % 20 == 0 ) { //20, same as the ADO batch size
//Flush a batch of inserts and release memory:
session.Flush();
session.Clear();
}
}
tx.Commit();
session.Close();

The usual implementation of buffered communications uses circular buffers, and you manage them with modulus arithmetic.

For languages that don't have bitwise operators, modulus can be used to get the lowest n bits of a number. For example, to get the lowest 8 bits of x:
x % 256
which is equivalent to:
x & 255

Cryptography. That alone would account for an obscene percentage of modulus (I exaggerate, but you get the point).
Try the Wikipedia page too:
Modular arithmetic is referenced in number theory, group theory, ring theory, knot theory, abstract algebra, cryptography, computer science, chemistry and the visual and musical arts.
In my experience, any sufficiently advanced algorithm is probably going to touch on one more of the above topics.

Well, there are many perspectives you can look at it. If you are looking at it as a mathematical operation then it's just a modulo division. Even we don't need this as whatever % do, we can achieve using subtraction as well, but every programming language implement it in very optimized way.
And modulu division is not limited to finding odd and even numbers or clock arithmetic. There are hundreds of algorithms which need this module operation, for example, cryptography algorithms, etc. So it's a general mathematical operation like other +, -, *, /, etc.
Except the mathematical perspective, different languages use this symbol for defining built-in data structures, like in Perl %hash is used to show that the programmer declared a hash. So it all varies based on the programing language design.
So still there are a lot of other perspectives which one can do add to the list of use of %.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Most common sequence of characters in a given string - sequence

Suppose I am given a string of characters. How to find the most common sequence of characters with a minimum length of l? Programming language doesn't matter but it should work with a string of 1000+ at an usual Computer.

Related

Time and Space Complexity(for specific algorithm)

How to read 8-byte integers in GMS 2.x?

Dealing with Int64 value with Booksleeve

Objective-C - Get number of decimals of a double variable

What are the practical uses of modulus (%) in programming? [duplicate]

Categories

Resources