I have a very long string, I'd like to find all the NaN values and replace it 'Null'. This long string was converted from a 120 x 150000 cell. The reason for converting it into a long string was I was going to convert it into a one giant SQL query as fastinsert and datainsert can can be very slow and sometimes I'm running out of heap space. The idea is to do the following
exec(sqlConnection, long_string)
I tried using the regexpreop to replace NaN with null but it seems very slow. Is there an alternative way.
long_string = regexprep(long_string,'NaN','null');
As Floris mentioned regexp is a very strong command and as a result is slower than other find commands.
In addition to Floris suggestion you can try using strrep which works in your case, since you are not using any of the special powers of regexp.
Here is an example:
str = char('A' + rand(1,120 * 15000)*('z'-'A'));
tic
str2 = strrep(str, 'g', 'null');
disp('strrep: '), toc
tic
str3 = regexprep(str, 'g','null');
disp('regexprep: '), toc
On my computer it will return:
strrep:
Elapsed time is 0.004640 seconds.
regexprep:
Elapsed time is 4.004671 seconds.
regex is very powerful, but can be slow because of its flexibility. If you still have the original cell array - and assuming it contains only strings - the following line of code should work, and very fast:
cellArray{find(ismember(cellArray,'NaN'))} = 'null';
ismember finds all the elements in cellArray that are NaN, returning a boolean array with the same shape as cellArray; the find operation turns these into indices of the elements that are NaN, and then you just assign the value null to all of them.
It must be said that 120 x 150,000 is a VERY large cell array - it will occupy over 2 GB even with just a single character in each cell (I know this because I just created a 120x15000 cell array, and it was 205,500,000 bytes). It might be worth processing this in smaller chunks rather than all at once. Especially if you know that the NaN would occur only in some of the columns, for example.
Processing a GB sized string, especially when you can't operate in-place (you are changing the size of the string with every replacement, and it's getting longer, not shorter) is going to be dead slow. It's possible you could write a short mex function to do this if you really have no other option - that could be pretty fast.
Related
i have a program that is finding for any string SHA 256 hash that is starting with n zeros.
It works like in bitcoin, so i get string for example "FOO", where im finding "nonce" where SHA(FOO+nonce) starts with for example 8 zeros. That nonce is 1231004720 (you can check it BTW)
The program of course works like a charm. I know, that things takes time, but i cant find, if i could calculate estimated time. If i can calculate n=10 in one hour, how much it can take to get 11
leading zeros hash. Somewhere i readed thats its "factor of 16", but IDKif its true and if yes, why ?
Thanks
backstory
I put together a simple multi-threaded brute-force hash hacking program for a job application test requirement.
Here are some of the particulars
It functions properly, but the performance is quite a bit different between my initial version and this altered portion.
factors
The reason for the alteration was due to increased number of possible combinations between the sample data processing and the test/challenge data processing.
The application test sample was 16^7 total combinations.. which is of course less that uint32 (or 16^8).
the challenge is a 9 length hashed string that produces a hashed long value (that I was given); thus it is 16^9. The size difference was something I accounted for, which is why I took the easy route of putting the initial program together targeting the 7 length Hashed string - getting it to function properly on a smaller scale.
overall
The issue isn't just the increased combinations, it is dramatically slower due to the loop operating using long/int64 or uint64..
when I crunched the numbers using int32 (not even uint32) data types.. I could hear my comp kick it up a notch.. The entire check was done in under 4 minutes. that's 16777216 (16^6) combination checks per thread..
noteworthy - multithreading
I broke everything into worker threads.. 16 of them, 1 for each of the beginning characters.. thus I'm only looking for 16^8 combination on each thread now... which is 1 freaking unit higher than uint32 value (includes 0)...
I'll give a final thought after I put up this code segment..
The function is as followed:
Function Propogate() As Boolean
Propogate = False
Dim combination As System.Text.StringBuilder = New System.Text.StringBuilder(Me.ScopedLetters)
For I As Integer = 1 To (Me.ResultLength - Me.ScopedLetters.Length) Step 1
combination.Append(Me.CombinationLetters.Chars(0))
Next
'Benchmarking feedback - This simply adds values to a list to be checked against to denote progress
Dim ProgressPoint As New List(Of Long)
'###############################
'#[THIS IS THE POINT OF INTEREST]
'# Me.CombinationLetters = 16 #
'# Me.ResultLength = 7 Or 9 # The 7 was the sample size provided.. 9 was the challenge/test
'# Me.ScopedLetters.Length = 1 # In my current implementation
'###############################
Dim TotalCombinations As Long = CType(Me.CombinationLetters.Length ^ (Me.ResultLength - Me.ScopedLetters.Length), Long)
ProgressPoint.Add(1)
ProgressPoint.Add(CType(TotalCombinations / 5, Long))
ProgressPoint.Add(CType(TotalCombinations * 2 / 5, Long))
ProgressPoint.Add(CType(TotalCombinations * 3 / 5, Long))
ProgressPoint.Add(CType(TotalCombinations * 4 / 5, Long))
ProgressPoint.Add(CType(TotalCombinations, Long))
For I As Long = 1 To TotalCombinations Step 1
Me.AddKeyHash(combination.ToString) 'The hashing arithmetic and Hash value check is done at this call.
Utility.UpdatePosition(Me.CombinationLetters, combination) 'does all the incremental character swapping and string manipulation..
If ProgressPoint.Contains(I) Then
RaiseEvent OnProgress(CType((I / TotalCombinations) * 100, UInteger).ToString & " - " & Me.Name)
End If
Next
Propogate = True
End Function
I already have an idea of what I could try, drop it down the int32 again and put another loop around this loop (16 iterations)
But there might be better alternative, so I would like to hear from the community on this one.
Would a For Loop using double point precision cycle better?
by the way, how coupled is long types and arithmetic to cpu architecture.. specifically cacheing?
My development comp is old.. Pentium D running XP Professional x64 .. my excuse is that if it runs in my environment, it will likely run on Win Server 2003..
In the end, this could have likely been a hardware issue.. my old workstation did not survive much longer after doing this project.
So, this should be an easy question for anyone who has used FORTH before, but I am a newbie trying to learn how to code this language (and this is a lot different than C++).
Anyways, I'm just trying to create a variable in FORTH called "Height" and I want a user to be able to input a value for "Height" whenever a certain word "setHeight" is called. However, everything I try seems to be failing because I don't know how to set up the variable nor how to grab user input and put it in the variable.
VARIABLE Height 5 ALLOT
: setHeight 5 ACCEPT ATOI CR ;
I hope this is an easy problem to fix, and any help would be greatly appreciated.
Thank you in advance.
Take a look at Rosettacode input/output examples for string or number input in FORTH:
String Input
: INPUT$ ( n -- addr n )
PAD SWAP ACCEPT
PAD SWAP ;
Number Input
: INPUT# ( -- u true | false )
0. 16 INPUT$ DUP >R
>NUMBER NIP NIP
R> <> DUP 0= IF NIP THEN ;
A big point to remember for your self-edification -- C++ is heavily typecasted, Forth is the complete opposite. Do you want Height to be a string, an integer, or a float, and is it signed or unsigned? Each has its own use cases. Whatever you choose, you must interact with the Height variable with the type you choose kept in mind. Think about what your bits mean every time.
Judging by your ATOI call, I assume you want the value of Height as an integer. A 5 byte integer is unusual, though, so I'm still not certain. But here goes with that assumption:
VARIABLE Height 1 CELLS ALLOT
VARIABLE StrBuffer 7 ALLOT
: setHeight ( -- )
StrBuffer 8 ACCEPT
DECIMAL ATOI Height ! ;
The CELLS call makes sure you're creating a variable with the number of bits your CPU prefers. The DECIMAL call makes sure you didn't change to HEX somewhere along the way before your ATOI.
Creating the StrBuffer variable is one of numerous ways to get a scratch space for strings. Assuming your CELL is 16-bit, you will need a maximum of 7 characters for a zero-terminated 16-bit signed integer -- for example, "-32767\0". Some implementations have PAD, which could be used instead of creating your own buffer. Another common word is SCRATCH, but I don't think it works the way we want.
If you stick with creating your own string buffer space, which I personally like because you know exactly how much space you got, then consider creating one large buffer for all your words' string handling needs. For example:
VARIABLE StrBuffer 201 ALLOT
This also keeps you from having to make the 16-bit CELL assumption, as 200 characters easily accommodates a 64-bit signed integer, in case that's your implementation's CELL size now or some day down the road.
I have a system which deals with keys that have been turned into unsigned long integers (by packing short sequences into byte strings). I want to try storing these in Redis, and I want to do it in the best way possible. My concern is mainly memory efficiency.
From playing with the online REPL I notice that the two following are identical
zadd myset 1.0 "123"
zadd myset 1.0 123
This means that even if I know I want to store an integer, it has to be set as a string. I notice from the documentation that keys are just stored as char*s and that commands like SETBIT indicate that Redis is not averse to treating strings as bytestrings in the client. This hints at a slightly more efficient way of storing unsigned longs than as their string representation.
What is the best way to store unsigned longs in sorted sets?
Thanks to Andre for his answer. Here are my findings.
Storing ints directly
Redis keys must be strings. If you want to pass an integer, it has to be some kind of string. For small, well-defined sets of values, Redis will parse the string into an integer, if it is one. My guess is that it will use this int to tailor its hash function (or even statically dimension a hash table based on the value). This works for small values (examples being the default values of 64 entries of a value of up to 512). I will test for larger values during my investigation.
http://redis.io/topics/memory-optimization
Storing as strings
The alternative is squashing the integer so it looks like a string.
It looks like it is possible to use any byte string as a key.
For my application's case it actually didn't make that much difference storing the strings or the integers. I imagine that the structure in Redis undergoes some kind of alignment anyway, so there may be some pre-wasted bytes anyway. The value is hashed in any case.
Using Python for my testing, so I was able to create the values using the struct.pack. long longs weigh in at 8 bytes, which is quite large. Given the distribution of integer values, I discovered that it could actually be advantageous to store the strings, especially when coded in hex.
As redis strings are "Pascal-style":
struct sdshdr {
long len;
long free;
char buf[];
};
and given that we can store anything in there, I did a bit of extra Python to code the type into the shortest possible type:
def do_pack(prefix, number):
"""
Pack the number into the best possible string. With a prefix char.
"""
# char
if number < (1 << 8*1):
return pack("!cB", prefix, number)
# ushort
elif number < (1 << 8*2):
return pack("!cH", prefix, number)
# uint
elif number < (1 << 8*4):
return pack("!cI", prefix, number)
# ulonglong
elif number < (1 << 8*8):
return pack("!cQ", prefix, number)
This appears to make an insignificant saving (or none at all). Probably due to struct padding in Redis. This also drives Python CPU through the roof, making it somewhat unattractive.
The data I was working with was 200000 zsets of consecutive integer => (weight, random integer) × 100, plus some inverted index (based on random data). dbsize yields 1,200,001 keys.
Final memory use of server: 1.28 GB RAM, 1.32 Virtual. Various tweaks made a difference of no more than 10 megabytes either way.
So my conclusion:
Don't bother encoding into fixed-size data types. Just store the integer as a string, in hex if you want. It won't make all that much difference.
References:
http://docs.python.org/library/struct.html
http://redis.io/topics/internals-sds
I'm not sure of this answer, it's more of a suggestion than anything else. I'd have to give it a try and see if it works.
As far as I can tell, Redis only supports UTF-8 strings.
I would suggest grabbing a bit representation of your long integer and pad it accordingly to fill up the nearest byte. Encode each set of 8 bytes to a UTF-8 string (ending up with 8x*utf8_char* string) and store that in Redis. The fact that they're unsigned means that you don't care about that first bit but if you did, you could add a flag to the string.
Upon retrieving the data, you have to remember to pad each character to 8 bytes again as UTF-8 will use less bytes for the representation if the character can be stored with less bytes.
End result is that you store a maximum of 8 x 8 byte characters instead of (possibly) a maximum of 64 x 8 byte characters.
From the help for the Overflow Error in VBA, there's the following examples:
Dim x As Long
x = 2000 * 365 ' gives an error
Dim x As Long
x = CLng(2000) * 365 ' fine
I would have thought that, since the Long data type is supposed to be able to hold 32-bit numbers, that the first example would work fine.
I ask this because I have some code like this:
Dim Price as Long
Price = CLng(AnnualCost * Months / 12)
and this throws an Overflow Error when AnnualCost is 5000 and Months is 12.
What am I missing?
2000 and 365 are Integer values. In VBA, Integers are 16-bit signed types, when you perform arithmetic on 2 integers the arithmetic is carried out in 16-bits. Since the result of multiplying these two numbers exceeds the value that can be represented with 16 bits you get an exception. The second example works because the first number is first converted to a 32-bit type and the arithmetic is then carried out using 32-bit numbers. In your example, the arithmetic is being performed with 16-bit integers and the result is then being converted to long but at that point it is too late, the overflow has already occurred. The solution is to convert one of the operands in the multiplication to long first:
Dim Price as Long
Price = CLng(AnnualCost) * Months / 12
The problem is that the multiplication is happening inside the brackets, before the type conversion. That's why you need to convert at least one of the variables to Long first, before multiplying them.
Presumably you defined the variables as Integer. You might consider using Long instead of Integer, partly because you will have fewer overflow problems, but also because Longs calculate (a little) faster than Integers on 32 bit machines. Longs do take more memory, but in most cases this is not a problem.
In VBA, literals are integer by default (as mentioned). If you need to force a larger datatype on them you can recast them as in the example above or just append a type declaration character. (The list is here: http://support.microsoft.com/kb/191713) The type for Long is "&" so you could just do:
Price = CLng(AnnualCost * Months / 12&)
And the 12 would be recast as a long. However it is generally good practice to avoid literals and use constants. In which case you can type the constant in it's declaration.
Const lngMonths12_c as Long = 12
Price = CLng(AnnualCost * Months / lngMonths12_c)