Is there a way to process a loop in all cores of the processor? - python-multiprocessing

trying to understand if it is at all possible to split a loop into multiple processes so it occupies all 8 cores in my processor? For example I made this code to find the divisors of the so called anti-primes. Takes ages to finish when I feed the 10 digit numbers as it is using up only around 15% of the cpu.
n=int(input('Numbers to check divisors for :'))
i = n
c=1
while i > 1:
if n % i == 0:
print(f'{c}: {i} is a divisor of {n}')
c+=1
i-=1
print(f'{n} has {c-1} divisors excluding 1 and {n} itself')
I tried almost all tips on other posts but they don't seem to work for me. Multithreading worked but they didn't seem to bring the execution time down. I'm super new to python so maybe I'm looking for a solution that doesn't exist? TIA for all inputs.

Related

Randomly increasing sequence- Wolfram Mathematica

Good afternoon, I have a problem making recurrence table with randomly increasing sequence. I want it to return an increasing sequence with a random difference between two elements. Right now I've got:
RecurrenceTable[{a[k+1]==a[k] + RandomInteger[{0,4}], a[1]==-12},a,{k,1,5}]
But it returns me an arithmetic progression with chosen d for all k (e.g. {-12,-8,-4,0,4,8,12,16,20,24}).
Also, I will be really grateful for explaining why if I replace every k in my code with n I get:
RecurrenceTable[{4+a[n] == a[n],a[1] == -12},a,{n,1,10}]
Thank You very much for Your time!
I don't believe that RecurrenceTable is what you are looking for.
Try this instead
FoldList[Plus,-12,RandomInteger[{0,4},5]]
which returns, this time,
{-12,-8,-7,-3,1,2}
and returns, this time,
{-12,-9,-5,-3,0,1}

What is a*<a[j] in pseudocode?

I was trying the time complexity mcq questions given in codechef under practice for Data Structures and Algorithms. One of the questions had a line a*< a[i]. What does that line mean?
I know that if there wasn't an and statement the complexity would have been O(n^2). But the a*< is completely alien to me. I searched for it in the internet but all I got was about the a star algorithm and asterisks! I tried running the program in python with a print statement but it says that * is invalid. Does that mean something like a pointer to the 1st element in the array or something?
Find the time complexity of the following function
n = len(a)
j = 0
for i =0 to n-1:
while (j < n and a* < a[j]):
j += 1
The answer is given as O(n). But there are nested loops so it is supposed to be O(n^2).Help required! Thanks
It doesn't actually matter what a* means. The question is to determine the time complexity of the algorithm. Notice that although there are two nested loops, the inner while loop isn't a full independent loop. Its index is j, which starts at 0 and is only ever incremented, with an upper bound of n. So the inner loop can only run a maximum of n times in total. This means that the overall complexity is only O(n).

number of instruments GMM estimator in R

I have one question, maybe very simple. In Stata in dynamic panel data model (GMM estimator) you recieve a "number of instruments". In turn, in R you recieve AR test, sargan test but the "number of instruments" is nor displayed. How to obtain number of isntruments in R?
Thank you for helping
If you used all the 99 lags available for the instrumental variable, the number of instruments (for each instrumental variable) will be:
(0,5 x t-1 x t-2) + the number of time dummies you used.
(t is the time span of your data).
If you used less the all available lags, I don’t know how to calculate the number of instruments. If someone knows, please tell me!!

bartMachine (Bayesian Additive Regression Tree) in >1 million cases in R

I want to use BART via the bartMachine package for a dataframe of just over 1 million cases. With a lot of optimisation in the java memory setting, I can get R on my MacBook to complete the BART model for about 5000 cases. Anything above that is aborted as the system runs out of memory space.
Is there any chance I can use bartMachine() with an input matrix of 1 mio numbers of rows (ca. 15 predictors)?
Otherwise are there any alternative packages that would allow my to at least run predictor selection using BART?
Thanks for your help!
have you tried to increase RAM option ?
options(java.parameters="-Xmx12g") # must be set initially
I'm the maintainer of this package. Have you tried "mem_cache_for_speed = FALSE" as an option?

Advice for bit level manipulation

I'm currently working on a project that involves a lot of bit level manipulation of data such as comparison, masking and shifting. Essentially I need to search through chunks of bitstreams between 8kbytes - 32kbytes long for bit patterns between 20 - 40bytes long.
Does anyone know of general resources for optimizing for such operations in CUDA?
There has been a least a couple of questions on SO on how to do text searches with CUDA. That is, finding instances of short byte-strings in long byte-strings. That is similar to what you want to do. That is, a byte-string search is much like a bit-string search where the number of bits in the byte-string can only be a multiple of 8, and the algorithm only checks for matches every 8 bits. Search on SO for CUDA string searching or matching, and see if you can find them.
I don't know of any general resources for this, but I would try something like this:
Start by preparing 8 versions of each of the search bit-strings. Each bit-string shifted a different number of bits. Also prepare start and end masks:
start
01111111
00111111
...
00000001
end
10000000
11000000
...
11111110
Then, essentially, perform byte-string searches with the different bit-strings and masks.
If you're using a device with compute capability >= 2.0, store the shifted bit-strings in global memory. The start and end masks can probably just be constants in your program.
Then, for each byte position, launch 8 threads that each checks a different version of the 8 shifted bit-strings against the long bit-string (which you now treat like a byte-string). In each block, launch enough threads to check, for instance, 32 bytes, so that the total number of threads per block becomes 32 * 8 = 256. The L1 cache should be able to hold the shifted bit-strings for each block, so that you get good performance.