what is the most efficient way to create a categorical variable based on another continuous variable in python pandas? - pandas

I have a continuous variable A (say, earnings) in my dataframe. I want to make a categorical variable B off that. Specifically, I'd like to define the second variable as going up in increments of 500 until a certain limit. For instance,
B= 1 if A<500
2 if A>=500 & A<1000
3 if A>=1000 & A<1500
....
11 if A>5000
What is the most efficient way to do this in Pandas? In STATA in which I mostly program, I would either use replace and if (tedious) or loop if I have many categories. I want to break out of STATA thinking when using Pandas but sometimes my imagination is limited.
Thanks in advance

If the intervals are regular and the values are positive as they seem to be in the example, you can get the integer part of the values divided by the length of the interval. Something like
df['category'] = (df.A / step_size).astype(int)
Note that if there are negative values you can run into problems, e.g. anything between -500 and 500 comes out as 0. But you can get around this by adding some base value before dividing. You can effectively define you're catgeories as the multiples of step size from some base value, which happens to be zero above.
Something like
df['category'] = ((df.A + base) / step_size).astype(int)
Here'a another approach for intervals which aren't regularly spaced:
lims = np.arange(500, 5500, 500)
df['category'] = 0
for lim in lims:
df.category += df.A > lim
This method is good when you have a relatively small number of limits but slows down for many, obviously.
Here's some benchmarking for the various methods:
a = np.random.rand(100000) * 6000
%timeit pd.cut(a, 11)
%timeit (a / 500).astype(int)
100 loops, best of 3: 6.47 ms per loop
1000 loops, best of 3: 1.12 ms per loop
%%timeit
x = 0
for lim in lims:
x += a > lim
100 loops, best of 3: 3.84 ms per loop
I put pd.cut in there as well as per John E's suggestion. This yields categorical variables rather than integers as he pointed out which have different uses. There are pros and cons to both approaches and the best method would depend on the scenario.

Related

Ranking Big O Functions By Complexity

I am trying to rank these functions — 2n, n100, (n + 1)2, n·lg(n), 100n, n!, lg(n), and n99 + n98 — so that each function is the big-O of the next function, but I do not know a method of determining if one function is the big-O of another. I'd really appreciate if someone could explain how I would go about doing this.
Assuming you have some programming background. Say you have below code:
void SomeMethod(int x)
{
for(int i = 0; i< x; i++)
{
// Do Some Work
}
}
Notice that the loop runs for x iterations. Generalizing, we say that you will get the solution after N iterations (where N will be the value of x ex: number of items in array/input etc).
so This type of implementation/algorithm is said to have Time Complexity of Order of N written as O(n)
Similarly, a Nested For (2 Loops) is O(n-squared) => O(n^2)
If you have Binary decisions made and you reduce possibilities into halves and pick only one half for solution. Then complexity is O(log n)
Found this link to be interesting.
For: Himanshu
While the Link explains how log(base2)N complexity comes into picture very well, Lets me put the same in my words.
Suppose you have a Pre-Sorted List like:
1,2,3,4,5,6,7,8,9,10
Now, you have been asked to Find whether 10 exists in the list. The first solution that comes to mind is Loop through the list and Find it. Which means O(n). Can it be made better?
Approach 1:
As we know that List of already sorted in ascending order So:
Break list at center (say at 5).
Compare the value of Center (5) with the Search Value (10).
If Center Value == Search Value => Item Found
If Center < Search Value => Do above steps for Right Half of the List
If Center > Search Value => Do above steps for Left Half of the List
For this simple example we will find 10 after doing 3 or 4 breaks (at: 5 then 8 then 9) (depending on how you implement)
That means For N = 10 Items - Search time was 3 (or 4). Putting some mathematics over here;
2^3 + 2 = 10 for simplicity sake lets say
2^3 = 10 (nearly equals --- this is just to do simple Logarithms base 2)
This can be re-written as:
Log-Base-2 10 = 3 (again nearly)
We know 10 was number of items & 3 was the number of breaks/lookup we had to do to find item. It Becomes
log N = K
That is the Complexity of the alogorithm above. O(log N)
Generally when a loop is nested we multiply the values as O(outerloop max value * innerloop max value) n so on. egfor (i to n){ for(j to k){}} here meaning if youll say for i=1 j=1 to k i.e. 1 * k next i=2,j=1 to k so i.e. the O(max(i)*max(j)) implies O(n*k).. Further, if you want to find order you need to recall basic operations with logarithmic usage like O(n+n(addition)) <O(n*n(multiplication)) for log it minimizes the value in it saying O(log n) <O(n) <O(n+n(addition)) <O(n*n(multiplication)) and so on. By this way you can acheive with other functions as well.
Approach should be better first generalised the equation for calculating time complexity. liken! =n*(n-1)*(n-2)*..n-(n-1)so somewhere O(nk) would be generalised formated worst case complexity like this way you can compare if k=2 then O(nk) =O(n*n)

numpy determine mean diff between arguments

Is there any ready made function to determine the average of differences between arguments of a sorted list?
for example, here is my manual try:
import numpy as np
rand_A = np.random.rand_integers(0, 99, 10)
np.sort(rand_A)
array([ 3, 8, 26, 34, 35, 37, 65, 82, 89, 94])
def mean_period(data):
diffe = 0
for ind in range(data.shape[0] - 1)
diffe += data[ind + 1] - data[ind]
return (diffe / (data.shape[0] - 1))
mean_period(np.sort(rand_A))
10
Basically I need this function to determine the frequency value of a sinus like signal which will be used as initial guess parameter for the scipy.leastsq function to fit it.
I need the fastest routine. I'm afraid my try will be a big load.
Let's see. If I understood your question correctly, we are talking about zero-crossings in frequency detector. You have the time stamps of the zero-crossings in a list (which is then sorted by necessity) and want to calculate the average difference of items in a list.
While unutbu's answer is correct and very Numpyish, I would like to suggest a brief look into the maths. Average of difference of consecutive elements is:
{ (s_1 - s_0) + (s_2 - s_1) + (s_3 - s_2) + ... + (s_n - s_(n-1)) } / n
There seem to be quite many terms cancelling out. What is left is:
(s_n - s_0) / n
So, the function above becomes:
def mean_period(data):
return 1. * (data[-1] - data[0]) / (len(data) - 1)
If we do some benchmarks with sorted data, then:
rand_A = np.random.randint(0,99999999,10000000)
sort_A = np.sort(rand_A)
% timeit np.diff(sort_A).mean() # 37.7 ms
% timeit mean_period(sort_A) # 0.98 ms
(The latter one is essentially O(1) plus it suffers from the slight function call overhead).
If the data is not sorted, then we will have to find the largest and smallest values:
def mean_period_unsorted(data):
smallest = np.min(data)
largest = np.max(data)
return 1. * (largest - smallest) / (len(data) - 1)
So maybe this time a bit of maths helps :)
And now the benchmarks
% timeit np.diff(np.sort(rand_A)).mean() # 733 ms
% timeit mean_period_unsorted(rand_A) # 17.9 ms
np.diff(np.sort(rand_A)).mean()
is almost equivalent to mean_period(np.sort(rand_A)), but should be faster since it uses NumPy method calls instead of a Python loop.
I say "almost equivalent" because there is one difference: mean_period always returns an int, since diffe is a numpy.int32 and the return value is the result of dividing this int32 by an int, (data.shape[0]-1).
In contrast, np.diff(np.sort(rand_A)).mean() returns a Numpy float64.
Edit: For small arrays (such as the one you posted in your question), the Python loop is faster:
In [84]: %timeit mean_period(np.sort(rand_A))
100000 loops, best of 3: 8.29 µs per loop
In [85]: %timeit np.diff(np.sort(rand_A)).mean()
10000 loops, best of 3: 21.5 µs per loop
but for large arrays, such as a million-element array,
rand_A = np.random.random_integers(0, 99, 10**6)
using NumPy's mean and diff methods is much faster:
In [87]: %timeit mean_period(np.sort(rand_A))
1 loops, best of 3: 442 ms per loop
In [88]: %timeit np.diff(np.sort(rand_A)).mean()
10 loops, best of 3: 48.8 ms per loop
See also:
numpy.diff
numpy.mean

Fast way of multiplying two 1-D arrays

I have the following data:
A = [a0 a1 a2 a3 a4 a5 .... a24]
B = [b0 b1 b2 b3 b4 b5 .... b24]
which I then want to multiply as follows:
C = A * B' = [a0b0 a1b1 a2b2 ... a24b24]
This clearly involves 25 multiplies.
However, in my scenario, only 5 new values are shifted into A per "loop iteration" (and 5 old values are shifted out of A). Is there any fast way to exploit the fact that data is shifting through A rather than being completely new? Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations). I initially thought a systolic array might help, but it doesn't (I think!?)
Update 1: Note B is fixed for long periods, but can be reprogrammed.
Update 2: the shifting of A is like the following: a[24] <= a[19], a[23] <= a[18]... a[1] <= new01, a[0] <= new00. And so on so forth each clock cycle
Many thanks!
Is there any fast way to exploit the fact that data is shifting through A rather than being completely new?
Even though all you're doing is the shifting and adding new elements to A, the products in C will, in general, all be different since one of the operands will generally change after each iteration. If you have additional information about the way the elements of A or B are structured, you could potentially use that structure to reduce the number of multiplications. Barring any such structural considerations, you will have to compute all 25 products each loop.
Ideally I want to minimize the number of multiplication operations (at a cost of perhaps more additions/subtractions/accumulations).
In theory, you can reduce the number of multiplications to 0 by shifting and adding the array elements to simulate multiplication. In practice, this will be slower than a hardware multiplication so you're better off just using any available hardware-based multiplication unless there's some additional, relevant constraint you haven't mentioned.
on the very first 5 data set you could be saving upto 50 multiplications. but after that its a flat road of multiplications. since for every set after the first 5 set you need to multiply with the new set of data.
i'l assume all the arrays are initialized to zero.
i dont think those 50 saved are of any use considering the amount of multiplication on the whole.
But still i will give you a hint on how to save those 50 maybe you could find an extension to it?
1st data set arrived : multiply the first data set in a with each of the data set in b. save all in a, copy only a[0] to a[4] to c. 25 multiplications here.
2nd data set arrived : multiply only a[0] to a[4](having new data) with b[0] to b[4] resp. save in a[0] to a[4],copy to a[0->9] to c. 5 multiplications here
3rd data set arrived : multiply a[0] to a[9] with b[0] to b[9] this time and copy to corresponding a[0->14] to c.10 multiplications here
4th data set : multiply a[0] to a[14] with corresponding b copy corresponding a[0->19] to c. 15 multiplications here.
5th data set : mutiply a[0] to a[19] with corresponding b copy corresponding a[0->24] to c. 20 multiplications here.
total saved mutiplications : 50 multiplications.
6th data set : usual data multiplications. 25 each. this is because for each set in the array a there a new data set avaiable so multiplication is unavoidable.
Can you add another array D to flag the changed/unchanged value in A. Each time you check this array to decide whether to do new multiplications or not.

programing ranging data

I have 2 inputs from 0-180 for x and y i need to add them together and stay in the range of 180 and 0 i am having some trouble since 90 is the mid point i cant seem to keep my data in that range im doing this in vb.net but i mainly need help with the logic
result = (x + y) / 2
Perhaps? At least that will stay in the 0-180 range. Are there any other constraints you're not telling us about, since right now this seems pretty obvious.
If you want to map the two values to the limited range in a linear fashion, just add them together and divide by two:
out = (in1 + in2) / 2
If you just want to limit the top end, add them together then use the minimimum of that and 180:
out = min (180, in1 + in2)
Are you wanting to find the average of the two or add them? If you're adding them, and you're dealing with angles which wrap around (which is what it sounds like) then, why not just add them and then modulo? Like this:
(in1 + in2) mod 180
Hopefully you're familiar with the modulo operator.

Avoid the use of for loops

I'm working with R and I have a code like this:
for (i in 1:10)
for (j in 1:100)
if (data[i] == paths[j,1])
cluster[i,4] <- paths[j,2]
where:
data is a vector with 100 rows and 1 column
paths is a matrix with 100 rows and 5 columns
cluster is a matrix with 100 rows and 5 columns
My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case.
This is a problem when j=10000 for example because the execution time is very long.
Thank you
Inner loop could be vectorized
cluster[i,4] <- paths[max(which(data[i]==paths[,1])),2]
but check Musa's comment. I think you indented something else.
Second (outer) loop could be vectorize either, by replicating vectors but
if i is only 100 your speed-up don't be large
it will need more RAM
[edit]
As I understood your comment can you just use logical indexing?
indx <- data==paths[, 1]
cluster[indx, 4] <- paths[indx, 2]
I think that both loops can be vectorized using the following:
cluster[na.omit(match(paths[1:100,1],data[1:10])),4] = paths[!is.na(match(paths[1:100,1],data[1:10])),2]