I currently have two timelines (timeline1 and timeline2), with matching data (data1 and data2). Timelines almost, but not quite match (about 90% of common values).
I'm trying to find values from data1 and data2 that correspond to identical timestamps (ignoring all other values)
My first trivial implementation is as follows (and is obviously terribly slow, given that my timelines contain thousands of values). Any ideas on how to improve this? I'm sure there is a smart way of doing this while avoiding the for loop, or the find operation...
% We expect the common timeline to contain
% 0 1 4 5 9
timeline1 = [0 1 4 5 7 8 9 10];
timeline2 = [0 1 2 4 5 6 9];
% Some bogus data
data1 = timeline1*10;
data2 = timeline2*20;
reconstructedData1 = data1;
reconstructedData2 = zeros(size(data1));
currentSearchPosition = 1;
for t = 1:length(timeline1)
% We only look beyond the previous matching location, to speed up find
matchingIndex = find(timeline2(currentSearchPosition:end) == timeline1(t), 1);
if isempty(matchingIndex)
reconstructedData1(t) = nan;
reconstructedData2(t) = nan;
else
reconstructedData2(t) = data2(matchingIndex+currentSearchPosition-1);
currentSearchPosition = currentSearchPosition+matchingIndex;
end
end
% Remove values from data1 for which no match was found in data2
reconstructedData1(isnan(reconstructedData1)) = [];
reconstructedData2(isnan(reconstructedData2)) = [];
You can use Matlab's intersect function:
c = intersect(A, B)
Couldn't you just call INTERSECT?
commonTimeline = intersect(timeline1,timeline2);
commonTimeline =
0 1 4 5 9
You need to use the indexes returned from intersect.
[~ ia ib] = intersect(timeline1, timeline2);
recondata1 = data1(ia);
recondata2 = data2(ib);
Related
My issue is, i don't know how to use the output of a function properly. The output contains multiple lines (j = column , i = testresult)
I want to use the output for some other rules in other functions. (eg. if (i) testresult > 5 then something)
I have a function with two loops. The function goes threw every column and test something. This works fine.
def test():
scope = range(10)
scope2 = range(len(df1.columns))
for (j) in scope2:
for (i) in scope:
if df1.iloc[:,[j]].shift(i).loc[selected_week].item() > df1.iloc[:,[j]].shift(i+1).loc[selected_week].item():
i + 1
else:
print(j,i)
break
Output:
test()
1 0
2 3
3 3
4 1
5 0
6 6
7 0
8 1
9 0
10 1
11 1
12 0
13 0
14 0
15 0
I tried to convert it to list, dataframe etc. However, i miss something here.
What is the best way for that?
Thank you!
A fix of your code would be:
def test():
out = []
scope = range(10)
scope2 = range(len(df1.columns))
for j in scope2:
for i in scope:
if df1.iloc[:,[j]].shift(i).loc[selected_week].item() <= df1.iloc[:,[j]].shift(i+1).loc[selected_week].item():
out.append([i, j])
return pd.DataFrame(out)
out = test()
But you probably don't want to use loops as it's slow, please clarify what is your input with a minimal reproducible example and what you are trying to achieve (expected output and logic), we can probably make it a vectorized solution.
My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone
I am using SQL Server and I have an int that is 4 to 5 characters long.
I have a report that cast the first 3 digits as the location and last 1 to 2 digits as a cause.
So this is how they look
5142 = 514 = paint line 2 = paint to thin:
50528 = 505 = machining 28 = oblong hole:
SELECT [Suspect]
,left(Suspect,3) as SuspectOP
,Right(Suspect,2) as SuspectID
This query will return
5142 = SuspectOP = 514 SuspectID = 42
50528 = SuspectOP = 505 SuspectID = 28
So what i want is to read everything after the first 3 digits of the int.
Some of the things I have tried are as follows:
Select Cast(Suspect as Varchar(5)),
Substring(Suspect,3,2)
And
Select Suspect % 514 as SuspectID
Which does work as long as the first 3 digits are always 514 which in my case aren't.
You could use a conditional operators based on the length like this:
SELECT
[Suspect]
, SuspectOP = LEFT(Suspect,3)
, SuspectID = CASE
WHEN LEN(Suspect) = 5 THEN RIGHT(Suspect,2)
ELSE RIGHT(Suspect, 1)
END
Mind you, it's not ideal, you should really keep the values separate if your use case is like the one mentioned.
I know that, with package DataFrames, it is possible by doing simply
julia> df = DataFrame();
julia> for i in 1:3
df[i] = [i, i+1, i*2]
end
julia> df
3x3 DataFrame
|-------|----|----|----|
| Row # | x1 | x2 | x3 |
| 1 | 1 | 2 | 3 |
| 2 | 2 | 3 | 4 |
| 3 | 2 | 4 | 6 |
... but are there any means to do the same on an empty Array{Int64,2} ?
If you know how many rows you have in your final Array, you can do it using hcat:
# The number of lines of your final array
numrows = 3
# Create an empty array of the same type that you want, with 3 rows and 0 columns:
a = Array(Int, numrows, 0)
# Concatenate 3x1 arrays in your empty array:
for i in 1:numrows
b = [i, i+1, i*2] # Create the array you want to concatenate with a
a = hcat(a, b)
end
Notice that, here you know that the arrays b have elements of the type Int. Therefore we can create the array a that have elements of the same type.
Loop over the rows of the matrix:
A = zeros(3,3)
for i = 1:3
A[i,:] = [i, i+1, 2i]
end
If at all possible, it is best to create your Array with the desired number of columns from the start. That way, you can just fill in those column values. Solutions using procedures like hcat() will suffer from inefficiency, since they require re-creating the Array each time.
If you do need to add columns to an already existing Array, you will be better off if you can add them all at once, rather than in a loop with hcat(). E.g. if you start with:
n = 10; m = 5;
A = rand(n,m);
then
A = [A rand(n, 3)]
will be faster and more memory efficient than:
for idx = 1:3
A = hcat(A, rand(n))
end
E.g. compare the difference in speed and memory allocations between these two:
n = 10^5; m = 10;
A = rand(n,m);
n_newcol = 10;
function t1(A::Array, n_newcol)
n = size(A,1)
for idx = 1:n_newcol
A = hcat(A, zeros(n))
end
return A
end
function t2(A::Array, n_newcol)
n = size(A,1)
[A zeros(n, n_newcol)]
end
# Stats after running each function once to compile
#time r1 = t1(A, n_newcol); ## 0.145138 seconds (124 allocations: 125.888 MB, 70.58% gc time)
#time r2 = t2(A, n_newcol); ## 0.011566 seconds (9 allocations: 22.889 MB, 39.08% gc time)
I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.