how to convert function output into list, dict or as data frame? - pandas

My issue is, i don't know how to use the output of a function properly. The output contains multiple lines (j = column , i = testresult)
I want to use the output for some other rules in other functions. (eg. if (i) testresult > 5 then something)
I have a function with two loops. The function goes threw every column and test something. This works fine.
def test():
scope = range(10)
scope2 = range(len(df1.columns))
for (j) in scope2:
for (i) in scope:
if df1.iloc[:,[j]].shift(i).loc[selected_week].item() > df1.iloc[:,[j]].shift(i+1).loc[selected_week].item():
i + 1
else:
print(j,i)
break
Output:
test()
1 0
2 3
3 3
4 1
5 0
6 6
7 0
8 1
9 0
10 1
11 1
12 0
13 0
14 0
15 0
I tried to convert it to list, dataframe etc. However, i miss something here.
What is the best way for that?
Thank you!

A fix of your code would be:
def test():
out = []
scope = range(10)
scope2 = range(len(df1.columns))
for j in scope2:
for i in scope:
if df1.iloc[:,[j]].shift(i).loc[selected_week].item() <= df1.iloc[:,[j]].shift(i+1).loc[selected_week].item():
out.append([i, j])
return pd.DataFrame(out)
out = test()
But you probably don't want to use loops as it's slow, please clarify what is your input with a minimal reproducible example and what you are trying to achieve (expected output and logic), we can probably make it a vectorized solution.

Related

Compare Values of 2 dataframes conditionally

I have the following problem. I have a dataframe which look like this.
Dataframe1
start end
0 0 2
1 3 7
2 8 9
and another dataframe which looks like this.
Dataframe2
data
1 ...
4 ...
8 ...
11 ...
What I am trying to achieve is following:
For each row in Dataframe1 I want to check if there is any index value in Dataframe2 which is in range(start, end) of Dataframe1.
If the condition is True, I want to create a new column["condition"] where the outcome is stored.
Since there is the possiblity to deal with large amounts of data I tried using numpy.select.
Like this:
range_start = df1.start
range_end = df1.end
condition = [
df2.index.to_series().between(range_start, range_end)
]
choice = ["True"]
df1["condition"] = np.select(condition, choice, default=0)
This gives me an error:
ValueError: Can only compare identically-labeled Series objects
I also tried a list comprehension. That didn't work either. All the things I tried are failing because I am dealing with a series (--> range_start, range_end). There has to be a way to make this work I think..
I already searched stackoverflow for this paricular problem. But I wasn't able to find a solution to this problem. It could be, that I'm just to inexperienced for this type of problem, to search for the right solution.
So maybe you can help me out here.
Thank you!
expected output:
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
Use DataFrame.drop_duplicates for remove duplicates by both columns and index, create all combinations by DataFrame.merge with cross join and last test at least one match by GroupBy.any:
df3 = (df1.drop_duplicates(['start','end'])
.merge(df2.index.drop_duplicates().to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df1.join(df3.groupby(['start','end'])['condition'].any(), on=['start','end'])
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True
If all pairs in df1 are unique is possible use:
df3 = (df1.merge(df2.index.to_frame(), how='cross'))
df3['condition'] = df3[0].between(df3.start, df3.end)
df3 = df3.groupby(['start','end'], as_index=False)['condition'].any()
print (df3)
start end condition
0 0 2 True
1 3 7 True
2 8 9 True

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

Adding numbers in a sequence

I have a basic problem.
If we are to input:
6
1 2 3 4 10 11
The desired outcome should be:
31
Here is the coding, you must simply finish the function and it should work:
#!/bin/python3
import sys
def simpleArraySum(n, ar):
# Complete this function
n = int(input().strip())
ar = list(map(int, input().strip().split(' ')))
result = simpleArraySum(n, ar)
print(result)
We want 1+2+3+4+10+11 = 31
Use sum().
def simpleArraySum(n, ar):
return sum(ar[:n])
The [:n] truncates the array to n elements.

Comparing vectors

I am new to R and am trying to find a better solution for accomplishing this fairly simple task efficiently.
I have a data.frame M with 100,000 lines (and many columns, out of which 2 columns are relevant to this problem, I'll call it M1, M2). I have another data.frame where column V1 with about 10,000 elements is essential to this task. My task is this:
For each of the element in V1, find where does it occur in M2 and pull out the corresponding M1. I am able to do this using for-loop and it is terribly slow! I am used to Matlab and Perl and this is taking for EVER in R! Surely there's a better way. I would appreciate any valuable suggestions in accomplishing this task...
for (x in c(1:length(V$V1)) {
start[x] = M$M1[M$M2 == V$V1[x]]
}
There is only 1 element that will match, and so I can use the logical statement to directly get the element in start vector. How can I vectorize this?
Thank you!
Here is another solution using the same example by #aix.
M[match(V$V1, M$M2),]
To benchmark performance, we can use the R package rbenchmark.
library(rbenchmark)
f_ramnath = function() M[match(V$V1, M$M2),]
f_aix = function() merge(V, M, by.x='V1', by.y='M2', sort=F)
f_chase = function() M[M$M2 %in% V$V1,] # modified to return full data frame
benchmark(f_ramnath(), f_aix(), f_chase(), replications = 10000)
test replications elapsed relative
2 f_aix() 10000 12.907 7.068456
3 f_chase() 10000 2.010 1.100767
1 f_ramnath() 10000 1.826 1.000000
Another option is to use the %in% operator:
> set.seed(1)
> M <- data.frame(M1 = sample(1:20, 15, FALSE), M2 = sample(1:20, 15, FALSE))
> V <- data.frame(V1 = sample(1:20, 10, FALSE))
> M$M1[M$M2 %in% V$V1]
[1] 6 8 11 9 19 1 3 5
Sounds like you're looking for merge:
> M <- data.frame(M1=c(1,2,3,4,10,3,15), M2=c(15,6,7,8,-1,12,5))
> V <- data.frame(V1=c(-1,12,5,7))
> merge(V, M, by.x='V1', by.y='M2', sort=F)
V1 M1
1 -1 10
2 12 3
3 5 15
4 7 3
If V$V1 might contain values not present in M$M2, you may want to specify all.x=T. This will fill in the missing values with NAs instead of omitting them from the result.

How can I optimize this timeline-matching code in Matlab?

I currently have two timelines (timeline1 and timeline2), with matching data (data1 and data2). Timelines almost, but not quite match (about 90% of common values).
I'm trying to find values from data1 and data2 that correspond to identical timestamps (ignoring all other values)
My first trivial implementation is as follows (and is obviously terribly slow, given that my timelines contain thousands of values). Any ideas on how to improve this? I'm sure there is a smart way of doing this while avoiding the for loop, or the find operation...
% We expect the common timeline to contain
% 0 1 4 5 9
timeline1 = [0 1 4 5 7 8 9 10];
timeline2 = [0 1 2 4 5 6 9];
% Some bogus data
data1 = timeline1*10;
data2 = timeline2*20;
reconstructedData1 = data1;
reconstructedData2 = zeros(size(data1));
currentSearchPosition = 1;
for t = 1:length(timeline1)
% We only look beyond the previous matching location, to speed up find
matchingIndex = find(timeline2(currentSearchPosition:end) == timeline1(t), 1);
if isempty(matchingIndex)
reconstructedData1(t) = nan;
reconstructedData2(t) = nan;
else
reconstructedData2(t) = data2(matchingIndex+currentSearchPosition-1);
currentSearchPosition = currentSearchPosition+matchingIndex;
end
end
% Remove values from data1 for which no match was found in data2
reconstructedData1(isnan(reconstructedData1)) = [];
reconstructedData2(isnan(reconstructedData2)) = [];
You can use Matlab's intersect function:
c = intersect(A, B)
Couldn't you just call INTERSECT?
commonTimeline = intersect(timeline1,timeline2);
commonTimeline =
0 1 4 5 9
You need to use the indexes returned from intersect.
[~ ia ib] = intersect(timeline1, timeline2);
recondata1 = data1(ia);
recondata2 = data2(ib);