pandas dataframe sub setting on multiple loop conditions - pandas

I have a pandas df having users and their answers to a survey and a score eg
Userid incomebracket insurance-knowledge..... score
123 3 3 56
346 4 6 65
Assume income bracket has 6 levels with 1:1000-5000,6:100000+,similarly insurance-knowledge has 6 levels (1:very little to 6:expert)
Now I have another df which has user profile features like
userid,age,gender,education....(10 such features)
Now I iterate through set of users (first df) and for each of them want to get the entire subset of other users who have the same user profile but higher answer on each column of first df, say income. I am doing this using the following say for 3 profile features like age, gender and education
df_sameusergroup=df[(df['PPGENDER']==sameuser_gender.values[0])
& (df['EDUC']==sameuser_educ.values[0])
& (df['age']==sameuser_agecat.values[0])
& (df['incomebracket']>user_feature.values[0])]
Although this works the profile features here are hardcoded and is a problem for longer conditions,what I want is
get the subset of users who have same profile on all 10 but with higher answer, if you don't get any such record (which is possible) the reduce to 9 features,then reduce to 8,7.....2 (of the most important features say age gender). My pseudocode for this should look like this
for i in range(10:2) // iterate the Userprofile_df for all profile features,then 9,then 8...
Similaruserdf[] = df[subset when all i features are same and income is >]
if(Similaruserdf.length==0)//no such users with all features same
continue loop reduce number of features to match on
else
return Similaruserdf[]
I am stuck trying to do this and have been looking throughout to find a solution. Any help would be greatly appreciated. Thanks.

Related

How can I detect similarity of names in the same columns

Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

How to label a whole dataset?

I have a question.I have a pandas dataframe that contains 5000 columns and 12 rows. Each row represents the signal received from an electrocardiogram lead. I want to assign 3 labels to this dataset. These 3 tags belong to the entire dataset and are not related to a specific row. How can I do this?
I have attached the picture of my dataframepandas dataframe.
and my labels are: Atrial Fibrillation:0,
right bundle branch block:1,
T Wave Change:2
I tried to assign 3 labels to a large dataset
(Not for a specific row or column)
but I didn't find a solution.
As you see, it has 12 rows and 5000 columns. each row represents 5000 data from one specific lead and overall we have 12 leads which refers to this 12 rows (I, II, III, aVR,.... V6) in my data frame. professional experts are recognised 3 label for this data frame which helps us to train a ML Model to detect different heart disease. I have 10000 data frame just like this and each one has 3 or 4 specific labels. Here is my question: How can I assign these 3 labels to this dataset that I mentioned.as I told before these labels don't refers to specific rows, in fact each data frame has 3 or 4 label for its whole. I mean, How can I assign 3 label to a whole data frame?

Pandas run function only on subset of whole Dataframe

Lets say i have Dataframe, which has 200 values, prices for products. I want to run some operation on this dataframe, like calculate average price for last 10 prices.
The way i understand it, right now pandas will go through every single row and calculate average for each row. Ie first 9 rows will be Nan, then from 10-200, it would calculate average for each row.
My issue is that i need to do a lot of these calculations and performance is an issue. For that reason, i would want to run the average only on say on last 10 values (dont need more) from all values, while i want to keep those values in the dataframe. Ie i dont want to get rid of those values or create new Dataframe.
I just essentially want to do calculation on less data, so it is faster.
Is something like that possible? Hopefully the question is clear.
Building off Chicodelarose's answer, you can achieve this in a more "pandas-like" syntax.
Defining your df as follows, we get 200 prices up to within [0, 1000).
df = pd.DataFrame((np.random.rand(200) * 1000.).round(decimals=2), columns=["price"])
The bit you're looking for, though, would the following:
def add10(n: float) -> float:
"""An exceptionally simple function to demonstrate you can set
values, too.
"""
return n + 10
df["price"].iloc[-12:] = df["price"].iloc[-12:].apply(add10)
Of course, you can also use these selections to return something else without setting values, too.
>>> df["price"].iloc[-12:].mean().round(decimals=2)
309.63 # this will, of course, be different as we're using random numbers
The primary justification for this approach lies in the use of pandas tooling. Say you want to operate over a subset of your data with multiple columns, you simply need to adjust your .apply(...) to contain an axis parameter, as follows: .apply(fn, axis=1).
This becomes much more readable the longer you spend in pandas. 🙂
Given a dataframe like the following:
Price
0 197.45
1 59.30
2 131.63
3 127.22
4 35.22
.. ...
195 73.05
196 47.73
197 107.58
198 162.31
199 195.02
[200 rows x 1 columns]
Call the following to obtain the mean over the last n rows of the dataframe:
def mean_over_n_last_rows(df, n, colname):
return df.iloc[-n:][colname].mean().round(decimals=2)
print(mean_over_n_last_rows(df, 2, "Price"))
Output:
178.67

Why can't I read all of the values in the matrix in scilab?

i am trying to read a csv file and my code is as follows
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 3 4]); //reads number of clusters and features
data=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%f",'double',[],[],[3 1 19 4]); //reads the values
numft=param(1,1);//save number of features
numcl=param(2,1);//save number of clusters
data_pts=0;
data_pts = max(size(data, "r"));//checks how many number of rows
disp(data(numft-3:data_pts,:));//print all data points (I added -3 otherwise it displays only 15 rows)
disp(numft);//print features
disp(data_pts);//print features
disp(param);
endfunction
below is the values that i am trying to read
features,4,,
clusters,3,,
5.1,3.5,1.4,0.2
4.9,3,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5,3.6,1.4,0.2
7,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,1.5
5.5,2.3,4,1.3
6.5,2.8,4.6,1.5
5.7,2.8,4.5,1.3
6.3,3.3,6,2.5
5.8,2.7,5.1,1.9
7.1,3,5.9,2.1
6.3,2.9,5.6,1.8
6.5,3,5.8,2.2
7.6,3,6.6,2.1
I do not know why the code only displays 15 rows instead of 17. The only time it displays the correct matrix is when i put -3 in numft but with that, the number of columns would be 1. I am so confused. Is there a better way to read the values?
In the csvRead call in the first line of your script the boundaries of the region to read is incorrect, it should be corrected like this:
param=csvRead("C:\Users\USER\Dropbox\VOA-BK code\assets\Iris.csv",",","%i",'double',[],[],[1 2 2 2]);

SPSS Compute Variable

Below is some data:
Test Day1 Day2 Score
A 1 2 100
B 1 3 62
C 3 4 90
D 2 4 20
E 4 5 80
I am trying to take the values from column 'day' and 'day2' and use them to select the row number for the column score. For example for Test A I would like to find the sum of 100 and 62 because that is the values of the first and second rows of score. Test B I would like to find the sum of 100, 62 and 90.
Is their anyway to do this in the Compute Variable window? Found in the menu Transform-Compute Variable?
I tried the following:
Score(MEAN(VALUE(Day1), VALUE(DAY2)))
This is not the proper way to call the cell location of Score and I received an error.
Can anyone help?
Thank you!
You really have two different datasets here. One is a dataset of scores numbered 1 through 5.
The other is a dataset that includes indexes into the score dataset. So the steps would be something like this.
First take the scores dataset and transpose it so that it has one row and 5 columns (Data>Transpose)
Then match that dataset to each case in the main dataset (Data>Merge Files>Add Variables).
Next you have to resort to using syntax directly.
You would declare a vector for the scores (VECTOR)
Finally, you use COMPUTE to index into the scores.
For your real problem, I suppose that you might have batches of scores and maybe there are some gaps. The Restructure Data Wizard can help you generalize this - convert cases into variables, but let's not go there yet.
HTH,
Jon Peck