how to split this kind of column in a dataframe? - pandas

I have data frame same like below:
Name Rating Review Price
1 The park NaN NaN 5040
2 The Westin Good 7.6 NaN 6045
3 Courtyard NaN NaN 4850
4 Radisson Excellent 9.8 NaN 7050
5 Banjara Average 6.7 NaN 5820
6 Mindspace NaN NaN 8000
My required output is like this:
Name Review Rating Price
1 The park NaN NaN 5040
2 The Westin Good 7.6 6045
3 Courtyard NaN NaN 4850
4 Radisson Excellent 9.8 7050
5 Banjara Average 6.7 5820
6 Mindspace NaN NaN 8000
I use this split function:
df[["review","ratings"]] = df["rating"].str.split(expand=True)
But I got 'Columns must be same length as key' this type error.
How to split this type of data can anyone help me?

Problem is there is multiple space, not only one at least in one splitted value.
You can add n=1 for split after first space:
df[["review","ratings"]] = df["Rating"].str.split(expand=True, n=1)
Or use rsplit with n=1 for split by last space:
df[["review","ratings"]] = df["Rating"].str.rsplit(expand=True, n=1)
Another idea is use Series.str.extract with regex for get all values before space before float:
df[["review","ratings"]] = df["Rating"].str.extract('(.*)\s+(\d+\.\d+)')

Related

How to perform nearest neighbor imputation in a time-series dataset?

I have a panda-series of the form
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 0.0
9 1.0
10 NaN
11 NaN
12 NaN
13. 0.0
...
The values can either be 0.0 or 1.0. From my knowledge of the data however, the 0's come in groups. Meaning, the entries 0-8 should be 0, then entries 9-12 should all be 1's, and then 13+ will be 0's. Therefore, the best way to impute the NaN's would be to do some kind of nearest neighbor I believe. However, it should return a 0 or 1 obviously and not an average value. Please let me know of anyway to do this!

How to remove outliers based on number of column values in each row

I am new to data science and trying to solve a course exercise for movie recommender system and I want to drop the rows based on total count of values for columns for each row.
i.e.
if someone gave rating to too much movies he should be dropped to filter out the final results.
Though, I found a traditional way of doing it but I am not satisfied as it will be really helpful if someone would help me find a more pythonic way of solving the problem.
Here is the table named userRatings
title Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN
1 NaN 4.0 NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN 4.0 NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
[10 rows x 1664 columns]
And here is the code i tried to solve the problem:
for index in userRatings.index:
if userRatings.loc[index].count() > 500:
userRatings = userRatings.drop(index)
I'm assuming you have a Pandas DataFrame... if so, one alternative would be something like this:
valid_rating_ixs = userRatings.sum(axis=1) <= 500
userRantings_cleaned = userRatings[valid_rating_ixs]
Note that in my code above and also your code, you may be including columns that are not ratings (e.g. user_id). Maybe you need to check that you are using only the relevant columns in your data frame

Pandas pivot table value counts

I have a dataframe in format:
Name Score Bin
John 90 80-100
Marc 30 20-40
John 10 0-20
David 20 0-20
...
I want to create a pivot table that looks like this:
Name 0-20 20-40 40-60 60-80 80-100 Total count Avg score
John 1 2 nan nan 2 5 60.53
Marc nan 2 nan nan nan 2 32.13
David 3 2 nan nan nan 5 21.80
So I want to have columns that show count of values for each bucket, as well as total count of values and average score.
I have tried
table = pd.pivot_table(df, values=['Score', "Bin"], index=["nAME"],
aggfunc={"Score" : np.average, "Bin" : "count"},
dropna=True, margins = True)
however I just get overall count and not broken down per bucket
Do your task in 3 steps:
Generate a pivot_table:
df2 = pd.pivot_table(df, index='Name', columns='Bin', values='Score', aggfunc='count')\
.reindex(columns=['0-20', '20-40', '40-60', '60-80', '80-100'])\
.rename_axis(columns='')
The result, for your source data extended to give roughly your expected
result, is:
0-20 20-40 40-60 60-80 80-100
Name
David 3.0 2.0 NaN NaN NaN
John 1.0 2.0 NaN NaN 2.0
Marc NaN 2.0 NaN NaN NaN
Note: As NaN is a special case of float, other values are also of
float type.
Generate Total_count and Avg_score:
df3 = df.groupby('Name')\
.agg(Total_count=('Score', 'count'), Avg_score=('Score', 'mean'))\
.rename(columns={'Total_count': 'Total count', 'Avg_score': 'Avg score'})
The result is:
Total count Avg score
Name
David 5 21.8
John 5 61.0
Marc 2 32.0
Join both above tables:
result = df2.join(df3)
The result is:
0-20 20-40 40-60 60-80 80-100 Total count Avg score
Name
David 3.0 2.0 NaN NaN NaN 5 21.8
John 1.0 2.0 NaN NaN 2.0 5 61.0
Marc NaN 2.0 NaN NaN NaN 2 32.0

SurveyMonkey data formatting using Pandas

I have a survey to analyze that was completed by participants on SurveyMonkey. Unfortunately, the way the data are organized is not ideal, in that each categorical response for each question has its own column.
Here, for example, are the first few lines of one of the responses in the dataframe:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
This question, How long have you been participating in the Garden Awards Program?, has valid responses: one year, 2-3 years, etc., and are all found on the first row as a key to which column holds which value. This is the first problem. (Similarly for How did you initially learn of the Garden Awards Program?, where valid responses are: I nominated my garden to be evaluated, A friend or family member nominated my garden, etc.).
The second problem is that the attached columns for each categorical response all are Unnamed: N, where N is as many columns as there are categories associated with all questions.
Before I start remapping and flattening/collapsing the columns into a single one per question, I was wondering if there was any other way of dealing with survey data presented like this using Pandas. All my searches pointed to the SurveyMonkey API, but I don't see how that would be useful.
I am guessing that I will need to flatten the columns, and thus, if anyone could suggest a method, that would be great. I'm thinking that there is a way to keep grabbing all columns belonging to a categorical response by grabbing an adjacent column until Unnamed is no longer in the column name, but I am clueless how to do this.
I will use the following DataFrame (which can be downloaded as CSV from here):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
Key assumptions:
Every column whose name DOES NOT start with Unnamed is actually the title of a question
The columns between question titles represent options for the question on the left end of the column interval
Solution overview:
Find indices of where each question starts and ends
Flatten each question to a single column (pd.Series)
Merge the question columns back together
Implementation (part 1):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
You can see that iterating for the over the slices like below you get a single DataFrame corresponding to each question:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
Implementation (part 2-3):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
Output:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output