SurveyMonkey data formatting using Pandas - pandas

I have a survey to analyze that was completed by participants on SurveyMonkey. Unfortunately, the way the data are organized is not ideal, in that each categorical response for each question has its own column.
Here, for example, are the first few lines of one of the responses in the dataframe:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
This question, How long have you been participating in the Garden Awards Program?, has valid responses: one year, 2-3 years, etc., and are all found on the first row as a key to which column holds which value. This is the first problem. (Similarly for How did you initially learn of the Garden Awards Program?, where valid responses are: I nominated my garden to be evaluated, A friend or family member nominated my garden, etc.).
The second problem is that the attached columns for each categorical response all are Unnamed: N, where N is as many columns as there are categories associated with all questions.
Before I start remapping and flattening/collapsing the columns into a single one per question, I was wondering if there was any other way of dealing with survey data presented like this using Pandas. All my searches pointed to the SurveyMonkey API, but I don't see how that would be useful.
I am guessing that I will need to flatten the columns, and thus, if anyone could suggest a method, that would be great. I'm thinking that there is a way to keep grabbing all columns belonging to a categorical response by grabbing an adjacent column until Unnamed is no longer in the column name, but I am clueless how to do this.

I will use the following DataFrame (which can be downloaded as CSV from here):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
Key assumptions:
Every column whose name DOES NOT start with Unnamed is actually the title of a question
The columns between question titles represent options for the question on the left end of the column interval
Solution overview:
Find indices of where each question starts and ends
Flatten each question to a single column (pd.Series)
Merge the question columns back together
Implementation (part 1):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
You can see that iterating for the over the slices like below you get a single DataFrame corresponding to each question:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
Implementation (part 2-3):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
Output:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A

Related

how to split this kind of column in a dataframe?

I have data frame same like below:
Name Rating Review Price
1 The park NaN NaN 5040
2 The Westin Good 7.6 NaN 6045
3 Courtyard NaN NaN 4850
4 Radisson Excellent 9.8 NaN 7050
5 Banjara Average 6.7 NaN 5820
6 Mindspace NaN NaN 8000
My required output is like this:
Name Review Rating Price
1 The park NaN NaN 5040
2 The Westin Good 7.6 6045
3 Courtyard NaN NaN 4850
4 Radisson Excellent 9.8 7050
5 Banjara Average 6.7 5820
6 Mindspace NaN NaN 8000
I use this split function:
df[["review","ratings"]] = df["rating"].str.split(expand=True)
But I got 'Columns must be same length as key' this type error.
How to split this type of data can anyone help me?
Problem is there is multiple space, not only one at least in one splitted value.
You can add n=1 for split after first space:
df[["review","ratings"]] = df["Rating"].str.split(expand=True, n=1)
Or use rsplit with n=1 for split by last space:
df[["review","ratings"]] = df["Rating"].str.rsplit(expand=True, n=1)
Another idea is use Series.str.extract with regex for get all values before space before float:
df[["review","ratings"]] = df["Rating"].str.extract('(.*)\s+(\d+\.\d+)')

How to perform nearest neighbor imputation in a time-series dataset?

I have a panda-series of the form
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 0.0
9 1.0
10 NaN
11 NaN
12 NaN
13. 0.0
...
The values can either be 0.0 or 1.0. From my knowledge of the data however, the 0's come in groups. Meaning, the entries 0-8 should be 0, then entries 9-12 should all be 1's, and then 13+ will be 0's. Therefore, the best way to impute the NaN's would be to do some kind of nearest neighbor I believe. However, it should return a 0 or 1 obviously and not an average value. Please let me know of anyway to do this!

How to remove outliers based on number of column values in each row

I am new to data science and trying to solve a course exercise for movie recommender system and I want to drop the rows based on total count of values for columns for each row.
i.e.
if someone gave rating to too much movies he should be dropped to filter out the final results.
Though, I found a traditional way of doing it but I am not satisfied as it will be really helpful if someone would help me find a more pythonic way of solving the problem.
Here is the table named userRatings
title Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN
1 NaN 4.0 NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN 4.0 NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
[10 rows x 1664 columns]
And here is the code i tried to solve the problem:
for index in userRatings.index:
if userRatings.loc[index].count() > 500:
userRatings = userRatings.drop(index)
I'm assuming you have a Pandas DataFrame... if so, one alternative would be something like this:
valid_rating_ixs = userRatings.sum(axis=1) <= 500
userRantings_cleaned = userRatings[valid_rating_ixs]
Note that in my code above and also your code, you may be including columns that are not ratings (e.g. user_id). Maybe you need to check that you are using only the relevant columns in your data frame

Create new dataframe columns from old dataframe rows using for loop --> N/A values

I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output

Pandas DataFrame + object type + HDF + PyTables 'table'

(Editing to clarify my application, sorry for any confusion)
I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.
So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.
Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.
Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.
I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?
I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials
If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.
This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.
But I welcome most enthusiastically a more canonical solution.
Thanks so much for all your time!
EDIT: Adding code (Jeff's suggestion)
import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]
df = pd.DataFrame(mydata)
In [4]: df
Out[4]:
0
0 [1.28822975392e-231]
1 [1.28822975392e-231, -2.31584192385e+77]
2 [1.28822975392e-231, -1.49166823584e-154, 2.12...
3 [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4 [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5 [1.28822975392e-231, 1.49166823584e-154, 1.531...
6 [1.28822975392e-231, -2.68156174706e+154, 2.20...
7 [1.28822975392e-231, -2.68156174706e+154, 2.13...
8 [1.28822975392e-231, -1.3365130604e-315, 2.222...
9 [1.28822975392e-231, -1.33651054067e-315, 2.22...
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0 10 non-null values
dtypes: object(1)
df.to_hdf('test.h5','data')
--> OK
df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype
Here's a simple example along the lines of what you have described
In [17]: df = DataFrame(randn(10,10))
In [18]: df.iloc[5:10,7:9] = np.nan
In [19]: df.iloc[7:10,4:9] = np.nan
In [22]: df.iloc[7:10,-1] = np.nan
In [23]: df
Out[23]:
0 1 2 3 4 5 6 7 8 9
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN
In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)
In [25]: df
Out[25]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3
Note that in 0.12 you should use table=True, rather than fmt (this is in the process of changing)
In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')
In [27]: pd.read_hdf('test.h5','df')
Out[27]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3