Pandas rolling sum of prior n elements with NaN values - pandas

I have a pandas Series named df which look like :
NaN
2
3
NaN
NaN
4
6
4
8
I would like to calculate the rolling sum only if there are 5 prior elements. If there is less than 5 prior elements the output should be NaN [see image below].
When there are five prior elements with some NaN element, then NaN should be treated like zeros.
I tried
df.rolling(window=5).sum()
But I get only NaN which is not what I look for.
I also used min_periods=1 as well (suggested in many stackoverflow posts) but it does not work.
Below I show the input, expected output, and I explain why the expected output should be as such.

Related

How to perform nearest neighbor imputation in a time-series dataset?

I have a panda-series of the form
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 0.0
9 1.0
10 NaN
11 NaN
12 NaN
13. 0.0
...
The values can either be 0.0 or 1.0. From my knowledge of the data however, the 0's come in groups. Meaning, the entries 0-8 should be 0, then entries 9-12 should all be 1's, and then 13+ will be 0's. Therefore, the best way to impute the NaN's would be to do some kind of nearest neighbor I believe. However, it should return a 0 or 1 obviously and not an average value. Please let me know of anyway to do this!

How to remove outliers based on number of column values in each row

I am new to data science and trying to solve a course exercise for movie recommender system and I want to drop the rows based on total count of values for columns for each row.
i.e.
if someone gave rating to too much movies he should be dropped to filter out the final results.
Though, I found a traditional way of doing it but I am not satisfied as it will be really helpful if someone would help me find a more pythonic way of solving the problem.
Here is the table named userRatings
title Zeus and Roxanne (1997) unknown Á köldum klaka (Cold Fever) (1994)
user_id
0 NaN NaN NaN
1 NaN 4.0 NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN 4.0 NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
[10 rows x 1664 columns]
And here is the code i tried to solve the problem:
for index in userRatings.index:
if userRatings.loc[index].count() > 500:
userRatings = userRatings.drop(index)
I'm assuming you have a Pandas DataFrame... if so, one alternative would be something like this:
valid_rating_ixs = userRatings.sum(axis=1) <= 500
userRantings_cleaned = userRatings[valid_rating_ixs]
Note that in my code above and also your code, you may be including columns that are not ratings (e.g. user_id). Maybe you need to check that you are using only the relevant columns in your data frame

SurveyMonkey data formatting using Pandas

I have a survey to analyze that was completed by participants on SurveyMonkey. Unfortunately, the way the data are organized is not ideal, in that each categorical response for each question has its own column.
Here, for example, are the first few lines of one of the responses in the dataframe:
How long have you been participating in the Garden Awards Program? \
0 One year
1 NaN
2 NaN
3 NaN
4 NaN
Unnamed: 10 Unnamed: 11 Unnamed: 12 \
0 2-3 years 4-5 years 5 or more years
1 NaN NaN NaN
2 NaN 4-5 years NaN
3 2-3 years NaN NaN
4 NaN NaN 5 or more years
How did you initially learn of the Garden Awards Program? \
0 I nominated my garden to be evaluated
1 NaN
2 I nominated my garden to be evaluated
3 NaN
4 NaN
Unnamed: 14 etc...
0 A friend or family member nominated my garden ...
1 A friend or family member nominated my garden ...
2 NaN
3 NaN
4 NaN
This question, How long have you been participating in the Garden Awards Program?, has valid responses: one year, 2-3 years, etc., and are all found on the first row as a key to which column holds which value. This is the first problem. (Similarly for How did you initially learn of the Garden Awards Program?, where valid responses are: I nominated my garden to be evaluated, A friend or family member nominated my garden, etc.).
The second problem is that the attached columns for each categorical response all are Unnamed: N, where N is as many columns as there are categories associated with all questions.
Before I start remapping and flattening/collapsing the columns into a single one per question, I was wondering if there was any other way of dealing with survey data presented like this using Pandas. All my searches pointed to the SurveyMonkey API, but I don't see how that would be useful.
I am guessing that I will need to flatten the columns, and thus, if anyone could suggest a method, that would be great. I'm thinking that there is a way to keep grabbing all columns belonging to a categorical response by grabbing an adjacent column until Unnamed is no longer in the column name, but I am clueless how to do this.
I will use the following DataFrame (which can be downloaded as CSV from here):
Q1 Unnamed: 2 Unnamed: 3 Q2 Unnamed: 5 Unnamed: 6 Q3 Unnamed: 7 Unnamed: 8
0 A1-A A1-B A1-C A2-A A2-B A2-C A3-A A4-B A3-C
1 A1-A NaN NaN NaN A2-B NaN NaN NaN A3-C
2 NaN A1-B NaN A2-A NaN NaN NaN A4-B NaN
3 NaN NaN A1-C NaN A2-B NaN A3-A NaN NaN
4 NaN A1-B NaN NaN NaN A2-C NaN NaN A3-C
5 A1-A NaN NaN NaN A2-B NaN A3-A NaN NaN
Key assumptions:
Every column whose name DOES NOT start with Unnamed is actually the title of a question
The columns between question titles represent options for the question on the left end of the column interval
Solution overview:
Find indices of where each question starts and ends
Flatten each question to a single column (pd.Series)
Merge the question columns back together
Implementation (part 1):
indices = [i for i, c in enumerate(df.columns) if not c.startswith('Unnamed')]
questions = [c for c in df.columns if not c.startswith('Unnamed')]
slices = [slice(i, j) for i, j in zip(indices, indices[1:] + [None])]
You can see that iterating for the over the slices like below you get a single DataFrame corresponding to each question:
for q in slices:
print(df.iloc[:, q]) # Use `display` if using Jupyter
Implementation (part 2-3):
def parse_response(s):
try:
return s[~s.isnull()][0]
except IndexError:
return np.nan
data = [df.iloc[:, q].apply(parse_response, axis=1)[1:] for q in slices]
df = pd.concat(data, axis=1)
df.columns = questions
Output:
Q1 Q2 Q3
1 A1-A A2-B A3-C
2 A1-B A2-A A4-B
3 A1-C A2-B A3-A
4 A1-B A2-C A3-C
5 A1-A A2-B A3-A

Python Pandas: add two different data frames

I am trying to sum different data frames, say dataframe a, dataframe b, and dataframe c.
Dataframe a is defined within the python code like this:
a=pd.DataFrame(index=range(0,8), columns=[0])
a.iloc[:,0]=0
(a.iloc[:,0]=0 is given to enable arithmetic operations, ie, replacing "NaN" with "Zero")
Dataframe b and Dataframe c are called from an excel sheet like this:
b=pd.read_excel("Test1.xlsx")
c=pd.read_excel("Test2.xlsx")
The excel sheets contain the same number of rows as Dataframe a. The sample is:
10
11
12
13
14
15
16
17
18
19
Now when I try to add, b+c gives fine output, but a+b or a+c give this:
0 10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
Why is this happening, even after assigning numbers to Dataframe a ?
Please help.
Pandas will take care of the indexing for you. You should be able to generate and add dataframes as show here
import pandas as pd
a = pd.DataFrame(list(range(8)))
b = pd.DataFrame(list(range(9,17)))
c = a + b
Using the code you provided to generate data produces a dataframe with only zeroes. Nevertheless, even if you generate two of those and add them, you will again get a dataframe with all zeroes.
a = pd.DataFrame(index=range(0,8), columns=[0])
a.iloc[:,0] = 0
b = pd.DataFrame(index=range(0,8), columns=[0])
b.iloc[:,0] = 0
c = a + b # All zeroes
I am also able to add all combinations such as b+c.

Pandas DataFrame + object type + HDF + PyTables 'table'

(Editing to clarify my application, sorry for any confusion)
I run an experiment broken up into trials. Each trial can produce invalid data or valid data. When there is valid data the data take the form of a list of numbers which can be of zero length.
So an invalid trial produces None and a valid trial can produce [] or [1,2] etc etc.
Ideally, I'd like to be able to save this data as a frame_table (call it data). I have another table (call it trials) that is easily converted into a frame_table and which I use as a selector to extract rows (trials). I would then like to pull up by data using select_as_multiple.
Right now, I'm saving the data structure as a regular table as I'm using an object array. I realize folks are saying this is inefficient, but I can't think of an efficient way to handle the variable length nature of data.
I understand that I can use NaNs and make a (potentially very wide) table whose max width is the maximum length of my data array, but then I need a different mechanism to flag invalid trials. A row with all NaNs is confusing - does it mean that I had a zero length data trial or did I have an invalid trial?
I think there is no good solution to this using Pandas. The NaN solution leads me to potentially extremely wide tables and an additional column marking valid/invalid trials
If I used a database I would make the data a binary blob column. With Pandas my current working solution is to save data as an object array in a regular frame and load it all in and then pull out the relevant indexes based on my trials table.
This is slightly inefficient, since I'm reading my whole data table in one go, but it's the most workable/extendable scheme I have come up with.
But I welcome most enthusiastically a more canonical solution.
Thanks so much for all your time!
EDIT: Adding code (Jeff's suggestion)
import pandas as pd, numpy
mydata = [numpy.empty(n) for n in range(1,11)]
df = pd.DataFrame(mydata)
In [4]: df
Out[4]:
0
0 [1.28822975392e-231]
1 [1.28822975392e-231, -2.31584192385e+77]
2 [1.28822975392e-231, -1.49166823584e-154, 2.12...
3 [1.28822975392e-231, 1.2882298313e-231, 2.1259...
4 [1.28822975392e-231, 1.72723381477e-77, 2.1259...
5 [1.28822975392e-231, 1.49166823584e-154, 1.531...
6 [1.28822975392e-231, -2.68156174706e+154, 2.20...
7 [1.28822975392e-231, -2.68156174706e+154, 2.13...
8 [1.28822975392e-231, -1.3365130604e-315, 2.222...
9 [1.28822975392e-231, -1.33651054067e-315, 2.22...
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 9
Data columns (total 1 columns):
0 10 non-null values
dtypes: object(1)
df.to_hdf('test.h5','data')
--> OK
df.to_hdf('test.h5','data1',table=True)
--> ...
TypeError: Cannot serialize the column [0] because
its data contents are [mixed] object dtype
Here's a simple example along the lines of what you have described
In [17]: df = DataFrame(randn(10,10))
In [18]: df.iloc[5:10,7:9] = np.nan
In [19]: df.iloc[7:10,4:9] = np.nan
In [22]: df.iloc[7:10,-1] = np.nan
In [23]: df
Out[23]:
0 1 2 3 4 5 6 7 8 9
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN
In [24]: df['stop'] = df.apply(lambda x: x.last_valid_index(), 1)
In [25]: df
Out[25]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3
Note that in 0.12 you should use table=True, rather than fmt (this is in the process of changing)
In [26]: df.to_hdf('test.h5','df',mode='w',fmt='t')
In [27]: pd.read_hdf('test.h5','df')
Out[27]:
0 1 2 3 4 5 6 7 8 9 stop
0 -1.671523 0.277972 -1.217315 -1.390472 0.944464 -0.699266 0.348579 0.635009 -0.330561 -0.121996 9
1 0.239482 -0.050869 0.488322 -0.668864 0.125534 -0.159154 1.092619 -0.638932 -0.091755 0.291824 9
2 0.432216 -1.101879 2.082755 -0.500450 0.750278 -1.960032 -0.688064 -0.674892 3.225115 1.035806 9
3 0.775353 -1.320165 -0.180931 0.342537 2.009530 0.913223 0.581071 -1.111551 1.118720 -0.081520 9
4 -0.255524 0.143255 -0.230755 -0.306252 0.748510 0.367886 -1.032118 0.232410 1.415674 -0.420789 9
5 -0.850601 0.273439 -0.272923 -1.248670 0.041129 0.506832 0.878972 NaN NaN 0.433333 9
6 -0.353375 -2.400167 -1.890439 -0.325065 -1.197721 -0.775417 0.504146 NaN NaN -0.635012 9
7 -0.241512 0.159100 0.223019 -0.750034 NaN NaN NaN NaN NaN NaN 3
8 -1.511968 -0.391903 0.257445 -1.642250 NaN NaN NaN NaN NaN NaN 3
9 -0.376762 0.977394 0.760578 0.964489 NaN NaN NaN NaN NaN NaN 3