I conduct a multiple choice survey at the start and end of the semester and I would like to analyze whether students answers to questions change significantly from begin to end.
There will be students who answer the first one and don't the second one and vice versa, for numerous reasons. I want to drop those from the analysis.
Note that the students don't all answer at the exact same time (or even day.) Some may do it the day before the assignment or the day after so I can't rely on the date/time. I have to rely on the matching of email addresses.
The questions have the usual "strongly agree or disagree, agree or disagree, or not sure.
My data file looks like this:
Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]
I need to filter out the records for students who didn't answer twice (at begin and end of semester)
I need to come up with a way to quantify how much each answer changed.
I've played around with many ideas but they are all some form of brute-force old fashioned looping and saving.
Using Pandas I suspect there's a much more elegant way to do it.
Here is a model of the input:
input = pd.DataFrame({'email':
['joe#sample.com', 'jane#sample.com', 'jack#sample.com',
'joe#sample.com', 'jane#sample.com', 'jack#sample.com', 'jerk#sample.com'],
'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})
and here's a model of the output:
output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
'change score': [+0.6, +1]})
What a great exercise, thanks for suggesting it.
The logic of the change scores are that for "are you happy?" Joe stayed the same, and jack and jane went from no to yes, so (0 + 1 + 1)/3. And for "are you smart?" all three went from no to yes so (1 + 1 + 1)/3 = 1. jerk#sample.com is not counted because he didn't respond to the beginning survey just the ending one.
Here are the first two lines of my data file:
Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a similar idea. The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
8/30/2017 18:53:21,s#b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with film production company,Sophomore,2020,,,,,,,,
starting with your initial data frame,
first, we convert your date into a proper datetime.
df['date'] = pd.to_datetime(df['date'])
then we create two variables, the first ensures there are more than 2 counts of an email per person, the 2nd that they fall into months 1 & 7 respectively.
(assuming you may have duplicate entires) .loc allows us to use boolean conditions with our dataframe.
s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
print(df2)
email date are you happy? are you smart?
0 joe#sample.com 2019-01-01 yes no
1 jane#sample.com 2019-01-02 no no
2 jack#sample.com 2019-01-01 no no
3 joe#sample.com 2019-07-02 yes yes
4 jane#sample.com 2019-07-01 yes yes
5 jack#sample.com 2019-07-01 yes yes
now, we need to re-shape our data so we can run some logical tests more easily.
df3 = (
df2.set_index(["email", "date"])
.stack()
.reset_index()
.rename(columns={0: "answer", "level_2": "question"})
.sort_values(["email", "date"])
)
email date question answer
0 jack#sample.com 2019-01-01 are you happy? no
1 jack#sample.com 2019-01-01 are you smart? no
2 jack#sample.com 2019-07-01 are you happy? yes
3 jack#sample.com 2019-07-01 are you smart? yes
now, we need to figure out if Jack's answer changed from the start of the semester and the end, and if so, we assign a score, we will leverage map and create a dictionary from the output dataframe.
score_dict = dict(zip(output["question"], output["change score"]))
s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
score_dict
)
print(df3)
email date question answer score
4 jack#sample.com 2019-01-01 are you happy? no NaN
5 jack#sample.com 2019-01-01 are you smart? no NaN
10 jack#sample.com 2019-07-01 are you happy? yes 0.6
11 jack#sample.com 2019-07-01 are you smart? yes 1.0
2 jane#sample.com 2019-01-02 are you happy? no NaN
3 jane#sample.com 2019-01-02 are you smart? no NaN
8 jane#sample.com 2019-07-01 are you happy? yes 0.6
9 jane#sample.com 2019-07-01 are you smart? yes 1.0
0 joe#sample.com 2019-01-01 are you happy? yes NaN
1 joe#sample.com 2019-01-01 are you smart? no NaN
6 joe#sample.com 2019-07-02 are you happy? yes NaN
7 joe#sample.com 2019-07-02 are you smart? yes 1.0
logically, we only want to apply a score to any value that changed and is not in the penultimate month.
so, Joe has a value of NaN for his are you happy question as he selected Yes in the first semester and Yes for the 2nd.
you might want to add some more logic for the scoring, to look at Y/N differently, and you'll need to clean up your dataframe from looking at your first row - but something along these lines should work.
I'm trying to generate a pandas timeseries where all values are 1.
start=str(timeseries.index[0].round('S'))
end=str(timeseries.index[-1].round('S'))
empty_series_index = pd.date_range(start=start, end=end, freq='2m')
empty_series_values = 1
empty_series = pd.Series(data=empty_series_values, index=empty_series_index)
print(start,end)
print(empty_series)
The printout reads
2019-09-20 00:30:51+00:00 2019-10-30 23:57:35+00:00
2019-09-30 00:30:51+00:00 1
Why is there only one value, even tough its a 2min frequency and its more than 10 days long?
in the line:
empty_series_index = pd.date_range(start=start, end=end, freq='2m')
you are using the frequency string: '2m' which actually means 2 months.
If you want to use minutes you should use: '2min' or '2T' (from documentation)
Hope this helps. Let me know if you have any more questions.
I have a dataset that I shaped according to my needs, the dataframe is as follows:
Index A B C D ..... Z
Date/Time 1 0 0 0,35 ... 1
Date/Time 0,75 1 1 1 1
The total number of rows is 8878
What I try to do is create a time-series dendrogram (Example: Whole A column will be compared to whole B column in whole time).
I am expecting an output like this:
(source: rsc.org)
I tried to construct the linkage matrix with Z = hierarchy.linkage(X, 'ward')
However, when I print the dendrogram, it just shows an empty picture.
There is no problem if a compare every time point with each other and plot, but in that way, the dendrogram becomes way too complicated to observe even in truncated form.
Is there a way to handle the data as a whole time series and compare within columns in SciPy?
Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.
Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()
I have some irregularly stamped time series data, with timestamps and the observations at every timestamp, in pandas. Irregular basically means that the timestamps are uneven, for instance the gap between two successive timestamps is not even.
For instance the data may look like
Timestamp Property
0 100
1 200
4 300
6 400
6 401
7 500
14 506
24 550
.....
59 700
61 750
64 800
Here the timestamp is say seconds elapsed since a chose origin time. As you can see we could have data at the same timestamp, 6 secs in this case. Basically the timestamps are strictly different, just that second resolution cannot measure the change.
Now I need to shift the timeseries data ahead, say I want to shift the entire data by 60 secs, or a minute. So the target output is
Timestamp Property
0 750
1 800
So the 0 point got matched to the 61 point and the 1 point got matched to the 64 point.
Now I can do this by writing something dirty, but I am looking to use as much as possible any inbuilt pandas feature. If the timeseries were regular, or evenly gapped, I could've just used the shift() function. But the fact that the series is uneven makes it a bit tricky. Any ideas from Pandas experts would be welcome. I feel that this would be a commonly encountered problem. Many thanks!
Edit: added a second, more elegant, way to do it. I don't know what will happen if you had a timestamp at 1 and two timestamps of 61. I think it will choose the first 61 timestamp but not sure.
new_stamps = pd.Series(range(df['Timestamp'].max()+1))
shifted = pd.DataFrame(new_stamps)
shifted.columns = ['Timestamp']
merged = pd.merge(df,shifted,on='Timestamp',how='outer')
merged['Timestamp'] = merged['Timestamp'] - 60
merged = merged.sort(columns = 'Timestamp').bfill()
results = pd.merge(df,merged, on = 'Timestamp')
[Original Post]
I can't think of an inbuilt or elegant way to do this. Posting this in case it's more elegant than your "something dirty", which is I guess unlikely. How about:
lookup_dict = {}
def assigner(row):
lookup_dict[row['Timestamp']] = row['Property']
df.apply(assigner, axis=1)
sorted_keys = sorted(lookup_dict.keys)
df['Property_Shifted'] = None
def get_shifted_property(row,shift_amt):
for i in sorted_keys:
if i >= row['Timestamp'] + shift_amt:
row['Property_Shifted'] = lookup_dict[i]
return row
df = df.apply(get_shifted_property, shift_amt=60, axis=1)