Identify pairs of matching records using Pandas for further analysis - pandas

I conduct a multiple choice survey at the start and end of the semester and I would like to analyze whether students answers to questions change significantly from begin to end.
There will be students who answer the first one and don't the second one and vice versa, for numerous reasons. I want to drop those from the analysis.
Note that the students don't all answer at the exact same time (or even day.) Some may do it the day before the assignment or the day after so I can't rely on the date/time. I have to rely on the matching of email addresses.
The questions have the usual "strongly agree or disagree, agree or disagree, or not sure.
My data file looks like this:
Email address: text
Time: date/time
Multiple Choice Q1: [agree, disagree, neutral]
Multiple Choice Q2: [agree, disagree, neutral]
I need to filter out the records for students who didn't answer twice (at begin and end of semester)
I need to come up with a way to quantify how much each answer changed.
I've played around with many ideas but they are all some form of brute-force old fashioned looping and saving.
Using Pandas I suspect there's a much more elegant way to do it.
Here is a model of the input:
input = pd.DataFrame({'email':
['joe#sample.com', 'jane#sample.com', 'jack#sample.com',
'joe#sample.com', 'jane#sample.com', 'jack#sample.com', 'jerk#sample.com'],
'date': ['jan 1 2019', 'jan 2 2019', 'jan 1 2019',
'july 2, 2019', 'july 1 2019', 'july 1, 2019', 'july 1, 2019'],
'are you happy?': ["yes", "no", "no", "yes", "yes", "yes", "no"],
'are you smart?': ['no', 'no', 'no', 'yes', 'yes' , 'yes', 'yes']})
and here's a model of the output:
output = pd.DataFrame({'question': ['are you happy?', 'are you smart?'],
'change score': [+0.6, +1]})
What a great exercise, thanks for suggesting it.
The logic of the change scores are that for "are you happy?" Joe stayed the same, and jack and jane went from no to yes, so (0 + 1 + 1)/3. And for "are you smart?" all three went from no to yes so (1 + 1 + 1)/3 = 1. jerk#sample.com is not counted because he didn't respond to the beginning survey just the ending one.
Here are the first two lines of my data file:
Timestamp,Email Address,How I see myself [I am comfortable in a leadership position],How I see myself [I like and am effective working in a team],How I see myself [I have a feel for business],How I see myself [I have a feel for marketing],How I see myself [I hope to start a company in the future],How I see myself [I like the idea of working at a large company with a global impact],"How I see myself [Carreerwise, I think working at a startup is very risky]","How I see myself [I prefer an unstructured, improvisational job]",How I see myself [I like to know exactly what is expected of me so I can excel],How I see myself [I've heard that I can make a lot of money in a startup and that is important to me so I can support myself and my family],How I see myself [I would never work at a significant company (like Google)],How I see myself [I definitely want to work at a significant company (like Facebook)],How I see myself [I have confidence in my intuitions about creating a successful business],How I see myself [The customer is always right],How I see myself [Don't ask users what they want: they don't know what they want],How I see myself [If you create what customers are asking for you will always be behind],"How I see myself [From the very start of designing a business, it is crucial to talk to users and customers]",What is your best guess of your career 3 years after your graduation?,Class,Year of expected graduation (undergrad or grad),"How I see myself [Imagine you've been working on a new product for months, then discover a competitor with a similar idea. The best response to this is to feel encouraged because this means that what you are working on is a real problem.]",How I see myself [Most startups fail],How I see myself [Row 20],"How I see myself [For an entrepreneur, Strategic skills are more important than having a great (people) network]","How I see myself [Strategic vision is crucial to success, so that one can consider what will happen several moves ahead]",How I see myself [It's important to stay focused on your studies rather than be dabbling in side projects or businesses],How I see myself [Row 23],How I see myself [Row 22]
8/30/2017 18:53:21,s#b.edu,I agree,Strongly agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I agree,I agree,I'm not sure,I disagree,I disagree,I disagree,I disagree,I disagree,Strongly disagree,I agree,working with film production company,Sophomore,2020,,,,,,,,

starting with your initial data frame,
first, we convert your date into a proper datetime.
df['date'] = pd.to_datetime(df['date'])
then we create two variables, the first ensures there are more than 2 counts of an email per person, the 2nd that they fall into months 1 & 7 respectively.
(assuming you may have duplicate entires) .loc allows us to use boolean conditions with our dataframe.
s = df.groupby('email')['email'].transform('count') >= 2
months = [1,7] # start & end of semester.
df2 = df.loc[(df['date'].dt.month.isin(months)) & (s)]
print(df2)
email date are you happy? are you smart?
0 joe#sample.com 2019-01-01 yes no
1 jane#sample.com 2019-01-02 no no
2 jack#sample.com 2019-01-01 no no
3 joe#sample.com 2019-07-02 yes yes
4 jane#sample.com 2019-07-01 yes yes
5 jack#sample.com 2019-07-01 yes yes
now, we need to re-shape our data so we can run some logical tests more easily.
df3 = (
df2.set_index(["email", "date"])
.stack()
.reset_index()
.rename(columns={0: "answer", "level_2": "question"})
.sort_values(["email", "date"])
)
email date question answer
0 jack#sample.com 2019-01-01 are you happy? no
1 jack#sample.com 2019-01-01 are you smart? no
2 jack#sample.com 2019-07-01 are you happy? yes
3 jack#sample.com 2019-07-01 are you smart? yes
now, we need to figure out if Jack's answer changed from the start of the semester and the end, and if so, we assign a score, we will leverage map and create a dictionary from the output dataframe.
score_dict = dict(zip(output["question"], output["change score"]))
s2 = df3.groupby(["email", "question"])["answer"].apply(lambda x: x.ne(x.shift()))
df3.loc[(s2) & (df3["date"].dt.month == 7), "score"] = df3["question"].map(
score_dict
)
print(df3)
email date question answer score
4 jack#sample.com 2019-01-01 are you happy? no NaN
5 jack#sample.com 2019-01-01 are you smart? no NaN
10 jack#sample.com 2019-07-01 are you happy? yes 0.6
11 jack#sample.com 2019-07-01 are you smart? yes 1.0
2 jane#sample.com 2019-01-02 are you happy? no NaN
3 jane#sample.com 2019-01-02 are you smart? no NaN
8 jane#sample.com 2019-07-01 are you happy? yes 0.6
9 jane#sample.com 2019-07-01 are you smart? yes 1.0
0 joe#sample.com 2019-01-01 are you happy? yes NaN
1 joe#sample.com 2019-01-01 are you smart? no NaN
6 joe#sample.com 2019-07-02 are you happy? yes NaN
7 joe#sample.com 2019-07-02 are you smart? yes 1.0
logically, we only want to apply a score to any value that changed and is not in the penultimate month.
so, Joe has a value of NaN for his are you happy question as he selected Yes in the first semester and Yes for the 2nd.
you might want to add some more logic for the scoring, to look at Y/N differently, and you'll need to clean up your dataframe from looking at your first row - but something along these lines should work.

Related

How can I detect similarity of names in the same columns

Guys I have a dataset like this:
`
df = pd.DataFrame(data = ['John','gal britt','mona','diana','molly','merry','mony','molla','johnathon','dina'],\
columns = ['Name'])
df
`
it gives this output
Name
0 John
1 gal britt
2 mona
3 diana
4 molly
5 merry
6 mony
7 molla
8 johnathon
so I imagine that to get all names across each other and detect the similarity I will use df.merge(df,how = "cross" )
The thing is the real data is 40000 rows and performing this will result in a very big dataset which I don't have the memory for.
any algorithm or idea would really help and I'll adjust the logic to my purposes
I tried working with vaex instead of pandas to work with this huge amount of data but still I run into the problem of insufficient memory allocation.
In short: I KNOW that this algorithm or way of thinking about such problem is wrong and inefficient.

R2 value to find correlation between two dataframes

Is it possible to find a Rsqaured value of two different dataset in order to find correlation?
For example, I have two dataframes as below
DataFrame 1
Date humidity Average windspeed sunshine avg cloud
0 2016-01-01 93.714 2.855 1.622 5.925
1 2016-01-02 89.423 5.762 0.237 6.879
2 2016-01-03 87.281 6.138 0.978 6.308
DataFrame 2
Date Wind ene wind offshore Photovoltaic
0 2016-01-01 93.714 2.855 1.622
1 2016-01-02 89.423 5.762 0.237
2 2016-01-03 87.281 6.138 0.978
How would I find a correlation between these two dataframes?
Could you please explain your goal? In my understanding, it makes less sense to find correlation between these two data frames.I would suggest you to find correlation between humidty and wind speed or investigate how the offshore and onshore wind speed look like if the wind speed in your first data frame is from onshore.

Historical Intraday ticks using BLPInterface

I use the https://github.com/691175002/BLPInterface wrapper for the Bloomberg API. Normally I just pull historical-end-of-day stuff. But I need to instead pull some small amount of historical intraday tick data.
In Excel API, I'd do something like:
=BDH($A$1,$A$2,"2021-09-17 14:29:00","2021-09-17 14:30:00","Dir=V","IntrRw=true","Headers=Y","Dts=S","cols=4;rows=1195", "Sort=D")
The critical bit here is the "IntrRw=true" parameter, which says "Intraday raw forces the historical intraday ticks output. The default option is true"
However, I cannot find a way to pass this parametr into the he historicalRequest() function in BLPinterface.
print(blp.BLPInterface().historicalRequest(['spx Index'],['bid', 'ask'],
dt.datetime(2021, 9, 17, 16,29,0), dt.datetime(2021, 9, 17, 16,31,0),
IntrRw=True
))
If I pass those time-specific dates, it still just gives me bid and ask End-of-Day, not during the 16:29-16:31 time.
But if I try to pass it a IntrRw=True parameter it doesn't pass along the extra keyword, failing with NotFoundException: Sub-element 'IntrRw' does not exist. (0x0006000d)
Any ideas how to achieve this? Sadly the BLPInterface seems unmaintained/unrespondive. I merged a pull-request with it a couple years ago but haven't heard any signs of life since.
This is how you could use the xbbg package to retrieve the tick data:
from xbbg import blp
from datetime import datetime
df = blp.bdtick('SPX Index',datetime(2021,9,17),types=['BID','ASK'],time_range=('14:29','14:31'),ref='EquityUS')
print(df)
Which yields:
SPX Index
typ value volume exch
2021-09-17 14:29:00-04:00 BID 4430.77 0 m
2021-09-17 14:29:00-04:00 ASK 4432.30 0 m
2021-09-17 14:29:01-04:00 BID 4430.86 0 m
2021-09-17 14:29:01-04:00 ASK 4432.39 0 m
2021-09-17 14:29:02-04:00 BID 4430.83 0 m
... ... ... ... ...
2021-09-17 14:30:58-04:00 ASK 4430.26 0 m
2021-09-17 14:30:59-04:00 BID 4428.96 0 m
2021-09-17 14:30:59-04:00 ASK 4430.26 0 m
2021-09-17 14:31:00-04:00 BID 4428.86 0 m
2021-09-17 14:31:00-04:00 ASK 4430.13 0 m
[242 rows x 4 columns]
The time interval you supply is based on the ref='EquityUS' parameter. xbbg has a lookup of what it terms 'exchanges', and uses this to impute the timezone. The underlying BLP API only deals in UTC times (ie relative to GMT), so the package performs the conversion. Hence in the example this is 14:29 to 14:31 New York time (ie UTC-4 currently).

Grouping nearby data in pandas

Lets say I have the following dataframe:
df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})
df
a b
0 1.00 10
1 1.10 11
2 1.03 12
3 3.00 13
4 3.10 14
And I want to group nearby points, eg.
df.groupby(#SOMETHING).mean():
a b
a
0 1.043333 11.0
1 3.050000 13.5
Now, I could use
#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)
But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).
I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.
Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.
Based on this answer
df.groupby( (df.a.diff() > 1).cumsum() ).mean()

Nearest Neighbor Search on large database table - SQL and/or ArcGis

Sorry for posting something that's probably obvious, but I don't have much database experience. Any help would be greatly appreciated - but remember, I'm a beginner :-)
I have a table like this:
Table.fruit
ID type Xcoordinate Ycoordinate Taste Fruitiness
1 Apple 3 3 Good 1,5
2 Orange 5 4 Bad 2,9
3 Apple 7 77 Medium 1,4
4 Banana 4 69 Bad 9,5
5 Pear 9 15 Medium 0,1
6 Apple 3 38 Good -5,8
7 Apple 1 4 Good 3
8 Banana 15 99 Bad 6,8
9 Pear 298 18789 Medium 10,01
… … … … … …
1000 Apple 1344 1388 Bad 5
… … … … … …
1958 Banana 759 1239 Good 1
1959 Banana 3 4 Medium 5,2
I need:
A table that gives me
The n (eg.: n=5) closest points to EACH point in the original table, including distance
Table.5nearest (please note that the distances are fake). So the resulting table has ID1, ID2 and distance between ID1 and ID2 (can't post images yet, unfortunately).
ID.Fruit1 ID.Fruit2 Distance
1 1959 1
1 7 2
1 2 2
1 5 30
1 14 50
2 1959 1
2 1 2
… … …
1000 1958 400
1000 Xxx Xxx
… … …
How can I do this (ideally with SQL/database management) or in ArcGis or similar? Any ideas?
Unfortunately, my table contains 15000 datasets, so the resulting table will have 75000 datasets if I choose n=5.
Any suggestions GREATLY appreciated.
EDIT:
Thank you very much for your comments and suggestions so far. Let me expand on it a little:
The first proposed method is sort of a brute-force scan of the whole table rendering huge filesizes or, likely, crashes, correct?
Now, the fruit is just a dummy, the real table contains a fix ID, nominal attributes ("fruit types" etc), X and Y spatial columns (in Gauss-Krueger) and some numeric attributes.
Now, I guess there is a way to code a "bounding box" into this, so the distances calculation is done for my point in question (let's say 1) and every other point within a square with a certain edge length. I can imagine (remotely) coding or querying for that, but how do I get the script to do that for EVERY point in my ID column. The way I understand it, this should either create a "subtable" for each record/point in my "Table.Fruit" containing all points within the square around the record/point with a distance field added - or, one big new table ("Table.5nearest"). I hope this makes some kind of sense. Any ideas? THanks again
To get all the distances between all fruit is fairly straightforward. In Access SQL (although you may need to add parentheses everywhere to get it to work :P):
select fruit1.id,
fruit2.id,
sqr(((fruit2.xcoordinate - fruit1.xcoordinate)^2) + ((fruit2.ycoordinate - fruit1.ycoordinate)^2)) as distance
from fruit as fruit1
join fruit as fruit2
on fruit2.id <> fruit1.id
order by distance;
I don't know if Access has the necessary sophistication to limit this to the "top n" records for each fruit; so this query, on your recordset, will return 225 million records (or, more likely, crash while trying)!
Thank you for your comments so far; in the meantime, I have gone for a pre-fabricated solution, an add-in for ArcGis called Hawth's Tools. This really works like a breeze to find the n closest neighbors to any point feature with an x and y value. So I hope it can help someone with similar problems and questions.
However, it leaves me with a more database-related issue now. Do you have an idea how I can get any DBMS (preferably Access), to give me a list of all my combinations? That is, if I have a point feature with 15000 fruits arranged in space, how do I get all "pure banana neighborhoods" (apple, lemon, etc.) and all other combinations?
Cheers and best wishes.