In my experiment, I am presenting images (faces) that are different across 2 dimensions: face identity and emotion.
There are 5 faces displaying 5 different emotional expressions; making 25 unique stimuli in total. These only need to be presented once (so 25 trials).
After I present one of the faces, the next face has to be different on only the emotion OR the identity, but the same on the other.
Example:
Face 1, emotion 1 -> face 3, emotion 1 -> face 3, emotion 4 -> ... etc.
1: is psychopy up to this task? I have mostly worked with the builder so far, except for some data-logging code, but I'd be happy to get more experienced with the coder.
My hunch is that I would need to add two columns to the list of trials, one for identity and one for emotion. Then use the getEarlierTrial call somehow, but I pretty much get lost at this point.
2: Would anyone be willing to point me in the right direction please?
Many thanks in advance.
This is difficult to implement in Builder's normal mode of operation, which is to drive trials from a fixed list of conditions. Although the order of rows can be randomised across subjects, the pairings of values across columns remains constant.
The standard answer to this is what you allude to in your comment above: in code, shuffle the conditions file at the beginning of each experiment, so each subject is in essence having their trials driven by a unique conditions file.
You seem happy to do this in Matlab. That would work fine, as this stuff can be done before PsychoPy even launches. But it could also very easily be implemented in Python code. That way you could do everything in PsychoPy, and in this case there would be no need to abandon Builder. You'd just insert a code component with some code to run at the beginning of the experiment that customises a conditions file.
You'll need to create three lists, not two, i.e. you also need a list of pseudo-random choices to alternate between preserving either face or emotion from trial to trial: if you do this fully randomly, you'll get unbalanced, and exhaust one of the attributes before the other.
from numpy.random import shuffle
# make a list of 25 dictionaries of unique face/emotion pairs:
pairsList = []
for face in ['1', '2', '3', '4', '5']:
for emotion in ['1', '2', '3', '4', '5']:
pairsList.append({'faceNum': face, 'emotionNum': emotion})
shuffle(pairsList)
# a list of whether to alternate between preserving face or emotion across trials:
attributes = ['faceNum', 'emotionNum'] * 12 # length 24
shuffle(attributes)
# need to create an initial selection before cycling though the
# next 24 randomised but balanced choices:
pair = pairsList.pop()
currentFace = pair['faceNum']
currentEmotion = pair['emotionNum']
images = ['face_' + currentFace + '_emotion_' + currentEmotion + '.jpg']
for attribute in attributes:
if attribute == 'faceNum':
selection = currentFace
else:
selection = currentEmotion
# find another pair with the same selected attribute:
for pair in pairsList:
if pair[attribute] == selection:
# store the combination for this trial:
currentFace = pair['faceNum']
currentEmotion = pair['emotionNum']
images.append('face_' + currentFace + '_emotion_' + currentEmotion + '.jpg')
# remove this combination so it can't be used again
pairsList.remove(pair)
images.reverse()
print(images)
Then just write the images list to a single column .csv file to use as a conditions file.
Remember to set the loop in Builder to be in a fixed order, not randomised, as the list itself has the randomisation built in.
Related
Say I have one column (X) which holds the customer id and have other multiple columns x1,x2,x3,x4,x5,x6
which have only these 4 distinct values ('High','Low','Medium','Nan') repeatedly. Please click on the above the attachment
Recent update: 16/12/2021: I have done one hot encoding and got 19 features now along with X column now I need to know how to go ahead with the clustering part for such unsupervised data set
Regarding the question what encoding to use i found this article to give a good understanding of when to use label encoding or one hot encoding:
https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
In your case as you do have a ordinal value of your data (high > medium > low > nan) i would suggest using the label encoding technique.
Then regarding the clusteringpart you have identified three diffrent clusters, do you want to identify which samples belong to which cluster or what is your goal?
You could start train a model with 3 cluster centroids as you have identified yourself but could also use an elbow function to find a optimal number of clusters to your dataset. (https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/)
For label encoding on a column in your dataframe:
encoding_dict = {}
def label_encode(string_value):
num_value = encoding_dict.setdefault(string_value, len(encoding_dict)) # Sets a numerical value for your string value
return num_value
for col in dataframe.columns:
if dataframe[col].dtype == object: #indicates string
dataframe[col] = dataframe[col].apply(label_encode)
encoding_dict = {} # Reset encoding dict to not reuse same values (or dont if you always have same values)
Is there an equivalent of TTree::AddFriend() with uproot ?
I have 2 parallel trees in 2 different files which I'd need to read with uproot.iterate and using interpretations (setting the 'branches' option of uproot.iterate).
Maybe I can do that by manually obtaining several iterators from iterate() calls on the files, and then calling next() on each iterators... but maybe there's a simpler way akin to AddFriend ?
Thanks for any hint !
edit: I'm not sure I've been clear, so here's a bit more details. My question is not about usage of arrays, but about how to read them from different files. Here's a mockup of what I'm doing :
# I will fill this array and give it as input to my DNN
# it's very big so I will fill it in place
bigarray = ndarray( (2,numentries),...)
# get a handle on a tree, just to be able to build interpretations :
t0 = .. first tree in input_files
interpretations = dict(
a=t0['a'].interpretation.toarray(bigarray[0]),
b=t0['b'].interpretation.toarray(bigarray[1]),
)
# iterate with :
uproot.iterate( input_files, treename,
branches = interpretations )
So what if a and b belong to 2 trees in 2 different files ?
In array-based programming, friends are implicit: you can JOIN any two columns after the fact—you don't have to declare them as friends ahead of time.
In the simplest case, if your arrays a and b have the same length and the same order, you can just use them together, like a + b. It doesn't matter whether a and b came from the same file or not. Even if I've if these is jagged (like jets.phi) and the other is not (like met.phi), you're still fine because the non-jagged array will be broadcasted to match the jagged one.
Note that awkward.Table and awkward.JaggedArray.zip can combine arrays into a single Table or jagged Table for bookkeeping.
If the two arrays are not in the same order, possibly because each writer was individually parallelized, then you'll need some column to act as the key associating rows of one array with different rows of the other. This is a classic database-style JOIN and although Uproot and Awkward don't provide routines for it, Pandas does. (Look up "merging, joining, and concatenating" in the Pandas documenting—there's a lot!) You can maintain an array's jaggedness in Pandas by preparing the column with the awkward.topandas function.
The following issue talks about a lot of these things, though the users in the issue below had to join sets of files, rather than just a single tree. (In principle, a process would have to look ahead to all the files to see which contain which keys: a distributed database problem.) Even if that's not your case, you might find more hints there to see how to get started.
https://github.com/scikit-hep/uproot/issues/314
This is how I have "friended" (befriended?) two TTree's in different files with uproot/awkward.
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
for field in array2.fields:
array1 = awkward.with_field(array1, getattr(array2, field), where=field)
# array1 now has branch "a" and "b"
print(array1.a)
print(array1.b)
Alternatively, if it is acceptable to "name" the trees,
import awkward
import uproot
iterate1 = uproot.iterate(["file_with_a.root"]) # has branch "a"
iterate2 = uproot.iterate(["file_with_b.root"]) # has branch "b"
for array1, array2 in zip(iterate1, iterate2):
# join arrays
zippedArray = awkward.zip({"tree1": array1, "tree2": array2})
# zippedArray. now has branch "tree1.a" and "tree2.b"
print(zippedArray.tree1.a)
print(zippedArray.tree2.b)
Of course you can use array1 and array2 together without merging them like this. But if you have already written code that expects only 1 Array this can be useful.
I have two problems:
First off, the documentation for tf.keras.datasets.imdb.get_word_index says
Retrieves the dictionary mapping word indices back to words.
While in fact it's the contrary,
print(tf.keras.datasets.imdb.get_word_index())
{'fawn': 34701, 'tsukino': 52006, 'nunnery': 52007
I tried to run this in TensorFlow 2.0
(train_data_raw, train_labels), (test_data_raw, test_labels) = keras.datasets.imdb.load_data()
words2idx = tf.keras.datasets.imdb.get_word_index()
idx2words = {idx:word for word, idx in words2idx.items()}
i = 0
train_ex = [idx2words[x] for x in train_data_raw[0]]
train_ex = ' '.join(train_ex)
print(train_ex)
This result in a nonsense string
the as you with out themselves powerful lets loves their [...]
Shouldn't I get a valid movie review?
I did a bit of digging and found that there are a few "offsets" in the processing which need to be undone in order to return a sensible review language. I modified your line to subtract 3 from the index that appears in the raw sequence (since the default is to start real words with index=3), and also the first character is a dummy marker (set to 1), so the real text starts at position 2 (or index 1 in python).
train_ex = [idx2words[x-3] for x in train_data_raw[0][1:]]
Using the above modification gives me the following for the review you originally selected:
this film was just brilliant casting location scenery story direction everyone's really suited the part they played ...
It appears that some punctuation and capitalization is removed etc, but this seems to return sensible reviews.
I hope this helps.
I am a starter in text mining topic. When I run LDA() over a huge dataset with 996165 observations, it displays the following error:
Error in LDA(dtm, k, method = "Gibbs", control = list(nstart = nstart, :
Each row of the input matrix needs to contain at least one non-zero entry.
I am pretty sure that there is no missing values in my corpus and also. The table of "DocumentTermMatrix" and "simple_triplet_matrix" is:
table(is.na(dtm[[1]]))
#FALSE
#57100956
table(is.na(dtm[[2]]))
#FALSE
#57100956
A little confused how "57100956" comes. But as my dataset is pretty large, I don't know how to check why does this error occurs. My LDA command is:
ldaOut<-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))
Can anyone provide some insights? Thanks.
In my opinion the problem is not the presence of missing values, but the presence of all 0 rows.
To check it:
raw.sum=apply(table,1,FUN=sum) #sum by raw each raw of the table
Then you can delete all raws which are all 0 doing:
table=table[raw.sum!=0,]
Now table should has all "non 0" raws.
I had the same problem. The design matrix, dtm, in your case, had rows with all zeroes because dome documents did not contain certain words (i.e. their frequency was zero). I suppose this somehow causes a singular matrix problem somewhere along the line. I fixed this by adding a common word to each of the documents so that every row would have at least one non-zero entry. At the very least, the LDA ran successfully and classified each of the documents. Hope this helps!
A client has asked me to add a simple spaced repeition algorithm (SRS) for an onlinebased learning site. But before throwing my self into it, I'd like to discuss it with the community.
Basically the site asks the user a bunch of questions (by automatically selecting say 10 out of 100 total questions from a database), and the user gives either a correct or incorrect answer. The users result are then stored in a database, for instance:
userid questionid correctlyanswered dateanswered
1 123 0 (no) 2010-01-01 10:00
1 124 1 (yes) 2010-01-01 11:00
1 125 1 (yes) 2010-01-01 12:00
Now, to maximize a users ability to learn all answers, I should be able to apply an SRS algorithm so that a user, next time he takes the quiz, receives questions incorrectly answered more often; than questions answered correctly. Also, questions that are previously answered incorrectly, but recently often answered correctly should occur less often.
Have anyone implemented something like this before? Any tips or suggestions?
Theese are the best links I've found:
http://en.wikipedia.org/wiki/Spaced_repetition
http://www.mnemosyne-proj.org/principles.php
http://www.supermemo.com/english/ol/sm2.htm
What you want to do is to have a number X_i for all questions i. You can normalize these numbers (make their sum 1) and pick one at random with the corresponding probability.
If N is the number of different questions and M is the number of times each question has been answered in average, then you could find X in M*N time like this:
Create the array X[N] set to 0.
Run through the data, and every time you see question i answered wrong, increase N[i] by f(t) where t is the answering time and f is an increasing function.
Because f is increasing, a question answered wrong a long time ago has less impact than one answered wrong yesterday. You can experiment with different f to get a nice behaviour.
The smarter way
A faster way is not to generate X[] every time you choose questions, but save it in a database table.
You won't be able to apply f with this solution. Instead just add 1 every time the question is answered wrongly, and then run through the table regularly - say every midnight - and multiply all X[i] by a constant - say 0.9.
Update: Actually you should base your data on corrects, not wrongs. Otherwise, questions not answered neither true nor false for a long time, will have a smaller chance of getting chosen. It should be opposite.
Anki is an open source program implementing spaced repetition.
Being open source, you can browse the source for libanki, a spaced repetition library for Anki.
As of Januray 2013, Anki version 2 sources can be browsed here.
The sources are in Python, the executable pseudo code language.
Reading the source to understand the algorithm may be feasible. The data model is defined using sqlalechmey, the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.
Here is a spaced repetition algorithm that is well documented and easy to understand.
Features
Introduces sub-decks for efficiently learning large decks (Super
useful!)
Intuitive variable names and algorithm parameters. Fully
open-source with human-readable examples.
Easily configurable
parameters to accommodate for different users' memorization
abilities.
Computationally cheap to compute next card. No need to
run a computation on every card in the deck.
https://github.com/Jakobovski/SaneMemo.
Disclaimer: I am the author of SaneMemo.
import random
import datetime
# The number of times needed for the user to get the card correct(EASY) consecutively before removing the card from
# the current sub_deck.
CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_KNOWN = 2
CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_WILL_FORGET = 3
# The number of cards in the sub-deck
SUBDECK_SIZE = 15
REMINDER_RATE = 1.6
class Deck(object):
def __init__(self):
self.cards = []
# Used to make sure we don't display the same card twice
self.last_card = None
def add_card(self, card):
self.cards.append(card)
def get_next_card(self):
self.cards = sorted(self.cards) # Sorted by next_practice_time
sub_deck = self.cards[0:min(SUBDECK_SIZE, len(self.cards))]
card = random.choice(sub_deck)
# In case card == last card lets select again. We don't want to show the same card two times in a row.
while card == self.last_card:
card = random.choice(sub_deck)
self.last_card = card
return card
class Card(object):
def __init__(self, front, back):
self.front = front
self.back = back
self.next_practice_time = datetime.utc.now()
self.consecutive_correct_answer = 0
self.last_time_easy = datetime.utc.now()
def update(self, performance_str):
""" Updates the card after the user has seen it and answered how difficult it was. The user can provide one of
three options: [I_KNOW, KNOW_BUT_WILL_FORGET, DONT_KNOW].
"""
if performance_str == "KNOW_IT":
self.consecutive_correct_answer += 1
if self.consecutive_correct_answer >= CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_KNOWN:
days_since_last_easy = (datetime.utc.now() - self.last_time_easy).days
days_to_next_review = (days_since_last_easy + 2) * REMINDER_RATE
self.next_practice_time = datetime.utc.now() + datetime.time(days=days_to_next_review)
self.last_time_easy = datetime.utc.now()
else:
self.next_practice_time = datetime.utc.now()
elif performance_str == "KNOW_BUT_WILL_FORGET":
self.consecutive_correct_answer += 1
if self.consecutive_correct_answer >= CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_WILL_FORGET:
self.next_practice_time = datetime.utc.now() + datetime.time(days=1)
else:
self.next_practice_time = datetime.utc.now()
elif performance_str == "DONT_KNOW":
self.consecutive_correct_answer = 0
self.next_practice_time = datetime.utc.now()
def __cmp__(self, other):
"""Comparator or sorting cards by next_practice_time"""
if hasattr(other, 'next_practice_time'):
return self.number.__cmp__(other.next_practice_time)