A value is trying to be set on a copy of a slice from a DataFrame - don't understand ist - pandas

I know this topic has been discussed a lot and I am so sorry, that I stil dont't find the sulution, even the difference between a view and a copy is easy to understand (in other languages)
def hole_aktienkurse_und_berechne_hist_einstandspreis(index, start_date, end_date):
df_history = pdr.get_data_yahoo(symbols=index, start=start_date, end=end_date)
df_history['HistEK'] = df_history['Adj Close']
df_only_trd_index = df_group_trade.loc[index].copy()
for i_hst, r_hst in df_history.iterrows():
df_bis = df_only_trd_index[(df_only_trd_index['DateClose']<=i_hst) & (df_only_trd_index['OpenPos']==0)].copy()
# here comes the part what causes the trouble:
df_history.loc[i_hst]['HistEK'] = df_history.loc[i_hst]['Adj Close'] - df_bis['Total'].sum()/100.0
return df_history
I think I tried nearly everithing, but I don't get it. python is not easy when it comes to this topic.

When you have to specify bow index and column in .loc you have to put all together otherwise the annoying message relative to views appears.
df_history.loc[i_hst, 'HistEK'] = df_history.loc[i_hst, 'Adj Close'] - df_bis['Total'].sum()/100.0
Look the examples here

Related

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

Multiple persists in the Spark Execution plan

I currently have some spark code (pyspark), which loads in data from S3 and applies several transformations on it. The current code is structured in such a way that there are a few persists along the way in the following format
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df = df.transformation5
df = df.transformation6
df = df.transformation7
df.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
df = df.transformationN-1
df = df.transformationN
df.persist(MEMORY_AND_DISK)
When I do df.explain() at the very end of all transformations, as expected, there are multiple persists in the execution plan. Now when I do the following at the end of all these transformations
print(df.count())
All transformations get triggered, including the persist. Since spark will flow through the execution plan, it will execute all these persists. Is there any way that I can inform Spark to unpersist the N-1th persist, when performing the Nth persist, or is Spark smart enough to do this. My issue stems from the fact that later on in the program, I run out of disk space, ie, spark errors out with the following error:
No space left on device
An easy solution is of course to increase the underlying number of instances. But my hypothesis is that the high number of persists eventually costs the disk to go out of space.
My question is, do these donkey persists cause this issue? If they do, what is the best way/practice to structure the code so that I can unpersist the N-1th persist automatically.
I'm more experienced with Scala Spark but it's definitely possible to unpersist a Dataframe.
In fact, the Pyspark method of a Dataframe is also called unpersist. So in your example, you could do something like (this is quite crude):
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df1 = df.transformation5
df1 = df1.transformation6
df1 = df1.transformation7
df.unpersist()
df1.persist(MEMORY_AND_DISK)
.
.
.
dfM = dfM-1.transformationN-2
dfM = dfM.transformationN-1
dfM = dfM.transformationN
dfM-1.unpersist()
dfM.persist(MEMORY_AND_DISK)
Now, the way this code looks triggers some questions in me. It might be that you've mostly written this as pseudocode to be able to ask this question, but still maybe the following questions might help you further:
I only see transformations in there, and no actions. If this is the case, do you even need to persist?
Also, you only seem to have 1 source of data (the spark.read.csv bit): this also seems to hint at not necessarily needing to persist.
This is more of a point about style (and maybe opinionated, so don't worry if you don't agree). As I said in the beginning, I have no experience with Pyspark but the way that I would write (in Scala Spark) something similar to what you have written would be something like this:
df = spark.read.csv(s3path)
.transformation1
.transformation2
.transformation3
.transformation4
.persist(MEMORY_AND_DISK)
df = df.transformation5
.transformation6
.transformation7
.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
.transformationN-1
.transformationN
.persist(MEMORY_AND_DISK)
This is less verbose and IMO a little more "true" to what really happens, just a chaining of transformations on the original dataframe.

splitting columns with str.split() not changing the outcome

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?
The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

How to get the latest papers from pubmed

This is a bit of a specific question, but somebody must have done this before. I would like to get the latest papers from pubmed. Not papers about a certain subjects, but all of them. I thought to query depending on modification date (mdat). I use biopython.py and my code looks like this
handle = Entrez.egquery(mindate='2015/01/10',maxdate='2017/02/19',datetype='mdat')
results = Entrez.read(handle)
for row in results["eGQueryResult"]:
if row["DbName"]=="nuccore":
print(row["Count"])
However, this results in zero papers. If I add term='cancer' I get heaps of papers. So the query seems to need the term keyword... but I want all papers, not papers on a certain subjects. Any ideas how to do this?
thanks
carl
term is a required parameter, so you can't omit it in your call to Entrez.egquery.
If you need all the papers within a specified timeframe, you will probably need a local copy of MEDLINE and PubMed Central:
For MEDLINE, this involves getting a license. For PubMed Central, you
can download the Open Access subset without a license by ftp.
EDIT for python3. The idea is that the latest pubmed id is the same thing as the latest paper (which I'm not sure is true). Basically does a binary search for the latest PMID, then gives a list of the n most recent. This does not look at dates, and only returns PMIDs.
There is an issue however where not all PMIDs exist, for example https://pubmed.ncbi.nlm.nih.gov/34078719/ exists, https://pubmed.ncbi.nlm.nih.gov/34078720/ does not (retraction?), and https://pubmed.ncbi.nlm.nih.gov/34078721/ exists. This ruins the binary search since it can't know if it's found a PMID that hasn't been used yet, or if it has found one that has previously existed.
CODE:
import urllib
def pmid_exists(pmid):
url_stem = 'https://www.ncbi.nlm.nih.gov/pubmed/'
query = url_stem+str(pmid)
try:
request = urllib.request.urlopen(query)
return True
except urllib.error.HTTPError:
return False
def get_latest_pmid(guess = 27239557, _min_guess=None, _max_guess=None):
#print(_min_guess,'<=',guess,'<=',_max_guess)
if _min_guess and _max_guess and _max_guess-_min_guess <= 1:
#recursive base case, this guess must be the largest PMID
return guess
elif pmid_exists(guess):
#guess PMID exists, search for larger ids
_min_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _max_guess else guess*2
else:
#guess PMID does not exist, search for smaller ids
_max_guess = guess
next_guess = (_min_guess+_max_guess)//2 if _min_guess else guess//2
return get_latest_pmid(next_guess, _min_guess, _max_guess)
#Start of program
n = 5
latest_pmid = get_latest_pmid()
most_recent_n_pmids = range(latest_pmid-n, latest_pmid)
print(most_recent_n_pmids)
OUTPUT:
[28245638, 28245639, 28245640, 28245641, 28245642]

pseudo randomization in loop PsychoPy

I know other people have asked similar questions in past but I am still stuck on how to solve the problem and was hoping someone could offer some help. Using PsychoPy, I would like to present different images, specifically 16 emotional trials, 16 neutral trials and 16 face trials. I would like to pseudo randomize the loop such that there would not be more than 2 consecutive emotional trials. I created the experiment in Builder but compiled a script after reading through previous posts on pseudo randomization.
I have read the previous posts that suggest creating randomized excel files and using those, but considering how many trials I have, I think that would be too many and was hoping for some help with coding. I have tried to implement and tweak some of the code that has been posted for my experiment, but to no avail.
Does anyone have any advice for my situation?
Thank you,
Rae
Here's an approach that will always converge very quickly, given that you have 16 of each type and only reject runs of more than two emotion trials. #brittUWaterloo's suggestion to generate trials offline is very good--this what I do myself typically. (I like to have a small number of random orders, do them forward for some subjects and backwards for others, and prescreen them to make sure there are no weird or unintended juxtapositions.) But the algorithm below is certainly safe enough to do within an experiment if you prefer.
This first example assumes that you can represent a given trial using a string, such as 'e' for an emotion trial, 'n' neutral, 'f' face. This would work with 'emo', 'neut', 'face' as well, not just single letters, just change eee to emoemoemo in the code:
import random
trials = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while 'eee' in ''.join(trials):
random.shuffle(trials)
print trials
Here's a more general way of doing it, where the trial codes are not restricted to be strings (although they are strings here for illustration):
import random
def run_of_3(trials, obj):
# detect if there's a run of at least 3 objects 'obj'
for i in range(2, len(trials)):
if trials[i-2: i+1] == [obj] * 3:
return True
return False
tr = ['e'] * 16 + ['n'] * 16 + ['f'] * 16
while run_of_3(tr, 'e'):
random.shuffle(tr)
print tr
Edit: To create a PsychoPy-style conditions file from the trial list, just write the values into a file like this:
with open('emo_neu_face.csv', 'wb') as f:
f.write('stim\n') # this is a 'header' row
f.write('\n'.join(tr)) # these are the values
Then you can use that as a conditions file in a Builder loop in the regular way. You could also open this in Excel, and so on.
This is not quite right, but hopefully will give you some ideas. I think you could occassionally get caught in an infinite cycle in the elif statement if the last three items ended up the same, but you could add some sort of a counter there. In any case this shows a strategy you could adapt. Rather than put this in the experimental code, I would generate the trial sequence separately at the command line, and then save a successful output as a list in the experimental code to show to all participants, and know things wouldn't crash during an actual run.
import random as r
#making some dummy data
abc = ['f']*10 + ['e']*10 + ['d']*10
def f (l1,l2):
#just looking at the output to see how it works; can delete
print "l1 = " + str(l1)
print l2
if not l2:
#checks if second list is empty, if so, we are done
out = list(l1)
elif (l1[-1] == l1[-2] and l1[-1] == l2[0]):
#shuffling changes list in place, have to copy it to use it
r.shuffle(l2)
t = list(l2)
f (l1,t)
else:
print "i am here"
l1.append(l2.pop(0))
f(l1,l2)
return l1
You would then run it with something like newlist = f(abc[0:2],abc[2:-1])