splitting columns with str.split() not changing the outcome - pandas

Will I have to use the str.split() for an exercise. I have a column called title and it looks like this:
and i need to split it into two columns Name and Season, the following code does not through an error but it doesn't seem to be doing anything as well when i'm testing it with df.head()
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
Any help as to why?

The code you have in your question is correct, and should be working. The issue could be coming from the execution order of your code though, if you're using Jupyter Notebook or some method that allows for unclear ordering of code execution.
I recommend starting a fresh kernel/terminal to clear all variables from the namespace, then executing those lines in order, e.g.:
# perform steps to load data in and clean
print(df.columns)
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
print(df.columns)
Alternatively you could add an assertion step in your code to ensure it's working as well:
df[['Name', 'Season']] = df['title'].str.split(':',n=1, expand=True)
assert {'Name', 'Season'}.issubset(set(df.columns)), "Columns were not added"

Related

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

Multiple persists in the Spark Execution plan

I currently have some spark code (pyspark), which loads in data from S3 and applies several transformations on it. The current code is structured in such a way that there are a few persists along the way in the following format
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df = df.transformation5
df = df.transformation6
df = df.transformation7
df.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
df = df.transformationN-1
df = df.transformationN
df.persist(MEMORY_AND_DISK)
When I do df.explain() at the very end of all transformations, as expected, there are multiple persists in the execution plan. Now when I do the following at the end of all these transformations
print(df.count())
All transformations get triggered, including the persist. Since spark will flow through the execution plan, it will execute all these persists. Is there any way that I can inform Spark to unpersist the N-1th persist, when performing the Nth persist, or is Spark smart enough to do this. My issue stems from the fact that later on in the program, I run out of disk space, ie, spark errors out with the following error:
No space left on device
An easy solution is of course to increase the underlying number of instances. But my hypothesis is that the high number of persists eventually costs the disk to go out of space.
My question is, do these donkey persists cause this issue? If they do, what is the best way/practice to structure the code so that I can unpersist the N-1th persist automatically.
I'm more experienced with Scala Spark but it's definitely possible to unpersist a Dataframe.
In fact, the Pyspark method of a Dataframe is also called unpersist. So in your example, you could do something like (this is quite crude):
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df1 = df.transformation5
df1 = df1.transformation6
df1 = df1.transformation7
df.unpersist()
df1.persist(MEMORY_AND_DISK)
.
.
.
dfM = dfM-1.transformationN-2
dfM = dfM.transformationN-1
dfM = dfM.transformationN
dfM-1.unpersist()
dfM.persist(MEMORY_AND_DISK)
Now, the way this code looks triggers some questions in me. It might be that you've mostly written this as pseudocode to be able to ask this question, but still maybe the following questions might help you further:
I only see transformations in there, and no actions. If this is the case, do you even need to persist?
Also, you only seem to have 1 source of data (the spark.read.csv bit): this also seems to hint at not necessarily needing to persist.
This is more of a point about style (and maybe opinionated, so don't worry if you don't agree). As I said in the beginning, I have no experience with Pyspark but the way that I would write (in Scala Spark) something similar to what you have written would be something like this:
df = spark.read.csv(s3path)
.transformation1
.transformation2
.transformation3
.transformation4
.persist(MEMORY_AND_DISK)
df = df.transformation5
.transformation6
.transformation7
.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
.transformationN-1
.transformationN
.persist(MEMORY_AND_DISK)
This is less verbose and IMO a little more "true" to what really happens, just a chaining of transformations on the original dataframe.

show all unique value from column

I try to show all value from a dataframe column with this code combinaison_df['coalesce'].unique()
but I got this result :
array([1.12440191, 0.54054054, 0.67510549, ..., 1.011378 , 1.39026812,
1.99637024])
I need to see all the values. Do you have an idea to do?
Thanks
If I have this situation but don't want to change it for every print() statement I found this solution with a context manager quite handy:
with pd.option_context('display.max_rows', None):
print(combinaison_df['coalesce'].unique())

A value is trying to be set on a copy of a slice from a DataFrame - don't understand ist

I know this topic has been discussed a lot and I am so sorry, that I stil dont't find the sulution, even the difference between a view and a copy is easy to understand (in other languages)
def hole_aktienkurse_und_berechne_hist_einstandspreis(index, start_date, end_date):
df_history = pdr.get_data_yahoo(symbols=index, start=start_date, end=end_date)
df_history['HistEK'] = df_history['Adj Close']
df_only_trd_index = df_group_trade.loc[index].copy()
for i_hst, r_hst in df_history.iterrows():
df_bis = df_only_trd_index[(df_only_trd_index['DateClose']<=i_hst) & (df_only_trd_index['OpenPos']==0)].copy()
# here comes the part what causes the trouble:
df_history.loc[i_hst]['HistEK'] = df_history.loc[i_hst]['Adj Close'] - df_bis['Total'].sum()/100.0
return df_history
I think I tried nearly everithing, but I don't get it. python is not easy when it comes to this topic.
When you have to specify bow index and column in .loc you have to put all together otherwise the annoying message relative to views appears.
df_history.loc[i_hst, 'HistEK'] = df_history.loc[i_hst, 'Adj Close'] - df_bis['Total'].sum()/100.0
Look the examples here

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())