Pandas method chaining pd.to_datetime() through .assign() - pandas

Im trying, with no luck, to method chain pd.to_datetime() through .assign()
This works:
tcap2 = tcap.\
assign(person = tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time = tcap['text'].str.extract(r'\(([^()]+)\)'),
text = tcap['text'].str.split(': ').str[1])
tcap2['date_time'] = pd.to_datetime(tcap2['date_time'])
but I was hoping to have the whole chunk in the same chain like this:
tcap2 = tcap.\
assign(person = tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time = tcap['text'].str.extract(r'\(([^()]+)\)'),
text = tcap['text'].str.split(': ').str[1]).\
assign(date_time = lambda df: pd.to_datetime(tcap['date_time']))
I would be grateful for any advice

Thank you Nipy you are awesome, just a little change there in my lambda function (facepalm)
This worked an absolute treat and just makes the code so much more compact and readable
tcap = tcap.\
assign(person = tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time = tcap['text'].str.extract(r'\(([^()]+)\)'),
text = tcap['text'].str.split(': ').str[1]).\
assign(date_time = lambda tcap: pd.to_datetime(tcap['date_time']))

On a separate note, to avoid the use of '\' and to make chaining like this more readable you can surround the expression with parentheses:
tcap = (tcap
.assign(person=tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time=tcap['text'].str.extract(r'\(([^()]+)\)'),
text=tcap['text'].str.split(': ').str[1])
.assign(date_time = lambda tcap: pd.to_datetime(tcap['date_time']))
)

Related

Replacing append with concat?

Whenever I run this code I get:
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
What should I do to make the code run with concat?
final_dataframe = pd.DataFrame(columns = my_columns)
for symbol in stocks['Ticker']:
api_url = f'https://sandbox.iexapis.com/stable/stock/{symbol}/quote?token={IEX_CLOUD_API_TOKEN}'
data = requests.get(api_url).json()
final_dataframe = final_dataframe.append(
pd.Series([symbol,
data['latestPrice'],
data['marketCap'],
'N/A'],
index = my_columns),
ignore_index = True)
See this release note
or from another post:
"Append is the specific case(axis=0, join='outer') of concat" link
The changes in your code should be: (changed the pd.Series to variable just for presentation)
s = pd.Series([symbol, data['latestPrice'], data['marketCap'], 'N/A'], index = my_columns)
final_dataframe = pd.concat([final_dataframe, s], ignore_index = True)

Writing "" value in column for missing geopy Nominatim dictionary key (raise KeyError) using Lambda with Pandas

I am looking for a way to allow for the exemption of a missing dictionary key while parsing through geopy gecode.reverse. I'm writing the dictionary to a column df['Street'] using Lambda to parse through the lat/longs from another column.
geo = Nominatim(user_agent = "Standard_Road", timeout = 10)
geocode = RateLimiter(geo.geocode, min_delay_seconds = .75)
tqdm.pandas()
df['geom'] = df['Latitude'].map(str) + ',' + df['Longitude'].map(str)
df['geom'][0]
df['Street'] = df['geom'].progress_apply(lambda x: geo.reverse(x, language = 'en').raw['address']['road'])
This returns the road value in the dictionary until the key 'road' does not exist. So I'm trying to handle the exemption with a simple if else statement to return a None or "" value in the column, however, what I have tried below is raising the same KeyError.
df['Street'] = df['geom'].progress_apply(lambda x: geo.reverse(x, language = 'en').raw['address']['road']
if df['geom'].get(geo.reverse(x, language = 'en').raw['address']['road'])
else geo.reverse(x, language = 'en').raw['address']['road'] == None)
Any help with this would be greatly appreciated!
Fixed, but maybe someone has a better solution.
I created a variable which only consisted of the dictionary as a whole. At that point I was able to call .get to create the new column in the dataframe. This allowed for the None exemption which populated the street data I needed and leaving blank values where no key existed in the dictionary.
### Geo search for consistent street & zip. Creating new columns 'Street' & 'Zip Code'
geo = Nominatim(user_agent = "Standard_Road", timeout = 10)
geocode = RateLimiter(geo.reverse, min_delay_seconds = 2)
tqdm.pandas()
g = df['Latitude'].map(str) + ',' + df['Longitude'].map(str)
g[0]
d = g.progress_apply(lambda x: geo.reverse(x).raw['address'])
df['Street'] = [d.get('road') for d in d]
df['Zip Code'] = [d.get('postcode') for d in d]

How to method chain .agg() and .assign() functions in Pandas

I am looking to replicate this Dplyr query in Pandas but am having trouble chaining the the .agg() and .assign() functions together, and would be so grateful for any advice
Dplyr code:
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
Attempt at the same in Pandas:
Within the .assign() part I am redirecting the variable back into the original dataframe, but nothing else works
counties.\
groupby('state').\
agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum')).\
reset_index().\
assign(density = counties['total_population'] / counties['total_area']).\
arrange('density', ascending = False).\
head()
Problem is you need lambda for processing chained data, alreday processing in previous chained methods:
assign(density = counties['total_population'] / counties['total_area'])
to:
assign(density = lambda x: x['total_population'] / x['total_area'])
Another problem is for sorting is used instead:
arrange('density', ascending = False)
method DataFrame.sort_values:
sort_values('density', ascending = False):
All together, . is used to start of methods like:
df = (counties.groupby('state')
.agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum'))
.reset_index()
.assign(density = lambda x: x['total_population'] / x['total_area'])
.sort_values('density', ascending = False)
.head())
With datar, it is easy to port your dplyr code to python code, without learning pandas APIs:
from datar.all import f, group_by, summarize, sum, mutate, arrange, desc
counties_selected >> \
group_by(f.state) >> \
summarize(total_area = sum(f.land_area),
total_population = sum(f.population)) >> \
mutate(density = f.total_population / f.total_area) >> \
arrange(desc(f.density))
I am the author of the package. Feel free to submit issues if you have any questions.

Is there a wrapper library for solving optimisation problems by declaring known and unknown variables?

cvxpy has a very neat way to write out the optimisation form without worrying too much about converting it into a "standard" matrix form as this is done internally somehow. Best to explain with an example:
def cvxpy_implementation():
var1 = cp.Variable()
var2 = cp.Variable()
constraints = [
var1 <= 3,
var2 >= 2
]
obj_fun = cp.Minimize(var1**2 + var2**2)
problem = cp.Problem(obj_fun, constraints)
problem.solve()
return var1.value, var2.value
def scipy_implementation1():
A = np.diag(np.ones(2))
lb = np.array([-np.inf, 2])
ub = np.array([3, np.inf])
con = LinearConstraint(A, lb, ub)
def obj_fun(x):
return (x**2).sum()
result = minimize(obj_fun, [0, 0], constraints=con)
return result.x
def scipy_implementation2():
con = [
{'type': 'ineq', 'fun': lambda x: 3 - x[0]},
{'type': 'ineq', 'fun': lambda x: x[1] - 2},]
def obj_fun(x):
return (x**2).sum()
result = minimize(obj_fun, [0, 0], constraints=con)
return result.x
All of the above give the correct result but the cvxpy implementation is much "easier" to write out, specifically I don't have to worry about the inequalities and can name variables useful thinks when writing out the inequalities. Compare that to the scipy1 and scipy2 implementations where in the first case I have to write out these extra infs and in the second case I have to remember which variable is which. You can imagine a case where I have 100 variables and while concatenating them will ultimately need to be done I'd like to be able to write it out like in cvxpy.
Question:
Has anyone implemented this for scipy? or is there an alternative library that could make this work?
thank you
Wrote something up that would do this and seems to cover the main issues I had in mind.
The general idea is you define variables and then create a simple expression as you would normally write it out and then the solver class optimises over the defined variables
https://github.com/evan54/optimisation/blob/master/var.py
The example below illustrates a simple use case
# fake data
a = 2
m = 3
x = np.linspace(0, 10)
y = a * x + m + np.random.randn(len(x))
a_ = Variable()
m_ = Variable()
y_ = a_ * x + m_
error = y_ - y
prob = Problem((error**2).sum(), None)
prob.minimize() print(f'a = {a}, a_ = {a_}') print(f'm = {m}, m_ = {m_}')

bnlearn error in structural.em

I got an error when try to use structural.em in "bnlearn" package
This is the code:
cut.learn<- structural.em(cut.df, maximize = "hc",
+ maximize.args = "restart",
+ fit="mle", fit.args = list(),
+ impute = "parents", impute.args = list(), return.all = FALSE,
+ max.iter = 5, debug = FALSE)
Error in check.data(x, allow.levels = TRUE, allow.missing = TRUE,
warn.if.no.missing = TRUE, : at least one variable has no observed
values.
Did anyone have the same problems, please tell me how to fix it.
Thank you.
I got structural.em working. I am currently working on a python interface to bnlearn that I call pybnl. I also ran into the problem you desecribe above.
Here is a jupyter notebook that shows how to use StructuralEM from python marks.
The gist of it is described in slides-bnshort.pdf on page 135, "The MARKS Example, Revisited".
You have to create an inital fit with an inital imputed dataframe by hand and then provide the arguments to structural.em like so (ldmarks is the latent-discrete-marks dataframe where the LAT column only contains missing/NA values):
library(bnlearn)
data('marks')
dmarks = discretize(marks, breaks = 2, method = "interval")
ldmarks = data.frame(dmarks, LAT = factor(rep(NA, nrow(dmarks)), levels = c("A", "B")))
imputed = ldmarks
# Randomly set values of the unobserved variable in the imputed data.frame
imputed$LAT = sample(factor(c("A", "B")), nrow(dmarks2), replace = TRUE)
# Fit the parameters over an empty graph
dag = empty.graph(nodes = names(ldmarks))
fitted = bn.fit(dag, imputed)
# Although we've set imputed values randomly, nonetheless override them with a uniform distribution
fitted$LAT = array(c(0.5, 0.5), dim = 2, dimnames = list(c("A", "B")))
# Use whitelist to enforce arcs from the latent node to all others
r = structural.em(ldmarks, fit = "bayes", impute="bayes-lw", start=fitted, maximize.args=list(whitelist = data.frame(from = "LAT", to = names(dmarks))), return.all = TRUE)
You have to use bnlearn 4.4-20180620 or later, because it fixes a bug in the underlying impute function.