How to method chain .agg() and .assign() functions in Pandas - pandas

I am looking to replicate this Dplyr query in Pandas but am having trouble chaining the the .agg() and .assign() functions together, and would be so grateful for any advice
Dplyr code:
counties_selected %>%
group_by(state) %>%
summarize(total_area = sum(land_area),
total_population = sum(population)) %>%
mutate(density = total_population / total_area) %>%
arrange(desc(density))
Attempt at the same in Pandas:
Within the .assign() part I am redirecting the variable back into the original dataframe, but nothing else works
counties.\
groupby('state').\
agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum')).\
reset_index().\
assign(density = counties['total_population'] / counties['total_area']).\
arrange('density', ascending = False).\
head()

Problem is you need lambda for processing chained data, alreday processing in previous chained methods:
assign(density = counties['total_population'] / counties['total_area'])
to:
assign(density = lambda x: x['total_population'] / x['total_area'])
Another problem is for sorting is used instead:
arrange('density', ascending = False)
method DataFrame.sort_values:
sort_values('density', ascending = False):
All together, . is used to start of methods like:
df = (counties.groupby('state')
.agg(total_area = ('land_area', 'sum'),
total_population = ('population', 'sum'))
.reset_index()
.assign(density = lambda x: x['total_population'] / x['total_area'])
.sort_values('density', ascending = False)
.head())

With datar, it is easy to port your dplyr code to python code, without learning pandas APIs:
from datar.all import f, group_by, summarize, sum, mutate, arrange, desc
counties_selected >> \
group_by(f.state) >> \
summarize(total_area = sum(f.land_area),
total_population = sum(f.population)) >> \
mutate(density = f.total_population / f.total_area) >> \
arrange(desc(f.density))
I am the author of the package. Feel free to submit issues if you have any questions.

Related

Pandas method chaining pd.to_datetime() through .assign()

Im trying, with no luck, to method chain pd.to_datetime() through .assign()
This works:
tcap2 = tcap.\
assign(person = tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time = tcap['text'].str.extract(r'\(([^()]+)\)'),
text = tcap['text'].str.split(': ').str[1])
tcap2['date_time'] = pd.to_datetime(tcap2['date_time'])
but I was hoping to have the whole chunk in the same chain like this:
tcap2 = tcap.\
assign(person = tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time = tcap['text'].str.extract(r'\(([^()]+)\)'),
text = tcap['text'].str.split(': ').str[1]).\
assign(date_time = lambda df: pd.to_datetime(tcap['date_time']))
I would be grateful for any advice
Thank you Nipy you are awesome, just a little change there in my lambda function (facepalm)
This worked an absolute treat and just makes the code so much more compact and readable
tcap = tcap.\
assign(person = tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time = tcap['text'].str.extract(r'\(([^()]+)\)'),
text = tcap['text'].str.split(': ').str[1]).\
assign(date_time = lambda tcap: pd.to_datetime(tcap['date_time']))
On a separate note, to avoid the use of '\' and to make chaining like this more readable you can surround the expression with parentheses:
tcap = (tcap
.assign(person=tcap['text'].apply(lambda x: x.split(" ", 1)[0]),
date_time=tcap['text'].str.extract(r'\(([^()]+)\)'),
text=tcap['text'].str.split(': ').str[1])
.assign(date_time = lambda tcap: pd.to_datetime(tcap['date_time']))
)

Best way to find modes of an array along the column

Suppose that I have an array
a = np.array([[1,2.5,3,4],[1, 2.5, 3,3]])
I want to find the mode of each column without using stats.mode().
The only way I can think of is the following:
result = np.zeros(a.shape[1])
for i in range(len(result)):
curr_col = a[:,i]
result[i] = curr_col[np.argmax(np.unique(curr_col, return_counts = True))]
update:
There is some error in the above code and the correct one should be:
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
I have to use the loop because np.unique does not output compatible result for each column and there is no way to use np.bincount because the dtype is not int.
If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:
import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)
Output:
% python3 script.py
[1. 5. 3. 4.]
Here is a code tha compares your solution with the scipy.stats.mode function:
import numpy as np
import scipy.stats as sps
import time
a = np.random.randint(1,100,(100,100))
t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')
t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')
print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')
Output:
% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!
I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

Should I use classes for pandas.DataFrame?

I have more of a general question. I've written a couple of functions that transform data successively:
def func1(df):
pass
...
def main():
df = pd.read_csv()
df1 = func1(df)
df2 = func2(df1)
df3 = func3(df2)
df4 = func4(df3)
df4.to_csv()
if __name__ == "__main__":
main()
Is there a better way of organizing the logic of my script?
Should I use classes for cases like this when everything is tied to one dataset?
It depends of your usecase. For what I understand, I would use dictionary of your functions that process a df.
For instance:
function_returning_a_df = { "f1": func1, "f2": func2, "f3": func3}
df = pd.read_csv(csv)
if this df needs 3 functions to be applied
df_processing = ["f1","f2","f3"] #function will be applied in this order
# If you need to keep df at every step you can make a list
dfs_processed = []
for func in df_processing:
dfs_processed.append(df) # if you want to save all steps
df = function_returning_a_df[func](df)

dask how to define a custom (time fold) function that operates in parallel and returns a dataframe with a different shape

I am trying to implement a time fold function to be 'map'ed to various partitions of a dask dataframe which in turn changes the shape of the dataframe in question (or alternatively produces a new dataframe with the altered shape). This is how far I have gotten. The result 'res' returned on compute is a list of 3 delayed objects. When I try to compute each of them in a loop (last tow lines of code) this results in a "TypeError: 'DataFrame' object is not callable" After going through the examples for map_partitions, I also tried altering the input DF (inplace) in the function with no return value which causes a similar TypeError with NoneType. What am I missing?
Also, looking at the visualization (attached) I feel like there is a need for reducing the individually computed (folded) partitions into a single DF. How do I do this?
#! /usr/bin/env python
# Start dask scheduler and workers
# dask-scheduler &
# dask-worker --nthreads 1 --nprocs 6 --memory-limit 3GB localhost:8786 --local-directory /dev/shm &
from dask.distributed import Client
from dask.delayed import delayed
import pandas as pd
import numpy as np
import dask.dataframe as dd
import math
foldbucketsecs=30
periodicitysecs=15
secsinday=24 * 60 * 60
chunksizesecs=60 # 1 minute
numts = 5
start = 1525132800 # 01/05
end = 1525132800 + (3 * 60) # 3 minute
c = Client('127.0.0.1:8786')
def fold(df, start, bucket):
return df
def reduce_folds(df):
return df
def load(epoch):
idx = []
for ts in range(0, chunksizesecs, periodicitysecs):
idx.append(epoch + ts)
d = np.random.rand(chunksizesecs/periodicitysecs, numts)
ts = []
for i in range(0, numts):
tsname = "ts_%s" % (i)
ts.append(tsname)
gts.append(tsname)
res = pd.DataFrame(index=idx, data=d, columns=ts, dtype=np.float64)
res.index = pd.to_datetime(arg=res.index, unit='s')
return res
gts = []
load(start)
cols = len(gts)
idx1 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+periodicitysecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx1[:0], data=[], columns=gts, dtype=np.float64)
dfs = [delayed(load)(fn) for fn in range(start, end, chunksizesecs)]
from_delayed = dd.from_delayed(dfs, meta, 'sorted')
nfolds = int(math.ceil((end - start)/foldbucketsecs))
cprime = nfolds * cols
gtsnew = []
for i in range(0, cprime):
gtsnew.append("ts_%s,fold=%s" % (i%cols, i/cols))
idx2 = pd.DatetimeIndex(start=start, freq=('%sS' % periodicitysecs), end=start+foldbucketsecs, dtype='datetime64[s]')
meta = pd.DataFrame(index=idx2[:0], data=[], columns=gtsnew, dtype=np.float64)
folded_df = from_delayed.map_partitions(delayed(fold)(from_delayed, start, foldbucketsecs), meta=meta)
result = c.submit(reduce_folds, folded_df)
c.gather(result).visualize(filename='/usr/share/nginx/html/svg/df4.svg')
res = c.gather(result).compute()
for f in res:
f.compute()
Never mind! It was my fault, instead of wrapping my function in delayed I simply passed it to the map_partitions call like so and it worked.
folded_df = from_delayed.map_partitions(fold, start, foldbucketsecs, nfolds, meta=meta)

Numpy: regrid by averaging?

I'm trying to regrid a numpy array onto a new grid. In this specific case, I'm trying to regrid a power spectrum onto a logarithmic grid so that the data are evenly spaced logarithmically for plotting purposes.
Doing this with straight interpolation using np.interp results in some of the original data being ignored entirely. Using digitize gets the result I want, but I have to use some ugly loops to get it to work:
xfreq = np.fft.fftfreq(100)[1:50] # only positive, nonzero freqs
psw = np.arange(xfreq.size) # dummy array for MWE
# new logarithmic grid
logfreq = np.logspace(np.log10(np.min(xfreq)), np.log10(np.max(xfreq)), 100)
inds = np.digitize(xfreq,logfreq)
# interpolation: ignores data *but* populates all points
logpsw = np.interp(logfreq, xfreq, psw)
# so average down where available...
logpsw[np.unique(inds)] = [psw[inds==i].mean() for i in np.unique(inds)]
# the new plot
loglog(logfreq, logpsw, linewidth=0.5, color='k')
Is there a nicer way to accomplish this in numpy? I'd be satisfied with just a replacement of the inline loop step.
You can use bincount() twice to calculate the average value of every bins:
logpsw2 = np.interp(logfreq, xfreq, psw)
counts = np.bincount(inds)
mask = counts != 0
logpsw2[mask] = np.bincount(inds, psw)[mask] / counts[mask]
or use unique(inds, return_inverse=True) and bincount() twice:
logpsw4 = np.interp(logfreq, xfreq, psw)
uinds, inv_index = np.unique(inds, return_inverse=True)
logpsw4[uinds] = np.bincount(inv_index, psw) / np.bincount(inv_index)
Or if you use Pandas:
import pandas as pd
logpsw4 = np.interp(logfreq, xfreq, psw)
s = pd.groupby(pd.Series(psw), inds).mean()
logpsw4[s.index] = s.values