Fill Forward cupy / cudf - cupy

should be possible execute a fill forward with cupy/cudf? the idea is execute a schimitt trigger function, something like:
# pandas version
df = some_random_vector
on_off = (df>.3)*1 + (df<.3)*-1
on_off[on_off==0) = np.nan
on_off = on_off.fillna(method='ffill').fillna(0)
i was trying this one but cupy don't have accumulate ufunc:
def schmitt_trigger(x, th_lo, th_hi, initial = False):
on_off = ((x >= th_hi)*1 + (x <= th_lo)*-1).astype(cp.int8)
mask = (on_off==0)
idx = cp.where(~mask, cp.arange(start=0, stop=mask.shape[0], step=1), 0)
cp.maximum.accumulate(idx,axis=1, out=idx)
out = on_off[cp.arange(idx.shape[0])[:,None], idx]
return out
any idea?
thanks!

Sadly, RAPIDS currently doesn't have that feature in cudf and may not for 0.16 either. There is the feature request in github for it. https://github.com/rapidsai/cudf/issues/1361
Would love for you to chime in on the request so that the devs can know its highly desired.
As for the Schmitt Trigger, I'll look into it and your code and edit this post if I get any progress.

Related

Unsure of Control flow in Pandas

I've been working on a Pandas project in Python and am confused a bit on how to accomplish a condition in Pandas.
The code at the below shows how i sort of propose to calculate business_minutes and calendar_minutes between a close_date and a open_date. It works great except when close_date has not yet been recorded or that it is null.
I'm thinking I can use control logic something like the following except I know the logic is not sound. Is there a way to do what i'd like to do but correctly?
if close_date:
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
elif:
now = dt.now(timezone.utc)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], now), axis=1)
df_incident['Cal_Mins'] = (now - df_incident['Open_Date']).dt.total_seconds()/60
# get current utc time
now = dt.now(timezone.utc)
# set start and stop times of business day
#Specify Business Working hours (7am - 5pm)
start_time = dt.time(7,00,0)
end_time = dt.time(17,0,0)
us_holidays = pyholidays.US()
unit='min'
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration, starttime=start_time, endtime=end_time, holidaylist=us_holidays, unit=unit)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
Have I presented my need clearly? Is it possible to do?
Thanks,
Jeff
As written, your code will not work because you call bduration() before defining it. Also, you assign to Bus_Mins and Cal_Mins twice in the body of the else condition. The second assignment will probably not work because close date is null. It is a syntax error to have an elif without a condition, so else: should be used instead. Something like the following might work:
# set start and stop times of business day
#Specify Business Working hours (7am - 5pm)
start_time = dt.time(7,00,0)
end_time = dt.time(17,0,0)
us_holidays = pyholidays.US()
unit='min'
# Create a partial function as a shortcut
bduration = partial(bd.businessDuration, starttime=start_time, endtime=end_time, holidaylist=us_holidays, unit=unit)
if close_date:
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], x['Close_Date']), axis=1)
df_incident['Cal_Mins'] = (df_incident['Close_Date'] - df_incident['Open_Date']).dt.total_seconds()/60
else:
# get current utc time
now = dt.now(timezone.utc)
df_incident['Bus_Mins'] = df_incident.apply(lambda x: bduration(x['Open_Date'], now), axis=1)
df_incident['Cal_Mins'] = (now - df_incident['Open_Date']).dt.total_seconds()/60

FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version

/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:8: FutureWarning: Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do left, right = left.align(right, axis=1, copy=False) before e.g. left == right
I tried to remove outlier from my data frame
using z-score manually by me
numerical_cols=df.select_dtypes(['int64','float64'])
for col in numerical_cols:
feature_value_less_than_3sigma=df[col].mean()-3*(df[col].std())
feature_value_greater_than_3sigma=df[col].mean()+3*(df[col].std())
df = df[~((df[col] < (feature_value_less_than_3sigma)) |(df[col] > (feature_value_greater_than_3sigma)))]
else:
print('\nAfter: ',df.shape)
I don't know what this error is telling and I like to know about it ,Can anyone explain with some simple example
Instead of:
df = df[~((df[col] < (feature_value_less_than_3sigma)) |(df[col] > (feature_value_greater_than_3sigma)))]
Use:
df = df.query('~(%s < #feature_value_less_than_3sigma or %s > #feature_value_greater_than_3sigma)' %(col,col))
This should remove the error.
Try this:
df = df[~((df[col].lt(feature_value_less_than_3sigma)) |(df[col].gt (feature_value_greater_than_3sigma)))]

How to apply Cuda.jl on DataFrames in julia

I am using dataframes.jl package in the below mentioned code to perform certain operations.
I would like to know, how may I apply CUDA.jl on this code, if possible while keeping the dataframe aspect?
Secondly, is it possible to allow the code to automatically choose between CPU and GPU based on the availability?
Code
using DataFrame
df = DataFrame(i = Int64[], a = Float64[], b =Float64[])
for i in 1:10
push!(df.i, i)
a = i + sin(i)*cos(i)/sec(i)^100
push!(df.a, a)
b = i + tan(i)*cosec(i)/sin(i)
push!(df.b, b)
end
transform!(df, [:a, :b] .=> (x -> [missing; diff(x)]) .=> [:da, :db])
Please suggest a solution to make this code compatible with CUDA.jl.
Thanks in advance!!

Does dask dataframe apply preserve rows order?

I am considering using a closure with the current state, to compute the rolling window (which in my case is of width 2), to answer my own question, which I have recently posed. Something on the lines of:
def test(init_value):
def my_fcn(x,y):
nonlocal init_value
actual_value = (x + y) * init_value
init_value = actual_value
return init_value
return my_fcn
where my_fcn is a dummy function used for testing. Therefore the function might be initialised thorugh actual_fcn = test(0); where we assume the initial value is zero, for example. Finally one could use the function through ddf.apply (where ddf is the actual dask dataframe).
Finally the question: this would work, if the order of the computations is preserved, otherwise everything would be scrambled. I have not tested it, since -even if it passes- I cannot be 100% sure it will always preserve the order. So, question is:
Does dask dataframe's apply method preserve rows order?
Any other ideas? Any help highly appreciated.
Apparently yes. I am using dask 1.0.0.
The following code:
import numpy as np
import pandas as pd
import dask.dataframe as dd
number_of_components = 30
df = pd.DataFrame(np.random.randint(0,number_of_components,size=(number_of_components, 4)), columns=list('ABCD'))
my_data_frame = dd.from_pandas(df, npartitions = 1 )
def sumPrevious( previousState ) :
def getValue(row):
nonlocal previousState
something = row['A'] - previousState
previousState = row['A']
return something
return getValue
given_func = sumPrevious(1)
out = my_data_frame.apply(given_func, axis = 1 , meta = float).compute()
behaves as expected. There is a big caveat: if the previous state is provided by reference (i.e.: it is some object of some class) then the user should be careful in using equality inside the nested function to update the previous state: since it will have side effects, if the state is passed by reference.
Rigorously, this example does not prove that order is preserved under any circumstances; so I would still be interested whether I can rely on this assumption.

pandas gets stuck when trying to read from bigquery

I have a fairly large table in big query ( app. 9M rows) and I would like to read it via pandas.
I've tried reading and using the [pd.read_gbq()][1] function, which works fine on small tables.
On the large table it gets stuck after 50 secs or so (logs show elapsed .. 50s) - without giving an error or anything.
My question is how can I read that table using pd (chunks?). Any conventions on scaling up these bigquery reads will be helpful.
EDIT / resolution
adding to Khan's answer, I ended up implementing chunks, writing 500,000 each time to a file, then reading these files to dataframe like so:
def download_gbq_table(self):
if not os.path.exists(self.tmp_dir):
os.makedirs(self.tmp_dir)
increment = 100000
intervals = list(range(0, self.table_size, 100000))
intervals.append(self.table_size - intervals[len(intervals)-1])
df = pd.DataFrame()
for offset in intervals:
query = f"select * from `<table_name>` limit {increment} offset {offset};"
logger.info(f"running query: {query}")
start_time = time.time()
tmp_df = pd.read_gbq(query,
project_id=self.connection_parameters['project_id'],
private_key=self.connection_parameters['service_account'],
dialect='standard'
)
df = pd.concat([df, tmp_df])
logger.info(f'time took: {str(round(time.time() - start_time, 2))}')
if len(df) % 500000 == 0:
df.to_csv(os.path.join(self.tmp_dir, f'df_{str(offset + increment)}.csv'))
df = pd.DataFrame()
def read_df_from_multi_csv(self):
all_files = glob.glob(os.path.join(self.tmp_dir, "df_*"))
df_list = []
for f in all_files:
start_time = time.time()
df_list.append(pd.read_csv(f))
logger.info(f'time took for reading {f}: {str(round(time.time() - start_time, 2))}')
return pd.concat((pd.read_csv(f) for f in all_files))
Pandas' read_gbq function currently does not provide a chunksize parameter (even though its opposite to_gbq function does provide a chunksize parameter).
Anyways, you can solve your problem with adding LIMIT and OFFSET to your SQL query read stuff iteratively from BigQuery. Something on the lines of:
project_id = "xxxxxxxx"
increment=100000
chunks=range(0, 9000000, 100000)
chunks[-1]+=increment
intervals=[[chunks[i-1], chunks[i]+1] for i, e in enumerate(chunks) if i > 0]
query_str="select * from `mydataset.mytable` limit {end} offset {start};"
for start, end in intervals:
query = query_str.format(start=start, end=end)
df = pd.read_gbq(query, project_id)
#-- do stuff with your df here..
Not sure if this existed back when the question was originally asked, but now you can use python-bigquery-sqlalchemy (link) to read data from BigQuery, which allows you to use the built-in chunking ability of pandas.read_sql(). You just create a SQLAlchemy connection engine using "bigquery://{project-name}" and pass that to con in pandas.read_sql().
For example:
from sqlalchemy.engine import create_engine
import pandas as pd
read_project = "my-cool-project"
query = f"""
select * from `{read_project}.cool-dataset.cooltable`
"""
bq_engine = create_engine(f"bigquery://{read_project}")
for df in pd.read_sql(query, con=bq_engine, chunksize=100_000):
# do stuff with df...