Polars DataFrame does not provide a method to update the value of a single cell currently. Instead, we have to the method DataFrame.apply or DataFrame.apply_at_idx that updates a whole column / Series. This can be very expensive in situations where an algorithm repeated update a few elements of some columns. Why is DataFrame designed in this way? Looking into the code, it seems to me that Series does provide inner mutability via the method Series._get_inner_mut?
As of polars >= 0.15.9 mutation of any data backed by number is constant complexity O(1) if data is not shared. That is numeric data and dates and duration.
If the data is shared we first must copy it, so that we become the solely owner.
import polars as pl
import matplotlib.pyplot as plt
from time import time
ts = []
ts_shared = []
clone_times = []
ns = []
for n in [1e3, 1e5, 1e6, 1e7, 1e8]:
s = pl.zeros(int(n))
t0 = time()
# we are the only owner
# so mutation is inplace
s[10] = 10
# time
t = time() - t0
# store datapoints
# clone is free
t0 = time()
s2 = s.clone()
t = time() - t0
# now there are two owners of the memory
# we write to it so we must copy all the data first
t0 = time()
s2[11] = 11
t = time() - t0
plt.plot(ns, ts_shared, label="writing to shared memory")
plt.plot(ns, ts, label="writing to owned memory")
plt.plot(ns, clone_times, label="clone time")
In rust this dispatches to set_at_idx2, but it is not released yet. Note that using the lazy engine this will all be done implicitly for you.
I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match
from dask_ml.preprocessing import LabelEncoder
df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']
df['stopped_tom'] = 1 * (df['stopped'] > 0)
def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df
df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])
df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.
To check if a timezone is not defined for the first row of a "timestamp" column in a pandas Series I can query .tz for a single element with:
import pandas as pd
dates = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
assert dates.iloc[0].tz is None
Do I have a way to check if there are elements where the timezone is defined, or even better, a way to list all the timezones in the whole series, without looping through its elements, such as:
dates.iloc[5] = dates.iloc[5].tz_localize('Africa/Abidjan')
dates.iloc[7] = dates.iloc[7].tz_localize('Africa/Banjul')
zones = []
for k in range(dates.shape[0]):
You can get the time zone setting of a datetime Series using the dt accessor, i.e. S.dt.tz. This will raise ValueError if you have multiple time zones since the datetime objects will then be stored in an object array, as opposed to a datetime64 array if you have only one time zone or None. You can make use of this to get a solution that is a bit more efficient than looping every time:
import pandas as pd
# tzinfo is None:
dates0 = pd.Series(pd.date_range('2/2/2002', periods=10, freq='M'))
# one timezone:
dates1 = dates0.dt.tz_localize('Africa/Abidjan')
# mixed timezones:
dates2 = dates0.copy()
dates2.iloc[5] = dates2.iloc[5].tz_localize('Africa/Abidjan')
dates2.iloc[7] = dates2.iloc[7].tz_localize('Africa/Banjul')
for ds in [dates0, dates1, dates2]:
zones = ds.dt.tz
except ValueError:
zones = set(t.tz for t in ds.values)
# prints
{None, <DstTzInfo 'Africa/Banjul' GMT0:00:00 STD>, <DstTzInfo 'Africa/Abidjan' GMT0:00:00 STD>}
I am considering using a closure with the current state, to compute the rolling window (which in my case is of width 2), to answer my own question, which I have recently posed. Something on the lines of:
def test(init_value):
def my_fcn(x,y):
nonlocal init_value
actual_value = (x + y) * init_value
init_value = actual_value
return init_value
return my_fcn
where my_fcn is a dummy function used for testing. Therefore the function might be initialised thorugh actual_fcn = test(0); where we assume the initial value is zero, for example. Finally one could use the function through ddf.apply (where ddf is the actual dask dataframe).
Finally the question: this would work, if the order of the computations is preserved, otherwise everything would be scrambled. I have not tested it, since -even if it passes- I cannot be 100% sure it will always preserve the order. So, question is:
Does dask dataframe's apply method preserve rows order?
Any other ideas? Any help highly appreciated.
Apparently yes. I am using dask 1.0.0.
The following code:
import numpy as np
import pandas as pd
import dask.dataframe as dd
number_of_components = 30
df = pd.DataFrame(np.random.randint(0,number_of_components,size=(number_of_components, 4)), columns=list('ABCD'))
my_data_frame = dd.from_pandas(df, npartitions = 1 )
def sumPrevious( previousState ) :
def getValue(row):
nonlocal previousState
something = row['A'] - previousState
previousState = row['A']
return something
return getValue
given_func = sumPrevious(1)
out = my_data_frame.apply(given_func, axis = 1 , meta = float).compute()
behaves as expected. There is a big caveat: if the previous state is provided by reference (i.e.: it is some object of some class) then the user should be careful in using equality inside the nested function to update the previous state: since it will have side effects, if the state is passed by reference.
Rigorously, this example does not prove that order is preserved under any circumstances; so I would still be interested whether I can rely on this assumption.
I have a fairly large table in big query ( app. 9M rows) and I would like to read it via pandas.
I've tried reading and using the [pd.read_gbq()][1] function, which works fine on small tables.
On the large table it gets stuck after 50 secs or so (logs show elapsed .. 50s) - without giving an error or anything.
My question is how can I read that table using pd (chunks?). Any conventions on scaling up these bigquery reads will be helpful.
EDIT / resolution
adding to Khan's answer, I ended up implementing chunks, writing 500,000 each time to a file, then reading these files to dataframe like so:
def download_gbq_table(self):
if not os.path.exists(self.tmp_dir):
increment = 100000
intervals = list(range(0, self.table_size, 100000))
intervals.append(self.table_size - intervals[len(intervals)-1])
df = pd.DataFrame()
for offset in intervals:
query = f"select * from `<table_name>` limit {increment} offset {offset};"
logger.info(f"running query: {query}")
start_time = time.time()
tmp_df = pd.read_gbq(query,
df = pd.concat([df, tmp_df])
logger.info(f'time took: {str(round(time.time() - start_time, 2))}')
if len(df) % 500000 == 0:
df.to_csv(os.path.join(self.tmp_dir, f'df_{str(offset + increment)}.csv'))
df = pd.DataFrame()
def read_df_from_multi_csv(self):
all_files = glob.glob(os.path.join(self.tmp_dir, "df_*"))
df_list = []
for f in all_files:
start_time = time.time()
logger.info(f'time took for reading {f}: {str(round(time.time() - start_time, 2))}')
return pd.concat((pd.read_csv(f) for f in all_files))
Pandas' read_gbq function currently does not provide a chunksize parameter (even though its opposite to_gbq function does provide a chunksize parameter).
Anyways, you can solve your problem with adding LIMIT and OFFSET to your SQL query read stuff iteratively from BigQuery. Something on the lines of:
project_id = "xxxxxxxx"
chunks=range(0, 9000000, 100000)
intervals=[[chunks[i-1], chunks[i]+1] for i, e in enumerate(chunks) if i > 0]
query_str="select * from `mydataset.mytable` limit {end} offset {start};"
for start, end in intervals:
query = query_str.format(start=start, end=end)
df = pd.read_gbq(query, project_id)
#-- do stuff with your df here..
Not sure if this existed back when the question was originally asked, but now you can use python-bigquery-sqlalchemy (link) to read data from BigQuery, which allows you to use the built-in chunking ability of pandas.read_sql(). You just create a SQLAlchemy connection engine using "bigquery://{project-name}" and pass that to con in pandas.read_sql().
For example:
from sqlalchemy.engine import create_engine
import pandas as pd
read_project = "my-cool-project"
query = f"""
select * from `{read_project}.cool-dataset.cooltable`
bq_engine = create_engine(f"bigquery://{read_project}")
for df in pd.read_sql(query, con=bq_engine, chunksize=100_000):
# do stuff with df...