I have a fairly large table in big query ( app. 9M rows) and I would like to read it via pandas.
I've tried reading and using the [pd.read_gbq()][1] function, which works fine on small tables.
On the large table it gets stuck after 50 secs or so (logs show elapsed .. 50s) - without giving an error or anything.
My question is how can I read that table using pd (chunks?). Any conventions on scaling up these bigquery reads will be helpful.
EDIT / resolution
adding to Khan's answer, I ended up implementing chunks, writing 500,000 each time to a file, then reading these files to dataframe like so:
def download_gbq_table(self):
if not os.path.exists(self.tmp_dir):
os.makedirs(self.tmp_dir)
increment = 100000
intervals = list(range(0, self.table_size, 100000))
intervals.append(self.table_size - intervals[len(intervals)-1])
df = pd.DataFrame()
for offset in intervals:
query = f"select * from `<table_name>` limit {increment} offset {offset};"
logger.info(f"running query: {query}")
start_time = time.time()
tmp_df = pd.read_gbq(query,
project_id=self.connection_parameters['project_id'],
private_key=self.connection_parameters['service_account'],
dialect='standard'
)
df = pd.concat([df, tmp_df])
logger.info(f'time took: {str(round(time.time() - start_time, 2))}')
if len(df) % 500000 == 0:
df.to_csv(os.path.join(self.tmp_dir, f'df_{str(offset + increment)}.csv'))
df = pd.DataFrame()
def read_df_from_multi_csv(self):
all_files = glob.glob(os.path.join(self.tmp_dir, "df_*"))
df_list = []
for f in all_files:
start_time = time.time()
df_list.append(pd.read_csv(f))
logger.info(f'time took for reading {f}: {str(round(time.time() - start_time, 2))}')
return pd.concat((pd.read_csv(f) for f in all_files))
Pandas' read_gbq function currently does not provide a chunksize parameter (even though its opposite to_gbq function does provide a chunksize parameter).
Anyways, you can solve your problem with adding LIMIT and OFFSET to your SQL query read stuff iteratively from BigQuery. Something on the lines of:
project_id = "xxxxxxxx"
increment=100000
chunks=range(0, 9000000, 100000)
chunks[-1]+=increment
intervals=[[chunks[i-1], chunks[i]+1] for i, e in enumerate(chunks) if i > 0]
query_str="select * from `mydataset.mytable` limit {end} offset {start};"
for start, end in intervals:
query = query_str.format(start=start, end=end)
df = pd.read_gbq(query, project_id)
#-- do stuff with your df here..
Not sure if this existed back when the question was originally asked, but now you can use python-bigquery-sqlalchemy (link) to read data from BigQuery, which allows you to use the built-in chunking ability of pandas.read_sql(). You just create a SQLAlchemy connection engine using "bigquery://{project-name}" and pass that to con in pandas.read_sql().
For example:
from sqlalchemy.engine import create_engine
import pandas as pd
read_project = "my-cool-project"
query = f"""
select * from `{read_project}.cool-dataset.cooltable`
"""
bq_engine = create_engine(f"bigquery://{read_project}")
for df in pd.read_sql(query, con=bq_engine, chunksize=100_000):
# do stuff with df...
Related
I am using at some point pd.melt to reshape my dataframe. This command after inspections is taking around 7min to run which is too long in my use case (I am using it in an interactive dashboard).
I am asking if there are any methods on how to improve running time of melt function via pandas.
If not, is it possible and a good practice to use a big data package just for this line of code?
pd.melt(change_t, id_vars=['id', 'date'], value_vars=factors, value_name='value')
factors=list of 20 columns
I've timed melting a test table with 2 id_vars, 20 factors, and 1M rows and it took 22 seconds on my laptop. Is your table similarly sized, or much much larger? If it is a huge table, would it be ok to return only part of the melted output to your interactive dashboard? I put some code for that approach and it took 1.3 seconds to return the first 1000 rows of the melted table.
Timing melting a large test table
import pandas as pd
import numpy as np
import time
id_cols = ['id','date']
n_ids = 1000
n_dates = 100
n_cols = 20
n_rows = 1000000
#Create the test table
df = pd.DataFrame({
'id':np.random.randint(1,n_ids+1,n_rows),
'date':np.random.randint(1,n_dates+1,n_rows),
})
factors = []
for c in range(n_cols):
c_name = 'C{}'.format(c)
factors.append(c_name)
df[c_name] = np.random.random(n_rows)
#Melt and time how long it takes
start = time.time()
pd.melt(df, id_vars=['id', 'date'], value_vars=factors, value_name='value')
print('Melting took',time.time()-start,'seconds for',n_rows,'rows')
#Melting took 21.744 seconds for 1000000 rows
Here's a way you can get just the first 1000 melted rows
ret_rows = 1000
start = time.time()
partial_melt_df = pd.DataFrame()
for ks,g in df.groupby(['id','date']):
g_melt = pd.melt(g, id_vars=['id', 'date'], value_vars=factors, value_name='value')
partial_melt_df = pd.concat((partial_melt_df,g_melt), ignore_index=True)
if len(partial_melt_df) >= ret_rows:
partial_melt_df = partial_melt_df.head(ret_rows)
break
print('Partial melting took',time.time()-start,'seconds to give back',ret_rows,'rows')
#Partial melting took 1.298 seconds to give back 1000 rows
I am trying to find the fastest way to compute the results only for the last row in a dataframe. For some reason, when I do so it is slower than computing the entire dataframe. What am I doing wrong here? What would be the correct way to access only the last two rows and compute their values?
Currently these are my results:
Processing time of add_complete(): 1.333 seconds
Processing time of add_last_row_only(): 1.502 seconds
import numpy as np
import pandas as pd
def add_complete(df):
df['change_a'] = df['a'].diff()
df['change_b'] = df['b'].diff()
df['factor'] = df['change_a'] * df['change_b']
def add_last_row_only(df):
df.at[df.index[-1], 'change_a_last_row'] = df['a'].iloc[-1] - df['a'].iloc[-2]
df.at[df.index[-1], 'change_b_last_row'] = df['b'].iloc[-1] - df['b'].iloc[-2]
df.at[df.index[-1], 'factor_last_row'] = df['change_a_last_row'].iloc[-1] * df['change_b_last_row'].iloc[-1]
def main():
a = np.arange(200_000_000).reshape(100_000_000, 2)
df = pd.DataFrame(a, columns=['a', 'b'])
add_complete(df)
add_last_row_only(df)
print(df.tail())
Unless I am missing something, for this kind of operation I would use numpy on the two last lines:
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
21µs just this operation, yes, microseconds.
If I add insertion it increases to 511ms, even filling all with same value.
I suspect the problem comes from handling around a 1.5Gb dataframe, which actually doubles the size when inserting the two extra columns.
%%timeit
changes = np.diff(df.values[-2:,:],axis=0)
factor = np.product(changes)
df['factor']=factor
df['changes_a']=changes[0][0]
df['changes_b']=changes[0][1]
I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match
from dask_ml.preprocessing import LabelEncoder
df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']
df['stopped_tom'] = 1 * (df['stopped'] > 0)
def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df
df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])
df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.
Is it possible to use Pandas' DataFrame.to_parquet functionality to split writing into multiple files of some approximate desired size?
I have a very large DataFrame (100M x 100), and am using df.to_parquet('data.snappy', engine='pyarrow', compression='snappy') to write to a file, but this results in a file that's about 4GB. I'd instead like this split into many ~100MB files.
I ended up using Dask:
import dask.dataframe as da
ddf = da.from_pandas(df, chunksize=5000000)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
This saves to multiple parquet files inside save_dir, where the number of rows of each sub-DataFrame is the chunksize. Depending on your dtypes and number of columns, you can adjust this to get files to the desired size.
One other option is to use the partition_cols option in pyarrow.parquet.write_to_dataset():
import pyarrow.parquet as pq
import numpy as np
# df is your dataframe
n_partition = 100
df["partition_idx"] = np.random.choice(range(n_partition), size=df.shape[0])
table = pq.Table.from_pandas(df, preserve_index=False)
pq.write_to_dataset(table, root_path="{path to dir}/", partition_cols=["partition_idx"])
Slice the dataframe and save each chunk to a folder, using just pandas api (without dask or pyarrow).
You can pass extra params to the parquet engine if you wish.
def df_to_parquet(df, target_dir, chunk_size=1000000, **parquet_wargs):
"""Writes pandas DataFrame to parquet format with pyarrow.
Args:
df: DataFrame
target_dir: local directory where parquet files are written to
chunk_size: number of rows stored in one chunk of parquet file. Defaults to 1000000.
"""
for i in range(0, len(df), chunk_size):
slc = df.iloc[i : i + chunk_size]
chunk = int(i/chunk_size)
fname = os.path.join(target_dir, f"part_{chunk:04d}.parquet")
slc.to_parquet(fname, engine="pyarrow", **parquet_wargs)
Keep each parquet size small, around 128MB. To do this:
import dask.dataframe as dd
# Get number of partitions required for nominal 128MB partition size
# "+ 1" for non full partition
size128MB = int(df.memory_usage().sum()/1e6/128) + 1
# Read
ddf = dd.from_pandas(df, npartitions=size128MB)
save_dir = '/path/to/save/'
ddf.to_parquet(save_dir)
cunk = 200000
i = 0
n = 0
while i<= len(all_df):
j = i + cunk
print((i, j))
tmpdf = all_df[i:j]
tmpdf.to_parquet(path=f"./append_data/part.{n}.parquet",engine='pyarrow', compression='snappy')
i = j
n = n + 1
I am very well aware that there are many similar questions, and most answers suggest things like bulk inserts of CSVs (which would be more headache than exporting from R).
However, I am not aware of any comparison with R which addresses if pandas.DataFrame.to_sql() can ever reach the same speed as R's DBI::dbWriteTable.
I run Anaconda on a Windows 10 machine. As a test, I am exporting 15 tables, each with 10,000 rows, 30 columns of random floats and 3 string columns.
The times are:
pyodbc without fast_executemany : 19 seconds
pyodbc with fast_executemany : 7 seconds
turbodbc: 13 seconds
R and DBI::dbWriteTable : 4 seconds
I tried pymssql but I couln't get it to work (see comments in the code).
fast_executemany is a great improvement, but it still takes almost twice as long as R. The difference vs R is striking! And it also suggests there is nothing inherently wrong with the connection to the SQL server.
If the answer is: "no, R is much faster, period", could someone elaborate why? It would be very interesting to understand what causes the difference.
For gigantic tables I might think of exporting them to a format like parquet or feather, get R to read them and then export to SQL for R. However, for mid-sized tables (anything between 30 and 200 MBs of data) I would much much rather find a better solution in Python.
PS It's a SQL Server Express running on the very same PC from which I am running Python and R.
Edit
In answer to a comment, I have compared execution times with different sizes, and the ratio of R's time to pyodbc's seems fairly constant. Take it with a grain of salt - I am not a database expert, and there may well be database configurations or other factors I am missing which are playing a role. The bottom line remains that, in my cases, R is almost twice as fast.
My code:
import numpy as np
import pandas as pd
import timeit
from sqlalchemy import create_engine #, MetaData, Table, select
import pymssql
ServerName = myserver
Database = mydb
params = '?driver=SQL+Server+Native+Client+11.0'
# we define the pyodbc connection
engine_pyo = create_engine('mssql+pyodbc://' + ServerName + '/' + Database + params,
encoding='latin1', fast_executemany=True)
conn_pyo = engine_pyo.connect()
# now the turbodbc
engine_turbo = create_engine('mssql+turbodbc://' + ServerName + '/' + Database + params, encoding='latin1')
conn_turbo = engine_turbo.connect()
# pymssql is installed but doesn't work.
# I get:
# connect() got an unexpected keyword argument 'driver'
#engine_pyms = create_engine('mssql+pymssql://' + ServerName + '/' + Database + params, encoding='latin1')
#conn_pyms = engine_pyms.connect()
sheets = 15
rows= int(10e3)
def create_data(sheets, rows):
df = {} # dictionary of dataframes
for i in range(sheets):
df[i] = pd.DataFrame(data= np.random.rand(rows,30) )
df[i]['a'] = 'some long random text'
df[i]['b'] = 'some more random text'
df[i]['c'] = 'yet more text'
return df
def data_to_sql(df, conn):
for d in df:
df[d].to_sql('Test ' + str(d) , conn, if_exists='replace' )
#NB: df is a dictionary containing dataframes - it is NOT a dataframe
# df[key] is a dataframe
df = create_data(sheets, rows)
rep = 1
n = 1
t_pyodbc = timeit.Timer( "data_to_sql(df, conn_pyo)" , globals=globals() ).repeat(repeat = rep, number = n)
t_turbo = timeit.Timer( "data_to_sql(df, conn_turbo)" , globals=globals() ).repeat(repeat = rep, number = n)
My R code:
library(tictoc)
library(readxl)
library(tidyverse)
library(dplyr)
library(dbplyr)
library(DBI)
tic("data creation")
rows = 10000
cols = 30
tables = 15
dataframes <- list()
for (i in 1:tables){
newdf <- as_tibble( matrix(rnorm(rows * cols),rows , cols) )
newdf <- mutate(newdf, a = 'some long random text')
newdf <- mutate(newdf, b = 'some more random text')
newdf <- mutate(newdf, c = 'yet more text')
dataframes[[i]] <- newdf
}
toc()
tic("set up odbc")
con <- DBI::dbConnect(odbc::odbc(),
driver = "SQL Server",
server = "LTS-LNWS010\\SQLEXPRESS",
database = "CDL",
trusted_connection = TRUE)
toc()
tic("SQL export")
for(i in seq_along(dataframes)){
DBI::dbWriteTable(con, paste("Test_", i), dataframes[[i]], overwrite = TRUE)
}
toc()