write dataframe to .xlsx too slow - pandas

I have a 40MB dataframe 'dfScore' I am writing to .xlsx。
the code is as follow,
writer = pandas.ExcelWriter('test.xlsx', engine='xlsxwriter')
dfScore.to_excel(writer,sheet_name='Sheet1')
writer.save()
the code dfScore.to_excel take almost an hour ,the code writer.save() takes another hour. Is this normal? Is there a good way to take less than 10 min?
i already searched in stackoverflow ,but it seems some suggestions not working on my problem.

Why don't you save it as .csv?
I have worked with heavier DataFrames on my personal laptop and I had the same problem with writing to xlsx.
your_dataframe.to_csv('my_file.csv',encoding='utf-8',columns=list_of_dataframe_columns)
then you can simply convert it to .xlsx with MS Excel or an online convertor.

the code dfScore.to_excel take almost an hour ,the code writer.save() takes another hour. Is this normal?
That sounds a bit too high. I ran an XlsxWriter test writing 1,000,000 rows x 5 columns and it took ~ 100s. The time will vary based on the CPU and Memory of the test machine but 1 hour is 36 times slower which doesn't seem right.
Note, Excel, and thus XlsxWriter, only supports 1,048,576 rows per worksheet so you are effectively throwing away 3/4s of your data and wasting time doing it.
Is there a good way to take less than 10 min?
For pure XlsxWriter programs pypy gives a good speed up. For example rerunning my 1,000,000 rows x 5 columns testcase with pypy the time went from 99.15s to 16.49s. I don't know if Pandas works with pypy though.

Related

Opening and concatenating a list of files in a single pandas DataFrame avoiding loops

I've been reading different post here on this topic but I didn't really manage to find an answer.
I have a folder with 50 files (total size around 70GB). I would like to open them and create a single DataFrame on which perform computation.
Unfortunately, I go out of memory if I try to concatenate the files immediately after I open them.
I know I can operate on chunks and work on smaller subsets, but these 50 files are already a small portion of the entire dataset.
Thus, I found an alternative that works (using for loops) and deleting the list at each iteration, but that obviously takes too much time. I had a tentative with "pd.appned" but in that case I go out of Memory.
for file in lst:
df_list.append(open_file(file))
#print(file)
result = df_list[len(df_list)-1]
del df_list[len(df_list)-1]
for i in range(len(df_list)-1, -1,-1):
print(i)
result = pd.concat([result, df_list[i]], ignore_index=True)
del df_list[i]
Although working, I feel like I'm doing things twice. Furthermore, putting pd.concat in a loop is very bad idea, since time increase exponentially (Why does concatenation of DataFrames get exponentially slower?).
Does anyone have any suggestion?
Now opening and concatenation takes 75 mins + 105 mins. I hope to reduce this time.

Filtering option causing python to hang - how to debug?

I am preprocessing large datasets to get them ready for clustering operations. I have a script that reads the data from CSV and performs various checks for missing data, erroneous values, etc. Until now, everything has worked as expected. Still, when I ran the script yesterday, it started to hang on to a simple filtering operation. The source data has not changed, but somehow processing can't get past this line. I have isolated the problem by moving the following lines of code to another file, and the same issue is observed:
import pandas as pd
df = pd.read_csv('data1.csv',index_col=0)
# Get list of columns of interest for first check
columns = [col for col in df.columns if 'temp' in col]
# Find indices where any value of a column of interest has a value of 1
indices = list(df[df[columns]==1].dropna(how='all').index)
This previously ran fine, correctly identifying indices with this '1' flag in 'columns'. Now (and with no changes to the code or source data), it hangs on the indices line. I further broke it down to identify the specific problem: df[columns]==1 runs fine, but grabbing the df filtered on this condition (df[df[columns]==1]) is the line that hangs.
How can I troubleshoot what the problem is? Since I had not made any changes when it last worked, I am perplexed. What could possibly be the cause? Thanks in advance for any tips.
EDIT: The below approach seems to be drastically faster and solved the problem:
indices = df[(df[columns]==1).any(1)].index
When tested on a subset of the whole df, it accomplished the task in 0.015 seconds, while the prior method took 15.0 seconds.

Why is renaming columns in pandas so slow?

Given a large data frame (in my case 250M rows and 30 cols), why is it so slow to just change then name of a column?
I am using df.rename(columns={'oldName':'newName'},inplace=True) so this should not make any copies of the data, yet it is taking over 30 seconds, while I would have expected this to be in the order of milliseconds (as it's just replacing one string by another).
I know, that' a huge table, more than most people have RAM in their machine (hence I'm not going to add example code either), but still this shouldn't take any significant amount of time as it's not actually touching any of the data. Why does this take so long, i.e. why is renaming a column doing effort proportional to the number of rows of my dataframe?
I don't think inplace=True doesn't copy your data. There are some discussion on SO saying it actually does copy, and then assign back. Also see this github issue.
You can just override the columns with:
df.columns = df.columns.to_series().replace({'a':'b'})

BigQuery is there any way to break the large result into smaller chucks for processing?

Hi i am new to the BigQuery, if i need to fetch a very large set of data, say more than 1 GB, how can i break it into smaller pieces for quicker processing? i will need to process the result and dump it into a file or elasticsearch. i need to find a efficient way to handle it. i tried with the QueryRequest.setPageSize option, but that does't seem to work. I set 100 and it doesn't seem to break on every 100 record i put this line to see how many record i get back before i turn to a new page
result = result.getNextPage();
it displays at random number of records. sometimes at 1000, sometimes at 400, etc.
thanks
Not sure if this helps you but in our project we have something that seems to be similar: we process lots of data in BigQuery and need to use the final result for later usage (it contains roughly 15 Gbs for us when compressed).
What we did was to first save the results to a table with AllowLargeResults set to True and then export the result by compressing it into cloud storage using the Python API.
It automatically breaks the results into several files.
After that we have a Python script that downloads concurrently all files, reads through the whole thing and builds some matrices for us.
I don't quite remember how long it takes to download all the files, I think it's around 10 minutes. I'll try to confirm this one.

R: openxlsx and sqldf

I have a question about using R to read in a file from Excel. I am reading in a few tabs from an Excel worksheet and performing some basic sql commands and merging them using sqldf. My problem is my RAM gets bogged down a lot after reading in the Excel data. I can run the program but had to install 8GB of RAM to not use like 80% of my available RAM.
I know if I have a text file, I can read it in directly using read.csv.sql() and performing the sql in the "read" command so my RAM doesn't get bogged down. I also know you can save the table as a tempfile() so it doesn't take up RAM space. The summarized data using sqldf does not have very many rows so does not bog the memory down.
The only solution I've been able to come up with is to set up an R program that just reads in the data and creates the text files. Close R down and run a second program that reads it back in from the text files using sqldf and performs the SQL commands and merges the data. I don't like this solution as much because it still involves using a lot of RAM in the initial read-in program and uses 2 programs which I would like to just use 1.
I could also manually create the text file from the Excel tab but they are some updates being made on a regular basis at the moment so I'd rather not have to do that. Also I'd like something more automated to create the text files.
For reference, the tables are are 4 tables of the following sizes:
3k rows x 9 columns
200K x 20
4k x 16
80k x 13
100K x 12
My read-in's look like this:
table<-read.xlsx(filename, sheet="Sheet")
summary<-sqldf("SQL code")
rm(table)
gc()
I have tried running the rm(table) and gc() commands after each read-in and sql manipulation (after which I no longer need the entire table) but these commands do not seem to free up much RAM. Only by closing the R session do I get the 1-2 GB back.
So is there any way to read in an Excel file to R and not take up RAM in the process? I also want to note this is on a work computer for which I do not have admin rights so anything I would want to install requiring such rights I would have to request from IT which is a barrier I'd like to avoid.