Compressing StringIO data to read with pandas? - pandas

I have been using pandas pd.read_sql_query to read a decent chunk of data into memory each day in order to process it (add columns, calculations, etc to about 1GB of data). This has cause my computer to freeze a few times though so today I tried using psql to create a .csv file. I then zipped that file (.xz) and read it with pandas.
Overall, it was a lot smoother and it made me think about automating the process. Is it possible to replace saving a .csv.xz file and instead copying the data directly to memory while still compressing it (ideally)?
buf = StringIO()
from_curs = from_conn.cursor()
from_curs.copy_expert("COPY table where row_date = '2016-10-17' TO STDOUT WITH CSV HEADER", buf)
(is it possible to compress this?)
buf.seek(0)
(read the buf with pandas to process it)

Related

Merge large pickle files and save as dta

I want to merge several large pickle files (around 500mb compression=gzip) and save it as dta file. I am using following code
df10= pd.read_pickle("10.pickle.gzip",compression="gzip")
df11= pd.read_pickle("11.pickle.gzip",compression="gzip")
new_df=pd.concat([df10,df11], ignore_index=True)
for col in new_df.columns:
new_df[col] = new_df[col].astype(str)
new_df.to_stata("myfile.dta",version=119)
The script is killed for zsh when I am it tries to save it.
I want to merge several of such files and save it as dta then read it in Stata. Any advice on how to do it will be great.
Thanks in advance.

How to use spark toLocalIterator to write a single file in local file system from cluster

I have a pyspark job which writes my resultant dataframe in local filesystem. Currently it is running in local mode and so I am doing coalesce(1) to get a single file as below
file_format = 'avro' # will be dynamic and so it will be like avro, json, csv, etc
df.coalesce.write.format(file_format).save('file:///pyspark_data/output')
But I see a lot of memory issues (OOM) and takes longer time as well. So I want to run this job with master as yarn and mode as client. And so to write the result df into a single file in localsystem, I need to use toLocalIterator which yields Rows. How can I stream these Rows into a file of required format (json/avro/csv/parquet and so on)?
file_format = 'avro'
for row in df.toLocalIterator():
# write the data into a single file
pass
You get OOM error because you try to retrieve all the data into a single partition with: coalesce(1)
I dont recommend to use toLocalIterator because you will re-rewrite a custom writer for every format and you wont have parallele writing.
You first solution is a good one :
df.write.format(file_format).save('file:///pyspark_data/output')
if you use hadoop you can retrieve all the data into one on filesysteme this way : (it work for csv, you can try for other) :
hadoop fs -getmerge <HDFS src> <FS destination>

Appending csv files in directory into a pandas dataframe

I have written a scraper which downloads daily flight prices, stores them as pandas data frames and saves them off as csv files in a given folder. I am now trying to combine these csv files into pandas for data analysis using append, but end result is an empty data frame.
Specifically, individual csv files are loaded correctly into pandas, but the append seems to fail (and several methods found on stackoverflow posts don't seem to work). Code is below, any pointers? Thanks!
directory = os.path.join("C:\\Testfolder\\")
for root,dirs,files in os.walk(directory):
for file in files:
daily_flight_df = (pd.read_csv(directory+file,sep=";")) #loads csv into dataframe - works correctly
cons_flight_df.append(daily_flight_df) #appends daily flight prices into a pandas with consolidated flight prices - does not seem to work
print(cons_flight_df) #currently prints out an empty data frame
cons_flight_df.to_csv('C:\\Testfolder\\test.csv') #currently returns empty csv file
In pandas, the append method isn't in place. You need to assign it.
cons_flight_df = cons_flight_df.append(daily_flight_df)

How can I read many large .7z files containing many CSV files?

I have many .7z files every file containing many large CSV files (more than 1GB). How can I read this in python (especially pandas and dask data frame)? Should I change the compression format to something else?
I believe you should be able to open the file using
import lzma
with lzma.open("myfile.7z", "r") as f:
df = pd.read_csv(f, ...)
This is, strictly speaking, meant for the xz file format, but may work for 7z also. If not, you will need to use libarchive.
For use with Dask, you can do the above for each file with dask.delayed.
dd.read_csv directly also allows you to specify storage_options={'compression': 'xz'}; however, ramdom access within a file is likely to be inefficient at best, so you should add blocksize=None to force one partition per file:
df = dd.read_csv('myfiles.*.7z', storage_options={'compression': 'xz'},
blocksize=None)

How to best and most efficiently Split files into vb.net

Say I have a 5 GB file. I want to split it in the following way.
First 100 MB is on the file
The rest go some reserve file
I do not want to use readalllines kind of function because it's too slow for large files.
I do not want to read the whole file to the memory. I want the program to handle only a medium chunk of data at a time.
You may use BinaryReader class and its method to read file in chunks.
Dim chunk() As Byte
chunk = br.ReadBytes(1024)