I'm reading a large amount of data from a database via pd.read_sql(...chunksize=10000) which generates a df generator object.
While I can still work with that dataframe in merging it with pd.merge(df,df2...) some functions are no longer available, such as df.to_cs(...)
What is the best way to handle that? How can I write such a dataframe to a CSV? Do I need to iterate over it manually?
You can either process each chunk individually, or combine them using e.g. pd.concat to operate on all chunks as a whole.
Individually, you would indeed iterate over the chunks like so:
for chunk in pd.read_sql(...chunksize=10000):
# process chunk
To combine, you can use list comprehension:
df = pd.concat([chunk for chunk in pd.read_sql(...chunksize=10000)])
#process df
Related
I'm posting this for pandas, numpy and spark tags because I'm not really sure the best approach to solve this problem within those three systems.
I have a large parquet file that a downstream process is having trouble opening because it exceeds the system's memory(~63gb in memory if opened at once). I was writing the file as such:
FULL_MAIN.write.mode("overwrite").parquet(PATH+"/FULL_MAIN.parquet")
but the file was too big, so I tried to do this to break the file into smaller chucks:
split_factor = [.1,.1,.1,.1,.1,.1,.1,.1,.1,.1]
FULL_MAIN_RDD1,FULL_MAIN_RDD2,FULL_MAIN_RDD3,FULL_MAIN_RDD4,FULL_MAIN_RDD5, FULL_MAIN_RDD6,FULL_MAIN_RDD7,FULL_MAIN_RDD8,FULL_MAIN_RDD9,FULL_MAIN_RDD10 = FULL_MAIN.randomSplit(split_factor)
FULL_MAIN_RDD1.write.mode("overwrite").parquet(PATH+"/FULL_MAIN_RDD1.parquet")
FULL_MAIN_RDD2.write.mode("overwrite").parquet(PATH+"/FULL_MAIN_RDD2.parquet")
...
The problem with this approach is there are other dataframes that I needed to keep the rows aligned to and doing this random split is making the dataframes not aligned.
So my two questions are:
Is there way to split multiple dataframes by relative equal amounts when I don't have any row numbers or numeric counter for each row in my dataset?
Is there a way to read parquet files in batches in pandas or numpy? This would basically solve my problem on the downstream system. I can't figure out how to open the parquet in batches(I've tried to open it in pandas and then split the rows and save each file but when I load the dataframe it crashes my system). I am not sure if it's possible without exceeding memory.
Parquet file format supports row groups. Install pyarrow and use row_groups when creating parquet file:
df.to_parquet("filename.parquet", row_group_size=10000, engine="pyarrow")
Then you can read group-by-group (or even only specific group):
import pyarrow.parquet as pq
pq_file = pq.ParquetFile("filename.parquet")
n_groups = pq_file.num_row_groups
for grp_idx in range(n_groups):
df = pq_file.read_row_group(grp_idx, use_pandas_metadata=True).to_pandas()
process(df)
If you don't have control over creation of the parquet file, you still able to read only part of the file:
pq_file = pq.ParquetFile("filename.parquet")
batch_size = 10000 # records
batches = pq_file.iter_batches(batch_size, use_pandas_metadata=True) # batches will be a generator
for batch in batches:
df = batch.to_pandas()
process(df)
I am not sure if you are having spark . If you want to provide downstream smaller chunks of file , you use repartition to a desired number of chunks and rewrite the parquet file .
You can change the repartition number as per your need.
df = spark.read.parquet('filename.parquet')
df.repartition(200).mode('overwrite').save('targetPath')
I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.
After saving a Pandas DataFrame with df.to_pickle(file_name), it can be loaded with df = pd.read_pickle(file_name). But sometimes, you may only want to load the data for one Series at a particular time, and loading the entire DataFrame is inefficient. Is there a way to load just a single Series from a pickled DataFrame?
This is not possible because pickle files are serialized and reading a single column of a serialized file is not possible. You can read a single column of other file types (i.e. h5, csv, etc.) but not a serialized file.
I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')
I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.