I have got a huge dataframe (pandas): 42 columns, 19 millions rows and different dtypes. I load this dataframe from a csv file to JupyterLab. Afterwards I do some operations on it (adding more colums) and I write it back to a csv file. A lot of the columns are int64. In some of these columns many rows are empty.
Do you know a technique / specific dtype which I can apply on int64 columns in order to reduce the size of the dataframe and write it to a csv file more effient saving memory capacity and reduce the size of the csv file?
Would you provide me with some example of code?
[For columns containing strings only I changed the dtype to 'category'.]
thank you
If I understand your question correctly, the issue is the size of the csv file when you write it back to disk.
A csv file is just a text file, and as such the columns aren't stored with dtypes. It doesn't matter what you change your dtype to in pandas, it will be written back as characters. This makes csv very inefficient for storing large amounts of numerical data.
If you don't need it as a csv for some other reason, try a different file type such as parquet. (I have found this to reduce my file size by 10x, but it depends on your exact data.)
If you're specifically looking to convert dtypes, see this question, but as mentioned, this won't help your csv file size: Change column type in pandas
Related
I was wondering if there is a method to store ones columns from a dataframe to an already existing csv file without reading the entire file first?
I am working with a very large dataset, where I read 2-5 columns of the dataset, use them for calculating a new variable(column) and I want to store this variable to the entire dataset. My memory can not load the entire dataset at once and therefore I am looking for a way to store the new columns to the entire dataset without loading all of it.
I have tried using chunking with:
df = pd.read_csv(Path, chunksize = 10000000)
But then I am faced with the Error "TypeError: 'TextFileReader' object is not subscriptable" When trying to process the data.
The data is also grouped by two variables and therefore chunking is not preferred when doing these calculations.
I want to concat 2 data-frames into one df and save as one csv considering that the first dataframe is in csv file and huge so i dont want to load it in memory. I tried the df.to_csv with append mode but it doesnt behave like df.concat in regards to different columns (comparing and combining columns). Anyone knows how to concat a csv and a df ? Basically csv and df can have different columns so the output csv should have only one header along with all columns and proper respective rows.
You can use Dask DataFrame to do this operation lazily. It'll load your data into memory, but do so in small chunks. Make sure to keep the partition size (blocksize) reasonable -- based on your overall memory capacity.
import dask.dataframe as dd
ddf1 = dd.read_csv("data1.csv", blocksize=25e6)
ddf2 = dd.read_csv("data2.csv", blocksize=25e6)
new_ddf = dd.concat([ddf1, ddf2])
new_ddf.to_csv("combined_data.csv")
API docs: read_csv, concat, to_csv
I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.
I am loading a file into a df.
df=spark.read.csv("path")
If I try above way,
It will load the whole CSV file, which has 20 columns, but I want to read just 5 out of it. Is there a way??
You can't perform your selection before reading.
df=spark.read.csv("path")
.select(my_cols)
For better reading (and writing) performances, You should convert your CSV to Parquet file which is a columnar storage format.
I have the following problem:
I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.
My input is the file names and an ordered list of columns I want to use for sorting.
The output should be a single hdf5 file containing all the sorted data.
Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.
Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.
Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?
I have already seen ptrepack but it seems to allow you sorting only on a single column.