How to save some Polars dfs into one file for partial data request was avaliable - dataframe

I want to save some Polars dataframes into one file, and next I want to request data from file with filter by timestamp (datetime) column. I don't need to take all the file in memory, but only filtered part.
I see, Polars API list has Feather/IPC and Parquet files that can do it in a theory, but I don't know how to read this files in Polars with a filter by data.
Before for Pandas I used hdf5 format and it was very clear, but I have not expirience with that new formats for me. Maybe you can help me how to make it most effective.

Related

Save columns from Pandas dataframe to csv, without reading the csv file

I was wondering if there is a method to store ones columns from a dataframe to an already existing csv file without reading the entire file first?
I am working with a very large dataset, where I read 2-5 columns of the dataset, use them for calculating a new variable(column) and I want to store this variable to the entire dataset. My memory can not load the entire dataset at once and therefore I am looking for a way to store the new columns to the entire dataset without loading all of it.
I have tried using chunking with:
df = pd.read_csv(Path, chunksize = 10000000)
But then I am faced with the Error "TypeError: 'TextFileReader' object is not subscriptable" When trying to process the data.
The data is also grouped by two variables and therefore chunking is not preferred when doing these calculations.

How append data to Parquet file with save dataframe from Polars

I have Polars df and I want to save it into Parquet file. And next df too, and next.
Code df.write_parquet("path.parquet") is only rewriting it. How I can do it in Polars?
Polars does not support appending to Parquet files, and most tools do not, see for example this SO post.
Your best bet would be to cast the dataframe to an Arrow table using .to_arrow(), and use pyarrow.dataset.write_dataset. In particular, see the comment on the parameter existing_data_behavior. Still, that requires organizing your data in partitions, which effectively means you have a separate parquet file per partition, stored in the same directory. So each df you have, becomes its own parquet file, and you abstract away from that on the read. Polars does not support writing partitions as far as I'm aware. There is support for reading though, see the source argument in pl.read_parquet.

Save memory with big pandas dataframe

I have got a huge dataframe (pandas): 42 columns, 19 millions rows and different dtypes. I load this dataframe from a csv file to JupyterLab. Afterwards I do some operations on it (adding more colums) and I write it back to a csv file. A lot of the columns are int64. In some of these columns many rows are empty.
Do you know a technique / specific dtype which I can apply on int64 columns in order to reduce the size of the dataframe and write it to a csv file more effient saving memory capacity and reduce the size of the csv file?
Would you provide me with some example of code?
[For columns containing strings only I changed the dtype to 'category'.]
thank you
If I understand your question correctly, the issue is the size of the csv file when you write it back to disk.
A csv file is just a text file, and as such the columns aren't stored with dtypes. It doesn't matter what you change your dtype to in pandas, it will be written back as characters. This makes csv very inefficient for storing large amounts of numerical data.
If you don't need it as a csv for some other reason, try a different file type such as parquet. (I have found this to reduce my file size by 10x, but it depends on your exact data.)
If you're specifically looking to convert dtypes, see this question, but as mentioned, this won't help your csv file size: Change column type in pandas

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

How to load only few columns into a dataframe?

I am loading a file into a df.
df=spark.read.csv("path")
If I try above way,
It will load the whole CSV file, which has 20 columns, but I want to read just 5 out of it. Is there a way??
You can't perform your selection before reading.
df=spark.read.csv("path")
.select(my_cols)
For better reading (and writing) performances, You should convert your CSV to Parquet file which is a columnar storage format.