Does pandas.read_parquet read only the corresponding columns, given list of columns name? - pandas

The pandas doc says -
columnslist, default=None
If not None, only these columns will be read from the file.
Want to be sure I understand correctly -
Does that mean Pandas will only read the mentioned columns, without loading all the files to the memory and dropping the unnecessary columns? Just as Spark does?

Related

Pandas dataframe sharing between functions isn't working

I have a script that modifies a pandas dataframe with several concurrent functions (asyncio coroutines). Each function adds rows to the dataframe and it's important that the functions all share the same list. However, when I add a row with pd.concat a new copy of the dataframe is created. I can tell because each dataframe now has a different memory location as given by id().
As a result the functions are no longer share the same object. How can I keep all functions pointed at a common dataframe object?
Note that this issue doesn't arise when I use the append method, but that is being deprecated.
pandas dataframes are efficient because they use contiguous memory blocks, frequently of fundamental types like int and float. You can't just add a row because the dataframe doesn't own the next bit of memory it would have to expand into. Concatenation usually requires that new memory is allocated and data is copied. Once that happens, referrers to the original dataframe
If you know the final size you want, you can preallocate and fill. Otherwise, you are better off keeping a list of new dataframes and concatenating them all at once. Since these are parallel procedures, they aren't dependent on each others output, so this may be a feasable option.

Save columns from Pandas dataframe to csv, without reading the csv file

I was wondering if there is a method to store ones columns from a dataframe to an already existing csv file without reading the entire file first?
I am working with a very large dataset, where I read 2-5 columns of the dataset, use them for calculating a new variable(column) and I want to store this variable to the entire dataset. My memory can not load the entire dataset at once and therefore I am looking for a way to store the new columns to the entire dataset without loading all of it.
I have tried using chunking with:
df = pd.read_csv(Path, chunksize = 10000000)
But then I am faced with the Error "TypeError: 'TextFileReader' object is not subscriptable" When trying to process the data.
The data is also grouped by two variables and therefore chunking is not preferred when doing these calculations.

Save memory with big pandas dataframe

I have got a huge dataframe (pandas): 42 columns, 19 millions rows and different dtypes. I load this dataframe from a csv file to JupyterLab. Afterwards I do some operations on it (adding more colums) and I write it back to a csv file. A lot of the columns are int64. In some of these columns many rows are empty.
Do you know a technique / specific dtype which I can apply on int64 columns in order to reduce the size of the dataframe and write it to a csv file more effient saving memory capacity and reduce the size of the csv file?
Would you provide me with some example of code?
[For columns containing strings only I changed the dtype to 'category'.]
thank you
If I understand your question correctly, the issue is the size of the csv file when you write it back to disk.
A csv file is just a text file, and as such the columns aren't stored with dtypes. It doesn't matter what you change your dtype to in pandas, it will be written back as characters. This makes csv very inefficient for storing large amounts of numerical data.
If you don't need it as a csv for some other reason, try a different file type such as parquet. (I have found this to reduce my file size by 10x, but it depends on your exact data.)
If you're specifically looking to convert dtypes, see this question, but as mentioned, this won't help your csv file size: Change column type in pandas

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

Pandas: in memory sorting hdf5 files

I have the following problem:
I have a set several hdf5 files with similar data frames which I want to sort globally based on multiple columns.
My input is the file names and an ordered list of columns I want to use for sorting.
The output should be a single hdf5 file containing all the sorted data.
Each file can contain millions of rows. I can afford loading a single file in memory but not the entire dataset.
Naively I would like first to copy all the data in a single hdf5 file (which is not difficult) and then find out a way to do in memory sorting of this huge file.
Is there a quick way to sort in memory a pandas datastructure stored in an hdf5 file based on multiple columns?
I have already seen ptrepack but it seems to allow you sorting only on a single column.