Efficiency in using pandas and parquet - pandas
People talk a lot about using parquet and pandas. And I am trying hard to understand if we can utilize the entire features of parquet files when used with pandas. For instance say I have a big parquet file (partitioned on year) with 30 columns (including year, state, gender, last_name) and many rows. I want to load the parquet file and perform similar computation that follow
import pandas as pd
df = pd.read_parquet("file.parquet")
df_2002 = df[df.year == 2002]
df_2002.groupby(["state", "gender"])["last_name"].count()
Here in this query only 4 columns (out of 30) and only year 2002 partition is used. It means we just want to bring the columns and rows that are needed for this computation, and something like this is possible in parquet with predicate and projection pushdown (and why we using parquet).
But I am trying to understand how this query behaves in pandas. Does it bring everything into memory the moment we call df = pd.read_parquet("file.parquet) ? Or any lazy factor is getting applied here to bring in the projection & predicate pushdown? If this is not the case then what is the point in using pandas with parquet? Any of this is possible with the arrow package out there ?
Eventhough I haven't used dask just wondering if this kind of situation is handled in dask as they perform it lazily.
I am sure this kind of situation is handled well in the spark world, but just wondering how these situations are handled in local scenarios with packages like pandas, arrow,dask, ibis etc.
And I am trying hard to understand if we can utilize the entire features of parquet files when used with pandas.
TL;DR: Yes, but you may have to work harder than if you used something like Dask.
For instance say I have a big parquet file (partitioned on year)
This is pedantic but a single parquet file is not partitioned on anything. Parquet "datasets" (collections of files) are partitioned. For example:
my_dataset/year=2002/data.parquet
my_dataset/year=2003/data.parquet
Does it bring everything into memory the moment we call df = pd.read_parquet("file.parquet) ?
Yes. But...you can do better:
df = pd.read_parquet('/tmp/new_dataset', filters=[[('year','=', 2002)]], columns=['year', 'state', 'gender', 'last_name'])
The filters keyword will pass the filter down to pyarrow which will apply the filter in a pushdown fashion both to the partition (e.g. to know which directories need to be read) and to the row group statistics.
The columns keyword will pass the column selection down to pyarrow which will apply the selection to only read the specified columns from disk.
Any of this is possible with the arrow package out there ?
Everything in pandas' read_parquet file is being handled behind the scenes by pyarrow (unless you change to some other engine). Traditionally, the group_by would then be handled by directly by pandas (well, maybe numpy) but pyarrow has some experimental compute APIs as well if you wanted to try doing everything in pyarrow.
Eventhough I haven't used dask just wondering if this kind of situation is handled in dask as they perform it lazily.
In my understanding (I don't have a ton of experience with dask), when you say...
df_2002 = df[df.year == 2002]
df_2002.groupby(["state", "gender"])["last_name"].count()
...in a dask dataframe then dask will figure out that it can apply pushdown filters and predicates and it will do so when loading the data. So dask takes care of figuring out what filters you should apply and what columns you need to load. This saves you from having to figure it out yourself ahead of time.
Complete example (you can use strace to verify that it is only loading one of the two parquet files and only part of that file):
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
import shutil
shutil.rmtree('/tmp/new_dataset')
tab = pa.Table.from_pydict({
"year": ["2002", "2002", "2002", "2002", "2002", "2002", "2003", "2003", "2003", "2003", "2003", "2003"],
"state": [ "HI", "HI", "HI", "HI", "CO", "CO", "HI", "HI", "CO", "CO", "CO", "CO"],
"gender": [ "M", "F", None, "F", "M", "F", None, "F", "M", "F", "M", "F"],
"last_name": ["Smi", "Will", "Stev", "Stan", "Smi", "Will", "Stev", "Stan", "Smi", "Will", "Stev", "Stan"],
"bonus": [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
})
ds.write_dataset(tab, '/tmp/new_dataset', format='parquet', partitioning=['year'], partitioning_flavor='hive')
df = pd.read_parquet('/tmp/new_dataset', filters=[[('year','=', 2002)]], columns=['year', 'state', 'gender', 'last_name'])
df_2002 = df[df.year == 2002]
print(df.groupby(["state", "gender"])["last_name"].count())
Disclaimer: You are asking about a number of technologies here. I work pretty closely with the Apache Arrow project and thus my answer may be biased in that direction.
Related
Python write function saves dataframe.__repr__ output but truncated?
I have a dataframe output as a result of running some code, like so df = pd.DataFrame({ "i": self.direct_hit_i, "domain name": self.domain_list, "j": self.direct_hit_j, "domain name 2": self.domain_list2, "domain name cleaned": self.clean_domain_list, "domain name cleaned 2": self.clean_domain_list2 }) All I was really looking for was a way to save these data to whatever file e.g. txt, csv but in a way where the columns of data align with the header. I was using df.to_csv() with \t delimeter but due to the data have different lengths of string and numbers, the elements within each row never quite line up as a column with the corresponding header. So I resulted to using with open('./filename.txt', 'w') as fo: fo.write(df.__repr__()) But bear in mind the data in the dataframe are lists with really long length. So for small lengths it returns which is exactly what I want. However, when I have very big lists it gives me So as seen below the outputs are truncated. I would like it to not be truncated since I'll need to manually scroll down and verify things.
Try the syntax: with open('./filename.txt', 'w') as fo: fo.write(f'{df!r}') Another way of doing this export to csv would be to use a too like Mito, which full disclosure I'm the author of. It should allow you to export ot CSV easier than the process here!
dask read parquet and specify schema
Is there a dask equivalent of spark's ability to specify a schema when reading in a parquet file? Possibly using kwargs passed to pyarrow? I have a bunch of parquet files in a bucket but some of the fields have slightly inconsistent names. I could create a custom delayed function to handle these cases after reading them but I'm hoping I could specify the schema when opening them via globing. Maybe not though as I guess opening then via globing is going to try and concatenate them. This currently fails because of the inconsistent field names. Create a parquet file: import dask.dataframe as dd df = dd.demo.make_timeseries( start="2000-01-01", end="2000-01-03", dtypes={"id": int, "z": int}, freq="1h", partition_freq="24h", ) df.to_parquet("df.parquet", engine="pyarrow", overwrite=True) Read it in via dask and specify the schema after reading: df = dd.read_parquet("df.parquet", engine="pyarrow") df["z"] = df["z"].astype("float") df = df.rename(columns={"z": "a"}) Read it in via spark and specify the schema: from pyspark.sql import SparkSession import pyspark.sql.types as T spark = SparkSession.builder.appName('App').getOrCreate() schema = T.StructType( [ T.StructField("id", T.IntegerType()), T.StructField("a", T.FloatType()), T.StructField("timestamp", T.TimestampType()), ] ) df = spark.read.format("parquet").schema(schema).load("df.parquet")
Some of the options are: Specify dtypes after loading (requires consistent column names): custom_dtypes = {"a": float, "id": int, "timestamp": pd.datetime} df = dd.read_parquet("df.parquet", engine="pyarrow").astype(custom_dtypes) This currently fails because of the inconsistent field names. If the column names are not the same across files, you might want to use a custom delayed before loading: #delayed def custom_load(path): df = pd.read_parquet(path) # some logic to ensure consistent columns # for example: if "z" in df.columns: df = df.rename(columns={"z": "a"}).astype(custom_dtypes) return df dask_df = dd.from_delayed([custom_load(path) for path in glob.glob("some_path/*parquet")])
Altering or updating of nested data in DataFrame Spark
I have a very weird requirement in spark wherein I have to transform the data present in a dataframe. So I read data from s3 bucket and transform them into a dataframe. This is being done fine, the next step is where the challenge lies. Once the data is being read the data which is Json data needs to be transformed so that all data are consistent. Sample data which I have {"name": "John", "age": 24, "object_data": {"tax_details":""}} {"name": "nash", "age": 26, "object_data": {"tax_details": {"Tax": "None"} } } The issue is that tax_details field is string in first document and second document is having an object. I want to ascertain that everytime I put it as object, if that can be done by dataframe operation then it will be great. Else any pointer to do it will be great. Looking for any help
How to get the correct centroid of a bigquery polygon with st_centroid
I'm having some trouble with the ST_CENTROID function in bigquery. There is a difference between getting the centroid of a GEOGRAPHY column and from the same WKT version of the column. The table is generated using a bq load with a geography column and a newline_delimited_json file containing the polygon as wkt text. Example: select st_centroid(polygon) loc, st_centroid(ST_GEOGFROMTEXT(st_astext(polygon))) loc2,polygon from table_with_polygon Result: POINT(-174.333247842246 -51.6549479435566) POINT(5.66675215775447 51.6549479435566) POLYGON((5.666771 51.654721, 5.666679 51.655027, 5.666597 51.655017, 5.666556 51.655154, 5.666702 51.655171, 5.666742 51.655037, 5.666824 51.655046, 5.666917 51.654737, 5.666771 51.654721)) POINT(-174.367214581541 -51.645030856473) POINT(5.63278541845948 51.645030856473) POLYGON((5.632691 51.644997, 5.63269 51.644999, 5.63273 51.645003, 5.632718 51.645049, 5.632843 51.645061, 5.632855 51.645014, 5.632691 51.644997)) POINT(-174.37100400049 -51.6434992715399) POINT(5.62899599950984 51.6434992715399) POLYGON((5.629063 51.643523, 5.629084 51.643465, 5.629088 51.643454, 5.628957 51.643436, 5.628915 51.643558, 5.629003 51.64357, 5.629021 51.643518, 5.629063 51.643523)) POINT(-174.293340001044 -51.6424190026157) POINT(5.70665999895557 51.6424190026157) POLYGON((5.706608 51.642414, 5.706624 51.642443, 5.706712 51.642424, 5.706696 51.642395, 5.706608 51.642414)) POINT(-174.306209997018 -51.6603530009923) POINT(5.69379000298176 51.6603530009923) POLYGON((5.693801 51.660361, 5.693802 51.660346, 5.693779 51.660345, 5.693778 51.66036, 5.693801 51.660361)) POINT(-174.291766437718 -51.6499633041183) POINT(5.70823356228228 51.6499633041183) POLYGON((5.708187 51.649858, 5.708091 51.650027, 5.70828 51.650069, 5.708376 51.649899, 5.708187 51.649858)) POINT(-174.369405698681 -51.653769846544) POINT(5.63059430131924 51.653769846544) POLYGON((5.630653 51.653531, 5.630462 51.653605, 5.630579 51.653722, 5.630574 51.65373, 5.630566 51.653729, 5.630551 51.653759, 5.630559 51.65376, 5.630555 51.653769, 5.630273 51.653846, 5.630364 51.653974, 5.630787 51.653858, 5.630852 51.653728, 5.630653 51.653531)) ...etc Is this a bug or am I doing something wrong? Update Did some further digging using Michael Entin's answer as a hint. It turns out that bq load with WKT does NOT use the smallest polygon by default. And there is no option with bq load to change this behaviour. The imported json is very large (openstreetmap data) so there is no easy option to change this to geoJson. To dig deeper into the actual value stored in the column, I did a select st_asgeojson(polygon) from ... which resulted in { "type": "Polygon", "coordinates": [ [ [5.598659, 51.65927], [5.598651, 51.659295], [5.598638, 51.659293], [5.598626, 51.65933], [5.598788, 51.659353], [5.598799, 51.659319], [5.598855, 51.659139], [5.598692, 51.65912], [5.598643, 51.659268], [5.598659, 51.65927] ], [ [180, 90], [180, 0], [180, -90], [-180, -90], [-180, 0], [-180, 90], [180, 90] ] ] } So here the wrong orientation can be seen.
Looks like some or all of these polygons might have gotten inverted, and this produces antipodal centroids: POINT(-174.333247842246 -51.6549479435566) is antipodal to POINT(5.66675215775447 51.6549479435566) etc. See BigQuery doc for details of what this means: https://cloud.google.com/bigquery/docs/gis-data#polygon_orientation There are two possible reasons and ways to resolve this (my bet is case 1): The polygons should be small, but were loaded with incorrect orientation, and thus became inverted - they are now complimentary to what was the intended shape, and are larger than hemisphere. Since you don't pass oriented parameter to ST_GEOGFROMTEXT, this function fixes them by ignoring the orientation. The correct solution is usually to load them as GeoJson (this also avoids another issue with loading WKT strings - geodesic vs planar edges). Or if all the edges are small and geodesic vs planar does not matter - replace the table geography with ST_GEOGFROMTEXT(st_astext(polygon)). The polygons should really be large, and were loaded with correct orientation. Then when you don't pass oriented parameter to ST_GEOGFROMTEXT, this function breaks them by ignoring the orientation. If this is the case, you should pass TRUE as second parameter to ST_GEOGFROMTEXT.
Append new columns to HDFStore with pandas
I'm using Pandas, and making a HDFStore object. I calculate 500 columns of data, and write it to a table format HDFStore object. Then I close the file, delete the data from memory, do the next 500 columns (labelled by an increasing integer), open up the store, and try to append the new columns. However, it doesn't like this. It gives me an error invalid combinate of [non_index_axes] on appending data [[(1, [500, 501, 502, ...])]] vs current table [[(1, [0, 1, 2, ...])]] I'm assuming it only allows appending of more rows not columns. So how do I add more columns?
HDF5 files have a fixed structure, and you cannot easily add a column , but the workaround is to concatenate different DFs and the re-write them into the HDF5 file. hdf5_files = ['data1.h5', 'data2.h5', 'data3.h5'] df_list = [] for file in hdf5_files: df = pd.read_hdf(file) df_list.append(df) result = pd.concat(df_list) # You can now use the result DataFrame to access all of the data from the HDF5 files Does this solve your problem ? Remind HDF5 is not designed for efficient append operations, you should consider database system if you need to frequently add new columns to your data , imho.
You have kept your column titles in the code [1, 2, 3, ...] and trying to append a DataFrame with different columns [500, 501, 502, ...].