Pyspark - Same filter as the predicate applied after scan even after the predicate is getting pushed down - sql

Question: When joining two datasets, Why is the filter isnotnull applied twice on the joining key column? In the physical plan, it is once applied as a PushedFilter and then explicitly applied right after it. Why is that so?
code:
import os
import pandas as pd, numpy as np
import pyspark
spark=pyspark.sql.SparkSession.builder.getOrCreate()
save_loc = "gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/"
df1 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]),
'b': np.random.random(1000)}))
df2 = spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size = 1000, p = [0.47,0.48,0.05]),
'b': np.random.random(1000)}))
df1.write.parquet(os.path.join(save_loc,"dfl_key_int"))
df2.write.parquet(os.path.join(save_loc,"dfr_key_int"))
dfl_int = spark.read.parquet(os.path.join(save_loc,"dfl_key_int"))
dfr_int = spark.read.parquet(os.path.join(save_loc,"dfr_key_int"))
dfl_int.join(dfr_int,on='a',how='inner').explain()
output:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#23L, b#24, b#28]
+- BroadcastHashJoin [a#23L], [a#27L], Inner, BuildRight, false
:- Filter isnotnull(a#23L)
: +- FileScan parquet [a#23L,b#24] Batched: true, DataFilters: [isnotnull(a#23L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfl_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double>
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
+- Filter isnotnull(a#27L)
+- FileScan parquet [a#27L,b#28] Batched: true, DataFilters: [isnotnull(a#27L)], Format: Parquet, Location: InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct<a:bigint,b:double>

The reason is that a PushedFilter does not guarantee you that all the data is filtered as you want before it has been read into memory by Spark. For more context on what a PushedFilter is, check out this SO answer.
Parquet files
Let's have a look at Parquet files like in your example. Parquet files are stored in a columnar format, and they are also organized in Row Groups (or chunks). The following picture comes from the Apache Parquet docs:
You see that the data is stored in a columnar fashion, and they are chopped up into chunks (row groups). Now, for each column/row chunk combination, Parquet stores some metadata. In that picture, you see that it contains a bunch of metadata and then also extra key/value pairs. These also contain statistics about your data (depending on what type your column is).
Some examples of these statistics are:
what the min/max value is of the chunk (in case it makes sense for the data type of the column)
whether the chunk has non-null values
...
Back to your example
You are joining on the a column. To be able to do that we need to be sure that a has no null values. Let's imagine that your a column (disregarding the other columns) is stored like this:
a column:
chunk 1: 0, 1, None, 1, 1, None
chunk 2: 0, 0, 0, 0, 0, 0
chunk 3: None, None, None, None, None, None
Now, using the PushedFilter we can immediately (just by looking at the metadata of the chunks) disregard chunk 3, we don't even have to read it in!
But as you see, chunk 1 still contains null values. This is something we can't filter out by only looking at the chunk's metadata. So we'll have to read in that whole chunk and then filter those other null values afterwards within Spark using that second Filter node in your Physical Plan.

Related

Loading "pivoted" data with pyarrow (or, "stack" or "melt" for pyarrow.Table)

I have large-ish CSV files in "pivoted" format: rows and columns are categorical, and values are a homogeneous data type.
What's the best (memory and compute efficient) way to load such a file into a pyarrow.Table with an "unpivoted" schema? In other words, given a CSV file with n rows and m columns, how do I get a pyarrow.Table with n*m rows and one column?
In terms of pandas, I think I want the pyarrow equivalent of pandas.DataFrame.melt() or .stack().
For example...
given this CSV file
item,A,B
item_0,0,0
item_1,370,1
item_2,43,0
I want this pyarrow.Table
item group value
item_0 A 0
item_0 B 0
item_1 A 370
item_1 B 1
item_2 A 43
item_2 B 0
Pyarrow has got some limited computation capacity and doesn't support melt at the moment. You can see what's available there: https://arrow.apache.org/docs/python/api/compute.html#
One alternative is to create the melted table yourself:
table = pyarrow.csv.read_csv("data.csv")
tables = []
for column_name in table.schema.names[1:]:
tables.append(pa.Table.from_arrays(
[
table[0],
pa.array([column_name]*table.num_rows, pa.string()),
table[column_name],
],
names=[
table.schema.names[0],
"key",
"value"
]
))
result = pa.concat_tables(tables)
Another option is to use pola-rs which is similar to pandas, but uses arrow as a back end. Unlike pyarrow it has got a lot more compute functions, including melt:
https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.melt.html

What is the most efficient method for calculating per-row historical values in a large pandas dataframe?

Say I have two pandas dataframes (df_a & df_b), where each row represents a toy and features about that toy. Some pretend features:
Was_Sold (Y/N)
Color
Size_Group
Shape
Date_Made
Say df_a is relatively small (10s of thousands of rows) and df_b is relatively large (>1 million rows).
Then for every row in df_a, I want to:
Find all the toys from df_b with the same type as the one from df_a (e.g. the same color group)
The df_b toys must also be made before the given df_a toy
Then find the ratio of those sold (So count sold / count all matched)
What is the most efficient means to make those per-row calculations above?
The best I've came up with so far is something like the below.
(Note code might have an error or two as I'm rough typing from a different use case)
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
prior_toys = df_b[(df_b.Date_Made < created_date) & (df_b[col] == relevant_val)]
prior_count = len(prior_toys)
# Now find the ones that were sold
prior_sold_count = len(prior_toys[prior_toys.Was_Sold == "Y"])
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
Using .itertuples() is useful, but this is still pretty slow. Is there a more efficient method or something I'm missing?
EDIT
Added the below script which will emulated data for the above scenario:
import numpy as np
import pandas as pd
colors = ['red', 'green', 'yellow', 'blue']
sizes = ['small', 'medium', 'large']
shapes = ['round', 'square', 'triangle', 'rectangle']
sold = ['Y', 'N']
size_df_a = 200
size_df_b = 2000
date_start = pd.to_datetime('2015-01-01')
date_end = pd.to_datetime('2021-01-01')
def random_dates(start, end, n=10):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
df_a = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_a),
'Size_Group': np.random.choice(sizes, size_df_a),
'Shape': np.random.choice(shapes, size_df_a),
'Was_Sold': np.random.choice(sold, size_df_a),
'Date_Made': random_dates(date_start, date_end, n=size_df_a)
}
)
df_b = pd.DataFrame(
{
'Color': np.random.choice(colors, size_df_b),
'Size_Group': np.random.choice(sizes, size_df_b),
'Shape': np.random.choice(shapes, size_df_b),
'Was_Sold': np.random.choice(sold, size_df_b),
'Date_Made': random_dates(date_start, date_end, n=size_df_b)
}
)
First of all, I think your computation would be much more efficient using relational database and SQL query. Indeed, the filters can be done by indexing columns, performing a database join, some advance filtering and count the result. An optimized relational database can generate an efficient algorithm based on a simple SQL query (hash-based row grouping, binary search, fast intersection of sets, etc.). Pandas is sadly not very good to perform efficiently advanced requests like this. It is also very slow to iterate over pandas dataframe although I am not sure this can be alleviated in this case using only pandas. Hopefully you can use some Numpy and Python tricks and (partially) implement what fast relational database engines would do.
Additionally, pure-Python object types are slow, especially (unicode) strings. Thus, **converting column types to efficient ones in a first place can save a lot of time (and memory). For example, there is no need for the Was_Sold column to contains "Y"/"N" string objects: a boolean can just be used in that case. Thus let us convert that:
df_b.Was_Sold = df_b.Was_Sold == "Y"
Finally, the current algorithm has a bad complexity: O(Na * Nb) where Na is the number of rows in df_a and Nb is the number of rows in df_b. This is not easy to improve though due to the non-trivial conditions. A first solution is to group df_b by col column ahead-of-time so to avoid an expensive complete iteration of df_b (previously done with df_b[col] == relevant_val). Then, the date of the precomputed groups can be sorted so to perform a fast binary search later. Then you can use Numpy to count boolean values efficiently (using np.sum).
Note that doing prior_toys['Was_Sold'] is a bit faster than prior_toys.Was_Sold.
Here is the resulting code:
cols = ['Color', 'Size_Group', 'Shape']
# Run this calculation for multiple features
for col in cols:
print(col + ' - Started')
# Empty list to build up the calculation in
ratio_list = []
# Split df_b by col and sort each (indexed) group by date
colGroups = {grId: grDf.sort_values('Date_Made') for grId, grDf in df_b.groupby(col)}
# Start the iteration
for row in df_a.itertuples(index=False):
# Relevant values from df_a
relevant_val = getattr(row, col)
created_date = row.Date_Made
# df to keep the overall prior toy matches
curColGroup = colGroups[relevant_val]
prior_count = np.searchsorted(curColGroup['Date_Made'], created_date)
prior_toys = curColGroup[:prior_count]
# Now find the ones that were sold
prior_sold_count = prior_toys['Was_Sold'].values.sum()
# Now make the calculation and append to the list
if prior_count == 0:
ratio = 0
else:
ratio = prior_sold_count / prior_count
ratio_list.append(ratio)
# Store the calculation in the original df_a
df_a[col + '_Prior_Sold_Ratio'] = ratio_list
print(col + ' - Finished')
This is 5.5 times faster on my machine.
The iteration of the pandas dataframe is a major source of slowdown. Indeed, prior_toys['Was_Sold'] takes half the computation time because of the huge overhead of pandas internal function calls repeated Na times... Using Numba may help to reduce the cost of the slow iteration. Note that the complexity can be increased by splitting colGroups in subgroups ahead of time (O(Na log Nb)). This should help to completely remove the overhead of prior_sold_count. The resulting program should be about 10 time faster than the original one.

pandas : Indexing for thousands of rows in dataframe

I initially had 100k rows in my dataset. I read the csv using pandas into a dataframe called data. I tried to do a subset selection of 51 rows using .loc. My index labels are numeric values 0, 1, 2, 3 etc. I tried using this command -
data = data.loc['0':'50']
But the results were weird, it took all the rows from 0 to 49999, looks like it is taking rows till the index value starts with 50.
Similarly, I tried with this command - new_data = data.loc['0':'19']
and the result was all the rows, starting from 0 till 18999.
Could this be a bug in pandas?
You want to use .iloc in place of .loc, since you are selecting data from the dataframe via numeric indices.
For example:
data.iloc[:50,:]
Keep in mind that your indices are of numeric-type, not string-type, so querying with a string (as you have done in your OP) attempts to match string-wise comparisons.

Replacing duplicates (pandas to csv or hdf5 or sql)?

I am using pandas to process information. The current work flow is to read the last 30 days of data from a hdf5 file, then add the latest data to it and perform some analysis.
I then need to append this data back to the original hdf5 file (with a column indicating whether the same Customer_ID showed up multiple times). The only problem is that there are duplicates. The only solution I have is to read the whole file into memory, drop the duplicates, and then write the file all over again (replacing it completely). Is there a way to avoid appending duplicate data? Like an 'insert and replace' command that I can use in pandas?
querydate = dt.date.today() - Timedelta(30, unit='d')
df = pd.read_hdf(loc+hdfname, 'Raw', where = [('Report_Date > querydate')])
df2 = pd.read_csv(loc+yesterdayfile.csv)
combine = [df,df2]
df3 = pd.concat(combine)
I need to see if the latest data (from yesterday) existed previously (within a 30 day rolling window). Below you can see that I append the latest data to the original file, I then read that file into memory, drop duplicates, then write it again (overwriting the existing file)
hdf = HDFStore(loc+hdfname)
hdf.put('Raw', df3, format= 'table', complib= 'blosc', complevel=5, data_columns = True, append = True)
df = pd.read_hdf(loc+hdfname, 'Raw')
df.drop_duplicates(subset = ['Emp_ID', 'Interaction_Time', 'Customer_ID'], take_last = True, inplace = True)
hdf = HDFStore(loc+hdfname)
hdf.put('Raw', df, format= 'table', complib= 'blosc', complevel=5, data_columns = True, append = False)

Storing .csv in HDF5 pandas

I was experimenting with HDF and it seems pretty great because my data is not normalized and it contains a lot of text. I love being able to query when I read data into pandas.
loc2 = r'C:\\Users\Documents\\'
(my dataframe with data is called 'export')
hdf = HDFStore(loc2+'consolidated.h5')
hdf.put('raw', export, format= 'table', complib= 'blosc', complevel=9, data_columns = True, append = True)
21 columns and about 12 million rows so far and I will add about 1 million rows per month.
1 Date column [I convert this to datetime64]
2 Datetime columns (one of them for each row and the other one is null about 70% of the time) [I convert this to datetime64]
9 text columns [I convert these to categorical which saves a ton of room]
1 float column
8 integer columns, 3 of these can reach a max of maybe a couple of hundred and the other 5 can only be 1 or 0 values
I made a nice small h5 table and it was perfect until I tried to append more data to it (literally just one day of data since I am receiving daily raw .csv files). I received errors which showed that the dtypes were not matching up for each column although I used the same exact ipython notebook.
Is my hdf.put code correct? If I have append = True does that mean it will create the file if it does not exist, but append the data if it does exist? I will be appending to this file everyday basically.
For columns which only contain 1 or 0, should I specify a dtype like int8 or int16 - will this save space or should I keep it at int64? It looks like some of my columns are randomly float64 (although no decimals) and int64. I guess I need to specify the dtype for each column individually. Any tips?
I have no idea what blosc compression is. Is that the most efficient one to use? Any recommendations here? This file is mainly used to quickly read data into a dataframe to join to other .csv files which Tableau is connected to