Joining multiple spark dataframes on non-primary key - dataframe

I am trying to join 1 parent Dataframe with 2 child Dataframes.
Here is how my parent DF look like
PersonId
FirstName
LastName
1
ABC
XYZ
Child DF 1
FirstName
FirstNameMatchedPersonIds
ABC
[1, 10, 20]
Child DF2
LastName
LastNameMatchedPersonIds
XYZ
[1, 40, 70]
I want to join parent DF and child DF 1 by FirstName column and the result should be joined with child DF 2 by LastName column. Here is how my expected join result should look like
PersonId
FirstName
LastName
FirstNamePersonIds
LastNamePersonIds
1
ABC
XYZ
[1,10,20]
[1,40,70]
What will be a good join strategy to achieve this result? Simple inner join is resulting in lot of shuffle writes and often job fails after running for a long time.
Some Estimations:
Number of records in Parent DF : 35 million
Number of records in Child DF1 : 1.5 million
Number of records in Child DF2 : 1.5 million
I would have more such child DF's (at least 15) which I would join with different columns in the parent DF
Additional info: Source for my parent DF is Hbase table which I scan and create RDD and write to sequence file. Further more I read the sequence files and create the Dataframes (so that Hbase scan is done only once).

For me those two smaller datasets are good candidates for broadcast join, you may try to use it and check the results. Broadcast join
Please remember that broadcast should be used carefully and only on relatively small datasets. Your broadcasted dataset will be collected on driver (so it needs to fit in its memory) and then propagated to other executors so size is really important here
If broadcast is not an option here you may check if you have data skew and if you are using Spark 3.X you may turn AQE on and check if it helps

Many a times, we cannot tweak joins more than a certain amount. We have to join to achieve the desired results.
What we can rather try to do is, calibrate the spark.sql.shuffle.partitions config based on the size/partitions of the data and the total number of executor cores available for your Spark application. Default value for this is 200 which in most cases prove to be too less. Ideally, I would recommend you to set this in multiples of the executor cores. For e.g., if you have 100 executors and 4 cores each, start with 800 and calibrate this based on the findings from Spark UI.
Hope this helps!

Related

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.
Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

How to a row in pandas based on column condition?

I have a pandas data frame and I would like to duplicate those rows which meet some column condition (i.e. having multiple elements in CourseID column)
I tried iterating over the data frame to identify the rows which should be duplicated but i don't know how to duplicate them,
Using Pandas version 0.25 it is quite easy:
The first step is to split df.CourseID (converting each element to a list)
and then to explode it (break each list into multiple rows,
repeating other columns in each row):
course = df.CourseID.str.split(',').explode()
The result is:
0 456
1 456
1 799
2 789
Name: CourseID, dtype: object
Then, all to do is to join df with course, but in order to avoid
repeating column names, you have to drop original CourseID column before.
Fortunately, in can be expressed in a single instruction:
df.drop(columns=['CourseID']).join(course)
If you have some older version of Pandas this is a good reason to
upgrade it.

How can I combine same-named columns into one in a pandas dataframe so all the columns are unique?

I have a dataframe that looks like this:
In [268]: dft.head()
Out[268]:
ticker BYND UBER UBER UBER ... ZM ZM BYND ZM
0 analyst worlds uber revenue ... company owning pet things
1 moskow apac note uber ... try things humanization users
2 growth anheuserbusch growth target ... postipo unicorn products revenue
3 stock kong analysts raised ... software revenue things million
4 target uberbeating stock rising ... earnings million pets direct
[5 rows x 500 columns]
In [269]: dft.columns.unique()
Out[269]: Index(['BYND', 'UBER', 'LYFT', 'SPY', 'WORK', 'CRWD', 'ZM'], dtype='object', name='ticker')
How do I combine the the columns so there is only a single unique column name for each ticker?
Maybe you should try making a copy of the column you wish to join then extend the first column with the copy you have.
Code :
First convert the all columns name into one case either in lower or upper case so that there is no miss-match in header case.
def merge_(df):
'''Return the data-frame with columns with the same lowercase'''
# Get the list of unique columns in lowercase
columns = set(map(str.lower,df.columns))
df1 = pd.DataFrame(data=np.zeros((len(df),len(columns))),columns=columns)
# Merging the matching columns
for col in df.cloumns:
df1[col.lower()] += df[col] # words are in str format so '+' will concatenate
return df1

Fillna (forward fill) on a large dataframe efficiently with groupby?

What is the most efficient way to forward fill information in a large dataframe?
I combined about 6 million rows x 50 columns of dimensional data from daily files. I dropped the duplicates and now I have about 200,000 rows of unique data which would track any change that happens to one of the dimensions.
Unfortunately, some of the raw data is messed up and has null values. How do I efficiently fill in the null data with the previous values?
id start_date end_date is_current location dimensions...
xyz987 2016-03-11 2016-04-02 Expired CA lots_of_stuff
xyz987 2016-04-03 2016-04-21 Expired NaN lots_of_stuff
xyz987 2016-04-22 NaN Current CA lots_of_stuff
That's the basic shape of the data. The issue is that some dimensions are blank when they shouldn't be (this is an error in the raw data). An example is that for previous rows, the location is filled out for the row but it is blank in the next row. I know that the location has not changed but it is capturing it as a unique row because it is blank.
I assume that I need to do a groupby using the ID field. Is this the correct syntax? Do I need to list all of the columns in the dataframe?
cols = [list of all of the columns in the dataframe]
wfm.groupby(['id'])[cols].fillna(method='ffill', inplace=True)
There are about 75,000 unique IDs within the 200,000 row dataframe. I tried doing a
df.fillna(method='ffill', inplace=True)
but I need to do it based on the IDs and I want to make sure that I am being as efficient as possible (it took my computer a long time to read and consolidate all of these files into memory).
It is likely efficient to execute the fillna directly on the groupby object:
df = df.groupby(['id']).fillna(method='ffill')
Method referenced
here
in documentation.
How about forward filling each group?
df = df.groupby(['id'], as_index=False).apply(lambda group: group.ffill())
github/jreback: this is a dupe of #7895. .ffill is not implemented in cython on a groupby operation (though it certainly could be), and instead calls python space on each group.
here's an easy way to do this.
url:https://github.com/pandas-dev/pandas/issues/11296
according to jreback's answer, when you do a groupby ffill() is not optimized, but cumsum() is. try this:
df = df.sort_values('id')
df.ffill() * (1 - df.isnull().astype(int)).groupby('id').cumsum().applymap(lambda x: None if x == 0 else 1)

Conditional join result count in a large dataframe

I have a data set of about 100m rows, 4gb, containing two lists like these:
Seed
a
r
apple
hair
brush
tree
Phrase
apple tree
hair brush
I want to get the count of unique matched 'Phrase's for each unique 'Seed'. So for example, the seed 'a' is contained in both 'apple tree' and 'hair brush', so it's 'Phrases_matched_count' should be '2'. Matches are just using partial patches (i.e. a 'string contains' match, does not need to be a regex or anything complex).
Seed Phrases_matched_count
a 2
r 2
apple 1
hair 1
brush 1
tree 1
I have been trying to find a way to do this using Apache Pig (on a small Amazon EMR cluster), and Python Pandas (the data set just about fits in memory), but just can't find a way to do this without looping through every row for each unique 'seed', which will take very long, or a cross product of the tables, which will use too much memory.
Any ideas?
This can be done by using built-in contains but I'm not sure of its scalability on an important number of data.
# Test data
seed = pd.Series(['a','r', 'apple', 'hair', 'brush', 'tree'])
phrase = pd.Series(['apple tree', 'hair brush'])
# Creating a DataFrame with seeds as index and phrases as columns
df = pd.DataFrame(index=seed, columns=phrase)
# Checking if each seed is contained in each phrase
df = df.apply(lambda x: x.index.str.contains(x.name), axis=1)
# Getting the result
df.sum(axis=1)
# The result
a 2
r 2
apple 1
hair 1
brush 1
tree 1