Spark dataframe: Merged data with python results in a very large number of rows - dataframe

Pyspark: A merged data (using Left join) rusults in a very large number of rows. Why are there too many resulting rows after merger? Is there anything seriously wrong with my code? Both dataframes have one common key 'Region'.
1st dataframe (df1): 47,972 rows
2nd dataframe (df2): 852,747 rows
Merged_df: 10,836,925,792 rows
merged_df = df1.join(df2, on=['Region'] , how = 'left')
merged_df = df1.join(df2, on=['Region'] , how = 'left')
I am expecting more rows but in billions.

Let assume two dataframes:
The left join result is:
In other words, a LEFT JOIN indicates that all records from the LEFT (first) dataframe will be returned, regardless of whether they are present in the RIGHT dataframe. If the right dataframe does not include any matches, the result is null.
For every region in first dataframe it will return all matching regions in second dataframe.
AS kasyap said the probability of getting maximum rows is 47,972 x 852,747 = 40,907,979,084 if Region column is same in both dataframe.

Related

Compile a count of similar rows in a Pandas Dataframe based on multiple column values

I have two Dataframes, one containing my data read in from a CSV file and another that has the data grouped by all of the columns but the last and reindexed to contain a column for the count of the size of the groups.
df_k1 = pd.read_csv(filename, sep=';')
columns_for_groups = list(df_k1.columns)[:-1]
k1_grouped = df_k1.groupby(columns_for_groups).size().reset_index(name="Count")
I need to create a series such that every row(i) in the series corresponds to row(i) in my original Dataframe but the contents of the series need to be the size of the group that the row belongs to in the grouped Dataframe. I currently have this, and it works for my purposes, but I was wondering if anyone knew of a faster or more elegant solution.
size_by_row = []
for row in df_k1.itertuples():
for group in k1_grouped.itertuples():
if row[1:-1] == group[1:-1]:
size_by_row.append(group[-1])
break
group_size = pd.Series(size_by_row)

How do I outer join data frames coming from a for loop and having different columns size?

I have a few data frames coming from a location. I want to combine them (outer join them). The problem is, few of them have different column sizes. But, the column to join them is the same. How do I achieve this in pyspark?
The sample code I have is:
def unionAll(*dfs):
return reduce(DataFrame.union, dfs)
attributes_df=[]
files = bucket.objects.filter(Prefix="data-process/attributes_extraction/Attributes_/")
for obj in files:
file_path=obj.key
if file_path[-5:]==".xlsx":
attributes_file_list.append(file_path)
fileobj = s3.Object(S3_BUCKET, file_path)
data = fileobj.get()['Body'].read()
df = pd.read_excel(io.BytesIO(data))
df= spark.createDataFrame(df.astype(str))
attributes_df.append(df)
attributes_ = unionAll(*attributes_df)
display(attributes_)
The error I have is:
Union can only be performed on tables with the same number of columns, but the first table has 4 columns and the second table has 3 columns;
So, I believe one of the data frames has 4 columns and the others 3. But Can't not outer join them on a common column?

How to concat 3 dataframes with each into sequential columns

I'm trying to understand how to concat three individual dataframes (i.e df1, df2, df3) into a new dataframe say df4 whereby each individual dataframe has its own column left to right order.
I've tried using concat with axis = 1 to do this, but it appears not possible to automate this with a single action.
Table1_updated = pd.DataFrame(columns=['3P','2PG-3Io','3Io'])
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io])
Note that with the exception of get_table1_2P_max_3Io, which has two columns, all other dataframes have one column
For example,
get_table1_3P =
get_table1_2P_max_3Io =
get_table1_3Io =
Ultimately, i would like to see the following:
I believe you need first concat and tthen change order by list of columns names:
Table1_updated=pd.concat([get_table1_3P,get_table1_2P_max_3Io,get_table1_3Io], axis=1)
Table1_updated = Table1_updated[['3P','2PG-3Io','3Io']]

Replace a subset of pandas data frame with another data frame

I have a data frame(DF1) with 100 columns.( one of the column is ID)
I have one more data frame(DF2) with 30 columns.( one column is ID)
I have to update the first 30 columns of the data frame(DF1) with the values in second data frame (DF2) keeping the rest of the values in the remaining columns of first data frame (DF1) intact.
update the first 30 column value in DF1 out of the 100 columns when the ID in second data frame (DF2) is present in first data frame (DF1).
I tested this on Python 3.7 but I see no reason for it not to work on 2.7:
joined = df1.reset_index() \
[['index', 'ID']] \
.merge(df2, on='ID')
df1.loc[joined['index'], df1.columns[:30]] = joined.drop(columns=['index', 'ID'])
This assumes that df2 doesn't have a column called index or the merge will fail saying duplicate key with suffix.
Here a slow-motion of its inner workings:
df1.reset_index() returns a dataframe same as df1 but with an additional column: index
[['index', 'ID']] extracts a dataframe containing just these 2 columns from the dataframe in #1
.merge(...) merges with df2 , matching on ID . The result (joined) is a dataframe with 32 columns: index, ID and the original 30 columns of df2.
df1.loc[<row_indexes>, <column_names>] = <another_dataframe> mean you want to replace at those particular cells with data from another_dataframe. Since joined has 32 columns, we need to drop the extra 2 (index and ID)

Remove rows from multiple dataframe that contain bad data

Say I have n dataframes, df1, df2...dfn.
Finding rows that contain "bad" values in a row in a given dataframe is done by e.g.,
index1 = df1[df1.isin([np.nan, np.inf, -np.inf])]
index2 = df2[df2.isin([np.nan, np.inf, -np.inf])]
Now, droping these bad rows in the bad dataframe is done with:
df1 = df1.replace([np.inf, -np.inf], np.nan).dropna()
df2 = df2.replace([np.inf, -np.inf], np.nan).dropna()
The problem is that any function that expects the two (n) dataframes columns to be of the same length may give an error if there is bad data in one df but not the other.
How do I drop not just the bad row from the offending dataframe, but the same row from a list of dataframes?
So in the two dataframe case, if in df1 date index 2009-10-09 contains a "bad" value, that same row in df2 will be dropped.
[Possible "ugly"? solution?]
I suspect that one way to do it is to merge the two (n) dataframes on date, then apply the cleanup function to drop "bad" values are automatic since the entire row gets dropped? But what happens if a date is missing from one dataframe and not the other? [and they still happen to be the same length?]
Doing your replace
df1 = df1.replace([np.inf, -np.inf], np.nan)
df2 = df2.replace([np.inf, -np.inf], np.nan)
Then, Here we using inner .
newdf=pd.concat([df1,df2],axis=1,keys=[1,2], join='inner').dropna()
And split it back to two dfs , here we using combine_first with dropna of original df
df1,df2=[s[1].loc[:,s[0]].combine_first(x.dropna()) for x,s in zip([df1,df2],newdf.groupby(level=0,axis=1))]