Join two PySpark DataFrames and get some of the columns from one DataFrame when column names are similar - dataframe

I want to join 2 PySpark DataFrames. But, I want all columns from one DataFrame, and some of columns from the 2nd DataFrame. The point is that there is a column with similar name in both DataFrames.
Sample DataFrames:
# Prepare Data
data_1 = [
(1, "Italy", "Europe"),
(2, "Italy", "Europe"),
(3, "Germany", None),
(4, "Iran", "Asia"),
(5, "China", "Asia"),
(6, "China", None),
(7, "Japan", "Asia"),
(8, "France", None),
]
# Create DataFrame
columns = ["Code", "Country", "Continent"]
df_1 = spark.createDataFrame(data=data_1, schema=columns)
df_1.show(truncate=False)
# Prepare Data
data_2 = [
(1, "Italy", "EUR", 11),
(2, "Germany", "EUR", 12),
(3, "China", "CNY", 13),
(4, "Japan", "JPY", 14),
(5, "France", "EUR", 15),
(6, "Taiwan", "TWD", 16),
(7, "USA", "USD", 17),
(8, "India", "INR", 18),
]
# Create DataFrame
columns = ["Code", "Country", "Currency", "Sales"]
df_2 = spark.createDataFrame(data=data_2, schema=columns)
df_2.show(truncate=False)
I want all columns of the 1st DataFrame and only column "Currency" from the 2nd DataFrame.
When I use left join:
output = df_1.join(df_2, ["Country"], "left")
output.show()
Now, there are two columns with name "Code" after Join operation.
Using drop columns:
output = df_1.join(df_2, ["Country"], "left").drop('Code', 'Sales')
output.show()
Both columns named "Code" are dropped. But, I want to keep column "Code" from the 1st DataFrame.
Any idea how to solve this issue?
Another question is how to make column "Code" as the left-most column in the resulting DataFrame after Join operation.

If you don't need columns from df_2, you can drop them before the join like this:
output = df_1.join(
df_2.select('Country', 'Currency'),
['Country'], 'left'
)
Note that you can also disambiguate two columns with the same name by specifying the dataframe they come from. e.g. df_1['Code']. So in your case, after the join, instead of using drop, you could use that to keep only the columns from df_1 and the Currency column:
output = df_1\
.join(df_2, ['Country'], 'left')\
.select([df_1[c] for c in df_1.columns] + ['Currency'])

Related

How to join two data frames using regexp_replace

I want to join two dataframes by removing the matching records using column cust_id. Some of the cust_id have leading zeros. So I need to match by removing zeros in the 'ON' clause. Tried the below query, but it's giving error in Databricks notebook.
PS: I don't want to create another DF1 with zeros removed.
Query:
df1 = df1.join(df2,[regexp_replace("cust_id", r'^[0]*','')], "leftanti")
Py4j.Py4JException: Method and Class java.lang.string does not exist
The following works, but the output that you provided will not be reached using "leftanti" join: S15 matches S15 from another table, so it is removed too. In the example that you provided, "leftanti" join does not return any row.
from pyspark.sql import functions as F
df1 = spark.createDataFrame(
[(1, 'S15', 'AAA'),
(2, '00767', 'BBB'),
(3, '03246', 'CCC')],
['ID', 'cust_id', 'Name'])
df2 = spark.createDataFrame(
[(1, 'S15', 'AAA'),
(2, '767', 'BBB'),
(3, '3246', 'CCC')],
['ID', 'cust_id', 'Name'])
df = df1.join(df2, df2.cust_id == F.regexp_replace(df1.cust_id, r'^0*', ''), "leftanti")
df.show()
# +---+-------+----+
# | ID|cust_id|Name|
# +---+-------+----+
# +---+-------+----+
No Need of square brackets [ ]
df1.join(df2, regexp_replace(df2("cust_id"), r'^[0]*', lit("")))
see documentatin here regexp_replace

Multiple calculations over spark dataframe in one pass

I need to make such several computations over data frame: min and max over column A, distinct values over column B. What is the most efficient way to do that? Is it possible to do that in one pass?
val df = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
If by one pass you mean inside a single statement , you can do this as below -
sparkDF = sc.parallelize(Seq(
(1, "John"),
(2, "John"),
(3, "Dave"),
)).toDF("A", "B")
sparkDF.select([F.max(F.col('A')),F.min(F.col('A')),F.countDistinct(F.col('B'))
Additionally you can provide sample data and the expected output you are looking for to better gauge the question

Why Spark uses ordered schema for dataframe?

I wondered why spark uses ordered schema in dataframe rather than using name based schema where 2 schemas considered to be the same if for each column name they have the same type.
My first question is that what was the advantage of ordering columns in schema that spark orders columns? Does it make some operations on dataframe faster when we have this assumption?
And my second question is whether I can tell spark that the order of columns do not matter to me and consider two schemas to be the same if the unordered set of columns and their types are the same.
Spark dataframes are not relational databases. It saves time for certain types of processing; e.g. union, which will take in fact the names from the last DF. So, it's an implementation detail.
You therefore cannot state that ordering does not matter to Spark. See the union of the below:
val df2 = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "talk", "animal")
val df3 = df.union(df2)
Note that with JSON schema inference everything is alphabetical. That to me is very handy.

Plotting a multi-index dataframe with Altair

I have a dataframe which looks like:
data = {'ColA': {('A', 'A-1'): 0,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 2,
('B', 'B-2'): 2,
('B', 'B-3'): 0,
('C', 'C-1'): 1,
('C', 'C-2'): 2,
('C', 'C-3'): 2,
('C', 'C-4'): 3},
'ColB': {('A', 'A-1'): 3,
('A', 'A-2'): 1,
('A', 'A-3'): 1,
('B', 'B-1'): 0,
('B', 'B-2'): 2,
('B', 'B-3'): 2,
('C', 'C-1'): 2,
('C', 'C-2'): 0,
('C', 'C-3'): 3,
('C', 'C-4'): 1}}
df = pd.DataFrame( data )
The values for every column are either 0, 1, 2, or 3. These values could just as easily be 'U', 'Q', 'R', or 'Z' ... i.e. there is nothing inherently numeric about them.
I would like to use Altair
** First Set of Charts
I would like to get one bar chart per column.
The labels for the X-axis should be based on the unique values in the columns. The Y-axis should be the count of the unique values in the column.
** Second Set of Charts
Similar to the first set, I would like to get one bar chart per row.
The labels for the X-axis should be based on the unique values in the row. The Y-axis should be the count of the unique values in the row.
This should be easy, but I am not sure how to do it.
All of Altair's APIs are column-based, and ignore indices unless you explicitly include them (see Including Index Data in Altair's documentation).
For the first set of charts (one bar chart per column) you can do this:
alt.Chart(df.reset_index()).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(['ColA', 'ColB'])
For the second set of charts (one bar chart per row) you can do something like this:
df_transposed = df.reset_index(0, drop=True).T
alt.Chart(df_transposed).mark_bar().encode(
alt.X(alt.repeat(), type='nominal'),
y='count()'
).repeat(list(df_transposed.columns), columns=5)
Though this is a bit of a strange visualization, so I suspect I'm misunderstanding what you're after... your data has ten rows, so one chart per row is ten charts.

Error while trying to perform isin while iterating

I have 2 data frames and I am trying to get the first value in the column 'name' of one dataframe & then do a isin using that value on the 'name' column of the other dataframe. I am trying to do it like this because, if isin is true, then i want to get the corresponding AGE and match both, then if that also is true, then get the corresponding City & match.
But i am getting the error as below.
"TypeError: only list-like objects are allowed to be passed to isin(), you passed a [str]"
if I just print "row['name']", i get the value of first name but why is it not doing the isin check? what am i missing here?
Df1 = pd.DataFrame({'name': ['Marc', 'Jake', 'Sam', 'Brad'],
'Age': ['24', '25', '26', '27'],
'City': ['Agra', 'Bangalore', 'Calcutta', 'Delhi']})
Df2 = pd.DataFrame({'name': ['Jake', 'John', 'Marc', 'Tony', 'Bob', 'Marc'],
'Age': ['25', '25', '24', '28','29', '39'],
'City': ['Bangalore', 'Chennai', 'Agra', 'Delhi','Pune','zoo']})
for index, row in Df1.iterrows():
if Df2.name.isin(row['name'])==True:
print('present')
Problem is isin need lists, so possible solution is create one element list with Series.any for test if at least one value matching - at least one True:
for index, row in Df1.iterrows():
if Df2.name.isin([row['name']]).any():
print ('present')
Or compare by Series.eq:
for index, row in Df1.iterrows():
if Df2.name.eq(row['name']).any():
print ('present')