I am working on datasets (having 20k distinct records) to join two data frames based on a identifier columns id_txt
df1.join(df2,df1.id_text== df2.id_text,"inner").select(df1['*'], df2['Name'].alias('DName'))
df1 has the following sample values in the identifier column id_text:
X North
Y South
Z West
Whereas df2 has the following sample values from identifier column id_text:
North X
South Y
West Z
Logically, the different values for id_text are correct. Hardcoding those values for 10k records is not a feasible solution. Is there any way id_text can be modified for df2 to be the same as in df1?
You can use a column expression directly inside join (it will not create an additional column). In this example, I used regexp_replace to switch places of both elements.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('X North', 1), ('Y South', 1), ('Z West', 1)], ['id_text', 'val1'])
df2 = spark.createDataFrame([('North X', 2), ('South Y', 2), ('West Z', 2)], ['id_text', 'Name'])
# df1 df2
# +-------+----+ +-------+----+
# |id_text|val1| |id_text|Name|
# +-------+----+ +-------+----+
# |X North| 1| |North X| 2|
# |Y South| 1| |South Y| 2|
# | Z West| 1| | West Z| 2|
# +-------+----+ +-------+----+
df = (df1
.join(df2, df1.id_text == F.regexp_replace(df2.id_text, r'(.+) (.+)', '$2 $1'), 'inner')
.select(df1['*'], df2.Name))
df.show()
# +-------+----+----+
# |id_text|val1|Name|
# +-------+----+----+
# |X North| 1| 2|
# |Y South| 1| 2|
# | Z West| 1| 2|
# +-------+----+----+
Related
I want to perform similar operation in pyspark like in how its possible with pandas
My dataframe is :
Year win_loss_date Deal L2 GFCID Name L2 GFCID GFCID GFCID Name Client Priority Location Deal Location Revenue Deal Conclusion New/Rebid
0 2021 2021-03-08 00:00:00 1-2JZONGU TEST GFCID CREATION P-1-P1DO P-1-P5O TEST GFCID CREATION None UNITED STATES UNITED STATES 4567.0000000 Won New
enter image description here
In pandas: code to pivot is :
df = pd.pivot_table(deal_df_pandas,
index=['GFCID', 'GFCID Name', 'Client Priority'],
columns=['New/Rebid', 'Year', 'Deal Conclusion'],
aggfunc={'Deal':'count',
'Revenue':'sum',
'Location': lambda x: set(x),
'Deal Location': lambda x: set(x)}).reset_index()
columns=['New/Rebid', 'Year', 'Deal Conclusion'] ---These are the columns pivoted
Output I get and expected:
GFCID GFCID Name Client Priority Deal Revenue
New/Rebid New Rebid New Rebid
Year 2020 2021 2020 2021 2020 2021 2020 2021
Deal Conclusion Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won Lost Won
0 0000000752 ARAMARK SERVICES INC Bronze NaN 1.0 1.0 2.0 NaN NaN NaN NaN NaN 1600000.0000000 20.0000000 20000.0000000 NaN NaN NaN NaN
enter image description here
What i want is to convert above code to pyspark.
what i am trying is not working:
from pyspark.sql import functions as F
df_pivot2=(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid').agg(F.first('Year'),F.first('Deal Conclusion'),F.count('Deal'),F.sum('Revenue'))
AS THIS OPERATION NOT POSSIBLE IN PySPARK:
(df_d1
.groupby('GFCID', 'GFCID Name', 'Client Priority')
.pivot('New/Rebid','Year','Deal Conclusion') #--error
you can concatenate the multiple columns into a single column which can be used within pivot.
consider the following example
data_sdf.show()
# +---+-----+--------+--------+
# | id|state| time|expected|
# +---+-----+--------+--------+
# | 1| A|20220722| 1|
# | 1| A|20220723| 1|
# | 1| B|20220724| 2|
# | 2| B|20220722| 1|
# | 2| C|20220723| 2|
# | 2| B|20220724| 3|
# +---+-----+--------+--------+
data_sdf. \
withColumn('pivot_col', func.concat_ws('_', 'state', 'time')). \
groupBy('id'). \
pivot('pivot_col'). \
agg(func.sum('expected')). \
fillna(0). \
show()
# +---+----------+----------+----------+----------+----------+
# | id|A_20220722|A_20220723|B_20220722|B_20220724|C_20220723|
# +---+----------+----------+----------+----------+----------+
# | 1| 1| 1| 0| 2| 0|
# | 2| 0| 0| 1| 3| 2|
# +---+----------+----------+----------+----------+----------+
The input dataframe had 2 fields - state and time - that were to be pivoted. They were concatenated with a '_' delimiter and used within pivot. You can use multiple aggregations within the agg, per your requirements, post that.
I have PySpark dataframe:
user_id
item_id
last_watch_dt
total_dur
watched_pct
1
1
2021-05-11
4250
72
1
2
2021-05-11
80
99
2
3
2021-05-11
1000
80
2
4
2021-05-11
5000
40
I used this code:
df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')
To get this:
1
2
3
4
1
72
99
0
0
2
0
0
80
40
But I got an error:
AttributeError: 'DataFrame' object has no attribute 'pivot'
What did I do wrong?
You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.
pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():
df.groupBy("user_id").pivot("item_id")
This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).
df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, '2021-05-11', 4250, 72),
(1, 2, '2021-05-11', 80, 99),
(2, 3, '2021-05-11', 1000, 80),
(2, 4, '2021-05-11', 5000, 40)],
['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])
df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id| 1| 2| 3| 4|
# +-------+----+----+----+----+
# | 1| 72| 99|null|null|
# | 2|null|null| 80| 40|
# +-------+----+----+----+----+
If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.
df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id| 1| 2| 3| 4|
# +-------+---+---+---+---+
# | 1| 72| 99| 0| 0|
# | 2| 0| 0| 80| 40|
# +-------+---+---+---+---+
I want to create a single column after concatenating number of columns in a single column but in dictionary format in PySpark.
I have concatenated data into a single column but I am unable to store it in a dictionary format.
Please find the below attached screenshot for more details.
Let me know if need more information.
In your current situation, you can use str_to_map
from pyspark.sql import functions as F
df = spark.createDataFrame([("datatype:0,length:1",)], ['region_validation_check_status'])
df = df.withColumn(
'region_validation_check_status',
F.expr("str_to_map(region_validation_check_status, ',')")
)
df.show(truncate=0)
# +------------------------------+
# |region_validation_check_status|
# +------------------------------+
# |{datatype -> 0, length -> 1} |
# +------------------------------+
If you didn't have a string yet, you could do it from column values with to_json and from_json
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 2), (3, 4)], ['a', 'b'])
df.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 3| 4|
# +---+---+
df = df.select(
F.from_json(F.to_json(F.struct('a', 'b')), 'map<string, int>')
)
df.show()
# +----------------+
# | entries|
# +----------------+
# |{a -> 1, b -> 2}|
# |{a -> 3, b -> 4}|
# +----------------+
Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.
Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())
Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+
You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+
I have a dataframe with dates in the format YYYYMM.
These start from 201801.
I now want to add a column where 201801 = 1, 201802 = 2 and so on up until the most recent month which is updated every month.
Kind regards,
wokter
months_between can be used:
from pyspark.sql import functions as F
from pyspark.sql import types as T
#some testdata
data = [
[201801],
[201802],
[201804],
[201812],
[202001],
[202010]
]
df = spark.createDataFrame(data, schema=["yyyymm"])
df.withColumn("months", F.months_between(
F.to_date(F.col("yyyymm").cast(T.StringType()), "yyyyMM"), F.lit("2017-12-01")
).cast(T.IntegerType())).show()
Output:
+------+------+
|yyyymm|months|
+------+------+
|201801| 1|
|201802| 2|
|201804| 4|
|201812| 12|
|202001| 25|
|202010| 34|
+------+------+