Insert data into a single column but in dictionary format after concatenating few column of data - dataframe

I want to create a single column after concatenating number of columns in a single column but in dictionary format in PySpark.
I have concatenated data into a single column but I am unable to store it in a dictionary format.
Please find the below attached screenshot for more details.
Let me know if need more information.

In your current situation, you can use str_to_map
from pyspark.sql import functions as F
df = spark.createDataFrame([("datatype:0,length:1",)], ['region_validation_check_status'])
df = df.withColumn(
'region_validation_check_status',
F.expr("str_to_map(region_validation_check_status, ',')")
)
df.show(truncate=0)
# +------------------------------+
# |region_validation_check_status|
# +------------------------------+
# |{datatype -> 0, length -> 1} |
# +------------------------------+
If you didn't have a string yet, you could do it from column values with to_json and from_json
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 2), (3, 4)], ['a', 'b'])
df.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 3| 4|
# +---+---+
df = df.select(
F.from_json(F.to_json(F.struct('a', 'b')), 'map<string, int>')
)
df.show()
# +----------------+
# | entries|
# +----------------+
# |{a -> 1, b -> 2}|
# |{a -> 3, b -> 4}|
# +----------------+

Related

AttributeError: 'DataFrame' object has no attribute 'pivot'

I have PySpark dataframe:
user_id
item_id
last_watch_dt
total_dur
watched_pct
1
1
2021-05-11
4250
72
1
2
2021-05-11
80
99
2
3
2021-05-11
1000
80
2
4
2021-05-11
5000
40
I used this code:
df_new = df.pivot(index='user_id', columns='item_id', values='watched_pct')
To get this:
1
2
3
4
1
72
99
0
0
2
0
0
80
40
But I got an error:
AttributeError: 'DataFrame' object has no attribute 'pivot'
What did I do wrong?
You can only do .pivot on objects having pivot attribute (method or property). You tried to do df.pivot, so it would only work if df had such attribute. You can inspect all the attributes of df (it's an object of pyspark.sql.DataFrame class) here. You see many attributes there, but none of them is called pivot. That's why you get an attribute error.
pivot is a method of pyspark.sql.GroupedData object. It means, in order to use it, you must somehow create pyspark.sql.GroupedData object from your pyspark.sql.DataFrame object. In your case, it's by using .groupBy():
df.groupBy("user_id").pivot("item_id")
This creates yet another pyspark.sql.GroupedData object. In order to make a dataframe out of it you would want to use one of the methods of GroupedData class. agg is the method that you need. Inside it, you will have to provide Spark's aggregation function which you will use for all the grouped elements (e.g. sum, first, etc.).
df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 1, '2021-05-11', 4250, 72),
(1, 2, '2021-05-11', 80, 99),
(2, 3, '2021-05-11', 1000, 80),
(2, 4, '2021-05-11', 5000, 40)],
['user_id', 'item_id', 'last_watch_dt', 'total_dur', 'watched_pct'])
df = df.groupBy("user_id").pivot("item_id").agg(F.sum("watched_pct"))
df.show()
# +-------+----+----+----+----+
# |user_id| 1| 2| 3| 4|
# +-------+----+----+----+----+
# | 1| 72| 99|null|null|
# | 2|null|null| 80| 40|
# +-------+----+----+----+----+
If you want to replace nulls with 0, use fillna of pyspark.sql.DataFrame class.
df = df.fillna(0)
df.show()
# +-------+---+---+---+---+
# |user_id| 1| 2| 3| 4|
# +-------+---+---+---+---+
# | 1| 72| 99| 0| 0|
# | 2| 0| 0| 80| 40|
# +-------+---+---+---+---+

Modify key column to match the join condition

I am working on datasets (having 20k distinct records) to join two data frames based on a identifier columns id_txt
df1.join(df2,df1.id_text== df2.id_text,"inner").select(df1['*'], df2['Name'].alias('DName'))
df1 has the following sample values in the identifier column id_text:
X North
Y South
Z West
Whereas df2 has the following sample values from identifier column id_text:
North X
South Y
West Z
Logically, the different values for id_text are correct. Hardcoding those values for 10k records is not a feasible solution. Is there any way id_text can be modified for df2 to be the same as in df1?
You can use a column expression directly inside join (it will not create an additional column). In this example, I used regexp_replace to switch places of both elements.
from pyspark.sql import functions as F
df1 = spark.createDataFrame([('X North', 1), ('Y South', 1), ('Z West', 1)], ['id_text', 'val1'])
df2 = spark.createDataFrame([('North X', 2), ('South Y', 2), ('West Z', 2)], ['id_text', 'Name'])
# df1 df2
# +-------+----+ +-------+----+
# |id_text|val1| |id_text|Name|
# +-------+----+ +-------+----+
# |X North| 1| |North X| 2|
# |Y South| 1| |South Y| 2|
# | Z West| 1| | West Z| 2|
# +-------+----+ +-------+----+
df = (df1
.join(df2, df1.id_text == F.regexp_replace(df2.id_text, r'(.+) (.+)', '$2 $1'), 'inner')
.select(df1['*'], df2.Name))
df.show()
# +-------+----+----+
# |id_text|val1|Name|
# +-------+----+----+
# |X North| 1| 2|
# |Y South| 1| 2|
# | Z West| 1| 2|
# +-------+----+----+

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.
Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())
Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+
You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+

How to check if a column only contains certain letters

I have a dataframe and I want to check one column that only contains letter A for example.
The column contains a lot of letters. It looks like:
AAAAAAAAAAAAAAAA
AAABBBBBDBBSBSBB
I want to check if this column only contains letter A, or both letter A or B, but nothing else.
Do you know which function I shall use?
Try this: I have considered four samples of letters. We can use rlike function in spark. I have used regex of [^AB]. This will return true to the column values having letters other than A or B and False will be displayed to the values having A or B or both AB. we can filter out False and that will be your answer.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
li = [[("AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB")], [("AAAAAAAAA")],[("BBBBBBBB")], [("AAAAAABBBBBBBB")]]
df = spark.createDataFrame(li, ["letter"])
df.show(truncate=False)
#
# +--------------------------------+
# |letter |
# +--------------------------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|
# |AAAAAAAAA |
# |BBBBBBBB |
# |AAAAAABBBBBBBB |
# +--------------------------------+
df1 = df.withColumn("contains_A_or_B", F.col('letter').rlike("[^AB]"))
df.show(truncate=False)
+--------------------------------+---------------+
# |letter |contains_A_or_B|
# +--------------------------------+---------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|true |
# |AAAAAAAAA |false |
# |BBBBBBBB |false |
# |AAAAAABBBBBBBB |false |
# +--------------------------------+---------------+
df1.filter(F.col('contains_A_or_B')==False).select("letter").show()
# +--------------+
# | letter|
# +--------------+
# | AAAAAAAAA|
# | BBBBBBBB|
# |AAAAAABBBBBBBB|
# +--------------+
Use rlike.
Example from the official documentation:
df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.Column.rlike

fill na with random numbers in Pyspark

I'm using Pyspark DataFrame.
I'd like to update NA values in Age column with a random value in the range 14 to 46.
How can I do it?
Mara's answer is correct if you would like to replace the null values with the same random number, but if you'd like a random value for each age, you should do something coalesce and F.rand() as illustrated below:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from random import randint
df = sqlContext.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
df = (df
.withColumn("x4", F.lit(None).cast(IntegerType()))
.withColumn("x5", F.lit(None).cast(IntegerType()))
)
df.na.fill({'x4':randint(0,100)}).show()
df.withColumn('x5', F.coalesce(F.col('x5'), (F.round(F.rand()*100)))).show()
+---+---+-----+---+----+
| x1| x2| x3| x4| x5|
+---+---+-----+---+----+
| 1| a| 23.0| 9|null|
| 3| B|-23.0| 9|null|
+---+---+-----+---+----+
+---+---+-----+----+----+
| x1| x2| x3| x4| x5|
+---+---+-----+----+----+
| 1| a| 23.0|null|44.0|
| 3| B|-23.0|null| 2.0|
+---+---+-----+----+----+
The randint function is what you need: it generates a random integer between two numbers. Apply it in the fillna spark function for the 'age' column.
from random import randint
df.fillna(randint(14, 46), 'age').show()