fill na with random numbers in Pyspark

fill na with random numbers in Pyspark - dataframe

I'm using Pyspark DataFrame.
I'd like to update NA values in Age column with a random value in the range 14 to 46.
How can I do it?

Mara's answer is correct if you would like to replace the null values with the same random number, but if you'd like a random value for each age, you should do something coalesce and F.rand() as illustrated below:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from random import randint
df = sqlContext.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
df = (df
.withColumn("x4", F.lit(None).cast(IntegerType()))
.withColumn("x5", F.lit(None).cast(IntegerType()))
)
df.na.fill({'x4':randint(0,100)}).show()
df.withColumn('x5', F.coalesce(F.col('x5'), (F.round(F.rand()*100)))).show()
+---+---+-----+---+----+
| x1| x2| x3| x4| x5|
+---+---+-----+---+----+
| 1| a| 23.0| 9|null|
| 3| B|-23.0| 9|null|
+---+---+-----+---+----+
+---+---+-----+----+----+
| x1| x2| x3| x4| x5|
+---+---+-----+----+----+
| 1| a| 23.0|null|44.0|
| 3| B|-23.0|null| 2.0|
+---+---+-----+----+----+

The randint function is what you need: it generates a random integer between two numbers. Apply it in the fillna spark function for the 'age' column.
from random import randint
df.fillna(randint(14, 46), 'age').show()

Related

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.

Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())

Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+

You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+

pyspark extra column where dates are trasformed to 1, 2 , 3

I have a dataframe with dates in the format YYYYMM.
These start from 201801.
I now want to add a column where 201801 = 1, 201802 = 2 and so on up until the most recent month which is updated every month.
Kind regards,
wokter

months_between can be used:
from pyspark.sql import functions as F
from pyspark.sql import types as T
#some testdata
data = [
[201801],
[201802],
[201804],
[201812],
[202001],
[202010]
]
df = spark.createDataFrame(data, schema=["yyyymm"])
df.withColumn("months", F.months_between(
F.to_date(F.col("yyyymm").cast(T.StringType()), "yyyyMM"), F.lit("2017-12-01")
).cast(T.IntegerType())).show()
Output:
+------+------+
|yyyymm|months|
+------+------+
|201801| 1|
|201802| 2|
|201804| 4|
|201812| 12|
|202001| 25|
|202010| 34|
+------+------+

How to delete specific characters from a string in a PySpark dataframe?

I want to delete the last two characters from values in a column.
The values of the PySpark dataframe look like this:
1000.0
1250.0
3000.0
...
and they should look like this:
1000
1250
3000
...

You can use substring to get the string until the index length - 2:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col',
F.expr("substring(col, 1, length(col) - 2)")
)

You can use regexp_replace:
from pyspark.sql import functions as F
df1 = df.withColumn("value", F.regexp_replace("value", "(.*).{2}", "$1"))
df1.show()
#+-----+
#|value|
#+-----+
#| 1000|
#| 1250|
#| 3000|
#+-----+
Or regexp_extract:
df1 = df.withColumn("value", F.regexp_extract("value", "(.*).{2}", 1))

You can use the function substring_index to extract the part before the period:
df = spark.createDataFrame([['1000.0'], ['2000.0']], ['col'])
df.withColumn('new_col', F.substring_index(F.col('col'), '.', 1))
Result:
+------+-------+
| col|new_col|
+------+-------+
|1000.0| 1000|
|2000.0| 2000|
+------+-------+

Transpose wide dataframe to long dataframe

I have a data frame looks like:
Region, 2000Q1, 2000Q2, 2000Q3, ...
A, 1,2,3,...
I want to transpose this wide table to a long table by 'Region'. So the final product will look like:
Region, Time, Value
A, 2000Q1,1
A, 2000Q2, 2
A, 2000Q3, 3
A, 2000Q4, 4
....
The original table has a very wide array of columns but the aggregation level is always region and remaining columns are set to be tranposed.
Do you know an easy way or function to do this?

Try with arrays_zip function then explode the array
Example:
df=spark.createDataFrame([('A',1,2,3)],['Region','2000q1','2000q2','2000q3'])
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("cc",explode(arrays_zip(array(cols),split(lit(col_name),"\\|")))).\
select("Region","cc.*").\
toDF(*['Region','Value','Time']).\
show()
#+------+-----+------+
#|Region|Value| Time|
#+------+-----+------+
#| A| 1|2000q1|
#| A| 2|2000q2|
#| A| 3|2000q3|
#+------+-----+------+

Similar but improved for the column calculation.
cols = df.columns
cols.remove('Region')
import pyspark.sql.functions as f
df.withColumn('array', f.explode(f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array(*cols), ))) \
.select('Region', 'array.*') \
.toDF('Region', 'Time', 'Value') \
.show(30, False)
+------+------+-----+
|Region|Time |Value|
+------+------+-----+
|A |2000Q1|1 |
|A |2000Q2|2 |
|A |2000Q3|3 |
|A |2000Q4|4 |
|A |2000Q5|5 |
+------+------+-----+
p.s. Don't accept this as an answer :)

SQL/PySpark: Create a new column consisting of a number of rows in the past n days

Currently, I have a table consisting of encounter_id and date field like so:
+---------------------------+--------------------------+
|encounter_id |date |
+---------------------------+--------------------------+
|random_id34234 |2018-09-17 21:53:08.999999|
|this_can_be_anything2432432|2018-09-18 18:37:57.000000|
|423432 |2018-09-11 21:00:36.000000|
+---------------------------+--------------------------+
encounter_id is a random string.
I'm aiming to create a column which consists of the total number of encounters in the past 30 days.
+---------------------------+--------------------------+---------------------------+
|encounter_id |date | encounters_in_past_30_days|
+---------------------------+--------------------------+---------------------------+
|random_id34234 |2018-09-17 21:53:08.999999| 2 |
|this_can_be_anything2432432|2018-09-18 18:37:57.000000| 3 |
|423432 |2018-09-11 21:00:36.000000| 1 |
+---------------------------+--------------------------+---------------------------+
Currently, I'm thinking of somehow using window functions and specifying an aggregate function.
Thanks for the time.

Here is one possible solution, I added some sample data. It indeed uses a window function, as you suggested yourself. Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame(
[
('A','2018-10-01 00:15:00'),
('B','2018-10-11 00:30:00'),
('C','2018-10-21 00:45:00'),
('D','2018-11-10 00:00:00'),
('E','2018-12-20 00:15:00'),
('F','2018-12-30 00:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.col('date').astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*30,0)
df = df.withColumn('encounters_past_30_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+-----------------------+
|encounter_id| date| timestamp|encounters_past_30_days|
+------------+-------------------+----------+-----------------------+
| A|2018-10-01 00:15:00|1538345700| 1|
| B|2018-10-11 00:30:00|1539210600| 2|
| C|2018-10-21 00:45:00|1540075500| 3|
| D|2018-11-10 00:00:00|1541804400| 2|
| E|2018-12-20 00:15:00|1545261300| 1|
| F|2018-12-30 00:30:00|1546126200| 2|
+------------+-------------------+----------+-----------------------+
EDIT: If you want to have days as the granularity, you could first convert your date column to the Date type. Example below, assuming that a window of five days means today and the four days before. If it should be today and the past five days just remove the -1.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
n_days = 5
df = sqlContext.createDataFrame(
[
('A','2018-10-01 23:15:00'),
('B','2018-10-02 00:30:00'),
('C','2018-10-05 05:45:00'),
('D','2018-10-06 00:15:00'),
('E','2018-10-07 00:15:00'),
('F','2018-10-10 21:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.to_date(F.col('date')).astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*(n_days-1),0)
df = df.withColumn('encounters_past_n_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+----------------------+
|encounter_id| date| timestamp|encounters_past_n_days|
+------------+-------------------+----------+----------------------+
| A|2018-10-01 23:15:00|1538344800| 1|
| B|2018-10-02 00:30:00|1538431200| 2|
| C|2018-10-05 05:45:00|1538690400| 3|
| D|2018-10-06 00:15:00|1538776800| 3|
| E|2018-10-07 00:15:00|1538863200| 3|
| F|2018-10-10 21:30:00|1539122400| 3|
+------------+-------------------+----------+----------------------+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

fill na with random numbers in Pyspark - dataframe

I'm using Pyspark DataFrame. I'd like to update NA values in Age column with a random value in the range 14 to 46. How can I do it?

The randint function is what you need: it generates a random integer between two numbers. Apply it in the fillna spark function for the 'age' column. from random import randint df.fillna(randint(14, 46), 'age').show()

Related

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

pyspark extra column where dates are trasformed to 1, 2 , 3

How to delete specific characters from a string in a PySpark dataframe?

Transpose wide dataframe to long dataframe

SQL/PySpark: Create a new column consisting of a number of rows in the past n days

Categories

Resources