How to delete specific characters from a string in a PySpark dataframe?

How to delete specific characters from a string in a PySpark dataframe? - dataframe

I want to delete the last two characters from values in a column.
The values of the PySpark dataframe look like this:
1000.0
1250.0
3000.0
...
and they should look like this:
1000
1250
3000
...

You can use substring to get the string until the index length - 2:
import pyspark.sql.functions as F
df2 = df.withColumn(
'col',
F.expr("substring(col, 1, length(col) - 2)")
)

You can use regexp_replace:
from pyspark.sql import functions as F
df1 = df.withColumn("value", F.regexp_replace("value", "(.*).{2}", "$1"))
df1.show()
#+-----+
#|value|
#+-----+
#| 1000|
#| 1250|
#| 3000|
#+-----+
Or regexp_extract:
df1 = df.withColumn("value", F.regexp_extract("value", "(.*).{2}", 1))

You can use the function substring_index to extract the part before the period:
df = spark.createDataFrame([['1000.0'], ['2000.0']], ['col'])
df.withColumn('new_col', F.substring_index(F.col('col'), '.', 1))
Result:
+------+-------+
| col|new_col|
+------+-------+
|1000.0| 1000|
|2000.0| 2000|
+------+-------+

Related

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Suppose I have a pyspark data frame as:
col1 col2 col3
1 2 -3
2 null 5
4 4 8
1 0 9
I want to add a column called check where it counts the number of values that are greater than 0.
The final output will be:
col1 col2 col3 check
1 2 -3 2
2 null 5 2
4 4 8 3
1 0 9 2
I was trying this. But, it didn't help and errors out as below:
df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))
Invalid argument, not a string or column: <generator object
at 0x7f0a866ae580> of type <class 'generator'>. For column literals,
use 'lit', 'array', 'struct' or 'create_map' function.

Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.
count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()
Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.
Idea: https://stackoverflow.com/a/42540401/496289
Your code shows you doing a sum of non-zero columns, not count. If you need sum then
count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())

Create a new column array and filter the newly created column finally count the elements in the column.
Example:
df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1 |2 |-3 |
#|2 |null|5 |
#+----+----+----+
df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1 |2 |-3 |2 |
#|2 |null|5 |1 |
#+----+----+----+-----+

You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:
from pyspark.sql import functions as F
from operator import add
from functools import reduce
df = spark.createDataFrame([
(1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])
df = df.withColumn(
"check",
reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)
df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#| 1| 2| -3| 2|
#| 2|null| 5| 2|
#| 4| 4| 8| 3|
#| 1| 0| 9| 2|
#+----+----+----+-----+

pyspark extra column where dates are trasformed to 1, 2 , 3

I have a dataframe with dates in the format YYYYMM.
These start from 201801.
I now want to add a column where 201801 = 1, 201802 = 2 and so on up until the most recent month which is updated every month.
Kind regards,
wokter

months_between can be used:
from pyspark.sql import functions as F
from pyspark.sql import types as T
#some testdata
data = [
[201801],
[201802],
[201804],
[201812],
[202001],
[202010]
]
df = spark.createDataFrame(data, schema=["yyyymm"])
df.withColumn("months", F.months_between(
F.to_date(F.col("yyyymm").cast(T.StringType()), "yyyyMM"), F.lit("2017-12-01")
).cast(T.IntegerType())).show()
Output:
+------+------+
|yyyymm|months|
+------+------+
|201801| 1|
|201802| 2|
|201804| 4|
|201812| 12|
|202001| 25|
|202010| 34|
+------+------+

Transpose wide dataframe to long dataframe

I have a data frame looks like:
Region, 2000Q1, 2000Q2, 2000Q3, ...
A, 1,2,3,...
I want to transpose this wide table to a long table by 'Region'. So the final product will look like:
Region, Time, Value
A, 2000Q1,1
A, 2000Q2, 2
A, 2000Q3, 3
A, 2000Q4, 4
....
The original table has a very wide array of columns but the aggregation level is always region and remaining columns are set to be tranposed.
Do you know an easy way or function to do this?

Try with arrays_zip function then explode the array
Example:
df=spark.createDataFrame([('A',1,2,3)],['Region','2000q1','2000q2','2000q3'])
from pyspark.sql.functions import *
from pyspark.sql.types import *
df.withColumn("cc",explode(arrays_zip(array(cols),split(lit(col_name),"\\|")))).\
select("Region","cc.*").\
toDF(*['Region','Value','Time']).\
show()
#+------+-----+------+
#|Region|Value| Time|
#+------+-----+------+
#| A| 1|2000q1|
#| A| 2|2000q2|
#| A| 3|2000q3|
#+------+-----+------+

Similar but improved for the column calculation.
cols = df.columns
cols.remove('Region')
import pyspark.sql.functions as f
df.withColumn('array', f.explode(f.arrays_zip(f.array(*map(lambda x: f.lit(x), cols)), f.array(*cols), ))) \
.select('Region', 'array.*') \
.toDF('Region', 'Time', 'Value') \
.show(30, False)
+------+------+-----+
|Region|Time |Value|
+------+------+-----+
|A |2000Q1|1 |
|A |2000Q2|2 |
|A |2000Q3|3 |
|A |2000Q4|4 |
|A |2000Q5|5 |
+------+------+-----+
p.s. Don't accept this as an answer :)

pyspark dataframe filtering on multiple columns

I have a pyspark dataframe which looks like below
df
num11 num21
10 10
20 30
5 25
I am filtering above dataframe on all columns present, and selecting rows with number greater than 10 [no of columns can be more than two]
from pyspark.sql.functions import col
col_list = df.schema.names
df_fltered = df.where(col(c) >= 10 for c in col_list)
desired output is :
num11 num21
10 10
20 30
How can we achieve filtering on multiple columns using iteration on column list as above. [all efforts are appriciated]
[error i reveive is : condition should be string or column]

As an alternative, if you not averse to some sql-like snippets of code, the following should work:
df.where("AND".join(["(%s >=10)"%(col) for col in col_list]))

You can use functools.reduce to combine the column conditions, to simulate an all condition, for instance, you can use reduce(lambda x, y: x & y, ...):
import pyspark.sql.functions as F
from functools import reduce
df.where(reduce(lambda x, y: x & y, (F.col(x) >= 10 for x in df.columns))).show()
+-----+-----+
|num11|num21|
+-----+-----+
| 10| 10|
| 20| 30|
+-----+-----+

fill na with random numbers in Pyspark

I'm using Pyspark DataFrame.
I'd like to update NA values in Age column with a random value in the range 14 to 46.
How can I do it?

Mara's answer is correct if you would like to replace the null values with the same random number, but if you'd like a random value for each age, you should do something coalesce and F.rand() as illustrated below:
from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType
from random import randint
df = sqlContext.createDataFrame(
[(1, "a", 23.0), (3, "B", -23.0)], ("x1", "x2", "x3"))
df = (df
.withColumn("x4", F.lit(None).cast(IntegerType()))
.withColumn("x5", F.lit(None).cast(IntegerType()))
)
df.na.fill({'x4':randint(0,100)}).show()
df.withColumn('x5', F.coalesce(F.col('x5'), (F.round(F.rand()*100)))).show()
+---+---+-----+---+----+
| x1| x2| x3| x4| x5|
+---+---+-----+---+----+
| 1| a| 23.0| 9|null|
| 3| B|-23.0| 9|null|
+---+---+-----+---+----+
+---+---+-----+----+----+
| x1| x2| x3| x4| x5|
+---+---+-----+----+----+
| 1| a| 23.0|null|44.0|
| 3| B|-23.0|null| 2.0|
+---+---+-----+----+----+

The randint function is what you need: it generates a random integer between two numbers. Apply it in the fillna spark function for the 'age' column.
from random import randint
df.fillna(randint(14, 46), 'age').show()

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to delete specific characters from a string in a PySpark dataframe? - dataframe

I want to delete the last two characters from values in a column. The values of the PySpark dataframe look like this: 1000.0 1250.0 3000.0 ... and they should look like this: 1000 1250 3000 ...

You can use substring to get the string until the index length - 2: import pyspark.sql.functions as F df2 = df.withColumn( 'col', F.expr("substring(col, 1, length(col) - 2)") )

Related

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

pyspark extra column where dates are trasformed to 1, 2 , 3

Transpose wide dataframe to long dataframe

pyspark dataframe filtering on multiple columns

fill na with random numbers in Pyspark

Categories

Resources