if there is a name column
name
prashant
ram
then the column values should become like this
name
##############################
# Name | Replaced_value #
##############################
# prashant | XXXXXXXX #
# | #
# ram | XXX #
##############################
It has to replaced by same number of Xs.
You can combine lpad/rpad and length
LPAD('X',LENGTH(InputString),'X')
This will work !!
select name,substr(lpad(name,length(name)+length(name),'X'),1,length(name)) as replaced_name from table
Related
Basically, I have a dataframe that looks exactly like this:
id
values
01
[{"final_price":10.0,"currency":"USD"},{"final_price":18.0,"currency":"CAD"}]
02
[{"final_price":44.15,"currency":"USD"},{"final_price":60.0,"currency":"CAD"}]
03
[{"final_price":99.99,"currency":"USD"},{"final_price":115.0,"currency":"CAD"}]
04
[{"final_price":25.0,"currency":"USD"},{"final_price":32.0,"currency":"CAD"}]
the same procut id have the price in US dollars and Canadian dollars. However, I need to check how many dicts this column has. Because some products only have the price in USD and others only in CAD. How can I check how many currencies are there and create new columns for each one of them?
Thanks!
Convert the JSON strings into array of structs using from_json. The number of dicts (currencies) will correspond to the size of the resulting array. And to select them as new columns, you can pivot like this:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("01", "[{'final_price':10.0,'currency':'USD'},{'final_price':18.0,'currency':'CAD'}]"),
("02", "[{'final_price':44.15,'currency':'USD'},{'final_price':60.0,'currency':'CAD'}]"),
("03", "[{'final_price':99.99,'currency':'USD'},{'final_price':115.0,'currency':'CAD'}]"),
("04", "[{'final_price':25.0,'currency':'USD'},{'final_price':32.0,'currency':'CAD'}]")
], ["id", "values"])
df.selectExpr(
"id",
"inline(from_json(values, 'array<struct<final_price:float,currency:string>>'))"
).groupby("id").pivot("currency").agg(
F.first("final_price")
).show()
# +---+-----+-----+
# | id| CAD| USD|
# +---+-----+-----+
# | 01| 18.0| 10.0|
# | 03|115.0|99.99|
# | 02| 60.0|44.15|
# | 04| 32.0| 25.0|
# +---+-----+-----+
I am using apache spark to find the longest common prefix per session
Given the following example:
session | prefix
_____________________
1 | keys
1 | key chain
1 | keysmith
2 | tim
2 | timmy
2 | tim hortons
I would like to format this into the following output:
session | prefix
_____________________
1 | key
2 | tim
I saw an example which checks a column in one row against all others but I have trouble wrapping my head around how to do this for aggregate rows.
Any help is appreciated!
try like below
select session,min(length(prefix)) from table_name
group by session
I have a pyspark dataframe that looks like this,
data = [("James","Joyce"),
("Michael","Doglus"),
("Robert","Connings"),
("Maria","XYZ"),
("Jen","PQR")
]
df2 = spark.createDataFrame(data, ["Name", "Lots_of_names"])
df2
Name Lots_of_names
0 James Joyce
1 Michael Doglus
2 Robert Connings
3 Maria XYZ
4 Jen PQR
I want to merge the two columns into one long column (probably in a new dataframe), that will have 10 rows. Is there any way to get there? Thanks in advance.
you are probably looking to do something like this
import pyspark.sql.functions as F
df_out = df2.select(F.explode(F.array("Name", "Lots_of_names")).alias("one_col"))
which produces df_out as follows
# one_col
#------
# James
# Joyce
# Michael
# Doglus
# Robert
# Connings
# Maria
# XYZ
# Jen
# PQR
I have a dataframe and I want to check one column that only contains letter A for example.
The column contains a lot of letters. It looks like:
AAAAAAAAAAAAAAAA
AAABBBBBDBBSBSBB
I want to check if this column only contains letter A, or both letter A or B, but nothing else.
Do you know which function I shall use?
Try this: I have considered four samples of letters. We can use rlike function in spark. I have used regex of [^AB]. This will return true to the column values having letters other than A or B and False will be displayed to the values having A or B or both AB. we can filter out False and that will be your answer.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
li = [[("AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB")], [("AAAAAAAAA")],[("BBBBBBBB")], [("AAAAAABBBBBBBB")]]
df = spark.createDataFrame(li, ["letter"])
df.show(truncate=False)
#
# +--------------------------------+
# |letter |
# +--------------------------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|
# |AAAAAAAAA |
# |BBBBBBBB |
# |AAAAAABBBBBBBB |
# +--------------------------------+
df1 = df.withColumn("contains_A_or_B", F.col('letter').rlike("[^AB]"))
df.show(truncate=False)
+--------------------------------+---------------+
# |letter |contains_A_or_B|
# +--------------------------------+---------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|true |
# |AAAAAAAAA |false |
# |BBBBBBBB |false |
# |AAAAAABBBBBBBB |false |
# +--------------------------------+---------------+
df1.filter(F.col('contains_A_or_B')==False).select("letter").show()
# +--------------+
# | letter|
# +--------------+
# | AAAAAAAAA|
# | BBBBBBBB|
# |AAAAAABBBBBBBB|
# +--------------+
Use rlike.
Example from the official documentation:
df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.Column.rlike
Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()
I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python
What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".