plsql code to replace values of a column with characters

plsql code to replace values of a column with characters - sql

if there is a name column
name
prashant
ram
then the column values should become like this
name
##############################
# Name | Replaced_value #
##############################
# prashant | XXXXXXXX #
# | #
# ram | XXX #
##############################
It has to replaced by same number of Xs.

You can combine lpad/rpad and length
LPAD('X',LENGTH(InputString),'X')

This will work !!
select name,substr(lpad(name,length(name)+length(name),'X'),1,length(name)) as replaced_name from table

Related

Checking how many dictionaries a pyspark dataframe has and collecting values from them

Basically, I have a dataframe that looks exactly like this:
id
values
01
[{"final_price":10.0,"currency":"USD"},{"final_price":18.0,"currency":"CAD"}]
02
[{"final_price":44.15,"currency":"USD"},{"final_price":60.0,"currency":"CAD"}]
03
[{"final_price":99.99,"currency":"USD"},{"final_price":115.0,"currency":"CAD"}]
04
[{"final_price":25.0,"currency":"USD"},{"final_price":32.0,"currency":"CAD"}]
the same procut id have the price in US dollars and Canadian dollars. However, I need to check how many dicts this column has. Because some products only have the price in USD and others only in CAD. How can I check how many currencies are there and create new columns for each one of them?
Thanks!

Convert the JSON strings into array of structs using from_json. The number of dicts (currencies) will correspond to the size of the resulting array. And to select them as new columns, you can pivot like this:
from pyspark.sql import functions as F
df = spark.createDataFrame([
("01", "[{'final_price':10.0,'currency':'USD'},{'final_price':18.0,'currency':'CAD'}]"),
("02", "[{'final_price':44.15,'currency':'USD'},{'final_price':60.0,'currency':'CAD'}]"),
("03", "[{'final_price':99.99,'currency':'USD'},{'final_price':115.0,'currency':'CAD'}]"),
("04", "[{'final_price':25.0,'currency':'USD'},{'final_price':32.0,'currency':'CAD'}]")
], ["id", "values"])
df.selectExpr(
"id",
"inline(from_json(values, 'array<struct<final_price:float,currency:string>>'))"
).groupby("id").pivot("currency").agg(
F.first("final_price")
).show()
# +---+-----+-----+
# | id| CAD| USD|
# +---+-----+-----+
# | 01| 18.0| 10.0|
# | 03|115.0|99.99|
# | 02| 60.0|44.15|
# | 04| 32.0| 25.0|
# +---+-----+-----+

Longest Common Prefix Per Aggregate Using Apache Spark SQL

I am using apache spark to find the longest common prefix per session
Given the following example:
session | prefix
_____________________
1 | keys
1 | key chain
1 | keysmith
2 | tim
2 | timmy
2 | tim hortons
I would like to format this into the following output:
session | prefix
_____________________
1 | key
2 | tim
I saw an example which checks a column in one row against all others but I have trouble wrapping my head around how to do this for aggregate rows.
Any help is appreciated!

try like below
select session,min(length(prefix)) from table_name
group by session

How to merge columns into one on top of each other in pyspark?

I have a pyspark dataframe that looks like this,
data = [("James","Joyce"),
("Michael","Doglus"),
("Robert","Connings"),
("Maria","XYZ"),
("Jen","PQR")
]
df2 = spark.createDataFrame(data, ["Name", "Lots_of_names"])
df2
Name Lots_of_names
0 James Joyce
1 Michael Doglus
2 Robert Connings
3 Maria XYZ
4 Jen PQR
I want to merge the two columns into one long column (probably in a new dataframe), that will have 10 rows. Is there any way to get there? Thanks in advance.

you are probably looking to do something like this
import pyspark.sql.functions as F
df_out = df2.select(F.explode(F.array("Name", "Lots_of_names")).alias("one_col"))
which produces df_out as follows
# one_col
#------
# James
# Joyce
# Michael
# Doglus
# Robert
# Connings
# Maria
# XYZ
# Jen
# PQR

How to check if a column only contains certain letters

I have a dataframe and I want to check one column that only contains letter A for example.
The column contains a lot of letters. It looks like:
AAAAAAAAAAAAAAAA
AAABBBBBDBBSBSBB
I want to check if this column only contains letter A, or both letter A or B, but nothing else.
Do you know which function I shall use?

Try this: I have considered four samples of letters. We can use rlike function in spark. I have used regex of [^AB]. This will return true to the column values having letters other than A or B and False will be displayed to the values having A or B or both AB. we can filter out False and that will be your answer.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder \
.appName('SO')\
.getOrCreate()
li = [[("AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB")], [("AAAAAAAAA")],[("BBBBBBBB")], [("AAAAAABBBBBBBB")]]
df = spark.createDataFrame(li, ["letter"])
df.show(truncate=False)
#
# +--------------------------------+
# |letter |
# +--------------------------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|
# |AAAAAAAAA |
# |BBBBBBBB |
# |AAAAAABBBBBBBB |
# +--------------------------------+
df1 = df.withColumn("contains_A_or_B", F.col('letter').rlike("[^AB]"))
df.show(truncate=False)
+--------------------------------+---------------+
# |letter |contains_A_or_B|
# +--------------------------------+---------------+
# |AAAAAAAAAAAAAAAAAAABBBBBDBBSBSBB|true |
# |AAAAAAAAA |false |
# |BBBBBBBB |false |
# |AAAAAABBBBBBBB |false |
# +--------------------------------+---------------+
df1.filter(F.col('contains_A_or_B')==False).select("letter").show()
# +--------------+
# | letter|
# +--------------+
# | AAAAAAAAA|
# | BBBBBBBB|
# |AAAAAABBBBBBBB|
# +--------------+

Use rlike.
Example from the official documentation:
df.filter(df.name.rlike('ice$')).collect()
[Row(age=2, name='Alice')]
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.Column.rlike

How to use LinearRegression across groups in DataFrame?

Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()

I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python

What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

plsql code to replace values of a column with characters - sql

if there is a name column name prashant ram then the column values should become like this name ############################## # Name | Replaced_value # ############################## # prashant | XXXXXXXX # # | # # ram | XXX # ############################## It has to replaced by same number of Xs.

You can combine lpad/rpad and length LPAD('X',LENGTH(InputString),'X')

This will work !! select name,substr(lpad(name,length(name)+length(name),'X'),1,length(name)) as replaced_name from table

Related

Checking how many dictionaries a pyspark dataframe has and collecting values from them

Longest Common Prefix Per Aggregate Using Apache Spark SQL

How to merge columns into one on top of each other in pyspark?

How to check if a column only contains certain letters

How to use LinearRegression across groups in DataFrame?

Categories

Resources