rename columns in dataframe pyspark adding a string

rename columns in dataframe pyspark adding a string - dataframe

I have written code in Python using Pandas that adds "VEN_" to the beginning of the column names:
Tablon.columns = "VEN_" + Tablon.columns
And it works fine, but now I'm working with PySpark and it doesn't work.
I've tried:
Vaa_total.columns = ['Vaa_' + col for col in Vaa_total.columns]
or
for elemento in Vaa_total.columns:
elemento = "Vaa_" + elemento
And other things like that but it doesn't work.
I don't want to replace the columns name, I just want to mantain it but adding a string to the beginning.

Try something like this:
for elemento in Vaa_total.columns:
Vaa_total =Vaa_total.withColumnRenamed(elemento, "Vaa_" + elemento)

I linked similar topic in comment.
Here's example adapted from that topic to your task:
dataframe.select([col(col_name).alias('VAA_' + col_name) for col_name in dataframe])

Standard format of writing it:
renamed_df = df.withColumnRenamed(col_name, "insert_text" + col_name) for col_name in dataframe.columns])

Related

Replace part of string in columns names

I have a issue with changing columns names. There is a part that I have to remove. Tables looks like below
Column.name_1 Column.name_2 Column.*.name_3 Column_name_4 Column_*_name_5
I wrote a line of code that changes dot and strix into underscore:
df_check.columns = df_check.columns.str.replace('.*.', '_')
But I get
Column_name_1 Column_name_2 Column___name_3 Column_name_4 Column___name_5
And I need below result with only one uderscore.
Column_name_1 Column_name_2 Column_name_3 Column_name_4 Column_name_5
Can you help me with this?
Regards

You could use:
df_check.columns = df_check.columns.str.replace(r'[.*_]+', '_', regex=True)
Output names:
['Column_name_1', 'Column_name_2', 'Column_name_3', 'Column_name_4', 'Column_name_5']

Regexp_Replace in pyspark not working properly

I am reading a csv file which is something like:
"ZEN","123"
"TEN","567"
Now if I am replacing character E with regexp_replace , its not giving correct results:
from pyspark.sql.functions import
row_number,col,desc,date_format,to_date,to_timestamp,regexp_replace
inputDirPath="/FileStore/tables/test.csv"
schema = StructType()
for field in fields:
colType = StringType()
schema.add(field.strip(),colType,True)
incr_df = spark.read.format("csv").option("header",
"false").schema(schema).option("delimiter", "\u002c").option("nullValue",
"").option("emptyValue","").option("multiline",True).csv(inputDirPath)
for column in incr_df.columns:
inc_new=incr_df.withColumn(column, regexp_replace(column,"E","") )
inc_new.show()
is not giving correct results, it is doing nothing
Note : I have 100+ columns, so need to use for loop
can someone help in spotting my error?

List comprehension will be neater and easier. Lets try
inc_new =inc_new.select(*[regexp_replace(x,'E','').alias(x) for x in inc_new.columns])
inc_new.show()

How to pass a defined text into PySpark SQL context

I am pretty new to PySpark and I wonder if there is something like below:
My PySpark SQL context is something like:
mysql = """
create table x as
select *
from a
"""
Since I need to change x a lot, and don't want to change it in sql itself every time, I'd like to define something in advance. Like
x = 'x'
mysql = """
create table x as
select *
from a
"""
Is there anything similar?
Thanks

Using string substitution?
x = 'x'
mysql = f"""
create table {x} as
select *
from a

Select first and last three strings from column for conditionSQL

My goal is to select all the columns that start and end with the same 3 strings as the first row.
In this case it was simple, since the CONCAT was equal to 'SCLMIA'
AND CONCAT(origin, destination) = 'SCLMIA'
AND ((flight_path LIKE '%SCL%' AND flight_path LIKE '%MIA%')
but now the difficulty is for multiple strings.
AND CONCAT(origin, destination) IN ('SCLMIA', 'SCLIQQ','SCLMAD', 'LIMCUZ', 'BOGMDE', 'FORGRU', 'SDUCGH', 'SCLGRU', 'BOGLIM', 'GYEUIO')
AND (**here I need to replicate the same as above.**)
I read that it can be with the functions SUBSTRING, LEFT AND RIGHT selecting the three first and last strings but I don't know how to do it.
Tried with this, but failed:
AND (flight_path LIKE '%' + SUBSTR(flight_path,3, LENGTH(flight_path) - 4) + '%')
It should be noted that it is a chain of conditions that's why start with AND.
Edit:
Image: Sample of data single path 'SCLMIA'
It's from Bigquery.

I think this is what you're trying to do:
SELECT *
FROM
flight_paths
WHERE
CONCAT(origin, destination) IN ('SCLMIA', 'SCLIQQ', 'SCLMAD', 'LIMCUZ', 'BOGMDE', 'FORGRU', 'SDUCGH', 'SCLGRU', 'BOGLIM', 'GYEUIO')
AND RIGHT(flight_path, 3) = origin
AND LEFT(flight_path, 3) = destination
Here's a db-fiddle that demonstrates the answer:
https://www.db-fiddle.com/f/vUZ4HL4NC9xaBBZpwTYNcR/0

I need help in filtering the content of this SQL column

I need help in filtering the content of this SQL column. I have unfortunately been unsuccessful so far. I will be happy for any assistance.
My goal is for all the unc paths to bear the same format.
All should look like: \\ps9\wa033242. Meaning all should begin with the "\\" replacing the "///"
I tried truncating it but because of the different string length, I have problems.
I tried truncating and UPDATING
SELECT
cw_platz.nummer,
cw_platz.nwaddress,
cw_platz.bezeichnung,
os_cw.cw_ldzuplatz.ldruckernr,
os_cw.cw_ldzuplatz.papierschacht,
os_cw.cw_ldzuplatz.treibername,
cw_logischerdrucker.bezeichnung
FROM
cw_platz,
os_cw.cw_ldzuplatz,
cw_logischerdrucker
WHERE
cw_platz.nummer = os_cw.cw_ldzuplatz.platznr and
cw_logischerdrucker.nummer = os_cw.cw_ldzuplatz.ldruckernr and
cw_platz.bezeichnung in cw_platz.bezeichnung
This is my result:

My first thought is to simply use something like REPLACE(yourstringhere, '/','\').
Is it something you already tried?
Reference: https://learn.microsoft.com/en-us/sql/t-sql/functions/replace-transact-sql?view=sql-server-2017

you could use replace
select replace('client/ps9///wa033242//', '/' ,'\');.
and for update
update your_table
set your_column = replace(your_column, '/' ,'\')
try avoid .. the where like
UPDATE os_cw.cw_ldzuplatz
SET os_cw.cw_ldzuplatz.treibername = REPLACE(os_cw.cw_ldzuplatz.treibername, '/' ,'\')
FROM
cw_platz,
os_cw.cw_ldzuplatz,
cw_logischerdrucker
WHERE
cw_platz.nummer = os_cw.cw_ldzuplatz.platznr and
cw_logischerdrucker.nummer = os_cw.cw_ldzuplatz.ldruckernr and
cw_platz.bezeichnung = cw_platz.bezeichnung

Try this it will help you.
UPDATE os_cw.cw_ldzuplatz
SET os_cw.cw_ldzuplatz.treibername = REPLACE(os_cw.cw_ldzuplatz.treibername, '\\','///')
FROM
cw_platz,
os_cw.cw_ldzuplatz,
cw_logischerdrucker
WHERE
cw_platz.nummer = os_cw.cw_ldzuplatz.platznr and
cw_logischerdrucker.nummer = os_cw.cw_ldzuplatz.ldruckernr and
cw_platz.bezeichnung = cw_platz.bezeichnung
and TREIBERNAME like '\\%'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

rename columns in dataframe pyspark adding a string - dataframe

Try something like this: for elemento in Vaa_total.columns: Vaa_total =Vaa_total.withColumnRenamed(elemento, "Vaa_" + elemento)

I linked similar topic in comment. Here's example adapted from that topic to your task: dataframe.select([col(col_name).alias('VAA_' + col_name) for col_name in dataframe])

Standard format of writing it: renamed_df = df.withColumnRenamed(col_name, "insert_text" + col_name) for col_name in dataframe.columns])

Related

Replace part of string in columns names

Regexp_Replace in pyspark not working properly

How to pass a defined text into PySpark SQL context

Select first and last three strings from column for conditionSQL

I need help in filtering the content of this SQL column

Categories

Resources