Regexp_Replace in pyspark not working properly - sql

I am reading a csv file which is something like:
"ZEN","123"
"TEN","567"
Now if I am replacing character E with regexp_replace , its not giving correct results:
from pyspark.sql.functions import
row_number,col,desc,date_format,to_date,to_timestamp,regexp_replace
inputDirPath="/FileStore/tables/test.csv"
schema = StructType()
for field in fields:
colType = StringType()
schema.add(field.strip(),colType,True)
incr_df = spark.read.format("csv").option("header",
"false").schema(schema).option("delimiter", "\u002c").option("nullValue",
"").option("emptyValue","").option("multiline",True).csv(inputDirPath)
for column in incr_df.columns:
inc_new=incr_df.withColumn(column, regexp_replace(column,"E","") )
inc_new.show()
is not giving correct results, it is doing nothing
Note : I have 100+ columns, so need to use for loop
can someone help in spotting my error?

List comprehension will be neater and easier. Lets try
inc_new =inc_new.select(*[regexp_replace(x,'E','').alias(x) for x in inc_new.columns])
inc_new.show()

Related

replace .withColumn with a df.select

I am doing a basic transformation on my pyspark dataframe but here i am using multiple .withColumn statements.
def trim_and_lower_col(col_name):
return F.when(F.trim(col_name) == "", F.lit("unspecified")).otherwise(F.lower(F.trim(col_name)))
df = (
source_df.withColumn("browser", trim_and_lower_col("browser"))
.withColumn("browser_type", trim_and_lower_col("browser_type"))
.withColumn("domains", trim_and_lower_col("domains"))
)
I read that creating multiple withColumn statements isn't very efficient and i should use df.select() instead.
I tried this:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(col) for col in cols_to_transform] + source_df.columns)
)
but it gives me a duplicate column error
What else can I try?
The duplicate column comes because you pass each transformed column twice in that list, once as your newly transformed column (through .alias) as original column (by name in source_df.columns). This solution will allow you to use a single select statement, preserve the column order and not hit the duplication issue:
df = (
source_df.select([trim_and_lower_col(col).alias(col) if col in cols_to_transform else col for col in source_df.columns])
)
Chaining many .withColumn does pose a problem as the unresolved query plan can get pretty large and cause StackOverflow error on Spark driver during query plan optimisation. One good explanation of this problem is shared here: https://medium.com/#manuzhang/the-hidden-cost-of-spark-withcolumn-8ffea517c015
You are naming your new columns the following: .alias(col).
That means that they have the same name as the column you use to create the new one.
During the creation (using .withColumn) this does not pose a problem. As soon as you are trying to select, Spark does not know which column to pick.
You could fix it for example by giving the new columns a suffix:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
df = (
source_df.select([trim_and_lower_col(col).alias(f"{col}_new") for col in cols_to_transform] + source_df.columns)
)
Another solution, which does pollute the DAG though, would be:
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
for col in cols_to_transform:
source_df = source_df.withColumn(col, trim_and_lower_col(col))
If you only have these few withColumns, keep using them.. It's still way more readable thus way more maintainable and self explanatory..
If you look into it, you'll see that spark says to be careful with the withColumns when you have like 200 of them.
Using select makes your code more error prone too since it's more complex to read.
Now, if you have many columns, I would define
the list of the column to transform,
the list of the column to keep
then do the select
cols_to_transform = [
"browser",
"browser_type",
"domains"
]
cols_to_keep = [c for c in df.columns if c not in cols_to_transform]
cols_transformed = [trim_and_lower_col(c).alias(c) for c in cols_to_transform]
source_df.select(*cols_to_keep, *cols_transformed)
This would give you the same column order as the withColumns.

Pattern match using regexp_extract_all

I am trying to build a array from this string and need help with pattern on regexp_extract_all.
Here is my input string contains keyword value pairs
BEGIN
DECLARE p_JSON STRING DEFAULT """
{
"instances": [{
"LT_20MN_SalesContrctCnt": 388.0,
"Pyramid_Index": '',
"MARKET": "'Growth Markets','Europe'",
"SERVICE_DIM": "'S&C','F&M'",
"SG_MD": "'All Service Group'"
}]}
""";
SELECT split(x,":")[OFFSET(0)] as keyword, split(x,":")[OFFSET(1)] keyword_value
FROM unnest(split(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'([\'\"\[\]{}])', ''))) as x
END;
The above SQL is failing at SPLIT due to , with in the data.
All I am trying to do here is build a two columns Keyword and value.
The idea here is if I can extract each row using REGEXP_EXTRACT_ALL with out the last "," then I should be able to split into keyword and keyword_value columns. Btw the names or number of keywords/values are not fixed.
Intended output from REGEXP_EXTRACT_ALL:
"LT_20MN_SalesContrctCnt": 388.0
"Pyramid_Index": ''
"MARKET": "'Growth Markets','Europe'"
"SERVICE_DIM": "'S&C','F&M'"
"SG_MD": "'All Service Group'"
Appreciate if you can suggest a better way to handle this.
Thanks in advance.
Using your sample data, I just added an extra REGEXP_REPLACE to replace ," to #" so we can avoid splitting using ,. See approach below:
SELECT
SPLIT(arr,":")[OFFSET(0)] as keyword,
SPLIT(arr,":")[OFFSET(1)] as keyword_value,
FROM sample_data,
UNNEST(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'[\[\]{}]',''),r',"','#"'),'#')) arr
Output:

Right way to implement pandas.read_sql with ClickHouse

Trying to implement pandas.read_sql function.
I created a clickhouse table and filled it:
create table regions
(
date DateTime Default now(),
region String
)
engine = MergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY tuple()
SETTINGS index_granularity = 8192;
insert into regions (region) values ('Asia'), ('Europe')
Then python code:
import pandas as pd
from sqlalchemy import create_engine
uri = 'clickhouse://default:#localhost/default'
engine = create_engine(uri)
query = 'select * from regions'
pd.read_sql(query, engine)
As the result I expected to get a dataframe with columns date and region but all I get is empty dataframe:
Empty DataFrame
Columns: [2021-01-08 09:24:33, Asia]
Index: []
UPD. It occured that defining clickhouse+native solves the problem.
Can it be solved without +native?
There is encient issue https://github.com/xzkostyan/clickhouse-sqlalchemy/issues/10. Also there is a hint which assumes to add FORMAT TabSeparatedWithNamesAndTypes at the end of a query. So the init query will be look like this:
select *
from regions
FORMAT TabSeparatedWithNamesAndTypes

rename columns in dataframe pyspark adding a string

I have written code in Python using Pandas that adds "VEN_" to the beginning of the column names:
Tablon.columns = "VEN_" + Tablon.columns
And it works fine, but now I'm working with PySpark and it doesn't work.
I've tried:
Vaa_total.columns = ['Vaa_' + col for col in Vaa_total.columns]
or
for elemento in Vaa_total.columns:
elemento = "Vaa_" + elemento
And other things like that but it doesn't work.
I don't want to replace the columns name, I just want to mantain it but adding a string to the beginning.
Try something like this:
for elemento in Vaa_total.columns:
Vaa_total =Vaa_total.withColumnRenamed(elemento, "Vaa_" + elemento)
I linked similar topic in comment.
Here's example adapted from that topic to your task:
dataframe.select([col(col_name).alias('VAA_' + col_name) for col_name in dataframe])
Standard format of writing it:
renamed_df = df.withColumnRenamed(col_name, "insert_text" + col_name) for col_name in dataframe.columns])

query using string in PyTables 3

I have a table:
h5file=open_file("ex.h5", "w")
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
table=h5file.root.table
rows=[
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
]
table.append(rows)
table.flush()
I am trying to query as per below:
find='swr'
creteria='B'
if creteria=='B':
condition='B'
else:
condition='C'
value=[x['A'] for x in table.where("""condition==find""")]
print(value)
It returns:
ValueError: there are no columns taking part in condition condition==find
Is there a way to use condition as a column name in above query?
Thanks in advance.
Yes, you can use Pytables .where() to search based on a condition. The problem is how you constructed your query for the table.where(condition). See Note about strings under Table.where() in the Pytables Users Guide:
A special care should be taken when the query condition includes string literals. ... Python 3 strings are unicode objects.
in Python 3, “condition” should be defined like this:
condition = 'col1 == b"AAAA"'
The reason is that in Python 3 “condition” implies a comparison between a string of bytes (“col1” contents) and an unicode literal (“AAAA”).
The simplest form of your query is shown below. It returns a subset of rows that match the condition. Note use of single and double quotes for string and unicode:
query_table = table.where('C=="swr"') # search in column C
I rewrote your example as best I could. See below. It shows several ways to enter the condition. I'm not smart enough to figure out how to combine your creteria and find variables into a single condition variable with string and unicode characters.
from tables import *
class ex(IsDescription):
A=StringCol(5, pos=0)
B=StringCol(5, pos=1)
C=StringCol(5, pos=2)
h5file=open_file("ex.h5", "w")
table=h5file.create_table('/', 'table', ex, "Passing string as column name")
## table=h5file.root.table
rows=[
('abc', 'bcd', 'dse'),
('der', 'fre', 'swr'),
('xsd', 'weq', 'rty')
]
table.append(rows)
table.flush()
find='swr'
query_table = table.where('C==find')
for row in query_table :
print (row)
print (row['A'], row['B'], row['C'])
value=[x['A'] for x in table.where('C == "swr"')]
print(value)
value=[x['A'] for x in table.where('C == find')]
print(value)
h5file.close()
Output shown below:
/table.row (Row), pointing to row #1
b'der' b'fre' b'swr'
[b'der']
[b'der']