Replace part of string in columns names - pandas

I have a issue with changing columns names. There is a part that I have to remove. Tables looks like below
Column.name_1 Column.name_2 Column.*.name_3 Column_name_4 Column_*_name_5
I wrote a line of code that changes dot and strix into underscore:
df_check.columns = df_check.columns.str.replace('.*.', '_')
But I get
Column_name_1 Column_name_2 Column___name_3 Column_name_4 Column___name_5
And I need below result with only one uderscore.
Column_name_1 Column_name_2 Column_name_3 Column_name_4 Column_name_5
Can you help me with this?
Regards

You could use:
df_check.columns = df_check.columns.str.replace(r'[.*_]+', '_', regex=True)
Output names:
['Column_name_1', 'Column_name_2', 'Column_name_3', 'Column_name_4', 'Column_name_5']

Related

Regexp_Replace in pyspark not working properly

I am reading a csv file which is something like:
"ZEN","123"
"TEN","567"
Now if I am replacing character E with regexp_replace , its not giving correct results:
from pyspark.sql.functions import
row_number,col,desc,date_format,to_date,to_timestamp,regexp_replace
inputDirPath="/FileStore/tables/test.csv"
schema = StructType()
for field in fields:
colType = StringType()
schema.add(field.strip(),colType,True)
incr_df = spark.read.format("csv").option("header",
"false").schema(schema).option("delimiter", "\u002c").option("nullValue",
"").option("emptyValue","").option("multiline",True).csv(inputDirPath)
for column in incr_df.columns:
inc_new=incr_df.withColumn(column, regexp_replace(column,"E","") )
inc_new.show()
is not giving correct results, it is doing nothing
Note : I have 100+ columns, so need to use for loop
can someone help in spotting my error?
List comprehension will be neater and easier. Lets try
inc_new =inc_new.select(*[regexp_replace(x,'E','').alias(x) for x in inc_new.columns])
inc_new.show()

Pattern match using regexp_extract_all

I am trying to build a array from this string and need help with pattern on regexp_extract_all.
Here is my input string contains keyword value pairs
BEGIN
DECLARE p_JSON STRING DEFAULT """
{
"instances": [{
"LT_20MN_SalesContrctCnt": 388.0,
"Pyramid_Index": '',
"MARKET": "'Growth Markets','Europe'",
"SERVICE_DIM": "'S&C','F&M'",
"SG_MD": "'All Service Group'"
}]}
""";
SELECT split(x,":")[OFFSET(0)] as keyword, split(x,":")[OFFSET(1)] keyword_value
FROM unnest(split(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'([\'\"\[\]{}])', ''))) as x
END;
The above SQL is failing at SPLIT due to , with in the data.
All I am trying to do here is build a two columns Keyword and value.
The idea here is if I can extract each row using REGEXP_EXTRACT_ALL with out the last "," then I should be able to split into keyword and keyword_value columns. Btw the names or number of keywords/values are not fixed.
Intended output from REGEXP_EXTRACT_ALL:
"LT_20MN_SalesContrctCnt": 388.0
"Pyramid_Index": ''
"MARKET": "'Growth Markets','Europe'"
"SERVICE_DIM": "'S&C','F&M'"
"SG_MD": "'All Service Group'"
Appreciate if you can suggest a better way to handle this.
Thanks in advance.
Using your sample data, I just added an extra REGEXP_REPLACE to replace ," to #" so we can avoid splitting using ,. See approach below:
SELECT
SPLIT(arr,":")[OFFSET(0)] as keyword,
SPLIT(arr,":")[OFFSET(1)] as keyword_value,
FROM sample_data,
UNNEST(SPLIT(REGEXP_REPLACE(REGEXP_REPLACE(JSON_EXTRACT(p_JSON, '$.instances'),r'[\[\]{}]',''),r',"','#"'),'#')) arr
Output:

Select first and last three strings from column for conditionSQL

My goal is to select all the columns that start and end with the same 3 strings as the first row.
In this case it was simple, since the CONCAT was equal to 'SCLMIA'
AND CONCAT(origin, destination) = 'SCLMIA'
AND ((flight_path LIKE '%SCL%' AND flight_path LIKE '%MIA%')
but now the difficulty is for multiple strings.
AND CONCAT(origin, destination) IN ('SCLMIA', 'SCLIQQ','SCLMAD', 'LIMCUZ', 'BOGMDE', 'FORGRU', 'SDUCGH', 'SCLGRU', 'BOGLIM', 'GYEUIO')
AND (**here I need to replicate the same as above.**)
I read that it can be with the functions SUBSTRING, LEFT AND RIGHT selecting the three first and last strings but I don't know how to do it.
Tried with this, but failed:
AND (flight_path LIKE '%' + SUBSTR(flight_path,3, LENGTH(flight_path) - 4) + '%')
It should be noted that it is a chain of conditions that's why start with AND.
Edit:
Image: Sample of data single path 'SCLMIA'
It's from Bigquery.
I think this is what you're trying to do:
SELECT *
FROM
flight_paths
WHERE
CONCAT(origin, destination) IN ('SCLMIA', 'SCLIQQ', 'SCLMAD', 'LIMCUZ', 'BOGMDE', 'FORGRU', 'SDUCGH', 'SCLGRU', 'BOGLIM', 'GYEUIO')
AND RIGHT(flight_path, 3) = origin
AND LEFT(flight_path, 3) = destination
Here's a db-fiddle that demonstrates the answer:
https://www.db-fiddle.com/f/vUZ4HL4NC9xaBBZpwTYNcR/0

Search for any of a list of strings inside another string

I need to identify records with valid addresses by comparing the address fields against a list of street-like words.
So the code would look something like:
set street_list = 'STREET', 'ROAD', 'AVENUE', 'DRIVE', 'WAY', 'PLACE' (etc.)
;
create table [new table] as
select *
from [source table]
where [address line 1] (contains any word from STREET_LIST) or
[address line 2] (contains any word from STREET_LIST) or
[address line 3] (contains any word from STREET_LIST)
;
Is this possible?
Using LostReality's regexp suggestion, I got as far as:
select *
from [source table]
where upper([address line 1]) regexp '.* STREET.*|.* ST.*|.* ROAD.*|.* RD.*|.* CLOSE.*|.* LANE.*|.* LA.*|.* AVENUE.*|.* AVE.*|.* DRIVE.*|.* DR.*|.* HOUSE.*|.* WAY.*|.* PLACE.*|.* SQUARE.*|.* WALK.*|.* GROVE.*|.* GREEN.*|.* PARK.*|.* PK.*|.* CRESCENT.*|.* TERRACE.*|.* PARADE.*|.* GARDEN.*|.* GARDENS.*|.* COURT.*|.* COTTAGES.*|.* COTTAGE.*|.* MEWS.*|.* ESTATE.*|.* RISE.*|.* FARM.*'
;
and it seems to work.
But I have two small problems with it:
1) how do I write the regexp on more than one line so it's easier to read?
2) is there any way of putting that regexp into a macro variable because I want to check 5 address lines and I don't want 5 copies of the same expression.
Thanks
Solution for Hive. You can put regexp pattern in the variable and also you can use macro, fixed your template:
set hivevar:street_list ='STREET|ST|ROAD|RD|CLOSE|LANE|LA|AVENUE|AVE|DRIVE|DR|HOUSE|WAY|PLACE|SQUARE|WALK|GROVE|GREEN|PARK|PK|CRESCENT|TERRACE|PARADE|GARDEN|GARDENS|COURT|COTTAGES|COTTAGE|MEWS|ESTATE|RISE|FARM';
--boolean macro for using in the WHERE
create temporary macro contains_word(s string) (upper(s) rlike ${hivevar:street_list} ) ;
with some_table as ( --use your table instead of this synthetic example
select stack(2,'some string containing STREET and WALK',
'some string containing something else') as str
) --use your table instead of this synthetic example
--use macro in your query
select str from some_table
where contains_word(str);
Result:
OK
some string containing STREET and WALK
Time taken: 0.229 seconds, Fetched: 1 row(s)
Use OR like in your question:
where contains_word(address_line_1) OR contains_word(address_line_2) ...
Hope you have got the idea

How should I add text to my column's select statement

This is my sql code
SELECT sessionname, left(comment,4))
FROM moma_reporting.comments where name like '%_2016_02_%'
and comment = '1200'
My output will be :
"WE247JP_2016_02_07__14_48_18";"1200"
"FORD49_2016_02_03__12_42_24";"1200"
"1-GRB-804_2016_02_06__08_20_15";"1200"
What i want to do is to add text to column -comment so it will look like this:
"WE247JP_2016_02_07__14_48_18";"1200-QC"
"FORD49_2016_02_03__12_42_24";"1200-QC"
"1-GRB-804_2016_02_06__08_20_15";"1200-QC"
How i can do this ?
Just concat it:
SELECT sessionname, left(comment,4))||'-QC'
FROM moma_reporting.comments
where name like '%_2016_02_%'
and comment = '1200'
Unrelated, but: left(comment,4)) is useless. The condition and comment = '1200' will never return comments that are longer then 4 characters.