Filtering on column : Pyspark - dataframe

I will filter a column on dataframe for to have only the number (digit code).
main_column
HKA1774348
null
774970331205
160-27601033
SGSIN/62/898805
null
LOCAL
217-29062806
null
176-07027893
724-22100374
297-00371663
217-11580074
I obtain this column
main_column
774970331205
160-27601033
217-29062806
176-07027893
724-22100374
297-00371663
217-11580074

You can use rlike with an regexp that only includes digits and a hyphen:
df.where(df['main_column'].rlike('^[0-9\-]+$')).show()
Output:
+------------+
| main_column|
+------------+
|774970331205|
|160-27601033|
|217-29062806|
|176-07027893|
|724-22100374|
|297-00371663|
|217-11580074|
+------------+

Related

pyspark hive sql convert array(map(varchar, varchar)) to string by rows

I would like to transform a column of
array(map(varchar, varchar))
to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.
example
user_id sport_ids
'aca' [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]
expected results
user_id. sport_ids
'aca'. '5815'
'aca'. '5712'
'aca'. '1065'
I have tried
sql_q= """
select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
from tab """
spark.sql(sql_q)
but got error:
'->' cannot be resolved
I have also tried
sql_q= """
select distinct, user_id, sport_ids
from tab"""
spark.sql(sql_q)
but got error:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;
Did I miss something ?
I have tried this, but helpful
hive convert array<map<string, string>> to string
Extract map(varchar, array(varchar)) - Hive SQL
thanks
Lets try use higher order functions to find map values and explode into individual rows
df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()
+-------+---------+
|user_id|sport_ids|
+-------+---------+
| aca| 5818|
| aca| 6712|
| aca| 1065|
+-------+---------+
You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:
-- sample data
WITH dataset(user_id, sport_ids) AS (
VALUES
('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
)
-- query
select user_id,
json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
unnest(cast(json_parse(sport_ids) as array(json))) as t(record)
Output:
user_id
sport_id
aca
5818
aca
6712
aca
1065

How to add Date validation in regexp of Spark-Sql

spark.sql("select case when length(regexp_replace(date,'[^0-9]', ''))==8 then regexp_replace(date,'[^0-9]', '') else regexp_replace(date,'[^0-9]','') end as date from input").show(false)
In the above I need to add the requirements such as,
1.the output should be validated with the format 'yyyymmdd' using unix_timestamp.
if it is not valid then should transform the extracted digits string by moving the first four (4) characters to the end of the extracted digits string (MMDDYYYY to YYYYMMDD) and then should be validated with the 'yyyymmdd' format, if this condition is satisfied then print that date.
I'm not sure how to include the Unix timestamp in my query.
Sample input and output 1:
input: 2021dgsth02hdg02
output: 20210202
Sample input and output 2:
input: 0101def20dr21 (note: MMDDYYYY TO YYYYMMDD)
output: 20210101
Using unix_timestamp in place of to_date
spark.sql("select (case when length(regexp_replace(date,'[^0-9]', ''))==8 then CASE WHEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') IS NULL THEN from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy') ,'MMddyyyy') ELSE from_unixtime(unix_timestamp(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') ,'yyyyMMdd') END else regexp_replace(date,'[^0-9]','') end ) AS dt from input").show(false)
Try Below code.
scala> val df = Seq("2021dgsth02hdg02","0101def20dr21").toDF("dt")
df: org.apache.spark.sql.DataFrame = [dt: string]
scala> df.show(false)
+----------------+
|dt |
+----------------+
|2021dgsth02hdg02|
|0101def20dr21 |
+----------------+
scala> df
.withColumn("dt",regexp_replace($"dt","[a-zA-Z]+",""))
.withColumn("dt",
when(
to_date($"dt","yyyyMMdd").isNull,
to_date($"dt","MMddyyyy")
)
.otherwise(to_date($"dt","yyyyMMdd"))
).show(false)
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+
// Entering paste mode (ctrl-D to finish)
spark.sql("""
select (
CASE WHEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd') IS NULL
THEN to_date(regexp_replace(date,'[a-zA-Z]+',''),'MMddyyyy')
ELSE to_date(regexp_replace(date,'[a-zA-Z]+',''),'yyyyMMdd')
END
) AS dt from input
""")
.show(false)
// Exiting paste mode, now interpreting.
+----------+
|dt |
+----------+
|2021-02-02|
|2021-01-01|
+----------+

Get value from json dimensional array in oracle

I have below JSON from which i need to fetch the value of issuedIdentValue where issuedIdentType = PANCARD
{
"issuedIdent": [
{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},
{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"},
{"issuedIdentType":"PANCARD","issuedIdentValue":"078-01-8877"}
]
}
I can not hard-code the index value [2] in my below query as the order of these records can be changed. So want to get rid off any hardcoded index.
select json_value(
'{"issuedIdent": [{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"}, {"issuedIdentType":"PANCARDSctyNb","issuedIdentValue":"078-01-8877"}]}',
'$.issuedIdent[2].issuedIdentValue'
) as output
from d1entzendev.ExternalEventLog
where
eventname = 'CustomerDetailsInqSVC'
and applname = 'digitalBANKING'
and requid = '4fe1fa1b-abd4-47cf-834b-858332c31618';
What changes will need to apply in json_value function to achieve the expected result
In Oracle 12c or higher, you can use JSON_TABLE() for this:
select value
from json_table(
'{"issuedIdent": [{"issuedIdentType":"DriversLicense","issuedIdentValue":"9797979797979797"},{"issuedIdentType":"SclSctyNb","issuedIdentValue":"078-01-8877"}, {"issuedIdentType":"PANCARD","issuedIdentValue":"078-01-8877"}]}',
'$.issuedIdent[*]' columns
type varchar(50) path '$.issuedIdentType',
value varchar(50) path '$.issuedIdentValue'
) t
where type = 'PANCARD'
This returns:
| VALUE |
| :---------- |
| 078-01-8877 |

Pass date values from dataframe to query in Spark /Scala

I have a dataframe having a date column and has values as below.
df.show()
+----+----------+
|name| dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+
Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.
I tried to collect the values to a variable like below which give me array of string.
val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)
However when i try to pass to the query i see its not substituting properly and get error.
val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification
Can you please help me how to pass date values to a query in spark.
You can collect the dates as follows (Row.getAs):
val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))
And then build the query:
val hql: String = s"select ... where dateOfBirth in (${
dates.map(d => s"'${d}'").mkString(", ")
})"
Option 2
If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.
First, load every table as DataFrames (I'll call them dfEmp and dfDates). Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:
val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
col("dateOfBirth") === col("updt_d"), "left_semi")

How to combine two column with colon ":" in postgres

I have a table as follows
name1 | name2
-------+--------
ishi | python
ishi | scala
ishi | java
sangee | java
sangee | c#
I need a output as
name
----------------
ishi : python
ishi : scala
ishi : java
sangee : java
sangee : c#
How to join two column as one concatenated with colon : ?
You can use concat_ws() for that:
select concat_ws(' : ', name1, name2) as name
from the_table;
concat_ws() will properly deal with NULL values and empty strings (unlike e.g. name1||' : '||name2)
With follow up of your previous question's answer, using CONCAT() function will give the expected result:
select CONCAT(st.name1, ' : ', dm.name2) AS name
from mainpk ms
join student st on st.id1 = ms.id1
join domain dm on dm.id2 = ms.id2
or using the string concatenation operator ||
select st.name1 || ' : ' || dm.name2 AS name
....