How to replace column values using na.replace() method while using more than one dictionary? - dataframe

While replacing values of a column in a df using replace method how can we make use of the dictionary to do the same. There are 5 dictionaries that have to get applied to a dataframe. I am having problems with the syntax of doing this in one step. This dataset will be growing so i am trying to find the most efficent way to chain the replace method or create a list array with the column names.
ab_normaldict= { '0': 'Normal' , '1': 'Abnormal' , '999': 'Not Done'}
ethnicitydict = { '1': 'Hispanic or Latino' , '2': 'Not Hispanic or Latino' , '3':' Unknown'}
racedict = { '1': 'American Indian or Alaska Native' , '2':'Asian' , '3': 'Black or African American' , '4': 'Native Hawaiian or Other Pacific Islander' , '5': 'White' , '7': 'More than one race' , '6': 'Unknown/Other'}
sexdict = { '1': 'Female' , '2': 'Male' , '888': 'Other' , '999': 'Unknown'}
df1= spark.createDataFrame([
(person1, "0", "1", "2", "1"),
(person2, "1", "2", "1", "2"),
(person3, "999", "2", "3" , "1" ),
(person4,'Null', "1", "6", "1")])\
.toDF("id", "abnormal", "ethnicity, "racedict", "sex")
I saw that the syntax is:
df1.na.replace(to_replace= ab_normaldict,'abnormal')
df2=df1
df2.na.replace(to_replace=sexdict, 'sex')
but i need something like below so i don't have to keep creating a new dataframe
df1.na.replace(to_replace= ab_normaldict,'abnormal').na.replace(to_replace=sexdict, 'sex')```

Your code works just fine with me (except the positional argument)
(df1
.na.replace(ab_normaldict, 'abnormal')
.na.replace(sexdict, 'sex')
.show()
)
+---+--------+---------+--------+--------+
| id|abnormal|ethnicity|racedict| sex|
+---+--------+---------+--------+--------+
| 1| Normal| Abnormal| Male|Abnormal|
| 2|Abnormal| Male|Abnormal| Male|
| 3|Not Done| Male| 3|Abnormal|
| 4| Null| Abnormal| 6|Abnormal|
+---+--------+---------+--------+--------+

Related

How to split array of set to multiple columns in SparkSQL

I have array of set data below data in pyspark dataframe like below.
-+-----------------------------------------------------------------------------------+-
| targeting_values |
-+-----------------------------------------------------------------------------------+-
| [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)] |
-+-----------------------------------------------------------------------------------+-
I want 4 different columns have with set in each column like below.
-+----------------------+----------------------+-----------------------+--------------------+-
| value1 | value2 | value3 | value4 |
-+----------------------+----------------------+-----------------------+--------------------+-
| ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2) |
-+----------------------+----------------------+-----------------------+--------------------+-
I was trying to achieve this by using split() but no luck.
I did not found other way to do solve this issue.
So is there a good way to do this?
You can do it by exploding the array than pivoting it,
// first create the data:
val arrayStructData = Seq(
Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)
val arrayStructSchema = new StructType()
.add("targeting_values", ArrayType(new StructType()
.add("_1", StringType)
.add("_2", StringType)
.add("_3", StringType)))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData), arrayStructSchema)
df.show(false)
+--------------------------------------------------------------+
|targeting_values |
+--------------------------------------------------------------+
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
+--------------------------------------------------------------+
// Then a combination of explode, creating and id then pivoting it like this:
df.withColumn("id2", monotonically_increasing_id())
.select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos") + 1))
.groupBy("id2").pivot("id").agg(first("col")).drop("id2")
.show(false)
+---------------+---------------+---------------+---------+
|value1 |value2 |value3 |value4 |
+---------------+---------------+---------------+---------+
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
+---------------+---------------+---------------+---------+
You can try this:
df.selectExpr([f"targeting_values[{i}] as value{i+1}" for i in range(4)])

How to aggregate on datetime column in Spark/SQL

How can I aggregate the mean values for every 10 min using column B datetime.
input
df1 = pd.DataFrame(
{"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo",
"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["2013-01-01 01:01:00", "2013-01-01 01:03:00", "2013-01-01 01:06:00",
"2013-01-01 01:07:00", "2013-01-01 01:10:00", "2013-01-01 01:13:00",
"2013-01-01 01:16:00", "2013-01-01 01:19:00",
"2013-01-02 02:01:00", "2013-01-02 02:03:00", "2013-01-02 02:06:00",
"2013-01-02 02:07:00", "2013-01-02 02:10:00", "2013-01-02 02:13:00",
"2013-01-02 02:16:00", "2013-01-02 02:19:00"],
"C": np.random.randn(16),
})
Code
SELECT A,B,AVG(C) as C_mean
FROM df1
GROUP BY (DATEPART(MINUTE, [B])/10)
Expected output
2013-01-01 01:10:00 20
2013-01-01 01:20:00 30
2013-01-02 02:10:00 10
2013-01-02 02:20:00 20
One of the ways you can achieve is to implicitly create the groupby column you require - which in this case is a 10 Minute Interval as below -
Once you have the required column , its pure aggregation afterwards
Data Preparation & Time Interval Creation
df = pd.DataFrame(
{"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo",
"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["2013-01-01 01:01:00", "2013-01-01 01:03:00", "2013-01-01 01:06:00",
"2013-01-01 01:07:00", "2013-01-01 01:10:00", "2013-01-01 01:13:00",
"2013-01-01 01:16:00", "2013-01-01 01:19:00",
"2013-01-02 02:01:00", "2013-01-02 02:03:00", "2013-01-02 02:06:00",
"2013-01-02 02:07:00", "2013-01-02 02:10:00", "2013-01-02 02:13:00",
"2013-01-02 02:16:00", "2013-01-02 02:19:00"],
"C": np.random.randn(16),
})
sparkDF = sql.createDataFrame(df)\
.withColumn("B",F.to_timestamp(F.col("B"),'yyyy-MM-dd hh:mm:ss'))\
.withColumn("unix_ts",F.unix_timestamp(F.col("B")))\
.withColumn("time_interval",F.col("unix_ts") - (F.col('unix_ts') % 600))\
.withColumn("time_interval_dt",F.from_unixtime("time_interval"))\
.orderBy(*['A','time_interval','B'])
sparkDF.createOrReplaceTempView("sparkDF")
Data Creation - Spark SQL
imm_res = sql.sql("""
WITH IMM_RES AS (
SELECT
A,
B,
C,
unix_ts,
unix_ts - (unix_ts % 600) as time_interval,
FROM_UNIXTIME(unix_ts - (unix_ts % 600)) as time_interval_dt
FROM(
SELECT
A,
TO_TIMESTAMP(B,'yyyy-MM-dd hh:mm:ss') as B,
C,
UNIX_TIMESTAMP(TO_TIMESTAMP(B,'yyyy-MM-dd hh:mm:ss')) as unix_ts
FROM SPARKDF
)
)
SELECT * FROM IMM_RES
""")
imm_res.show()
+---+-------------------+--------------------+----------+-------------+-------------------+
| A| B| C| unix_ts|time_interval| time_interval_dt|
+---+-------------------+--------------------+----------+-------------+-------------------+
|bar|2013-01-01 01:03:00| -1.1074768756488491|1356982380| 1356982200|2013-01-01 01:00:00|
|bar|2013-01-01 01:07:00| -1.3904690748604658|1356982620| 1356982200|2013-01-01 01:00:00|
|bar|2013-01-01 01:13:00| 0.10823010338926187|1356982980| 1356982800|2013-01-01 01:10:00|
|bar|2013-01-02 02:03:00|-0.42164831031239086|1357072380| 1357072200|2013-01-02 02:00:00|
|bar|2013-01-02 02:07:00|-0.10930060840368964|1357072620| 1357072200|2013-01-02 02:00:00|
|bar|2013-01-02 02:13:00| -1.7879231287345696|1357072980| 1357072800|2013-01-02 02:10:00|
|foo|2013-01-01 01:01:00| -1.0260782342032664|1356982260| 1356982200|2013-01-01 01:00:00|
|foo|2013-01-01 01:06:00|1.415566010814215...|1356982560| 1356982200|2013-01-01 01:00:00|
|foo|2013-01-01 01:10:00| -1.0542860688409124|1356982800| 1356982800|2013-01-01 01:10:00|
|foo|2013-01-01 01:16:00| 0.14101008568224452|1356983160| 1356982800|2013-01-01 01:10:00|
|foo|2013-01-01 01:19:00| -1.0361973194717629|1356983340| 1356982800|2013-01-01 01:10:00|
|foo|2013-01-02 02:01:00| 0.9224421915087914|1357072260| 1357072200|2013-01-02 02:00:00|
|foo|2013-01-02 02:06:00| -0.5560569181896606|1357072560| 1357072200|2013-01-02 02:00:00|
|foo|2013-01-02 02:10:00| -0.9397516560457578|1357072800| 1357072800|2013-01-02 02:10:00|
|foo|2013-01-02 02:16:00| -0.5107170960172278|1357073160| 1357072800|2013-01-02 02:10:00|
|foo|2013-01-02 02:19:00| 1.975558541342583|1357073340| 1357072800|2013-01-02 02:10:00|
+---+-------------------+--------------------+----------+-------------+-------------------+
Aggregation
imm_res.createOrReplaceTempView("IMM_RES")
sql.sql("""
SELECT
time_interval_dt,
AVG(C) as C_mean
FROM IMM_RES
GROUP BY 1
ORDER BY 1
""").show()
+-------------------+--------------------+
| time_interval_dt| C_mean|
+-------------------+--------------------+
|2013-01-01 01:00:00| -0.8809706570278749|
|2013-01-01 01:10:00| -0.4603107998102922|
|2013-01-02 02:00:00|-0.04114091134923745|
|2013-01-02 02:10:00| -0.315708334863743|
+-------------------+--------------------+

delete row number in dictionary format from pandas dataframe column

please help me with the following conversion please. So I have a pandas dataframe in the following format:
id
location
{ "0": "5",
"0": "Charlotte, North Carolina",
"1": "5",
"1": "N/A",
"2": "5",
"2": "Portland, Oregon",
"3": "5",
"3": "Jonesborough, Tennessee",
"4": "5",
"4": "Rockville, Indiana",
"5": "5",}
"5": "Dallas, Texas",
and would like to convert this into the following format:
A header
Another header
"5"
"Charlotte, North Carolina"
"5"
"N/A"
"5"
"Portland, Oregon"
"5"
"Jonesborough, Tennessee"
"5"
"Rockville, Indiana"
"5"
"Dallas, Texas"
Please help
You can try this.
import pandas as pd
import re
df = pd.DataFrame([['{ "0": "5",', '"0": "Charlotte, North Carolina",'], ['"1": "5",','"1": "N/A",']], columns=['id', 'location'])
#using regex to extract int values and selecting second int
df['id'] = df['id'].apply(lambda x: re.findall(r'\d+', x)[1])
#Split the string with : and select second value. And remove comma
df['location'] = df['location'].apply(lambda x: x.split(':')[1][:-1])
print(df)
Output:
id location
0 5 "Charlotte, North Carolina"
1 5 "N/A"

Python program Convert middle digits into its corresponding alphabets form in between start position and end position

take string and two integers from user and convert middle digits into its corresponding alphabetical form in between start and end position
def convert_digits( input_string, start_position, end_position ) :
new_string = input_string[:start_position]
digit_mapping = {
'0': 'ZERO',
'1': 'ONE',
'2': 'TWO',
'3': 'THREE',
'4': 'FOUR',
'5': 'FIVE',
'6': 'SIX',
'7': 'SEVEN',
'8': 'EIGHT',
'9': 'NINE'
}
for index in range(start_position, end_position):
if input_string[index].isdigit():
mapped = digit_mapping[input_string[index]]
new_string = mapped
new_string += input_string[end_position + 1:]
return new_string
def convert_digits(input_string, start_position, end_position):
len_str = len(input_string)
if end_position > len_str:
end_position = len_str
if start_position < 0:
start_position = 0
elif start_position >= end_position:
start_position = end_position - 1
input_string = input_string[start_position:end_position]
digit_mapping = {
'0': 'ZERO',
'1': 'ONE',
'2': 'TWO',
'3': 'THREE',
'4': 'FOUR',
'5': 'FIVE',
'6': 'SIX',
'7': 'SEVEN',
'8': 'EIGHT',
'9': 'NINE'
}
input_string = list(input_string)
for index, char in enumerate(input_string):
if char.isdigit():
input_string[index] = digit_mapping[char]
return "".join(input_string)
print(convert_digits("ajoegh12ioh12oih", 0, 14))

How to flatten JSON values into frequency counts in SQL

I have a column with JSON values like so:
{'A': 'true', 'B': 'false', 'C': 'true'}
{'A': 'true', 'C': 'false'}
{'D': 'true'}
{'C': 'true', 'A': 'false'}
I would like to create an SQL query which counts the number of entries with each key-value combination in the json.
Note that the keys and values are unknown in advance.
So the output of the above would be:
2 A=true
1 A=false
1 B=false
2 C=true
1 C=false
1 D=true
How can I do that?
SELECT a1||':'||a2, count(*) from (
SELECT map_entries(cast(json_parse(x) as MAP<VARCHAR, VARCHAR>)) row from
(VALUES ('{"A": "true", "B": "false", "C": "true"}'), ('{"A": "true", "C": "false"}'), ('{"D": "true"}'), ('{"C": "true", "A": "false"}')) as t(x))
as nested_data CROSS JOIN UNNEST(row) as nested_data(a1, a2)
group by 1;
_col0 | _col1
---------+-------
D:true | 1
B:false | 1
C:false | 1
C:true | 2
A:false | 1
A:true | 2
https://prestosql.io/docs/current/functions/map.html