While replacing values of a column in a df using replace method how can we make use of the dictionary to do the same. There are 5 dictionaries that have to get applied to a dataframe. I am having problems with the syntax of doing this in one step. This dataset will be growing so i am trying to find the most efficent way to chain the replace method or create a list array with the column names.
ab_normaldict= { '0': 'Normal' , '1': 'Abnormal' , '999': 'Not Done'}
ethnicitydict = { '1': 'Hispanic or Latino' , '2': 'Not Hispanic or Latino' , '3':' Unknown'}
racedict = { '1': 'American Indian or Alaska Native' , '2':'Asian' , '3': 'Black or African American' , '4': 'Native Hawaiian or Other Pacific Islander' , '5': 'White' , '7': 'More than one race' , '6': 'Unknown/Other'}
sexdict = { '1': 'Female' , '2': 'Male' , '888': 'Other' , '999': 'Unknown'}
df1= spark.createDataFrame([
(person1, "0", "1", "2", "1"),
(person2, "1", "2", "1", "2"),
(person3, "999", "2", "3" , "1" ),
(person4,'Null', "1", "6", "1")])\
.toDF("id", "abnormal", "ethnicity, "racedict", "sex")
I saw that the syntax is:
df1.na.replace(to_replace= ab_normaldict,'abnormal')
df2=df1
df2.na.replace(to_replace=sexdict, 'sex')
but i need something like below so i don't have to keep creating a new dataframe
df1.na.replace(to_replace= ab_normaldict,'abnormal').na.replace(to_replace=sexdict, 'sex')```
Your code works just fine with me (except the positional argument)
(df1
.na.replace(ab_normaldict, 'abnormal')
.na.replace(sexdict, 'sex')
.show()
)
+---+--------+---------+--------+--------+
| id|abnormal|ethnicity|racedict| sex|
+---+--------+---------+--------+--------+
| 1| Normal| Abnormal| Male|Abnormal|
| 2|Abnormal| Male|Abnormal| Male|
| 3|Not Done| Male| 3|Abnormal|
| 4| Null| Abnormal| 6|Abnormal|
+---+--------+---------+--------+--------+
Related
I have array of set data below data in pyspark dataframe like below.
-+-----------------------------------------------------------------------------------+-
| targeting_values |
-+-----------------------------------------------------------------------------------+-
| [('123', '123', '123'), ('abc', 'def', 'ghi'), ('jkl', 'mno', 'pqr'), (0, 1, 2)] |
-+-----------------------------------------------------------------------------------+-
I want 4 different columns have with set in each column like below.
-+----------------------+----------------------+-----------------------+--------------------+-
| value1 | value2 | value3 | value4 |
-+----------------------+----------------------+-----------------------+--------------------+-
| ('123', '123', '123')|('abc', 'def', 'ghi') | ('jkl', 'mno', 'pqr') | (0, 1, 2) |
-+----------------------+----------------------+-----------------------+--------------------+-
I was trying to achieve this by using split() but no luck.
I did not found other way to do solve this issue.
So is there a good way to do this?
You can do it by exploding the array than pivoting it,
// first create the data:
val arrayStructData = Seq(
Row(List(Row("123", "123", "123"), Row("abc", "def", "ghi"), Row("jkl", "mno", "pqr"), Row("0", "1", "2"))),
Row(List(Row("456", "456", "456"), Row("qsd", "fgh", "hjk"), Row("aze", "rty", "uio"), Row("4", "5", "6")))
)
val arrayStructSchema = new StructType()
.add("targeting_values", ArrayType(new StructType()
.add("_1", StringType)
.add("_2", StringType)
.add("_3", StringType)))
val df = spark.createDataFrame(spark.sparkContext
.parallelize(arrayStructData), arrayStructSchema)
df.show(false)
+--------------------------------------------------------------+
|targeting_values |
+--------------------------------------------------------------+
|[{123, 123, 123}, {abc, def, ghi}, {jkl, mno, pqr}, {0, 1, 2}]|
|[{456, 456, 456}, {qsd, fgh, hjk}, {aze, rty, uio}, {4, 5, 6}]|
+--------------------------------------------------------------+
// Then a combination of explode, creating and id then pivoting it like this:
df.withColumn("id2", monotonically_increasing_id())
.select(col("id2"), posexplode(col("targeting_values"))).withColumn("id", concat(lit("value"), col("pos") + 1))
.groupBy("id2").pivot("id").agg(first("col")).drop("id2")
.show(false)
+---------------+---------------+---------------+---------+
|value1 |value2 |value3 |value4 |
+---------------+---------------+---------------+---------+
|{123, 123, 123}|{abc, def, ghi}|{jkl, mno, pqr}|{0, 1, 2}|
|{456, 456, 456}|{qsd, fgh, hjk}|{aze, rty, uio}|{4, 5, 6}|
+---------------+---------------+---------------+---------+
You can try this:
df.selectExpr([f"targeting_values[{i}] as value{i+1}" for i in range(4)])
How can I aggregate the mean values for every 10 min using column B datetime.
input
df1 = pd.DataFrame(
{"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo",
"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["2013-01-01 01:01:00", "2013-01-01 01:03:00", "2013-01-01 01:06:00",
"2013-01-01 01:07:00", "2013-01-01 01:10:00", "2013-01-01 01:13:00",
"2013-01-01 01:16:00", "2013-01-01 01:19:00",
"2013-01-02 02:01:00", "2013-01-02 02:03:00", "2013-01-02 02:06:00",
"2013-01-02 02:07:00", "2013-01-02 02:10:00", "2013-01-02 02:13:00",
"2013-01-02 02:16:00", "2013-01-02 02:19:00"],
"C": np.random.randn(16),
})
Code
SELECT A,B,AVG(C) as C_mean
FROM df1
GROUP BY (DATEPART(MINUTE, [B])/10)
Expected output
2013-01-01 01:10:00 20
2013-01-01 01:20:00 30
2013-01-02 02:10:00 10
2013-01-02 02:20:00 20
One of the ways you can achieve is to implicitly create the groupby column you require - which in this case is a 10 Minute Interval as below -
Once you have the required column , its pure aggregation afterwards
Data Preparation & Time Interval Creation
df = pd.DataFrame(
{"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo",
"foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["2013-01-01 01:01:00", "2013-01-01 01:03:00", "2013-01-01 01:06:00",
"2013-01-01 01:07:00", "2013-01-01 01:10:00", "2013-01-01 01:13:00",
"2013-01-01 01:16:00", "2013-01-01 01:19:00",
"2013-01-02 02:01:00", "2013-01-02 02:03:00", "2013-01-02 02:06:00",
"2013-01-02 02:07:00", "2013-01-02 02:10:00", "2013-01-02 02:13:00",
"2013-01-02 02:16:00", "2013-01-02 02:19:00"],
"C": np.random.randn(16),
})
sparkDF = sql.createDataFrame(df)\
.withColumn("B",F.to_timestamp(F.col("B"),'yyyy-MM-dd hh:mm:ss'))\
.withColumn("unix_ts",F.unix_timestamp(F.col("B")))\
.withColumn("time_interval",F.col("unix_ts") - (F.col('unix_ts') % 600))\
.withColumn("time_interval_dt",F.from_unixtime("time_interval"))\
.orderBy(*['A','time_interval','B'])
sparkDF.createOrReplaceTempView("sparkDF")
Data Creation - Spark SQL
imm_res = sql.sql("""
WITH IMM_RES AS (
SELECT
A,
B,
C,
unix_ts,
unix_ts - (unix_ts % 600) as time_interval,
FROM_UNIXTIME(unix_ts - (unix_ts % 600)) as time_interval_dt
FROM(
SELECT
A,
TO_TIMESTAMP(B,'yyyy-MM-dd hh:mm:ss') as B,
C,
UNIX_TIMESTAMP(TO_TIMESTAMP(B,'yyyy-MM-dd hh:mm:ss')) as unix_ts
FROM SPARKDF
)
)
SELECT * FROM IMM_RES
""")
imm_res.show()
+---+-------------------+--------------------+----------+-------------+-------------------+
| A| B| C| unix_ts|time_interval| time_interval_dt|
+---+-------------------+--------------------+----------+-------------+-------------------+
|bar|2013-01-01 01:03:00| -1.1074768756488491|1356982380| 1356982200|2013-01-01 01:00:00|
|bar|2013-01-01 01:07:00| -1.3904690748604658|1356982620| 1356982200|2013-01-01 01:00:00|
|bar|2013-01-01 01:13:00| 0.10823010338926187|1356982980| 1356982800|2013-01-01 01:10:00|
|bar|2013-01-02 02:03:00|-0.42164831031239086|1357072380| 1357072200|2013-01-02 02:00:00|
|bar|2013-01-02 02:07:00|-0.10930060840368964|1357072620| 1357072200|2013-01-02 02:00:00|
|bar|2013-01-02 02:13:00| -1.7879231287345696|1357072980| 1357072800|2013-01-02 02:10:00|
|foo|2013-01-01 01:01:00| -1.0260782342032664|1356982260| 1356982200|2013-01-01 01:00:00|
|foo|2013-01-01 01:06:00|1.415566010814215...|1356982560| 1356982200|2013-01-01 01:00:00|
|foo|2013-01-01 01:10:00| -1.0542860688409124|1356982800| 1356982800|2013-01-01 01:10:00|
|foo|2013-01-01 01:16:00| 0.14101008568224452|1356983160| 1356982800|2013-01-01 01:10:00|
|foo|2013-01-01 01:19:00| -1.0361973194717629|1356983340| 1356982800|2013-01-01 01:10:00|
|foo|2013-01-02 02:01:00| 0.9224421915087914|1357072260| 1357072200|2013-01-02 02:00:00|
|foo|2013-01-02 02:06:00| -0.5560569181896606|1357072560| 1357072200|2013-01-02 02:00:00|
|foo|2013-01-02 02:10:00| -0.9397516560457578|1357072800| 1357072800|2013-01-02 02:10:00|
|foo|2013-01-02 02:16:00| -0.5107170960172278|1357073160| 1357072800|2013-01-02 02:10:00|
|foo|2013-01-02 02:19:00| 1.975558541342583|1357073340| 1357072800|2013-01-02 02:10:00|
+---+-------------------+--------------------+----------+-------------+-------------------+
Aggregation
imm_res.createOrReplaceTempView("IMM_RES")
sql.sql("""
SELECT
time_interval_dt,
AVG(C) as C_mean
FROM IMM_RES
GROUP BY 1
ORDER BY 1
""").show()
+-------------------+--------------------+
| time_interval_dt| C_mean|
+-------------------+--------------------+
|2013-01-01 01:00:00| -0.8809706570278749|
|2013-01-01 01:10:00| -0.4603107998102922|
|2013-01-02 02:00:00|-0.04114091134923745|
|2013-01-02 02:10:00| -0.315708334863743|
+-------------------+--------------------+
please help me with the following conversion please. So I have a pandas dataframe in the following format:
id
location
{ "0": "5",
"0": "Charlotte, North Carolina",
"1": "5",
"1": "N/A",
"2": "5",
"2": "Portland, Oregon",
"3": "5",
"3": "Jonesborough, Tennessee",
"4": "5",
"4": "Rockville, Indiana",
"5": "5",}
"5": "Dallas, Texas",
and would like to convert this into the following format:
A header
Another header
"5"
"Charlotte, North Carolina"
"5"
"N/A"
"5"
"Portland, Oregon"
"5"
"Jonesborough, Tennessee"
"5"
"Rockville, Indiana"
"5"
"Dallas, Texas"
Please help
You can try this.
import pandas as pd
import re
df = pd.DataFrame([['{ "0": "5",', '"0": "Charlotte, North Carolina",'], ['"1": "5",','"1": "N/A",']], columns=['id', 'location'])
#using regex to extract int values and selecting second int
df['id'] = df['id'].apply(lambda x: re.findall(r'\d+', x)[1])
#Split the string with : and select second value. And remove comma
df['location'] = df['location'].apply(lambda x: x.split(':')[1][:-1])
print(df)
Output:
id location
0 5 "Charlotte, North Carolina"
1 5 "N/A"
take string and two integers from user and convert middle digits into its corresponding alphabetical form in between start and end position
def convert_digits( input_string, start_position, end_position ) :
new_string = input_string[:start_position]
digit_mapping = {
'0': 'ZERO',
'1': 'ONE',
'2': 'TWO',
'3': 'THREE',
'4': 'FOUR',
'5': 'FIVE',
'6': 'SIX',
'7': 'SEVEN',
'8': 'EIGHT',
'9': 'NINE'
}
for index in range(start_position, end_position):
if input_string[index].isdigit():
mapped = digit_mapping[input_string[index]]
new_string = mapped
new_string += input_string[end_position + 1:]
return new_string
def convert_digits(input_string, start_position, end_position):
len_str = len(input_string)
if end_position > len_str:
end_position = len_str
if start_position < 0:
start_position = 0
elif start_position >= end_position:
start_position = end_position - 1
input_string = input_string[start_position:end_position]
digit_mapping = {
'0': 'ZERO',
'1': 'ONE',
'2': 'TWO',
'3': 'THREE',
'4': 'FOUR',
'5': 'FIVE',
'6': 'SIX',
'7': 'SEVEN',
'8': 'EIGHT',
'9': 'NINE'
}
input_string = list(input_string)
for index, char in enumerate(input_string):
if char.isdigit():
input_string[index] = digit_mapping[char]
return "".join(input_string)
print(convert_digits("ajoegh12ioh12oih", 0, 14))
I have a column with JSON values like so:
{'A': 'true', 'B': 'false', 'C': 'true'}
{'A': 'true', 'C': 'false'}
{'D': 'true'}
{'C': 'true', 'A': 'false'}
I would like to create an SQL query which counts the number of entries with each key-value combination in the json.
Note that the keys and values are unknown in advance.
So the output of the above would be:
2 A=true
1 A=false
1 B=false
2 C=true
1 C=false
1 D=true
How can I do that?
SELECT a1||':'||a2, count(*) from (
SELECT map_entries(cast(json_parse(x) as MAP<VARCHAR, VARCHAR>)) row from
(VALUES ('{"A": "true", "B": "false", "C": "true"}'), ('{"A": "true", "C": "false"}'), ('{"D": "true"}'), ('{"C": "true", "A": "false"}')) as t(x))
as nested_data CROSS JOIN UNNEST(row) as nested_data(a1, a2)
group by 1;
_col0 | _col1
---------+-------
D:true | 1
B:false | 1
C:false | 1
C:true | 2
A:false | 1
A:true | 2
https://prestosql.io/docs/current/functions/map.html