I have:
Dataframe A
ID
Name
1
Null
2
Null
3
Null
4
Null
Dataframe B
Name
weight
Max
0.5
Mike
0.25
John
0.25
I need to fill (in PySpark) the column "Name" (Dataset A) according to the weighted Name of Dataset B:
ID
Name
1
Max
2
Max
3
Mike
4
John
Max 50%; Mike 25%; John 25%
This may not be the best approach, but it would work for weights >= 0.01. I wanted to create an approach where equality would be used in the join as opposed to > < >= and <=.
Input:
from pyspark.sql import functions as F, Window as W
dfA = spark.range(1, 5).toDF("ID")
dfB = spark.createDataFrame(
[('Max', 0.5),
('Mike', 0.25),
('John', 0.25)],
['Name', 'weight'])
Script:
cs = F.round(F.sum('weight').over(W.rowsBetween(W.unboundedPreceding, 0)), 2)
dfB = dfB.withColumn('w', F.sequence(((cs - F.col('weight'))*100+1).cast('int'), (cs*100).cast('int')))
dfB = dfB.withColumn('w', F.explode('w'))
cntA = dfA.count()
dfA = dfA.withColumn('w', F.ceil(F.count('ID').over(W.rowsBetween(W.unboundedPreceding, 0))/cntA*100))
dfA = dfA.join(dfB, 'w', 'left').drop('w', 'weight')
dfA.show()
# +---+----+
# | ID|Name|
# +---+----+
# | 1| Max|
# | 2| Max|
# | 3|Mike|
# | 4|John|
# +---+----+
I want to create a single column after concatenating number of columns in a single column but in dictionary format in PySpark.
I have concatenated data into a single column but I am unable to store it in a dictionary format.
Please find the below attached screenshot for more details.
Let me know if need more information.
In your current situation, you can use str_to_map
from pyspark.sql import functions as F
df = spark.createDataFrame([("datatype:0,length:1",)], ['region_validation_check_status'])
df = df.withColumn(
'region_validation_check_status',
F.expr("str_to_map(region_validation_check_status, ',')")
)
df.show(truncate=0)
# +------------------------------+
# |region_validation_check_status|
# +------------------------------+
# |{datatype -> 0, length -> 1} |
# +------------------------------+
If you didn't have a string yet, you could do it from column values with to_json and from_json
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 2), (3, 4)], ['a', 'b'])
df.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 3| 4|
# +---+---+
df = df.select(
F.from_json(F.to_json(F.struct('a', 'b')), 'map<string, int>')
)
df.show()
# +----------------+
# | entries|
# +----------------+
# |{a -> 1, b -> 2}|
# |{a -> 3, b -> 4}|
# +----------------+
I'm trying to convert the following SAS logic to PySpark, but I'm not getting the expected output.
if first.loan then seq_id = 0;
seq_id+1
The current dataset:
loan
module
743
455
4490
795
1101
235
1101
335
1101
435
3471
898
The expected dataset:
loan
module
seq_id
743
455
1
4490
795
1
1101
235
1
1101
335
2
1101
435
3
3471
898
1
For the first value in the group, you assign seq_id=0 and then you immediately change it using seq_id+1. Subsequent values in the group one-by-one get changed using seq_id+1, so effectively you create row numbers in every group.
In Spark, this can be done using row_number window function.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(743, 455),
(4490, 795),
(1101, 235),
(1101, 335),
(1101, 435),
(3471, 898)],
['loan', 'module'])
w = W.partitionBy('loan').orderBy('module')
df = df.withColumn('seq_id', F.row_number().over(w))
df.show()
# +----+------+------+
# |loan|module|seq_id|
# +----+------+------+
# | 743| 455| 1|
# |1101| 235| 1|
# |1101| 335| 2|
# |1101| 435| 3|
# |3471| 898| 1|
# |4490| 795| 1|
# +----+------+------+
I want to Combine the column values of two dataframe after performing some operations to create a new dataframe in pyspark. The columns of each dataframe are vectors with integer values. The operations done are taking the average of each values in the vectors of the dataframe and finding the index of the maximum element of the new vectors created.
Dataframe1:
|id| |value1 |
|:.| |:......|
| 0| |[0,1,2]|
| 1| |[3,4,5]|
Dataframe2:
|id| |value2 |
|:.| |:......|
| 0| |[1,2,3]|
| 1| |[4,5,6]|
Dataframe3:
|value3 |
|:............|
|[0.5,1.5,2.5]|
|[3.5,4.5,5.5]|
Dataframe4:
|value4|
|:.....|
|2 |
|2 |
Dataframe3 is obtained by taking the average of each elements of each vectors of dataframe 1 and 2 i.e.: first vector of dataframe3 [0.5,1.5,2.5] is obtained by [0+1/2,1+2/2,2+3/2]. Dataframe4 is obtained by taking the index of maximum value of each vector.i.e; Take first vector of dataframe3[0.5,1.5,2.5] maximum value is 2.5 and it occurs at index 2 so first element in Dataframe4 is 2. How we can implement this in pyspark .
V1:
+--------------------------------------+---+
|p1 |id |
+--------------------------------------+---+
|[0.01426862, 0.010903089, 0.9748283] |0 |
|[0.068229124, 0.89613986, 0.035630997]|1 |
+--------------------------------------+---+
V2:
+-------------------------+---+
|p2 |id |
+-------------------------+---+
|[0.0, 0.0, 1.0] |0 |
|[2.8160464E-27, 1.0, 0.0]|1 |
+-------------------------+---+
when df3 = v1.join(v2,on="id") is used
df3=
this is what I get
+-------------------------------------+---------------+
|p1 |p2 |
+-------------------------------------+---------------+
|[0.02203844, 0.010056663, 0.9679049] |[0.0, 0.0, 1.0]|
|[0.039553806, 0.015186918, 0.9452593]|[0.0, 0.0, 1.0]|
+-------------------------------------+---------------+
and when
df3 = df3.withColumn( "p3", F.expr("transform(arrays_zip(p1, p2), x -> (x.p1 + x.p2) / 2)"),)
df4 = df3.withColumn("p4",F.expr("array_position(p3, array_max(p3))"))
were p3 is the average value .I get all values of df4 as zero
First, I recreate your test data :
a = [
[0, [0,1,2]],
[1, [3,4,5]],
]
b = ["id", "value1"]
df1 = spark.createDataFrame(a,b)
c = [
[0, [1,2,3]],
[1, [4,5,6]],
]
d = ["id", "value2"]
df2 = spark.createDataFrame(c,d)
then, I process the data :
join
df3 = df1.join(df2, on="id")
df3.show()
+---+---------+---------+
| id| value1| value2|
+---+---------+---------+
| 0|[0, 1, 2]|[1, 2, 3]|
| 1|[3, 4, 5]|[4, 5, 6]|
+---+---------+---------+
create the average array
from pyspark.sql import functions as F, types as T
#F.udf(T.ArrayType(T.FloatType()))
def avg_array(array1, array2):
return list(map(lambda x: (x[0] + x[1]) / 2, zip(array1, array2)))
df3 = df3.withColumn("value3", avg_array(F.col("value1"), F.col("value2")))
# OR without UDF
df3 = df3.withColumn(
"value3",
F.expr("transform(arrays_zip(value1, value2), x -> (x.value1 + x.value2) / 2)"),
)
df3.show()
+---+---------+---------+---------------+
| id| value1| value2| value3|
+---+---------+---------+---------------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]|
+---+---------+---------+---------------+
get the index (the array_position start at 1, you can do a -1 if necessary)
df4 = df3.withColumn("value4",F.expr("array_position(value3, array_max(value3))"))
df4.show()
+---+---------+---------+---------------+------+
| id| value1| value2| value3|value4|
+---+---------+---------+---------------+------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]| 3|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]| 3|
+---+---------+---------+---------------+------+
I loaded a csv into a DataFrame with pandas.
The format is the following:
Timestamp | 1014.temperature | 1014.humidity | 1015.temperature | 1015.humidity ....
-------------------------------------------------------------------------------------
2017-... | 23.12 | 12.2 | 25.10 | 10.34 .....
The problem is that the '1014' or '1015' numbers are supposed to be ID's that are supposed to be in a special column.
I would like to end up with the following format for my DF:
TimeStamp | ID | Temperature | Humidity
-----------------------------------------------
. | | |
.
.
.
The CSV is tab separated.
Thanks in advance guys!
import pandas as pd
from io import StringIO
# create sample data frame
s = """Timestamp|1014.temperature|1014.humidity|1015.temperature|1015.humidity
2017|23.12|12.2|25.10|10.34"""
df = pd.read_csv(StringIO(s), sep='|')
df = df.set_index('Timestamp')
# split columns on '.' with list comprehension
l = [col.split('.') for col in df.columns]
# create multi index columns
df.columns = pd.MultiIndex.from_tuples(l)
# stack column level 0, reset the index and rename level_1
final = df.stack(0).reset_index().rename(columns={'level_1': 'ID'})
Timestamp ID humidity temperature
0 2017 1014 12.20 23.12
1 2017 1015 10.34 25.10