Unpivot the data frame from wide to long in PySpark using melt

Unpivot the data frame from wide to long in PySpark using melt - dataframe

I am trying to perform melting operation on my data frame. I have tried the code below, but I am getting an error:
A DataFrame object does not have an attribute melt. Please check the spelling and/or the datatype of the object.
df_pivot_jp = JP_ch.melt(id_vars=['c_id'], var_name='views_on_character', value_name='answer')
df_pivot_gj = GJ_ch.melt(id_vars=['c_id'], var_name='views_on_character', value_name='answer')
Can someone please tell me what is this attribute that I am missing?

Input dataframe:
from pyspark.sql import functions as F
JP_ch = spark.createDataFrame(
[('c1', 111, 1111),
('c2', 222, 2222),
('c3', 333, 3333)],
['c_id', 'col2', 'col3'])
Pandas' melt returns this:
JP_ch = JP_ch.toPandas()
df_pivot_jp = JP_ch.melt(id_vars=['c_id'], var_name='views_on_character', value_name='answer')
print(df_pivot_jp)
# c_id views_on_character answer
# 0 c1 col2 111
# 1 c2 col2 222
# 2 c3 col2 333
# 3 c1 col3 1111
# 4 c2 col3 2222
# 5 c3 col3 3333
In PySpark, I would do it like this:
to_melt = {c for c in JP_ch.columns if c not in ['c_id']}
new_names = '(views_on_character, answer)'
melt_list = [f"\'{c}\', `{c}`" for c in to_melt]
df = JP_ch.select(
*(set(JP_ch.columns) - to_melt),
F.expr(f"stack({len(melt_list)}, {','.join(melt_list)}) {new_names}")
)
df.show()
# +----+------------------+------+
# |c_id|views_on_character|answer|
# +----+------------------+------+
# | c1| col3| 1111|
# | c1| col2| 111|
# | c2| col3| 2222|
# | c2| col2| 222|
# | c3| col3| 3333|
# | c3| col2| 333|
# +----+------------------+------+

Related

Fill a column according to a weight list in another Spark DataFrame

I have:
Dataframe A
ID
Name
1
Null
2
Null
3
Null
4
Null
Dataframe B
Name
weight
Max
0.5
Mike
0.25
John
0.25
I need to fill (in PySpark) the column "Name" (Dataset A) according to the weighted Name of Dataset B:
ID
Name
1
Max
2
Max
3
Mike
4
John
Max 50%; Mike 25%; John 25%

This may not be the best approach, but it would work for weights >= 0.01. I wanted to create an approach where equality would be used in the join as opposed to > < >= and <=.
Input:
from pyspark.sql import functions as F, Window as W
dfA = spark.range(1, 5).toDF("ID")
dfB = spark.createDataFrame(
[('Max', 0.5),
('Mike', 0.25),
('John', 0.25)],
['Name', 'weight'])
Script:
cs = F.round(F.sum('weight').over(W.rowsBetween(W.unboundedPreceding, 0)), 2)
dfB = dfB.withColumn('w', F.sequence(((cs - F.col('weight'))*100+1).cast('int'), (cs*100).cast('int')))
dfB = dfB.withColumn('w', F.explode('w'))
cntA = dfA.count()
dfA = dfA.withColumn('w', F.ceil(F.count('ID').over(W.rowsBetween(W.unboundedPreceding, 0))/cntA*100))
dfA = dfA.join(dfB, 'w', 'left').drop('w', 'weight')
dfA.show()
# +---+----+
# | ID|Name|
# +---+----+
# | 1| Max|
# | 2| Max|
# | 3|Mike|
# | 4|John|
# +---+----+

Insert data into a single column but in dictionary format after concatenating few column of data

I want to create a single column after concatenating number of columns in a single column but in dictionary format in PySpark.
I have concatenated data into a single column but I am unable to store it in a dictionary format.
Please find the below attached screenshot for more details.
Let me know if need more information.

In your current situation, you can use str_to_map
from pyspark.sql import functions as F
df = spark.createDataFrame([("datatype:0,length:1",)], ['region_validation_check_status'])
df = df.withColumn(
'region_validation_check_status',
F.expr("str_to_map(region_validation_check_status, ',')")
)
df.show(truncate=0)
# +------------------------------+
# |region_validation_check_status|
# +------------------------------+
# |{datatype -> 0, length -> 1} |
# +------------------------------+
If you didn't have a string yet, you could do it from column values with to_json and from_json
from pyspark.sql import functions as F
df = spark.createDataFrame([(1, 2), (3, 4)], ['a', 'b'])
df.show()
# +---+---+
# | a| b|
# +---+---+
# | 1| 2|
# | 3| 4|
# +---+---+
df = df.select(
F.from_json(F.to_json(F.struct('a', 'b')), 'map<string, int>')
)
df.show()
# +----------------+
# | entries|
# +----------------+
# |{a -> 1, b -> 2}|
# |{a -> 3, b -> 4}|
# +----------------+

How to convert SAS code logic if first.variable to PySpark?

I'm trying to convert the following SAS logic to PySpark, but I'm not getting the expected output.
if first.loan then seq_id = 0;
seq_id+1
The current dataset:
loan
module
743
455
4490
795
1101
235
1101
335
1101
435
3471
898
The expected dataset:
loan
module
seq_id
743
455
1
4490
795
1
1101
235
1
1101
335
2
1101
435
3
3471
898
1

For the first value in the group, you assign seq_id=0 and then you immediately change it using seq_id+1. Subsequent values in the group one-by-one get changed using seq_id+1, so effectively you create row numbers in every group.
In Spark, this can be done using row_number window function.
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(743, 455),
(4490, 795),
(1101, 235),
(1101, 335),
(1101, 435),
(3471, 898)],
['loan', 'module'])
w = W.partitionBy('loan').orderBy('module')
df = df.withColumn('seq_id', F.row_number().over(w))
df.show()
# +----+------+------+
# |loan|module|seq_id|
# +----+------+------+
# | 743| 455| 1|
# |1101| 235| 1|
# |1101| 335| 2|
# |1101| 435| 3|
# |3471| 898| 1|
# |4490| 795| 1|
# +----+------+------+

Average of array column of two dataframes and find the maximum index in pyspark

I want to Combine the column values of two dataframe after performing some operations to create a new dataframe in pyspark. The columns of each dataframe are vectors with integer values. The operations done are taking the average of each values in the vectors of the dataframe and finding the index of the maximum element of the new vectors created.
Dataframe1:
|id| |value1 |
|:.| |:......|
| 0| |[0,1,2]|
| 1| |[3,4,5]|
Dataframe2:
|id| |value2 |
|:.| |:......|
| 0| |[1,2,3]|
| 1| |[4,5,6]|
Dataframe3:
|value3 |
|:............|
|[0.5,1.5,2.5]|
|[3.5,4.5,5.5]|
Dataframe4:
|value4|
|:.....|
|2 |
|2 |
Dataframe3 is obtained by taking the average of each elements of each vectors of dataframe 1 and 2 i.e.: first vector of dataframe3 [0.5,1.5,2.5] is obtained by [0+1/2,1+2/2,2+3/2]. Dataframe4 is obtained by taking the index of maximum value of each vector.i.e; Take first vector of dataframe3[0.5,1.5,2.5] maximum value is 2.5 and it occurs at index 2 so first element in Dataframe4 is 2. How we can implement this in pyspark .
V1:
+--------------------------------------+---+
|p1 |id |
+--------------------------------------+---+
|[0.01426862, 0.010903089, 0.9748283] |0 |
|[0.068229124, 0.89613986, 0.035630997]|1 |
+--------------------------------------+---+
V2:
+-------------------------+---+
|p2 |id |
+-------------------------+---+
|[0.0, 0.0, 1.0] |0 |
|[2.8160464E-27, 1.0, 0.0]|1 |
+-------------------------+---+
when df3 = v1.join(v2,on="id") is used
df3=
this is what I get
+-------------------------------------+---------------+
|p1 |p2 |
+-------------------------------------+---------------+
|[0.02203844, 0.010056663, 0.9679049] |[0.0, 0.0, 1.0]|
|[0.039553806, 0.015186918, 0.9452593]|[0.0, 0.0, 1.0]|
+-------------------------------------+---------------+
and when
df3 = df3.withColumn( "p3", F.expr("transform(arrays_zip(p1, p2), x -> (x.p1 + x.p2) / 2)"),)
df4 = df3.withColumn("p4",F.expr("array_position(p3, array_max(p3))"))
were p3 is the average value .I get all values of df4 as zero

First, I recreate your test data :
a = [
[0, [0,1,2]],
[1, [3,4,5]],
]
b = ["id", "value1"]
df1 = spark.createDataFrame(a,b)
c = [
[0, [1,2,3]],
[1, [4,5,6]],
]
d = ["id", "value2"]
df2 = spark.createDataFrame(c,d)
then, I process the data :
join
df3 = df1.join(df2, on="id")
df3.show()
+---+---------+---------+
| id| value1| value2|
+---+---------+---------+
| 0|[0, 1, 2]|[1, 2, 3]|
| 1|[3, 4, 5]|[4, 5, 6]|
+---+---------+---------+
create the average array
from pyspark.sql import functions as F, types as T
#F.udf(T.ArrayType(T.FloatType()))
def avg_array(array1, array2):
return list(map(lambda x: (x[0] + x[1]) / 2, zip(array1, array2)))
df3 = df3.withColumn("value3", avg_array(F.col("value1"), F.col("value2")))
# OR without UDF
df3 = df3.withColumn(
"value3",
F.expr("transform(arrays_zip(value1, value2), x -> (x.value1 + x.value2) / 2)"),
)
df3.show()
+---+---------+---------+---------------+
| id| value1| value2| value3|
+---+---------+---------+---------------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]|
+---+---------+---------+---------------+
get the index (the array_position start at 1, you can do a -1 if necessary)
df4 = df3.withColumn("value4",F.expr("array_position(value3, array_max(value3))"))
df4.show()
+---+---------+---------+---------------+------+
| id| value1| value2| value3|value4|
+---+---------+---------+---------------+------+
| 0|[0, 1, 2]|[1, 2, 3]|[0.5, 1.5, 2.5]| 3|
| 1|[3, 4, 5]|[4, 5, 6]|[3.5, 4.5, 5.5]| 3|
+---+---------+---------+---------------+------+

Loading csv with pandas, wrong columns

I loaded a csv into a DataFrame with pandas.
The format is the following:
Timestamp | 1014.temperature | 1014.humidity | 1015.temperature | 1015.humidity ....
-------------------------------------------------------------------------------------
2017-... | 23.12 | 12.2 | 25.10 | 10.34 .....
The problem is that the '1014' or '1015' numbers are supposed to be ID's that are supposed to be in a special column.
I would like to end up with the following format for my DF:
TimeStamp | ID | Temperature | Humidity
-----------------------------------------------
. | | |
.
.
.
The CSV is tab separated.
Thanks in advance guys!

import pandas as pd
from io import StringIO
# create sample data frame
s = """Timestamp|1014.temperature|1014.humidity|1015.temperature|1015.humidity
2017|23.12|12.2|25.10|10.34"""
df = pd.read_csv(StringIO(s), sep='|')
df = df.set_index('Timestamp')
# split columns on '.' with list comprehension
l = [col.split('.') for col in df.columns]
# create multi index columns
df.columns = pd.MultiIndex.from_tuples(l)
# stack column level 0, reset the index and rename level_1
final = df.stack(0).reset_index().rename(columns={'level_1': 'ID'})
Timestamp ID humidity temperature
0 2017 1014 12.20 23.12
1 2017 1015 10.34 25.10

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Unpivot the data frame from wide to long in PySpark using melt - dataframe

Related

Fill a column according to a weight list in another Spark DataFrame

Insert data into a single column but in dictionary format after concatenating few column of data

How to convert SAS code logic if first.variable to PySpark?

Average of array column of two dataframes and find the maximum index in pyspark

Loading csv with pandas, wrong columns

Categories

Resources