How to flatten a pyspark dataframe? (spark 1.6) - dataframe

I'm working with Spark 1.6
Here are my data :
eDF = sqlsc.createDataFrame([Row(v=1, eng_1=10,eng_2=20),
Row(v=2, eng_1=15,eng_2=30),
Row(v=3, eng_1=8,eng_2=12)])
eDF.select('v','eng_1','eng_2').show()
+---+-----+-----+
| v|eng_1|eng_2|
+---+-----+-----+
| 1| 10| 20|
| 2| 15| 30|
| 3| 8| 12|
+---+-----+-----+
I would like to 'flatten' this table.
That is to say :
+---+-----+---+
| v| key|val|
+---+-----+---+
| 1|eng_1| 10|
| 1|eng_2| 20|
| 2|eng_1| 15|
| 2|eng_2| 30|
| 3|eng_1| 8|
| 3|eng_2| 12|
+---+-----+---+
Note that since I'm working with Spark 1.6, I can't use pyspar.sql.functions.create_map or pyspark.sql.functions.posexplode.

Use rdd.flatMap to flatten it:
df = spark.createDataFrame(
eDF.rdd.flatMap(
lambda r: [Row(v=r.v, key=col, val=r[col]) for col in ['eng_1', 'eng_2']]
)
)
df.show()
+-----+---+---+
| key| v|val|
+-----+---+---+
|eng_1| 1| 10|
|eng_2| 1| 20|
|eng_1| 2| 15|
|eng_2| 2| 30|
|eng_1| 3| 8|
|eng_2| 3| 12|
+-----+---+---+

Related

Strategy to convert a corelated hive query to pyspark dataframe transformation?

I need to convert below SQL query into pyspark dataframe transformation. There is a corelated subquery defined inside select clause. is there any way to convert this to pyspark dataframe transformations? Appreciate if you can share articles wrt this.
Note: acc_cap table is also created from test_db.test_table after adding prev_time column using lag window function on time column.
Query ---
SELECT
A.id,
"psmark" fid,
(
SELECT distinct psmark
from test_db.test_table
where id = A.id and time = A.prev_time and rnk=1
)
AS fromvalue,
FROM acc_cap A;
I have taken two sample dataframes acc_cap_df and test_table_df
Below is the pyspark equivalent code to your query
>>> acc_cap_df=spark.read.csv("/path to/sample2.csv",header=True)
>>> acc_cap_df.show()
+---+-------+
| id| time|
+---+-------+
|003| 174256|
|003| 174267|
|003| 17429|
|003|1747567|
|001| 10|
|001| 10|
|004| 12719|
|002| 11|
|002| 117|
|002| 11878|
+---+-------+
>>> from pyspark.sql.functions import rank
>>> acc_cap_df=acc_cap_df.withColumn("rank",rank().over(windowSpec))
>>> acc_cap_df.show()
+---+-------+----+
| id| time|rank|
+---+-------+----+
|003| 174256| 1|
|003| 174267| 2|
|003| 17429| 3|
|003|1747567| 4|
|001| 10| 1|
|001| 10| 1|
|004| 12719| 1|
|002| 11| 1|
|002| 117| 2|
|002| 11878| 3|
+---+-------+----+
>>> acc_cap_df=acc_cap_df.withColumn("prev_time",lag("time",2).over(windowSpec))
>>> acc_cap_df.show()
+---+-------+----+---------+
| id| time|rank|prev_time|
+---+-------+----+---------+
|003| 174256| 1| null|
|003| 174267| 2| 174256|
|003| 17429| 3| 174267|
|003|1747567| 4| 17429|
|001| 10| 1| null|
|001| 10| 1| 10|
|004| 12719| 1| null|
|002| 11| 1| null|
|002| 117| 2| 11|
|002| 11878| 3| 117|
+---+-------+----+---------+
>>> test_table_df=spark.read.csv("/path to/sample1.csv",header=True)
>>> test_table_df.show()
+--------+---+----+
| psmark| id|time|
+--------+---+----+
| csvvsw|001| 10|
|csvvswfw|002| 11|
| csvvsgg|003| 12|
| csvvser|004| 13|
+--------+---+----+
>>> acc_cap_df=acc_cap_df.withColumnRenamed("id","acc_cap_id")
>>> acc_cap_df.show()
+----------+-------+----+
|acc_cap_id| time|rank|
+----------+-------+----+
| 003| 174256| 1|
| 003| 174267| 2|
| 003| 17429| 3|
| 003|1747567| 4|
| 001| 10| 1|
| 001| 10| 1|
| 004| 12719| 1|
| 002| 11| 1|
| 002| 117| 2|
| 002| 11878| 3|
+----------+-------+----+
>>> df_join=test_table_df.join(acc_cap_df, (test_table_df.id == acc_cap_df.acc_cap_id) & (test_table_df.time == acc_cap_df.time)).filter(F.col("rank")=="1")
>>> df_join.show()
+--------+---+----+----------+----+----+
| psmark| id|time|acc_cap_id|time|rank|
+--------+---+----+----------+----+----+
| csvvsw|001| 10| 001| 10| 1|
| csvvsw|001| 10| 001| 10| 1|
|csvvswfw|002| 11| 002| 11| 1|
+--------+---+----+----------+----+----+
>>> df_output=df_join.select('psmark','id').distinct()
>>> df_output=df_output.withColumnRenamed("psmark","fid")
>>> df_output.show()
+--------+---+
| fid| id|
+--------+---+
| csvvsw|001|
|csvvswfw|002|
+--------+---+

How can count occurrence frequency of records in Spark data frame and add it as new column to data frame without affecting on index column?

I'm trying to add a new column named Freq to the given spark dataframe without affecting on index column or records' order of frame to assign back results of Statistic frequency (which is counts) to right row/incident/event/record in dataframe.
This is my data frame:
+---+-------------+------+------------+-------------+-----------------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|
+---+-------------+------+------------+-------------+-----------------+
| 0| Sentence| 4014| 198| false| 136|
| 1| contextid| 90| 2| false| 15|
| 2| Sentence| 172| 11| false| 118|
| 3| String| 12| 0| true| 11|
| 4|version-style| 16| 0| false| 13|
| 5| Sentence| 339| 42| false| 110|
| 6|version-style| 16| 0| false| 13|
| 7| url_variable| 10| 2| false| 9|
| 8| url_variable| 10| 2| false| 9|
| 9| Sentence| 172| 11| false| 117|
| 10| contextid| 90| 2| false| 15|
| 11| Sentence| 170| 11| false| 114|
| 12|version-style| 16| 0| false| 13|
| 13| Sentence| 68| 10| false| 59|
| 14| String| 12| 0| true| 11|
| 15| Sentence| 173| 11| false| 118|
| 16| String| 12| 0| true| 11|
| 17| Sentence| 132| 8| false| 96|
| 18| String| 12| 0| true| 11|
| 19| contextid| 88| 2| false| 15|
+---+-------------+------+------------+-------------+-----------------+
I tried following scripts unsuccessfully due to presence of index column id:
from pyspark.sql import functions as F
from pyspark.sql import Window
bo = features_sdf.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
sdf2 = (
bo.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(bo.columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
)
sdf2.show()
#bad result due to existence of id column which makes every record unique and causes Freq=1
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
| id| Type|Length|Token_number|Encoding_type|Character_feature|Freq|MaxFreq|MinFreq|
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
| 0| Sentence| 4014| 198| false| 136| 1| 1| 1|
| 1| contextid| 90| 2| false| 15| 1| 1| 1|
| 2| Sentence| 172| 11| false| 118| 1| 1| 1|
| 3| String| 12| 0| true| 11| 1| 1| 1|
| 4|version-style| 16| 0| false| 13| 1| 1| 1|
| 5| Sentence| 339| 42| false| 110| 1| 1| 1|
| 6|version-style| 16| 0| false| 13| 1| 1| 1|
| 7| url_variable| 10| 2| false| 9| 1| 1| 1|
| 8| url_variable| 10| 2| false| 9| 1| 1| 1|
| 9| Sentence| 172| 11| false| 117| 1| 1| 1|
| 10| contextid| 90| 2| false| 15| 1| 1| 1|
| 11| Sentence| 170| 11| false| 114| 1| 1| 1|
| 12|version-style| 16| 0| false| 13| 1| 1| 1|
| 13| Sentence| 68| 10| false| 59| 1| 1| 1|
| 14| String| 12| 0| true| 11| 1| 1| 1|
| 15| Sentence| 173| 11| false| 118| 1| 1| 1|
| 16| String| 12| 0| true| 11| 1| 1| 1|
| 17| Sentence| 132| 8| false| 96| 1| 1| 1|
| 18| String| 12| 0| true| 11| 1| 1| 1|
| 19| contextid| 88| 2| false| 15| 1| 1| 1|
+---+-------------+------+------------+-------------+-----------------+----+-------+-------+
If I exclude index column id the code works but somehow it messes up the order (due to unwanted sorting/ordering) and results are not going to be assigned to the right record/row as follows:
+--------+------+------------+-------------+-----------------+----+-------+-------+
| Type|Length|Token_number|Encoding_type|Character_feature|Freq|MaxFreq|MinFreq|
+--------+------+------------+-------------+-----------------+----+-------+-------+
|Sentence| 7| 1| false| 6| 2| 1665| 1|
|Sentence| 7| 1| false| 6| 2| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 17| 4| false| 14| 6| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
|Sentence| 19| 4| false| 17| 33| 1665| 1|
+--------+------+------------+-------------+-----------------+----+-------+-------+
In the end, I wanted to add this function and normalized it between 0 and 1 using simple mathematic formula and use it as a new feature. during normalizing I faced problems also and get null values. I already implemented the pythonic version and it is so easy but I'm so fed up in spark:
#Statistical Preprocessing
def add_freq_to_features(df):
frequencies_df = df.groupby(list(df.columns)).size().to_frame().rename(columns={0: "Freq"})
frequencies_df["Freq"] = frequencies_df["Freq"] / frequencies_df["Freq"].sum() # Normalzing 0 & 1
new_df = pd.merge(df, frequencies_df, how='left', on=list(df.columns))
return new_df
# Apply frequency allocation and merge with extracted features df
features_df = add_freq_to_features(oba)
features_df.head(20)
and it turns following right results as I expected:
I also tried to litterally translted the pythonic scripts using df.groupBy(df.columns).count() but I couldn't:
# this is to build "raw" Freq based on #pltc answer
sdf2 = (sdf
.groupBy(sdf.columns)
.agg(F.count('*').alias('Freq'))
.withColumn('Encoding_type', F.col('Encoding_type').cast('string'))
)
sdf2.cache().count()
sdf2.show()
Here is the PySpark full code of what we have tried on simplified example available in this colab notebook based on the answer of #ggordon:
def add_freq_to_features_(df):
from pyspark.sql import functions as F
from pyspark.sql import Window
sdf_pltc = df.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
print("Before Any Modification") # only included for debugging purposes
sdf_pltc.show(5,truncate=0)
# fill missing values with 0 using `na.fill(0)` before applying count as window function
sdf2 = (
sdf_pltc.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(sdf_pltc.columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
.withColumn('id' , F.col('id'))
)
print("After replacing null with 0 and counting by partitions") # only included for debugging purposes
# use orderby as your last operation, only included here for debugging purposes
#sdf2 = sdf2.orderBy(F.col('Type').desc(),F.col('Length').desc() )
sdf2.show(5,truncate=False) # only included for debugging purposes
sdf2 = (
sdf2.withColumn('Freq' , F.when(
F.col('MaxFreq')==0.000000000 , 0
).otherwise(
(F.col('Freq')-F.col('MinFreq')) / (F.col('MaxFreq') - F.col('MinFreq'))
)
) # Normalzing between 0 & 1
)
sdf2 = sdf2.drop('MinFreq').drop('MaxFreq')
sdf2 = sdf2.withColumn('Encoding_type', F.col('Encoding_type').cast('string'))
#sdf2 = sdf2.orderBy(F.col('Type').desc(),F.col('Length').desc() )
print("After normalization, encoding transformation and order by ") # only included for debugging purposes
sdf2.show(50,truncate=False)
return sdf2
Sadly due to dealing BigData I can't hack it with df.toPandas() it is inexpensive and cause OOM error.
Any help will be forever appreciated.
The pandas behavior is different because the ID field is the DataFrame index so it does not count in the "group by all" you do. You can get the same behavior in Spark with one change.
partitionBy takes any ordinary list of strings, Try removing the id column from your partition key list like this:
bo = features_sdf.select('id', 'Type', 'Length', 'Token_number', 'Encoding_type', 'Character_feature')
partition_columns = bo.columns.remove('id')
sdf2 = (
bo.na.fill(0).withColumn(
'Freq',
F.count("*").over(Window.partitionBy(partition_columns))
).withColumn(
'MaxFreq',
F.max('Freq').over(Window.partitionBy())
).withColumn(
'MinFreq',
F.min('Freq').over(Window.partitionBy())
)
)
That will give you the results you said worked but keep the ID field. You'll need to figure out how to do the division for your frequencies but that should get you started.

Pyspark dataframes group by

I have dataframe like below
|123 |124 |125 |
+-----+-----+-----+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10
i need the data to be in
1:[123,125]
2:[123,124,125]
3:[125]
Order is not required to be sorted . I am new to dataframes in pyspark any help would be appreciated
There are no melt or pivot APIs in pyspark that will accomplish this directly. Instead, flatmap from the RDD into a new dataframe and aggregate:
df.show()
+---+---+---+
|123|124|125|
+---+---+---+
| 1| 2| 3|
| 9| 9| 4|
| 4| 12| 1|
| 2| 4| 8|
| 7| 6| 3|
| 19| 11| 2|
| 21| 10| 10|
+---+---+---+
For each column or each row in the RDD, output a row with two columns: the value of the column and the column name:
cols = df.columns
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.show())
+-----+-----------+
|value|column_name|
+-----+-----------+
| 1| 123|
| 2| 124|
| 3| 125|
| 9| 123|
| 9| 124|
| 4| 125|
| 4| 123|
| 12| 124|
| 1| 125|
| 2| 123|
| 4| 124|
| 8| 125|
| 7| 123|
| 6| 124|
| 3| 125|
| 19| 123|
| 11| 124|
| 2| 125|
| 21| 123|
| 10| 124|
+-----+-----------+
Then, group by the value and aggregate the column names into a list:
from pyspark.sql import functions as f
(df.rdd
.flatMap(lambda row: [(row[c], c) for c in cols]).toDF(["value", "column_name"])
.groupby("value").agg(f.collect_list("column_name"))
.show())
+-----+-------------------------+
|value|collect_list(column_name)|
+-----+-------------------------+
| 19| [123]|
| 7| [123]|
| 6| [124]|
| 9| [123, 124]|
| 1| [123, 125]|
| 10| [124, 125]|
| 3| [125, 125]|
| 12| [124]|
| 8| [125]|
| 11| [124]|
| 2| [124, 123, 125]|
| 4| [125, 123, 124]|
| 21| [123]|
+-----+-------------------------+

need to perform multi-column join on a dataframe with alook-up dataframe

I have two dataframes like so
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| 0| 1| 2| 3| 4|
| 5| 6| 7| 8| 9|
+---+---+---+---+---+
+---+---+
|key|val|
+---+---+
| 0| A|
| 1| B|
| 2| C|
| 3| D|
| 4| E|
| 5| F|
| 6| G|
| 7| H|
| 8| I|
| 9| J|
+---+---+
I want to lookup each column on df1 with the equivalent key in df2 and return the lookup val from df2 for each.
Here is the code to produce the two input dataframes
df1 = sc.parallelize([('0','1','2','3','4',), ('5','6','7','8','9',)]).toDF(['c1','c2','c3','c4','c5'])
df1.show()
df2 = sc.parallelize([('0','A',), ('1','B', ),('2','C', ),('3','D', ),('4','E',),\
('5','F',), ('6','G', ),('7','H', ),('8','I', ),('9','J',)]).toDF(['key','val'])
df2.show()
I want to join the above to produce the following
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 0| 1| 2| 3| 4|A |B |C |D |E |
| 5| 6| 7| 8| 9|F |G |H |I |J |
+---+---+---+---+---+---+---+---+--+----+
I can get it to work for a single column like so but I'm not sure how to extend it to all columns
df1.join(df2, df1.c1 == df2.key).select('c1','val').show()
+---+---+
| c1|val|
+---+---+
| 0| A|
| 5| F|
+---+---+
You can just chain the join:
df1
.join(df2, on=df1.c1 == df2.key, how='left')
.withColumnRenamed('val', 'lu1') \
.join(df2, on=df1.c2 == df2.key, how='left) \
.withColumnRenamed('val', 'lu2') \
.etc
You can even do it in a loop, but don't do it with too many columns:
from pyspark.sql import functions as f
df = df1
for i in range(1, 6):
df = df \
.join(df2.alias(str(i)), on=f.col('c{}'.format(i)) == f.col("{}.key".format(i)), how='left') \
.withColumnRenamed('val', 'lu{}'.format(i))
df \
.select('c1', 'c2', 'c3', 'c4', 'c5', 'lu1', 'lu2', 'lu3', 'lu4', 'lu5') \
.show()
output
+---+---+---+---+---+---+---+---+---+---+
| c1| c2| c3| c4| c5|lu1|lu2|lu3|lu4|lu5|
+---+---+---+---+---+---+---+---+---+---+
| 5| 6| 7| 8| 9| F| G| H| I| J|
| 0| 1| 2| 3| 4| A| B| C| D| E|
+---+---+---+---+---+---+---+---+---+---+

Pyspark: Add new Column contain a value in a column counterpart another value in another column that meets a specified condition

Add new Column contain a value in a column counterpart another value in another column that meets a specified condition
For instance,
original DF as follows:
+-----+-----+-----+
|col1 |col2 |col3 |
+-----+-----+-----+
| A| 17| 1|
| A| 16| 2|
| A| 18| 2|
| A| 30| 3|
| B| 35| 1|
| B| 34| 2|
| B| 36| 2|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
+-----+-----+-----+
I need to repeat the value in col2 that counterpart to 1 in col3 for each col1's groups. and if there are more value =1 in col3 for any group from col1 repeat the minimum value
the desired Df as follows:
+----+----+----+----------+
|col1|col2|col3|new_column|
+----+----+----+----------+
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
+----+----+----+----------+
df3=df.filter(df.col3==1)
+----+----+----+
|col1|col2|col3|
+----+----+----+
| B| 35| 1|
| C| 20| 1|
| C| 30| 1|
| C| 43| 1|
| A| 17| 1|
+----+----+----+
df3.createOrReplaceTempView("mytable")
To obtain minimum value of col2 I followed the accepted answer in this link How to find exact median for grouped data in Spark
df6=spark.sql("select col1, min(col2) as minimum from mytable group by col1 order by col1")
df6.show()
+----+-------+
|col1|minimum|
+----+-------+
| A| 17|
| B| 35|
| C| 20|
+----+-------+
df_a=df.join(df6,['col1'],'leftouter')
+----+----+----+-------+
|col1|col2|col3|minimum|
+----+----+----+-------+
| B| 35| 1| 35|
| B| 34| 2| 35|
| B| 36| 2| 35|
| C| 20| 1| 20|
| C| 30| 1| 20|
| C| 43| 1| 20|
| A| 17| 1| 17|
| A| 16| 2| 17|
| A| 18| 2| 17|
| A| 30| 3| 17|
+----+----+----+-------+
Is there way better than this solution?