I want to create a string from a list of substrings and a correponding frequency list. E.g. my df_in looks like following:
+-------------------------+-----------+
| substr | frequency |
+-------------------------+-----------+
| ['ham', 'spam', 'eggs'] | [1, 2, 3] |
| ['foo', 'bar'] | [2, 1] |
+-------------------------+-----------+
And I want my df_out to look like this:
+--------------------------------+
| output |
+--------------------------------+
| 'ham spam spam eggs eggs eggs' |
| 'foo foo bar' |
+--------------------------------+
Since the dataset is very large (~22Mio rows), I would like to avoid for-loops whereever possible.
Is there any elegant way to achieve this?
Thanks a lot!
Edit:
My current approach:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def create_text(l_sub, l_freq):
l_str = [(a+' ')*b if isinstance(b, int) else (a+' ') for a, b in zip(l_sub, l_freq)]
return ''.join(l_str)
create_str = F.udf(lambda x, y: create_text(x, y), T.StringType())
df = df.withColumn('output', create_str(df_in.sbustr, df_in.frequency))
Problem:
I read that in order to speed up the compute, UDFs should be re-written into pyspark way. I do not know how that can be done though.
Also I found out the dtype of df_in.frequency is array<decimal(4.0)>. So I am trying to either convert these values to int first or cast them to int in runtime.
Check if the following works for you:
from pyspark.sql.functions import expr
df.withColumn('output', expr('''
array_join(flatten(zip_with(`substr`, `frequency`, (x,y) -> array_repeat(x,int(y)))), ' ')
''')).show(truncate=False)
+-----------------+---------+----------------------------+
|substr |frequency|output |
+-----------------+---------+----------------------------+
|[ham, spam, eggs]|[1, 2, 3]|ham spam spam eggs eggs eggs|
|[foo, bar] |[2, 1] |foo foo bar |
+-----------------+---------+----------------------------+
Below is how it works:
use zip_with to iterate two arrays substr(as x) and frequency(as y) side-by-side and run array_repeat(x, int(y)) on each combination to create an array of y repeats of x.
flatten the array of arrays
join the 1-D array of StringType with space
Related
I have the below dataframe which I have read from a JSON file.
1
2
3
4
{"todo":["wakeup", "shower"]}
{"todo":["brush", "eat"]}
{"todo":["read", "write"]}
{"todo":["sleep", "snooze"]}
I need my output to be as below Key and Value. How do I do this? Do I need to create a schema?
ID
todo
1
wakeup, shower
2
brush, eat
3
read, write
4
sleep, snooze
The key-value which you refer to is a struct. "keys" are struct field names, while "values" are field values.
What you want to do is called unpivoting. One of the ways to do it in PySpark is using stack. The following is a dynamic approach, where you don't need to provide existent column names.
Input dataframe:
df = spark.createDataFrame(
[((['wakeup', 'shower'],),(['brush', 'eat'],),(['read', 'write'],),(['sleep', 'snooze'],))],
'`1` struct<todo:array<string>>, `2` struct<todo:array<string>>, `3` struct<todo:array<string>>, `4` struct<todo:array<string>>')
Script:
to_melt = [f"\'{c}\', `{c}`.todo" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (ID, todo)")
df.show()
# +---+----------------+
# | ID| todo|
# +---+----------------+
# | 1|[wakeup, shower]|
# | 2| [brush, eat]|
# | 3| [read, write]|
# | 4| [sleep, snooze]|
# +---+----------------+
Use from_json to convert string to array. Explode to cascade each unique element to row.
data
df = spark.createDataFrame(
[(('{"todo":"[wakeup, shower]"}'),('{"todo":"[brush, eat]"}'),('{"todo":"[read, write]"}'),('{"todo":"[sleep, snooze]"}'))],
('value1','values2','value3','value4'))
code
new = (df.withColumn('todo', explode(flatten(array(*[map_values(from_json(x, "MAP<STRING,STRING>")) for x in df.columns])))) #From string to array to indivicual row
.withColumn('todo', translate('todo',"[]",'')#Remove corner brackets
) ).show(truncate=False)
outcome
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|value1 |values2 |value3 |value4 |todo |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|wakeup, shower|
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|brush, eat |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|read, write |
|{"todo":"[wakeup, shower]"}|{"todo":"[brush, eat]"}|{"todo":"[read, write]"}|{"todo":"[sleep, snooze]"}|sleep, snooze |
+---------------------------+-----------------------+------------------------+--------------------------+--------------+
Currently I'm gathering the top 5 most frequent values with a UDF.
The goal is to achieve the same result without using UDF and have the most efficient solution (avoid groupBy in loops).
Here's the code I'm using to have the result :
from pyspark.sql import functions as F
df = df.select('A', 'B', ...)
#F.udf
def get_top_5_udf(x)
from collections import Counter
return [elem[0] for elem in Counter(x).most_common(5)]
agg_expr = [get_top_5_udf(F.collect_list(col)).alias(col) for col in df.columns]
df_top5 = df.agg(*agg_expr)
The result looks like the following :
# result
#+-----------------+--------------+---------------+
#| A | B | ... |
#+-----------------+--------------+---------------+
#| [1, 2, 3, 4, 5] | [...] | ... |
#+-----------------+--------------+---------------+
You can try using count over window partitioned by each column before aggregating:
from pyspark.sql import functions as F, Window
result = df.select(*[
F.struct(
F.count(c).over(Window.partitionBy(c)).alias("cnt"),
F.col(c).alias("val")
).alias(c) for c in df.columns
]).agg(*[
F.slice(
F.expr(f"transform(sort_array(collect_set({c}), false), x -> x.val)"),
1, 5
).alias(c) for c in df.columns
])
result.show()
I have one dataframe - df_similar_strings, which looks like this:
|---------------------|
| string_values |
|---------------------|
| ['catish', 'cat'] |
|---------------------|
| ['doggo', 'dogy'] |
|---------------------|
and the other one - df_source:
|-----------------------------|------------------|
| values | key_value |
|-----------------------------|------------------|
| ['catish', 'cat', 'cat-'] | cat |
|-----------------------------|------------------|
| ['doggo', 'dogy', 'dog'] | dog |
|-----------------------------|------------------|
I would like to join those data frames based on the column string_values and values so that there is at least one value matching.
I have no idea how to do this since the columns are nested as arrays.
Hey you just need to type cast your list to tuple. And then try merging. Since list is unhashable hence merge operation can't be applied. Try this
df_source.values = df_source["values"].apply(lambda x: tuple(x))
Similarly with the other df and try merging using pd merge.
You can solve it by first doing a cartesian-product between your two dataframes and then dropping from that dataframe all rows which doesn't have any shared value.
For simplicity, I assume the columns on both datasets have the same name ("values"). Also, I assume the lists doesn't have repeated values (all values appear once).
from collections import Counter
def find_duplicates(arr):
return [item for item,count in Counter(arr).items() if count==2]
df1['key']=1
df2['key']=1
cartes_prod_df = df1.merge(df2,on=['key'],how='outer').drop(columns=['key'])
duplicate_values = (cartes_prod_df.values_x + cartes_prod_df.values_y).apply(find_duplicates)
merged_df = cartes_prod_df[duplicate_values.apply(lambda x: len(x)>0)]
I've used a little trick in order to do the cartesian product (Adding the key column), and then the duplicate_values found from the joint array (using the + operator) are the values which appeared twice in the joint array.
UPDATE
In order to supply a full example, here's an example of df1 and df2:
d1 = {'values': [['A','B'],['B','C'],['D']],'otherkey':[1,2,3]}
d2 = {'values': [['A'],['B'],['A','C'],['D']],'otherkey':[4,5,3,6]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
Now, merged_df would give the output:
I have a dataframe which looks like one given below. All the values for a corresponding id is the same except for the mappingcol field.
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
|ref |banana |Map("lname"->"Nikki"| 2 |
|ddd |apple |Map("lname"->"tenka"| 1 |
+--------------------+----------------+--------------------+-------+
I want to merge the rows with same row in such a way that I get exactly one row for one id and the value of mappingcol needs to be merged. The output should look like :
+--------------------+----------------+--------------------+-------+
|misc |fruit |mappingcol |id |
+--------------------+----------------+--------------------+-------+
|ddd |apple |Map("name"->"Sameer"| 1 |
|ref |banana |Map("name"->"Riyazi"| 2 |
+--------------------+----------------+--------------------+-------+
the value for mappingcol for id = 1 would be :
Map(
"name" -> "Sameer",
"lname" -> "tenka"
)
I know that maps can be merged using ++ operator, so thats not what im worried about. I just cant understand how to merge the rows, because if I use a groupBy, I have nothing to aggregate the rows on.
You can use by groupBy and then managing a little the map
df.groupBy("id", "fruit", "misc").agg(collect_list("mappingcol"))
.as[(Int, String, String, Seq[Map[String, String]])]
.map { case (id, fruit, misc, list) => (id, fruit, misc, list.reduce(_ ++ _)) }
.toDF("id", "fruit", "misc", "mappingColumn")
With the first line, tou group by your desired columns and aggregate the map pairs in the same element (an array)
With the second line (as), you convert your structure to a Dataset of a Tuple4 with the last element being a sequence of maps
With the third line (map), you merge all the elements to a single map
With the last line (toDF) to give the columns the original names
OUTPUT
+---+------+----+--------------------------------+
|id |fruit |misc|mappingColumn |
+---+------+----+--------------------------------+
|1 |apple |ddd |[name -> Sameer, lname -> tenka]|
|2 |banana|ref |[name -> Riyazi, lname -> Nikki]|
+---+------+----+--------------------------------+
You can definitely do the above with a Window function!
This is in PySpark not Scala but there's almost no difference when only using native Spark functions.
The below code only works on a map column that 1 one key, value pair per row, as it how your example data is, but it can be made to work with map columns with multiple entries.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = ['id', 'fruit', 'misc']
# or, a lazier way if you have a lot of columns to group on
cols = df.columns # save as list
group_cols_2 = cols.remove('mappingCol') # remove what you're not grouping by
w = Window.partitionBy(group_cols)
# unpack map value and key into a pair struct column
df1 = df.withColumn(map_col , F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
# Collect all key values into an array of structs, here each row
# contains the map entries for all rows in the group/window
df1 = df1.withColumn(map_col , F.collect_list(map_col).over(w))
# drop duplicate values, as you only want one row per group
df1 = df1.dropDuplicates(group_cols)
# return the values for map type
df1 = df1.withColumn(map_col , F.map_from_entries(map_col))
You can save the output of each step to a new column to see how each step works, as I have done below.
from pyspark.sql import Window
map_col = 'mappingColumn'
group_cols = list('id', 'fruit', 'misc')
w = Window.partitionBy(group_cols)
df1 = df.withColumn('test', F.struct(F.map_keys(map_col)[0], F.map_values(map_col)[0]))
df1 = df1.withColumn('test1', F.collect_list('test').over(w))
df1 = df1.withColumn('test2', F.map_from_entries('test1'))
df1.show(truncate=False)
df1.printSchema()
df1 = df1.dropDuplicates(group_cols)
Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!
You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}
You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.
I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.