This might be a duplicate somewhere, but I have a simple df:
df1_schema = StructType([StructField("Date", StringType(), True) ])
df_data = [('1-Jun-20',)]
rdd = sc.parallelize(df_data)
df1 = sqlContext.createDataFrame(df_data, df1_schema)
#df1 = df1.withColumn("Date",to_date("Date", 'yyyy-MM-dd'))
df1.show()
+--------+
| Date|
+--------+
|1-Jun-20|
+--------+
I tried to change to Date format but it just gives me a null value.
This is what I tried:
df1= df1.withColumn("Date2", F.to_date(col('Date'), "dd-MM-yyyy"))
+----------+-----+
| Date|Date2|
+----------+-----+
|1-Jun-20 | null|
+----------+-----+
any solution to this?.. Thank you
The correct format for your Date is "d-MMM-yy"
df1.withColumn("Date2", F.to_date(col('Date'), "d-MMM-yy")).show()
+--------+----------+
| Date| Date2|
+--------+----------+
|1-Jun-20|2020-06-01|
+--------+----------+
This also works for 01-Jun-20 or 10-Jun-20.
Related
I try to join dataframes using a LIKE expression in which the conditions (content of LIKE) is stores in a column. Is it possible in PySpark 2.3?
Source dataframe:
+---------+----------+
|firstname|middlename|
+---------+----------+
| James| |
| Michael| Rose|
| Robert| Williams|
| Maria| Anne|
+---------+----------+
Second dataframe
+---------+----+
|condition|dest|
+---------+----+
| %a%|Box1|
| %b%|Box2|
+---------+----+
Expected result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+
Let me reproduce the issue on the sample below.
Let's create a sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data = [("James",""),
("Michael","Rose"),
("Robert","Williams"),
("Maria","Anne")
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True)
])
df = spark.createDataFrame(data=data,schema=schema)
df.show()
and the second one:
mapping = [("%a%","Box1"),("%b%","Box2")]
schema = StructType([ \
StructField("condition",StringType(),True), \
StructField("dest",StringType(),True)
])
map = spark.createDataFrame(data=mapping,schema=schema)
map.show()
If I am rights, it is not possible to use LIKE during join dataframes, so I have created a crossJoin and tried to use a filter with like, but is it possible to take the content from a column, not a fixed string? This is invalid syntax of cource, but I am looking for another solution:
df.crossJoin(map).filter(df.firstname.like(map.condition)).show()
Any expression can be used as a join condition. True, with DataFrame API like function's parameter can only be str, not Column, so you can't have col("firstname").like(col("condition")). However SQL version does not have this limitation so you can leverage expr:
df.join(map, expr("firstname like condition")).show()
Or just plain SQL:
df.createOrReplaceTempView("df")
map.createOrReplaceTempView("map")
spark.sql("SELECT * FROM df JOIN map ON firstname like condition").show()
Both return the same result:
+---------+----------+---------+----+
|firstname|middlename|condition|dest|
+---------+----------+---------+----+
| James| | %a%|Box1|
| Michael| Rose| %a%|Box1|
| Robert| Williams| %b%|Box2|
| Maria| Anne| %a%|Box1|
+---------+----------+---------+----+
I have bellow two data frame with hash added as additional column to identify differences for same id from both data frame
df1=
name | department| state | id|hash
-----+-----------+-------+---+---
James|Sales |NY |101| c123
Maria|Finance |CA |102| d234
Jen |Marketing |NY |103| df34
df2=
name | department| state | id|hash
-----+-----------+-------+---+----
James| Sales1 |null |101|4df2
Maria| Finance | |102|5rfg
Jen | |NY2 |103|234
#identify unmatched row for same id from both data frame
df1_un_match_indf2=df1.join(df2,df1.hash==df2.hash,"leftanti")
df2_un_match_indf1=df2.join(df1,df2.hash==df1.hash,"leftanti")
#The above case list both data frame, since all hash for same id are different
Now i am trying to find difference of row value against the same id from 'df1_un_match_indf1,df2_un_match_indf1' data frame, so that it shows differences row by row
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff=df3.join(df4,df3.id==df4.id,"inner")
common_dff.show()
but result show difference like this
+--------+----------+-----+----+-----+-----------+-------+---+---+----+
|name |department|state|id |hash |name | department|state| id|hash
+--------+----------+-----+----+-----+-----+-----------+-----+---+-----+
|James |Sales |NY |101 | c123|James| Sales1 |null |101| 4df2
|Maria |Finance |CA |102 | d234|Maria| Finance | |102| 5rfg
|Jen |Marketing |NY |103 | df34|Jen | |NY2 |103| 2f34
What i am expecting is
+-----------------------------------------------------------+-----+--------------+
|name | department | state | id | hash
['James','James']|['Sales','Sales'] |['NY',null] |['101','101']|['c123','4df2']
['Maria','Maria']|['Finance','Finance']|['CA',''] |['102','102']|['d234','5rfg']
['Jen','Jen'] |['Marketing',''] |['NY','NY2']|['102','103']|['df34','2f34']
I tried with different ways, but didn't find right solution to make this expected format
Can anyone give a solution or idea to this?
Thanks
What you want to use is likely collect_list or maybe 'collect_set'
This is really well described here:
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("a", None, None),
("a", "code1", None),
("a", "code2", "name2"),
], ["id", "code", "name"])
df.show()
+---+-----+-----+
| id| code| name|
+---+-----+-----+
| a| null| null|
| a|code1| null|
| a|code2|name2|
+---+-----+-----+
(df
.groupby("id")
.agg(F.collect_set("code"),
F.collect_list("name"))
.show())
+---+-----------------+------------------+
| id|collect_set(code)|collect_list(name)|
+---+-----------------+------------------+
| a| [code1, code2]| [name2]|
+---+-----------------+------------------+
In your case you need to slightly change your join into a union to enable you to group the data.
df3=df1_un_match_indf1
df4=df2_un_match_indf1
common_diff = df3.union(df4)
(common_diff
.groupby("id")
.agg(F.collect_set("name"),
F.collect_list("department"))
.show())
If you can do a union just use an array:
from pyspark.sql.functions import array
common_diff.select(
df.id,
array(
common_diff.thisState,
common_diff.thatState
).alias("State"),
array(
common_diff.thisDept,
common_diff.thatDept
).alias("Department")
)
It a lot more typing and a little more fragile. I suggest that renaming columns and using the groupby is likely cleaner and clearer.
DataFrame has multiple columns. I need add a new column for the whole row size which means I need add all columns size together. Is there a simple way to do it efficiently? Thanks
Here is the sample:
val DataFrame = Seq(("Alice", "He is girl"), ("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
display(DataFrame)
I want to add a column to df that it can sum length of each column. In this sample only two columns, but actually I have hundred columns in the df.
val df = Seq(("Alice", "He is girl"),
("Bob", "She is girl"), ("Ben", null)).toDF("name","string")
scala> df.show
+-----+-----------+
| name| string|
+-----+-----------+
|Alice| He is girl|
| Bob|She is girl|
| Ben| null|
+-----+-----------+
Get rid of null values:
val dfNoNull = df.na.fill("")
scala> dfNoNull.show
+-----+-----------+
| name| string|
+-----+-----------+
|Alice| He is girl|
| Bob|She is girl|
| Ben| |
+-----+-----------+
Create list of columns with applied length function to each of them:
val cols = dfNoNull.columns.map(x => length(col(x)))
Select data based on these columns/expressions:
val dfColCounts = dfNoNull.select(cols:_*)
scala> dfColCounts.show
+------------+--------------+
|length(name)|length(string)|
+------------+--------------+
| 5| 10|
| 3| 11|
| 3| 0|
+------------+--------------+
Get these new colum names:
val countCols = dfColCounts.columns.map(x => col(x))
Apply reduce to sum up all column values which are ints by now:
val dfPerRowCounts = dfColCounts
.withColumn("countPerRow", countCols.reduce(_ + _))
.select("countPerRow")
Result:
dfPerRowCounts.show
scala> dfPerRowCounts.show
+-----------+
|countPerRow|
+-----------+
| 15|
| 14|
| 3|
+-----------+
I have recently started working with pySpark so don't know about many details regarding this.
I am trying to create a BinaryType column in a data frame? But struggling to do it...
for example, let's take a simple df
df.show(2)
+---+----------+
| col1|col2|
+---+----------+
| "1"| null|
| "2"| "20"|
+---+----------+
Now I want to have a third column "col3" with BinaryType like
| col1|col2| col3|
+---+----------+
| "1"| null|[1 null]
| "2"| "20"|[ 2 20]
+---+----------+
How should i do it?
Try this:
a = [('1', None), ('2', '20')]
df = spark.createDataFrame(a, ['col1', 'col2'])
df.show()
+----+----+
|col1|col2|
+----+----+
| 1|null|
| 2| 20|
+----+----+
df = df.withColumn('col3', F.array(['col1', 'col2']))
df.show()
+----+----+-------+
|col1|col2| col3|
+----+----+-------+
| 1|null| [1,]|
| 2| 20|[2, 20]|
+----+----+-------+
I have dataframe track_log where columns are
item track_info Date
----------------------
1 ordered 01/01/19
1 Shipped 02/01/19
1 delivered 03/01/19
I want to get data as
item ordered Shipped Delivered
--------------------------------------------
1 01/01/19 02/01/19 03/01/19
need to resolve this using pyspark
I can think of a solution like this:
>>> df.show()
+----+----------+--------+
|item|track_info| date|
+----+----------+--------+
| 1| ordered|01/01/19|
| 1| Shipped|02/01/19|
| 1| delivered|03/01/19|
+----+----------+--------+
>>> df_grouped=df.groupBy(df.item).agg(collect_list(df.track_info).alias('grouped_data'))
>>> df_grouped_date=df.groupBy(df.item).agg(collect_list(df.date).alias('grouped_dates'))
>>> df_cols=df_grouped.select(df_grouped.grouped_data).first()['grouped_data'].insert(0,'item')
>>> df_grouped_date.select(df_grouped_date.item,df_grouped_date.grouped_dates[0],df_grouped_date.grouped_dates[1],df_grouped_date.grouped_dates[2]).toDF(*df_cols).show()
+----+--------+--------+---------+
|item| ordered| Shipped|delivered|
+----+--------+--------+---------+
| 1|01/01/19|02/01/19| 03/01/19|
+----+--------+--------+---------+
You can use spark pivot function to do that as a single liner as below
>>> df.show()
+----+----------+--------+
|item|track_info| date|
+----+----------+--------+
| 1| ordered|01/01/19|
| 1| Shipped|02/01/19|
| 1| delivered|03/01/19|
+----+----------+--------+
>>> pivot_df = df.groupBy('item').pivot('track_info').agg(collect_list('date'))
>>> pivot_df.show()
+----+--------+--------+---------+
|item| ordered| Shipped|delivered|
+----+--------+--------+---------+
| 1|[01/01/19]|[02/01/19]| [03/01/19]|
+----+--------+--------+---------+