I have a dataframe in Spark, the column is name, it is a string delimited by space, the tricky part is some names have middle name, others don't. How can I split the column into firstname, middlename and lastname? I am using F.split, dunno how to differentiate middle name and last name. I understand I cannot use negative index in Spark. Take a look at my sample df
from pyspark.sql import functions as F
cols = ['id', 'name']
vals = [('l03', 'Bob K Barry'), ('S20', 'Cindy Winston'), ('l10', 'Jerry Kyle Moore'), ('j31', 'Dora Larson')]
df = spark.createDataFrame(vals, cols)
df.show()
+---+----------------+
| id| name|
+---+----------------+
|l03| Bob K Barry|
|S20| Cindy Winston|
|l10|Jerry Kyle Moore|
|j31| Dora Larson|
+---+----------------+
split_col = F.split(df['name'], ' ')
df = df.withColumn('firstname', split_col.getItem(0))
df.show()
+---+----------------+---------+
| id| name|firstname|
+---+----------------+---------+
|l03| Bob K Barry| Bob|
|S20| Cindy Winston| Cindy|
|l10|Jerry Kyle Moore| Jerry|
|j31| Dora Larson| Dora|
+---+----------------+---------+
How do I continue to split? Appreciated.
Have the first element in the array always as the firstname and the last element as lastname (using size). If there cannot be more than 1 middle name, you can do:
from pyspark.sql import functions as F
from pyspark.sql.functions import *
df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
.withColumn("ln", col("split_list")[F.size("split_list") - 1])\
.withColumn("mn", when(F.size("split_list")==2, None)\
.otherwise(col("split_list")[1])).drop("split_list").show()
+---+----------------+-----+-------+----+
| id| name| fn| ln| mn|
+---+----------------+-----+-------+----+
|l03| Bob K Barry| Bob| Barry| K|
|S20| Cindy Winston|Cindy|Winston|null|
|l10|Jerry Kyle Moore|Jerry| Moore|Kyle|
|j31| Dora Larson| Dora| Larson|null|
+---+----------------+-----+-------+----+
If there can be more than 1 middle name, then you can use substring on name for middlename column:
df.withColumn("split_list", F.split(F.col("name"), " ")).withColumn("fn", col("split_list")[0])\
.withColumn("ln", col("split_list")[F.size("split_list") - 1])\
.withColumn("mn", when(F.size("split_list")==2, None)\
.otherwise(col('name').substr(F.length("fn")+2, \
F.length("name")-F.length("fn")-F.length("ln")-2))).drop("split_list").show()
+---+----------------+-----+-------+-----+
| id| name| fn| ln| mn|
+---+----------------+-----+-------+-----+
|l03| Bob K Barry| Bob| Barry| K|
|S20| Cindy Winston|Cindy|Winston| null|
|l10|Jerry Kyle Moore|Jerry| Moore| Kyle|
|j31| Dora Larson| Dora| Larson| null|
|A12| Fn A B C Ln| Fn| Ln|A B C|
+---+----------------+-----+-------+-----+
I'm assuming that the FN is the first element, and the LN is the last element, and anything in between is the MN. This is not always true as people can have multiple FN/LN.
Related
I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+
Need to create a new column with using existing three column (city,state,country), have to =
fill null value of city column with 'None' + replace null values of state column with country column value.
NEW COLUMN "CITY_STATE"
OUTPUT -
city state country CITY_STATE
A MH INDIA A MH
NULL NULL POLAND NONE POLAND
NULL AZ RUSSIA NONE AZ
E NULL SOUTH AFRICA E SOUTH AFRICA
raw_data = raw_data.withColumn('city_state',F.concat(col('city').fillna('None'),lit(' '),col('state').fillna(col('country'))))
This is showing 'TypeError: 'Column' object is not callable error
How can I achieve my desired output? Do I need to break it into couple of steps or this can be done in a single command?
One way to do it in a single command is like below:
from pyspark.sql.functions import when, concat_ws
row = Row('city', 'state', 'country')
row_df = spark.createDataFrame(
[row('A', 'MH', 'INDIA'), row(None, None, 'POLAND'), row(None, 'AZ', 'RUSSIA'), row('E', None, 'SOUTH AFRICA')])
row_df.show()
row_df = row_df.select('city', 'state', 'country',
concat_ws(' ', when(row_df.city.isNull(), "NONE").otherwise(row_df.city),
when(row_df.state.isNull(), row_df.country).otherwise(row_df.state)).alias(
'CITY_STATE'))
row_df.show()
The output will be like below:
+----+-----+------------+
|city|state| country|
+----+-----+------------+
| A| MH| INDIA|
|null| null| POLAND|
|null| AZ| RUSSIA|
| E| null|SOUTH AFRICA|
+----+-----+------------+
+----+-----+------------+--------------+
|city|state| country| CITY_STATE|
+----+-----+------------+--------------+
| A| MH| INDIA| A MH|
|null| null| POLAND| NONE POLAND|
|null| AZ| RUSSIA| NONE AZ|
| E| null|SOUTH AFRICA|E SOUTH AFRICA|
+----+-----+------------+--------------+
To know more about spark-SQL read this.
I have a dataframe like below. I want to convert the first row as columns for this dataframe.
How could I do this. Is there any way to convert it directly.(without using df.first)
usdata.show()
-----+---+------------+------------+--------------------+--------------------+
| _1| _2| _3| _4| _5|
+----------+---------+--------------------+--------------------+-------------+
|first_name|last_name| company_name| address| city|
| James| Butt| "Benton, John B Jr"| 6649 N Blue Gum St| New Orleans|
| Josephine| Darakjy|"Chanay, Jeffrey ...| 4 B Blue Ridge Blvd| Brighton|
| Art| Venere|"Chemel, James L ...|8 W Cerritos Ave #54| Bridgeport|
| Lenna| Paprocki|Feltz Printing Se...| 639 Main St| Anchorage|
+----------+---------+--------------------+--------------------+-------------+
Regards,
Dinesh
So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")
Currently, I have a table consisting of encounter_id and date field like so:
+---------------------------+--------------------------+
|encounter_id |date |
+---------------------------+--------------------------+
|random_id34234 |2018-09-17 21:53:08.999999|
|this_can_be_anything2432432|2018-09-18 18:37:57.000000|
|423432 |2018-09-11 21:00:36.000000|
+---------------------------+--------------------------+
encounter_id is a random string.
I'm aiming to create a column which consists of the total number of encounters in the past 30 days.
+---------------------------+--------------------------+---------------------------+
|encounter_id |date | encounters_in_past_30_days|
+---------------------------+--------------------------+---------------------------+
|random_id34234 |2018-09-17 21:53:08.999999| 2 |
|this_can_be_anything2432432|2018-09-18 18:37:57.000000| 3 |
|423432 |2018-09-11 21:00:36.000000| 1 |
+---------------------------+--------------------------+---------------------------+
Currently, I'm thinking of somehow using window functions and specifying an aggregate function.
Thanks for the time.
Here is one possible solution, I added some sample data. It indeed uses a window function, as you suggested yourself. Hope this helps!
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sqlContext.createDataFrame(
[
('A','2018-10-01 00:15:00'),
('B','2018-10-11 00:30:00'),
('C','2018-10-21 00:45:00'),
('D','2018-11-10 00:00:00'),
('E','2018-12-20 00:15:00'),
('F','2018-12-30 00:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.col('date').astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*30,0)
df = df.withColumn('encounters_past_30_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+-----------------------+
|encounter_id| date| timestamp|encounters_past_30_days|
+------------+-------------------+----------+-----------------------+
| A|2018-10-01 00:15:00|1538345700| 1|
| B|2018-10-11 00:30:00|1539210600| 2|
| C|2018-10-21 00:45:00|1540075500| 3|
| D|2018-11-10 00:00:00|1541804400| 2|
| E|2018-12-20 00:15:00|1545261300| 1|
| F|2018-12-30 00:30:00|1546126200| 2|
+------------+-------------------+----------+-----------------------+
EDIT: If you want to have days as the granularity, you could first convert your date column to the Date type. Example below, assuming that a window of five days means today and the four days before. If it should be today and the past five days just remove the -1.
import pyspark.sql.functions as F
from pyspark.sql.window import Window
n_days = 5
df = sqlContext.createDataFrame(
[
('A','2018-10-01 23:15:00'),
('B','2018-10-02 00:30:00'),
('C','2018-10-05 05:45:00'),
('D','2018-10-06 00:15:00'),
('E','2018-10-07 00:15:00'),
('F','2018-10-10 21:30:00')
],
("encounter_id","date")
)
df = df.withColumn('timestamp',F.to_date(F.col('date')).astype('Timestamp').cast("long"))
w = Window.orderBy('timestamp').rangeBetween(-60*60*24*(n_days-1),0)
df = df.withColumn('encounters_past_n_days',F.count('encounter_id').over(w))
df.show()
Output:
+------------+-------------------+----------+----------------------+
|encounter_id| date| timestamp|encounters_past_n_days|
+------------+-------------------+----------+----------------------+
| A|2018-10-01 23:15:00|1538344800| 1|
| B|2018-10-02 00:30:00|1538431200| 2|
| C|2018-10-05 05:45:00|1538690400| 3|
| D|2018-10-06 00:15:00|1538776800| 3|
| E|2018-10-07 00:15:00|1538863200| 3|
| F|2018-10-10 21:30:00|1539122400| 3|
+------------+-------------------+----------+----------------------+