Pyspark dataframe: creating column based on other column values - dataframe

I have a pyspark dataframe:
Now, I want to add a new column called "countryAndState", where, for example for the first row, the value would be "USA_CA". I have tried several approaches, the last one was the following:
df_2 = df.withColumn("countryAndState", '{}_{}'.format(df.country, df.state))
I have tried with "country" and "state" instead, or with simply country and state,and also using col() but nothing seems to work. Can anyone help me solve this?

You can't use Python format strings in Spark. Use concat instead:
import pyspark.sql.functions as F
df_2 = df.withColumn("countryAndState", F.concat(F.col('country'), F.lit('_'), F.col('state')))
or concat_ws, if you need to chain many columns together with a given separator:
import pyspark.sql.functions as F
df_2 = df.withColumn("countryAndState", F.concat_ws('_', F.col('country'), F.col('state')))

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

filter dataframe 1 column with other dataframe column pyspark

i have dataframe1 that contains contracts and i have dataframe2 that contains workers now i want to filter dataframe1 with a column from dataframe2. i tried at first to filter dataframe1 with one string and it works, this is the code :
contract_con=dataframe1.filter(dataframe1.name_of_column.contains('Entretien des espaces naturels')
and this is the code i tried to make to filter the same dataframe1 with a column of an other dataframe2 that contains 10 lines:
contract_con=dataframe1.filter(dataframe1.name_of_column.contains(dataframe2.name_of_column))
contract_con.show()
any help please ?
the solution to make a list from dataframe1 and use it to filter dataframe2
this is the code to make the list:
job_list=dataframe1.select("name_of_column").rdd.flatMap(lambda x: x).collect()
print(job_list)
and this is the code to filter it :
from pyspark.sql.functions import col
contract_workers=dataframe2.filter(col("name_of_column_to_filter").isin(job_list))
contract_workers.show()
Since it is a different dataframe you cannot pass the column directly. You could use isin() after collecting the dataframe2.name_of_column into a list. But the easiest way is just to do a join like this:
contract_con = dataframe1.join(dataframe2, "name_of_column", "inner")
contract_con.show()

pandas: split pandas columns of unequal length list into multiple columns

I have a dataframe with one column of unequal list which I want to spilt into multiple columns (the item value will be the column names). An example is given below
I have done through iterrows, iterating thruough the rows and examine the list from each rows. It seem workable as my dataframe has few rows. However, I wonder if there is any clean methods
I have done through additional_df = pd.DataFrame(venue_df.location.values.tolist())
However the list break down into as below
thanks fro your help
Can you try this code: built assuming venue_df.location contains the list you have shown in the cells.
venue_df['school'] = venue_df.location.apply(lambda x: ('school' in x)+0)
venue_df['office'] = venue_df.location.apply(lambda x: ('office' in x)+0)
venue_df['home'] = venue_df.location.apply(lambda x: ('home' in x)+0)
venue_df['public_area'] = venue_df.location.apply(lambda x: ('public_area' in x)+0)
Hope this helps!
First lets explode your location column, so we can get your wanted end result.
s=df['Location'].explode()
Then lets use crosstab in that series so we can get your end result
import pandas as pd
pd.crosstab(s).unstack()
I didnt test it out cause i dont know you base_df

Pandas splitting a column with new line separator

I am extracting tables from pdf using Camelot. Two of the columns are getting merged together with a newline separator. Is there a way to separate them into two columns?
Suppose the column looks like this.
A\nB
1\n2
2\n3
3\n4
Desired output:
|A|B|
|-|-|
|1|2|
|2|3|
|3|4|
I have tried df['A\nB'].str.split('\n', 2, expand=True) and that splits it into two columns however I want the new column names to be A and B and not 0 and 1. Also I need to pass a generalized column label instead of actual column name since I need to implement this for several docs which may have different column names. I can determine such column name in my dataframe using
colNew = df.columns[df.columns.str.contains(pat = '\n')]
However when I pass colNew in split function, it throws an attribute error
df[colNew].str.split('\n', 2, expand=True)
AttributeError: DataFrame object has no attribute 'str'
You can take advantage of the Pandas split function.
import pandas as pd
# recreate your pandas series above.
df = pd.DataFrame({'A\nB':['1\n2','2\n3','3\n4']})
# first: Turn the col into str.
# second. split the col based on seperator \n
# third: make sure expand as True since you want the after split col become two new col
test = df['A\nB'].astype('str').str.split('\n',expand=True)
# some rename
test.columns = ['A','B']
I hope this is helpful.
I reproduced the error from my side... I guess the issue is that "df[colNew]" is still a dataframe as it contains the indexes.
But .str.split() only works on Series. So taking as example your code, I would convert the dataframe to series using iloc[:,0].
Then another line to split the column headers:
df2=df[colNew].iloc[:,0].str.split('\n', 2, expand=True)
df2.columns = 'A\nB'.split('\n')

How to replace column value in dataframes spark?

so I created students datafram from this list
example_scores=[('Ann', 92),('Bob',55) ]
scores_df = spark.createDataFrame(example_scores,schema=['Name','Score'])
scores_df.show()
I want to replace students score with a number.
for example if their score is between 51,60 when these dataframe will i want it to show
--Bob, 6-- and etc.
I want to use if statement but I dont know how to filter so much with in dataframe.
I tried regexp_replace, translate, but its not working.
You can write a when expression to create a new column
from pyspark.sql.functions import col, when
example_scores=[('Ann', 92),('Bob',55) ]
scores_df = spark.createDataFrame(example_scores,schema=['Name','Score'])
result_df = scores_df.withColumn("Grade", F.when((F.col("Score")>=51) & (F.col("Score")<=60),"6").otherwise("1")).select("Name","Grade")
result_df.show()