Transfer a df to a new one and change the context of a column - dataframe

I have one dataframe df_test and I want to parse all the columns into a new df.
Also I want with if else statement to modify one column's context.
Tried this:
import pyspark
import pandas as pd
from pyspark.sql import SparkSession
df_cast= df_test.withColumn('account_id', when(col("account_id") == 8, "teo").when(col("account_id") == 9, "liza").otherwise(' '))
But it gives me this error:
NameError: name 'when' is not defined
Thanks in advance

At the start of your code, you should import the pyspark sql functions. The following, for example, would work:
import pyspark.sql.functions as F
import pyspark
import pandas as pd
from pyspark.sql import SparkSession
df_cast= df_test.withColumn('account_id', F.when(col("account_id") == 8, "teo").F.when(col("account_id") == 9, "liza").otherwise(' '))

Related

Converting Python code to pyspark environment

How can I have the same functions as shift() and cumsum() from pandas in pyspark?
import pandas as pd
temp = pd.DataFrame(data=[['a',0],['a',0],['a',0],['b',0],['b',1],['b',1],['c',1],['c',0],['c',0]], columns=['ID','X'])
temp['transformed'] = temp.groupby('ID').apply(lambda x: (x["X"].shift() != x["X"]).cumsum()).reset_index()['X']
print(temp)
My question is how to achieve in pyspark.
Pyspark have handle these type of queries with Windows utility functions.
you can read its documentation here
Your pyspark code would be something like this :
from pyspark.sql import functions as F
from pyspark.sql Import Window as W
window = W.partitionBy('id').orderBy('time'?)
new_df = (
df
.withColumn('shifted', F.lag('X').over(window))
.withColumn('isEqualToPrev', (F.col('shifted') == F.col('X')).cast('int'))
.withColumn('cumsum', F.sum('isEqualToPrev').over(window))
)

Lambdas function on multiple columns

I am trying to extract only number from multiple columns in my pandas data.frame.
I am able to do so one-by-one columns however I would like to perform this operation simultaneously to multiple columns
My reproduced example:
import pandas as pd
import re
import numpy as np
import seaborn as sns
df = sns.load_dataset('diamonds')
# Create columns one again
df['clarity2'] = df['clarity']
df.head()
df[['clarity', 'clarity2']].apply(lambda x: x.str.extract(r'(\d+)'))
If you want a tuple
cols = ['clarity', 'clarity2']
tuple(df[col].str.extract(r'(\d+)') for col in cols)
If you want a list
cols = ['clarity', 'clarity2']
[df[col].str.extract(r'(\d+)') for col in cols]
adding them to the original data
df['digit1'], df['digit2'] = [df[col].str.extract(r'(\d+)') for col in cols]

Parse list and create DataFrame

I have been given a list called data which has the following content
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
I wanted to have a pandas dataframe which looks like the link
Expected Dataframe
How can I achieve this.
You can try this:
data=[b'Name,Age,Occupation,Salary\r\nRam,37,Plumber,1769\r\nMohan,49,Elecrician,3974\r\nRahim,39,Teacher,4559\r\n']
processed_data = [x.split(',') for x in data[0].decode().replace('\r', '').strip().split('\n')]
df = pd.DataFrame(columns=processed_data[0], data=processed_data[1:])
Hope it helps.
I would recommend you to convert this list to string as there is only one index in this list
str1 = ''.join(data)
Then use solution provided here
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
TESTDATA = StringIO(str1)
df = pd.read_csv(TESTDATA, sep=",")

TypeError: field Customer: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

SL No: Customer Month Amount
1 A1 12-Jan-04 495414.75
2 A1 3-Jan-04 245899.02
3 A1 15-Jan-04 259490.06
My Df is above
Code
import findspark
findspark.init('/home/mak/spark-3.0.0-preview2-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mak').getOrCreate()
import numpy as np
import pandas as pd
# Enable Arrow-based columnar data transfers
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pdf3 = pd.read_csv('Repayment.csv')
df_repay = spark.createDataFrame(pdf3)
only loading df_repay is having issue, other data frame are loaded successfully. When i shfted my above to code to below code its worked successfully
df4 = (spark.read.format("csv").options(header="true")
.load("Repayment.csv"))
why df_repay is not loaded with spark.createDataFrame(pdf3) while similar data frames loaded successfully
pdf3 is pandas dataframe and you are trying to convert pandas dataframe to spark dataframe. if you want to stick to your code please use below code that is convert your pandas dataframe to spark dataframe.
from pyspark.sql.types import *
pdf3 = pd.read_csv('Repayment.csv')
#create schema for your dataframe
schema = StructType([StructField("Customer", StringType(), True)\
,StructField("Month", DateType(), True)\
,StructField("Amount", IntegerType(), True)])
#create spark dataframe using schema
df_repay = spark.createDataFrame(pdf3,schema=schema)

How to remove diacritics in pyspark dataframes?

I am wondering how to remove diacritics in Pyspark Dataframe with Python2. I would need something like
from pyspark.sql.session import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
df = sc.parallelize([(u'pádlo', 1), (u'dřez', 4)]).toDF(['text', 'num'])
def remove_diacritics(s):
return unidecode.unidecode(s)
rem_udf = sf.udf(remove_diacritics, StringType())
df.select(rem_udf('text'))
unfortunatelly, unidecode module is not available in our cluster.
Is there some any natural solution that I am missing excepting manual replacement of all possible characters? Note that the expected result is [padlo, drez]
You can use analog of SQL translate to replace character based on two "dictionaries":
from pyspark.sql.session import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
charsFrom = 'řá' #fill those strings with all diactricts
charsTo = 'ra' #and coresponding latins
df = sc.parallelize([(u'pádlo', 1), (u'dřez', 4)]).toDF(['text', 'num'])
df = df.select(sf.translate('text', charsFrom, charsTo),'num')
It will replace every occurence of each character from the first string to coresponding character from second string.