I want to compare dataframe column after trim and convert it into lower case in pyspark.
is below code is wrong ?
if f.trim(Loc_Country_df.LOC_NAME.lower) == f.trim(sdf.location_name.lower):
print('y')
else:
print('N')
No you can't do like this because df columns are not just variable , they are collection of values(Iterable).
The best way is you can perform join.
join_df=join_df.withColumn("LOC_NAME",trim(col("LOC_NAME")))
sdf=sdf.withColumn("location_name",trim(col("location_name")))
join_df=Loc_Country_df.join(sdf,Loc_Country_df.LOC_NAME==sdf.location_name,"left")
from pyspark.sql import functions as f
join_df.withColumn('Result', f.when(f.col('LOC_NAME') == 0, "N").otherwise("Y")).show()
Related
I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()
I have a pyspark dataframe like below :
I wanted to keep only one record if two column uniq_id and date_time have same value.
Expected Output :
I wanted to achieve this using pyspark.
Thank you
You can group by uniq_id and date_time and use first()
from pyspark.sql import functions as F
df.groupBy("uniq_id", "date_time").agg(F.first("col_1"), F.first("col_2"), F.first("col_3")).show()
I can't get how you compare int column and timestamp one(though it can be done with casting timestamp to int) but such a filtering can be made via
from pyspark.sql import functions as F
# assume you already have your DataFrame
df = df.filter(F.col('first_column_name') == F.col('second_column_name'))
or just
df = df.filter('first_column_name = second_column_name')
I want to create binary pandas dataframe from existing pandas dataframe. values of binary pandas dataframe will be 1 if its equal to mode otherwise it will be 0.
P.S. I have more than 100 columns so cant do manually for each columns.
You can do:
df.eq(df.mode().iloc[0])
using vectorization:
mode_condition = <your conditions>
df['newcol'] = np.where(mode_condition, 1, 0)
I would like to append to each value of a column in a pyspark dataframe a word( for example from a list of words). I though to just convert it to pandas framework because it is easier but I need to do it on pyspark. Any Ideas? Thank you :)
you can do it easily with concat function:
from pyspark.sql import functions as F
for col in df.columns:
df.withColumn(col, F.concat(F.col(col), F.lit("new_word"))
In pandas Dataframe/Series there's a .isnull() method. Is there something similar in the syntax of where= filter of the select method of HDFStore?
WORKAROUND SOLUTION:
The /meta section of a data column inside hdf5 can be used as a hack solution:
import pandas as pd
store = pd.HDFStore('store.h5')
print(store.groups)
non_null = list(store.select("/df/meta/my_data_column/meta"))
df = store.select('df', where='my_data_column == non_null')