How to create variable PySpark Dataframes by Dropping Null columns - apache-spark-sql

I have 2 JSON files in a relative folder named 'source_data'
"source_data/data1.json"
{
"name": "John Doe",
"age": 32,
"address": "ZYZ - Heaven"
}
"source_data/data2.json"
{
"userName": "jdoe",
"password": "password",
"salary": "123456789"
}
Using the following PySpark code I have created DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
df = spark.read.json("source_data")
print(df.head())
Output:
df.head(10)
[Row(name='John Doe', age=32, address='ZYZ - Heaven', userName=None, password=None, salary=None),
Row(name=None, age=None, address=None, userName='jdoe', password='password', salary='123456789')]
Now I want to create variable number of DataFrame, by dropping 'None' type column values, like this:
df1.head()
[Row(name='John Doe', age=32, address='ZYZ - Heaven']
and,
df2.head()
[Row(userName='jdoe', password='password', salary='123456789')]
I am only finding solutions for dropping entire row based on all or any column(s)
Is there any ways to achieve what I am looking for ?
TIA

You can just select the columns that you require in a different dataframe and filter that based on the condition.
//source data
val df = spark.read.json("path")
//select and filter
val df1 = df.select("address","age","name")
.filter($"address".isNotNull || $"age".isNotNull || $"name".isNotNull)
val df2 = df.select("password","salary","userName")
.filter($"password".isNotNull || $"salary".isNotNull || $"userName".isNotNull)
//see the output as dataframe or using head as you want
println(df1.head)
df2.head
Output for both the head command
df1 :
df2:

Related

How to escape single quote in sparkSQL

I am new to pySpark and SQL. I am working on below query;
sqlContext.sql("Select Crime_type, substring(Location,11,100) as Location_where_crime_happened, count(*) as Count\
From street_SQLTB\
where LSOA_name = 'City of London 001F' and \
group by Location_where_crime_happened, Crime_type\
having Location_where_crime_happened = 'Alderman'S Walk'")
I am struggling in dealing with single quote. I need to apply filter on Alderman'S Walk. It could be easy one but I am unable to figure out.
Your help is much appreciated.
Try this
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
simpleData = [("James","Sales","NY",90000,34,10000), \
("Michael","Sales","NY",86000,56,20000), \
("Robert","Sales","CA",81000,30,23000), \
("Maria","Alderman'S Walk","CA",90000,24,23000) \
]
columns= ["employee_name","department","state","salary","age","bonus"]
df1 = spark.createDataFrame(data = simpleData, schema = columns)
df1.createOrReplaceTempView('temp')
df = sqlContext.sql("""select * from temp where department = "Alderman'S Walk" """)
display(df)
or
df = sqlContext.sql("select * from temp where department = 'Alderman\\'S Walk' ")
display(df)
Filtered output:

Vectorized pandas udf in pyspark with dict lookup

I'm trying to learn to use pandas_udf in pyspark (Databricks).
One of the assignments is to write a pandas_udf to sort by day of the week. I know how to do this using spark udf:
from pyspark.sql.functions import *
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
print('Original')
df.show()
#udf()
def udf(day: str) -> str:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day] + '-' + day
print('with spark udf')
final_df = df.select(col('avg_users'), udf(col('day')).alias('day')).sort('day')
final_df.show()
Prints:
Original
+---+-----------+
|day| avg_users|
+---+-----------+
|Sun| 282905.5|
|Mon| 238195.5|
|Thu| 264620.0|
|Sat| 278482.0|
|Wed| 227214.0|
+---+-----------+
with spark udf
+-----------+-----+
| avg_users| day|
+-----------+-----+
| 238195.5|1-Mon|
| 227214.0|3-Wed|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 282905.5|7-Sun|
+-----------+-----+
Trying to do the same with pandas_udf
import pandas as pd
#pandas_udf('string')
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return dow[day.str] + '-' + day.str
p_final_df = df.select(df.avg_users, p_udf(df.day))
print('with pandas udf')
p_final_df.show()
I get KeyError: <pandas.core.strings.accessor.StringMethods object at 0x7f31197cd9a0>. I think it's coming from dow[day.str], which kinda makes sense.
I also tried:
return dow[day.str.__str__()] + '-' + day.str # KeyError: .... StringMethods
return dow[str(day.str)] + '-' + day.str # KeyError: .... StringMethods
return dow[day.str.upper()] + '-' + day.str # TypeError: unhashable type: 'Series'
return f"{dow[day.str]}-{day.str}" # KeyError: .... StringMethods (but I think this is logically
# wrong, returning a string instead of a Series)
I've read:
API reference
PySpark equivalent for lambda function in Pandas UDF
How to convert Scalar Pyspark UDF to Pandas UDF?
Pandas UDF in pyspark
Using the .str method alone without any actual vectorized transformation was giving you the error. Also, you can not use the whole series as a key for your dow dict. Use a map method for pandas.Series:
from pyspark.sql.functions import *
import pandas as pd
data = [('Sun', 282905.5), ('Mon', 238195.5), ('Thu', 264620.0), ('Sat', 278482.0), ('Wed', 227214.0)]
schema = 'day string, avg_users double'
df = spark.createDataFrame(data, schema)
#pandas_udf("string")
def p_udf(day: pd.Series) -> pd.Series:
dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
"Fri": "5", "Sat": "6", "Sun": "7"}
return day.map(dow) + '-' + day
df.select(df.avg_users, p_udf(df.day).alias("day")).show()
+---------+-----+
|avg_users| day|
+---------+-----+
| 282905.5|7-Sun|
| 238195.5|1-Mon|
| 264620.0|4-Thu|
| 278482.0|6-Sat|
| 227214.0|3-Wed|
+---------+-----+
what about we return a dataframe using groupeddata and orderby after you do the udf. Pandas sort_values is quite problematic within udfs.
Basically, in the udf I generate the numbers using python and then concatenate them back to the day column.
from pyspark.sql.functions import pandas_udf
import pandas as pd
from pyspark.sql.types import *
import calendar
def sortdf(pdf):
day=pdf.day
pdf =pdf.assign(day=(day.map(dict(zip(calendar.day_abbr, range(7))))+1).astype(str) + '-'+day)
return pdf
df.groupby('avg_users').applyInPandas(sortdf, schema=df.schema).show()
+-----+---------+
| day|avg_users|
+-----+---------+
|3-Wed| 227214.0|
|1-Mon| 238195.5|
|4-Thu| 264620.0|
|6-Sat| 278482.0|
|7-Sun| 282905.5|
+-----+---------+

GroupBy Function Not Applying

I am trying to groupby for the following specializations but I am not getting the expected result (or any for that matter). The data stays ungrouped even after this step. Any idea what's wrong in my code?
cols_specials = ['Enterprise ID','Specialization','Specialization Branches','Specialization Type']
specials = pd.read_csv(agg_specials, engine='python')
specials = specials.merge(roster, left_on='Enterprise ID', right_on='Enterprise ID', how='left')
specials = specials[cols_specials]
specials = specials.groupby(['Enterprise ID'])['Specialization'].transform(lambda x: '; '.join(str(x)))
specials.to_csv(end_report_specials, index=False, encoding='utf-8-sig')
Please try using agg:
import pandas as pd
df = pd.DataFrame(
[
['john', 'eng', 'build'],
['john', 'math', 'build'],
['kevin', 'math', 'asp'],
['nick', 'sci', 'spi']
],
columns = ['id', 'spec', 'type']
)
df.groupby(['id'])[['spec']].agg(lambda x: ';'.join(x))
resiults in:
if you need to preserve starting number of lines, use transform. transform returns one column:
df['spec_grouped'] = df.groupby(['id'])[['spec']].transform(lambda x: ';'.join(x))
df
results in:

pandas multi-index from JSON document

How do I convert this multi-column index json document to pandas dataframe?
mylist=[
{
"col1": "val1",
"col2": [
{
"col21": "val21"
},
{
"col21": "val22"
}
]
}
]
What I have tried:
# pip install json-excel-converter[extra]
from json_excel_converter import Converter
from json_excel_converter.xlsx import Writer
import pandas as pd
conv = Converter()
conv.convert(mylist, Writer(file="test.xlsx"))
df = pd.read_excel("test.xlsx", header=list(range(2)))
To get this as expected:
col1 col2
Unnamed: 0_level_1 col21 col21.1
0 val1 val21 val22
But I have 2 questions:
Is there any way to avoid creating excel file and directly write to dataframe?
Is there any other way to achieve the same results?
Take a look at the structure of the JSON when it comes back from json_excel_converter
df.to_dict(orient="rows")
[{('col1', 'Unnamed: 0_level_1'): 'val1',
('col2', 'col21'): 'val21',
('col2', 'col21.1'): 'val22'}]
So need to produce that format from your input JSON. I've done this in two ways - old school looping and as a list comprehension.
Plus you need to tell pandas to create a multiindex from the tuples in the columns
newlist = []
for d in mylist:
myd = {}
for k in list(d.keys()):
if not isinstance(d[k], list):
myd = {**myd, (k, ""):d[k]}
else:
for x, i in enumerate(d[k]):
myd = {**myd, (k, f"{list(i)[0]}.{x}"):list(i.values())[0]}
newlist.append(myd)
newlist = [
{
**{(k,""):d[k]
for k in list(d.keys()) if not isinstance(d[k], list)
},
**{(k,f"{kk}.{i}"):l[kk]
for k in list(d.keys()) if isinstance(d[k], list)
for i, l in enumerate(d[k])
for kk in list(l.keys())
},
}
for d in mylist
]
df = pd.DataFrame(newlist)
df.columns= pd.MultiIndex.from_tuples(df.columns)
df
output
col1 col2
col21.0 col21.1
val1 val21 val22

pyspark pass multiple options in dataframe

I am new to python and pyspark. I would like to know
how can I write the below spark dataframe function in pyspark:
val df = spark.read.format("jdbc").options(
Map(
"url" -> "jdbc:someDB",
"user" -> "root",
"password" -> "password",
"dbtable" -> "tableName",
"driver" -> "someDriver")).load()
I tried to write as below in pyspark. But, getting syntax error:
df = spark.read.format("jdbc").options(
map(lambda : ("url","jdbc:someDB"), ("user","root"), ("password","password"), ("dbtable","tableName"), ("driver","someDriver"))).load()
Thanks in Advance
In PySpark, pass the options as keyword arguments:
df = spark.read\
.format("jdbc")\
.options(
url="jdbc:someDB",
user="root",
password="password",
dbtable="tableName",
driver="someDriver",
)\
.load()
Sometimes it's handy to keep them in a dict and unpack them later using the splat operator:
options = {
"url": "jdbc:someDB",
"user": "root",
"password": "password",
"dbtable": "tableName",
"driver": "someDriver",
}
df = spark.read\
.format("jdbc")\
.options(**options)\
.load()
Regarding the code snippets from your question: you happened to mix up two different concepts of "map":
Map in Scala is a data structure also known as "associative array" or "dictionary", equivalent to Python's dict
map in Python is a higher-order function you can use for applying a function to an iterable, e.g.:
In [1]: def square(x: int) -> int:
...: return x**2
...:
In [2]: list(map(square, [1, 2, 3, 4, 5]))
Out[2]: [1, 4, 9, 16, 25]
In [3]: # or just use a lambda
In [4]: list(map(lambda x: x**2, [1, 2, 3, 4, 5]))
Out[4]: [1, 4, 9, 16, 25]
Try to use option() instead:
df = spark.read \
.format("jdbc") \
.option("url","jdbc:someDB") \
.option("user","root") \
.option("password","password") \
.option("dbtable","tableName") \
.option("driver","someDriver") \
.load()
To load a CSV file with multiple parameters, pass the arguments to load():
df = spark.read.load("examples/src/main/resources/people.csv",
format="csv", sep=":", inferSchema="true", header="true")
Here's the documentation for that.