I am new to Python and spark.
We are using Azure Databrick and with help of PySpark code shown below for it.
data=spark.sql("SELECT 'Name' as name, 'Number' as number FROM Student")
print(data)
This solution will work for you.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("Finance",10),
("Marketing",20),
("Sales",30),
("IT",40)
]
schema = StructType([ \
StructField("Name",StringType(),True), \
StructField("number", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df1 = df.withColumn("Student",lit("Student")).select("Student",to_json(struct("Name","number")).alias("Data"))
display(df1)
Related
I have the data below and need to create a line chart of x = Date and y = count.
The code I used to create the dataframe below was from another dataframe.
df7=df7.select("*",
concat(col("Month"),lit("/"),col("Year")).alias("Date"))
df7.show()
I've imported matplotlib.pyplot as plt and am still getting errors.
The code to plot I used in different variations as below:
df.plot(x = 'Date', y = 'Count')
df.plot(kind = 'line')
I keep getting this error though:
AttributeError: 'DataFrame' object has no attribute 'plt'/'plot'
Please note that using df_pd= df.toPandas() is sometimes expensive, and if you deal with a high number of records like a scale of M, you might face OOM error in Databricks medium or your session could be crashed due to a lack of RAM memory of the drive. Long story short, by using toPandas(), in fact, you are not using spark-based or distributed computation resources anymore! So alternatively, you can follow below approach:
So let's start with a simple example:
import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.functions import dayofmonth, dayofweek
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType
dict2 = [("2021-08-11 04:05:06", 10),
("2021-08-12 04:15:06", 17),
("2021-08-13 09:15:26", 25),
("2021-08-14 11:04:06", 68),
("2021-08-15 14:55:16", 50),
("2021-08-16 04:12:11", 2),
]
schema = StructType([
StructField("timestamp", StringType(), True), \
StructField("count", IntegerType(), True), \
])
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
sdf.printSchema()
sdf.show(truncate=False)
#Generate date and timestamp
new_df = sdf.withColumn('timestamp', F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
.withColumn('date', F.to_date("timestamp", "yyyy-MM-dd").cast(DateType())) \
.select('timestamp', 'date', 'count')
new_df.show(truncate = False)
#root
# |-- timestamp: string (nullable = true)
# |-- count: integer (nullable = true)
#+-------------------+-----+
#|timestamp |count|
#+-------------------+-----+
#|2021-08-11 04:05:06|10 |
#|2021-08-12 04:15:06|17 |
#|2021-08-13 09:15:26|25 |
#|2021-08-14 11:04:06|68 |
#|2021-08-15 14:55:16|50 |
#|2021-08-16 04:12:11|2 |
#+-------------------+-----+
#+-------------------+----------+-----+
#|timestamp |date |count|
#+-------------------+----------+-----+
#|2021-08-11 04:05:06|2021-08-11|10 |
#|2021-08-12 04:15:06|2021-08-12|17 |
#|2021-08-13 09:15:26|2021-08-13|25 |
#|2021-08-14 11:04:06|2021-08-14|68 |
#|2021-08-15 14:55:16|2021-08-15|50 |
#|2021-08-16 04:12:11|2021-08-16|2 |
#+-------------------+----------+-----+
Now you need to collect() the values of the columns you want to reflect your plot in the absence of Pandas; of course, this is expensive and takes (a long) time in big data records, but it works. Now you can apply one of the following ways:
#for big\high # of records
xlabels = new_df.select("timestamp").rdd.flatMap(list).collect()
ylabels = new_df.select("count").rdd.flatMap(list).collect()
#for limited # of records
xlabels = [val.timestamp for val in new_df.select('timestamp').collect()]
ylabels = [val.count for val in new_df.select('count').collect()]
To plot:
import matplotlib.pyplot as plt
import matplotlib.dates as md
fig, ax = plt.subplots(figsize=(10,6))
plt.plot(xlabels, ylabels, color='blue', label="event's count") #, marker="o"
plt.scatter(xlabels, ylabels, color='cyan', marker='d', s=70)
plt.xticks(rotation=45)
plt.ylabel('Event counts \n# of records', fontsize=15)
plt.xlabel('timestamp', fontsize=15)
plt.title('Events over time', fontsize=15, color='darkred', weight='bold')
plt.legend(['# of records'], loc='upper right')
plt.show()
Based on comments, I assumed due to having lots of records that are printed under x-axis timestamps are not readable like the below pic:
To resolve this, you need to use the following approach to arrange x-axis ticks properly so that they would not plot on top of each other or ultimately side-by-side:
import pandas as pd
import matplotlib.pyplot as plt
x=xlabels
y=ylabels
#Note 1: if you use Pandas dataFrame after .toPandas()
#x=df['timestamp']
#y=df['count']
##Note 2: if you use Pandas dataFrame after .toPandas()
# convert the datetime column to a datetime type and assign it back to the column
#df.timestamp = pd.to_datetime(df.timestamp)
#verify timestamp column type is datetime64[ns] using print(df.info())
fig, ax = plt.subplots( figsize=(12,8))
plt.plot(x, y)
ax.legend(['# of records'])
ax.set_xlabel('Timestamp')
ax.set_ylabel('Event counts \n# of records')
# beautify the x-labels
import matplotlib.dates as md
plt.gcf().autofmt_xdate()
myFmt = md.DateFormatter('%Y-%m-%d %H:%M:%S.%f')
plt.gca().xaxis.set_major_formatter(myFmt)
plt.show()
plt.close()
I have local Spark installed. Running in VSCode on Jupyter Notebook.
Using this test code to create small dataframe and show it in the console using .show(), but my output is not aligned:
# %%
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.master("local").appName("my-application-name").getOrCreate()
)
sc = spark.sparkContext
spark.conf.set("spark.sql.shuffle.partitions", "5")
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkByExamples.com").getOrCreate()
columns = ["language", "users_count"]
data = [
("Java", "20000"),
("Python", "100000"),
("Scala", "3000"),
]
df = spark.createDataFrame(data).toDF(*columns)
df.cache()
df.show(truncate=False)
Also converting to pandas and printing shows similarly:
df_pd = df.toPandas()
print(df_pd)
Can you help me, where can I look to try to fix it?
Thanks
I'm trying to upload a sample pyspark dataframe to Azure blob, after converting it to excel format. Getting the below error. Also, below is the snippet of my sample code.
If there is a other way to do the same, pls let me know.
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
import pandas as ps
#%pip install xlwt
#%pip install openpyxl
#%pip install fsspec
my_data = [
("A","1","M",3000),
("B","2","F",4000),
("C","3","M",4000)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=my_data,schema=schema)
pandasDF = df.toPandas()
pandasDF.to_excel("wasbs://blob.paybledvclient1#civblbstr.blob.core.windows.net/output_file.xlsx")
ValueError: Protocol not known: wasbs
You are directly using python library pandas to write the data. This isn't work this way. You need to first mount the Azure Blob storage container and then write the data.
To mount, use following command:
dbutils.fs.mount(
source = "wasbs://<container-name>#<storage-account-name>.blob.core.windows.net",
mount_point = "/mnt/<mount-name>",
extra_configs = {"<conf-key>":dbutils.secrets.get(scope = "<scope-name>", key = "<key-name>")})
To write, use below commands:
df.write
.mode("overwrite")
.option("header", "true")
.csv("dbfs:/mnt/azurestorage/filename.csv"))
I have a sample spark dataframe that I create from pandas dataframe -
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
from pyspark.sql.types import *
import pandas as pd
spark = SparkSession \
.builder \
.appName("Python Spark SQL basic example") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
#create sample spark dataframe first and then create pandas dataframe from it
import pandas as pd
pdf = pd.DataFrame([[1,"hello world. lets shine and spread happiness"],[2,"not so sure"],[2,"cool i like it"],[2,"cool i like it"],[2,"cool i like it"]]
, columns = ['input1','input2'])
df = spark.createDataFrame(pdf) # this is spark df
now, I have the data types as
df.printSchema()
root
|-- input1: long (nullable = true)
|-- input2: string (nullable = true)
If i convert this spark dataframe back to pandas using -
pandas_df = df.toPandas()
and then if I try to print the data types, I get back object type for second column instead of string type.
pandas_df.dtypes
input1 int64
input2 object
dtype: object
How do I convert this string type in spark correctly to string type in pandas ?
To convert to string, you can use StringDtype:
pandas_df["input_2"] = pandas_df["input_2"].astype(pd.StringDtype())
Trying to append to a pandas:
df=[]
df = pd.DataFrame(columns = FEATURES)
df = df.append({
'_DayOfMonth':_DayOfMonth,
'_MonthOfYear':_MonthOfYear,
...
}, ignore_index = True)
Somehow, something goes wrong here. Syntax?
I am running pandas 0.18.1