How to export Spark DataFrame with columns having valuse lists aggregated with collect_list() to 3 dimentional Pandas in Pyspark? - pandas

I have the DataFrame like this one (How to get the occurence rate of the specific values with Apache Spark)
+-----------+--------------------+------------+-------+
|device | windowtime | values| counts|
+-----------+--------------------+------------+-------+
| device_A|2022-01-01 18:00:00 |[99,100,102]|[1,3,1]|
| device_A|2022-01-01 18:00:10 |[98,100,101]|[1,2,2]|
Windowtime is considered to be X axis value, values are considered to be Y value, while counts are Z axis value (to be later plot say on heatmap).
How to export that to Pandas 3d object from PySpark dataframe?
With "2 dimensions", I have
pdf = df.toPandas()
and then I can use that for Bokeh's figure like that:
fig1ADB = figure(title="My 2 graph", tooltips=TOOLTIPS, x_axis_type='datetime')
fig1ADB.line(x='windowtime', y='values', source=source, color="orange")
But I'd like to use something like this:
hm = HeatMap(data, x='windowtime', y='values', values='counts', title='My heatmap (3d) graph', stat=None)
show(hm)
WHat kind of transformation I should do for that?

I have realized, that the approach itself is wrong, there should be no aggregation to list done before the exporting to Pandas!
According to discussion below
https://discourse.bokeh.org/t/cant-render-heatmap-data-for-apache-zeppelins-pyspark-dataframe/8844/8
instead of grouped to list columns values/counts we have raw table with one line per unique id ('value') and value of count ('index') and each line has its 'write_time'
+-------------------+------+-----+
|window_time |values|index|
+-------------------+------+-----+
|2022-01-24 18:00:00|999 |2 |
|2022-01-24 19:00:00|999 |1 |
|2022-01-24 20:00:00|999 |3 |
|2022-01-24 21:00:00|999 |4 |
|2022-01-24 22:00:00|999 |5 |
|2022-01-24 18:00:00|998 |4 |
|2022-01-24 19:00:00|998 |5 |
|2022-01-24 20:00:00|998 |3 |
rowIDs = pdf['values']
colIDs = pdf['window_time']
A = pdf.pivot_table('index', 'values', 'window_time', fill_value=0)
source = ColumnDataSource(data={'x':[pd.to_datetime('Jan 24 2022')] #left most
,'y':[0] #bottom most
,'dw':[pdf['window_time'].max()-pdf['window_time'].min()] #TOTAL width of image
#,'dh':[df['delayWindowEnd'].max()] #TOTAL height of image
,'dh':[1000] #TOTAL height of image
,'im':[A.to_numpy()] #2D array using to_numpy() method on pivotted df
})
color_mapper = LogColorMapper(palette="Viridis256", low=1, high=20)
plot = figure(toolbar_location=None,x_axis_type='datetime')
plot.image(x='x', y='y', source=source, image='im',dw='dw',dh='dh', color_mapper=color_mapper)
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=12)
plot.add_layout(color_bar, 'right')
#show(plot)
show(gridplot([plot], ncols=1, plot_width=1000, plot_height=400))
And the result:

Related

How to select a subset of pandas dataframe containing an even distribution of one column's values?

I have a huge dataset over different years. As a subsample for local tests, I need to separate a small dataframe which contains only a few samples distributed over years. Does anyone have any idea how to do that?
After groupby by 'year' column, the count of instances in each year is something like:
year
A
1838
1000
1839
2600
1840
8900
1841
9900
I want to select a subset which after groupby looks like:
| year| A |
| ----| --|
| 1838| 10|
| 1839| 10|
| 1840| 10|
| 1841| 10|
Try groupby().sample().
Here's example usage with dummy data.
import numpy as np
import pandas as pd
# create a long array of 'years' from 1800 to 1805
years = np.random.randint(low=1800,high=1805,size=200)
values = np.random.randint(low=1, high=200,size=200)
df = pd.DataFrame({'Years':years,"Values":values})
number_per_year = 10
sample_df = df.groupby("Years").sample(n=number_per_year, random_state=1)

How to keep data unique in a certain range in pysaprk dataframe?

Companies can select a section of a Road. Sections are denoted by a start & end.
pyspark dataframe below:
+--------------------+----------+--------+
|Road company |start(km) |end(km) |
+--------------------+----------+--------+
|classA |1 |3 |
|classA |4 |7 |
|classA |10 |15 |
|classA |16 |20 |
|classB |1 |3 |
|classB |4 |7 |
|classB |10 |15 |
+--------------------+----------+--------+
The classB company would pick the section of the road first. For classA entries, there should be overlap with classB. That is, classA Companies could not select a section of the road part that has been chosen by classB(company). The result should as below:
+--------------------+----------+--------+
|Road company |start(km) |end(km) |
+--------------------+----------+--------+
|classA |16 |20 |
|classB |1 |3 |
|classB |4 |7 |
|classB |10 |15 |
+--------------------+----------+--------+
The distinct() function does not support separating the frame into several parts to apply the distinct operation. What should I do to implement that?
If you could partially allocate the section of Road here's a different (very similar) strategy:
start="start(km)"
end="end(km)"
def emptyDFr():
schema = StructType([
StructField(start,IntegerType(),True),
StructField(end,IntegerType(),True),
StructField("Road company",StringType(),True),
StructField("ranged",IntegerType(),True)
])
return spark.createDataFrame(sc.emptyRDD(), schema)
def dummyData():
return sc.parallelize([["classA",1,3],["classA",4,7],["classA",8,15],["classA",16,20],["classB",1,3],["classB",4,7],["classB",8,17]]).toDF(['Road company','start(km)','end(km)'])
df = dummyData()
df.cache()
df_ordered = df.orderBy(when(col("Road company") == "classB", 1)
.when(col("Road company") == "classA", 2)
.when(col("Road company") == "classC", 3)
).select("Road company").distinct()
# create the sequence of kilometers that cover the 'start' to 'end'
ranged = df.withColumn("range", explode(sequence( col(start), col(end) )) )
whatsLeft = ranged.select( col("range") ).distinct()
result = emptyDFr()
#Only use collect() on small countable sets of data.
for company in df_ordered.collect():
taken = ranged.where(col("Road company") == lit(company[0]))\
.join(whatsLeft, ["range"])
whatsLeft = whatsLeft.subtract( taken.select( col("range") ) )
result = result.union( taken.select( col("range") ,col(start), col(end),col("Road company") ) )
#convert our result back to the 'original style' of records with starts and ends.
result.groupBy( start, end, "Road company").agg(count("ranged").alias("count") )\
#figure out math to see if you got everything you asked for.
.withColumn("Partial", ((col(end)+lit(1)) - col(start)) != col("count"))\
.withColumn("Maths", ((col(end)+lit(1)) - col(start))).show() #helps show why this works not requried.
If you can can rely on the fact that sections will not ever overlap, you can solve this with the below logic. You could likely optimize it to rely on the "start(km)". But if you are talking more in-depth than that it might be more complicated.
from pyspark.sql.functions col, when
from pyspark.sql.types import *
def emptyDF():
schema = StructType([
StructField("start(km)",IntegerType(),True),
StructField("end(km)",IntegerType(),True),
StructField("Road company",StringType(),True)
])
return spark.createDataFrame(sc.emptyRDD(), schema)
def dummyData():
return sc.parallelize([["classA",1,3],["classA",4,7],["classA",8,15],["classA",16,20],["classB",1,3],["classB",4,7],["classB",8,15]]).toDF(['Road company','start(km)','end(km)'])
df = dummyData()
df.cache()
df_ordered = df.orderBy(when(col("Road company") == "classB", 1)
.when(col("Road company") == "classA", 2)
.when(col("Road company") == "classC", 3)
).select("Road company").distinct()
whatsLeft = df.select( col("start(km)") ,col("end(km)") ).distinct()
result = emptyDF()
#Only use collect() on small countable sets of data.
for company in df_ordered.collect():
taken = df.where(col("Road company") == lit(company[0]))\
.join(whatsLeft, ["start(km)" ,"end(km)"])
whatsLeft = whatsLeft.subtract( taken.drop( col("Road company") ) )
result = result.union( taken )
result.show()
+---------+-------+------------+
|start(km)|end(km)|Road company|
+---------+-------+------------+
| 1| 3| classB|
| 4| 7| classB|
| 8| 15| classB|
| 16| 20| classA|
+---------+-------+------------+

How can I update Pyspark DataFrame column values under two column conditions using Bitwise or bit and function?

I need to update a column (Flag, containing many flags, each flag is 2^n int number, add up) in a pyspark dataframe under two conditions, i.e. column(Age) value >= 65 and column Flag does not contain the new flag value which is checked by a Bitwise or bit and function: (Flag & newFlag) == 0
I have demonstrated my work using a sample dataframe and python script(plelase see it below) but encountered an error message.
the error message is: AnalysisException: cannot resolve '(Flag AND 2)' due to data type mismatch: '(Flag AND 2)' requires boolean type, not int;
from pyspark.sql.types import StructType,StructField, StringType, IntegerType`
from pyspark.sql.functions import *
# create a data frame with two columns: Age and Flag and three rows
data = [
(61,0),
(65,1),
(66,10) #previous inserted Flag 2 and 8, add up to 10, Flag is 2^n
]
schema = StructType([ \
StructField("Age",IntegerType(), True), \
StructField("Flag",IntegerType(), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
#df.printSchema()
df.show(truncate=False)
N_FLAG_AGE65=2
new_column = when(
(col("Age") >= 65) & ((col("Flag") & lit(N_FLAG_AGE65) == 0)),
col("Flag")+N_FLAG_AGE65
).otherwise(col("Flag"))
df = df.withColumn("Flag", new_column)
df.show(truncate=False)
after input source df is constructed, the first display line of df.show(truncate=False) should be
+---+----+
|Age|Flag|
+---+----+
|61 |0 |
|65 |1 |
|66 |10 |
+---+----+
My updating algorithm is to check both columns (Age and Flag), if age >=65 and Flag bit function does not contain N_FLAG_AGE65, we update Flag field by Flag = Flag+N_FLAG_AGE65. Thus, the expected result should be
+---+----+
|Age|Flag|
+---+----+
|61 |0 |
|65 |3 |
|66 |10 |
+---+----+
I think that the original syntax of "new_column" conditional expression won't work with df = df.withColumn("Flag", new_column)
I did syntax change, it works now for the following code by adding a new constant lit(N_FLAG_AGE65) called column(Flag65_exp) and used expr("case when Age>=65 and Flag & lit(N_FLAG_AGE65)=0 then Flag+lit(N_FLAG_AGE65) Else Flag End") in df.withColumn("Flag",expr("..."))
%python
from pyspark.sql.types import StructType,StructField,
StringType, IntegerType
from pyspark.sql.functions import *
# create a data frame with two columns: Age and Flag and three
rows
data = [
(61,0),
(65,1),
(66,10) #previous inserted Flag 2 and 8, add up to 10, Flag is
2^n
]
schema = StructType([ \
StructField("Age",IntegerType(), True), \
StructField("Flag",IntegerType(), True) \
])
df = spark.createDataFrame(data=data,schema=schema)
#df.printSchema()
df.show(truncate=False)
N_FLAG_AGE65=2
df=df.withColumn('Flag65_exp', lit(N_FLAG_AGE65))
df = df.withColumn("Flag", expr("case when Age>=65 and Flag &
lit(N_FLAG_AGE65)=0 then Flag+lit(N_FLAG_AGE65) Else Flag End"))
df.show(truncate=False)
#source df
+---+----+
|Age|Flag|
+---+----+
|61 |0 |
|65 |1 |
|66 |10 |
+---+----+
#updated df
+---+----+----------+
|Age|Flag|Flag65_exp|
+---+----+----------+
|61 |0 |2 |
|65 |3 |2 |
|66 |10 |2 |
+---+----+----------+

Create new column with fuzzy-score across two string columns in the same dataframe

I'm trying to calculate a fuzzy score (preferable partial_ratio score) across two columns in the same dataframe.
| column1 | column2|
| -------- | -------------- |
| emmett holt| holt
| greenwald| christopher
It would need to look something like this:
| column1 | column2|partial_ratio|
| -------- | -------------- |-----------|
| emmett holt| holt|100|
| greenwald| christopher|22|
|schaefer|schaefer|100|
With the help of another question on this website, I worked towards the following code:
compare=pd.MultiIndex.from_product([ dataframe['column1'],dataframe ['column2'] ]).to_series()
def metrics (tup):
return pd.Series([fuzz.partial_ratio(*tup)], ['partial_ratio'])
df['partial_ratio'] = df.apply(lambda x: fuzz.partial_ratio(x['original_title'], x['title']), axis=1)
But the problem already starts with the first line of the code that returns the following error notification:
Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
You can say I'm kind of stuck here so any advice on this is appreciated!
You need a UDF to use fuzzywuzzy:
from fuzzywuzzy import fuzz
import pyspark.sql.functions as F
#F.udf
def fuzzyudf(original_title, title):
return fuzz.partial_ratio(original_title, title)
df2 = df.withColumn('partial_ratio', fuzzyudf('column1', 'column2'))
df2.show()
+-----------+-----------+-------------+
| column1| column2|partial_ratio|
+-----------+-----------+-------------+
|emmett holt| holt| 100|
| greenwald|christopher| 22|
+-----------+-----------+-------------+

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")