How to calculate conditional probability of values in pyspark dataframe? - apache-spark-sql

I want to calculate conditional probabilites of ratings('A','B','C') in ratings column by value of column type in pyspark without collecting.
company model rating type
0 ford mustang A coupe
1 chevy camaro B coupe
2 ford fiesta C sedan
3 ford focus A sedan
4 ford taurus B sedan
5 toyota camry B sedan
rating type conditional_probability
0 A coupe 0.50
1 B coupe 0.33
2 C sedan 1.00
3 A sedan 0.50
4 B sedan 0.66

You can use groupby to get counts of items in separate ratings and separate combinations of ratings and types and calculate conditional probability using these values.
from pyspark.sql import functions as F
ratings_cols = ["company", "model", "rating", "type"]
ratings_values = [
("ford", "mustang", "A", "coupe"),
("chevy", "camaro", "B", "coupe"),
("ford", "fiesta", "C", "sedan"),
("ford", "focus", "A", "sedan"),
("ford", "taurus", "B", "sedan"),
("toyota", "camry", "B", "sedan"),
ratings_df = spark.createDataFrame(data=ratings_values, schema=ratings_cols)
# +-------+-------+------+-----+
# |company| model|rating| type|
# +-------+-------+------+-----+
# | ford|mustang| A|coupe|
# | chevy| camaro| B|coupe|
# | ford| fiesta| C|sedan|
# | ford| focus| A|sedan|
# | ford| taurus| B|sedan|
# | toyota| camry| B|sedan|
# +-------+-------+------+-----+
probability_df = (ratings_df.groupby(["rating", "type"])
.join(ratings_df.groupby("rating").agg(F.count(F.lit(1)).alias("rating_count")), on="rating")
.withColumn("conditional_probability", F.round(F.col("rating_type_count")/F.col("rating_count"), 2))
.select(["rating", "type", "conditional_probability"])
.sort(["type", "rating"]))
# +------+-----+-----------------------+
# |rating| type|conditional_probability|
# +------+-----+-----------------------+
# | A|coupe| 0.5|
# | B|coupe| 0.33|
# | A|sedan| 0.5|
# | B|sedan| 0.67|
# | C|sedan| 1.0|
# +------+-----+-----------------------+


Fill a column according to a weight list in another Spark DataFrame

I have:
Dataframe A
Dataframe B
I need to fill (in PySpark) the column "Name" (Dataset A) according to the weighted Name of Dataset B:
Max 50%; Mike 25%; John 25%
This may not be the best approach, but it would work for weights >= 0.01. I wanted to create an approach where equality would be used in the join as opposed to > < >= and <=.
from pyspark.sql import functions as F, Window as W
dfA = spark.range(1, 5).toDF("ID")
dfB = spark.createDataFrame(
[('Max', 0.5),
('Mike', 0.25),
('John', 0.25)],
['Name', 'weight'])
cs = F.round(F.sum('weight').over(W.rowsBetween(W.unboundedPreceding, 0)), 2)
dfB = dfB.withColumn('w', F.sequence(((cs - F.col('weight'))*100+1).cast('int'), (cs*100).cast('int')))
dfB = dfB.withColumn('w', F.explode('w'))
cntA = dfA.count()
dfA = dfA.withColumn('w', F.ceil(F.count('ID').over(W.rowsBetween(W.unboundedPreceding, 0))/cntA*100))
dfA = dfA.join(dfB, 'w', 'left').drop('w', 'weight')
# +---+----+
# | ID|Name|
# +---+----+
# | 1| Max|
# | 2| Max|
# | 3|Mike|
# | 4|John|
# +---+----+

Dealing with multiple values in Pandas Dataframe Cell

Columns are the description of the data and the rows keep the values. However, in some columns there are multiple values (tabular form on website). Rows of those tabular get merged in one cell and are separated by hashtags. Since they are only part of the tabular they refer to other columns with values in cells also separated by hashtags.
Column Name: solution_id | type labour | labour_unit | est_labour_quantity | est_labour_costs | est_labour_total_costs
10 | WorkA#WorkB | Person#Person | 2.0#2.0 | 300.0#300.0. | 600.0#600.0
11 | WorkC#WorkD | Person#Person | 3.0#2.0 | 300.0#300.0. | 900.0#600.0
My questions are twofold:
What would be a good way to transform the data to work on it more efficiently, e.g. create as many as new columns as there are entries in one cell. So e.g. separate it like e.g.
Column Name: solution_id | type labour_1 | labour_unit_1 | est_labour_quantity_1 | est_labour_costs_1 | est_labour_total_costs_1 | type labour_2 | labour_unit_2 | est_labour_quantity_2 | est_labour_costs_2 | est_labour_total_costs_2
10 | WorkA | Person. | 2.0. | 300.0. | 600.0. | WorkB | Person | 2.0 | 300.0 | 600.0
11 | WorkC | Person. | 3.0. | 300.0. | 900.0. | WorkD | Person | 2.0 | 300.0 | 600.0
This makes it more readable but it doubles the amount of columns and I have some cells with up to 5 entries, so it would be x5 more columns. What I also don't like so much about the idea is that the new column names are not really meaningful and it will be hard to interpret them.
How can I make this separation in pandas, so that I have WorkA and then the associated values, and then Work B etc...
If there is another better way to work with this tabular form, maybe bring it all in one cell? Please let me know!
#unpivot by melt
df = df.melt('solution_id')
#create lists by split #
df['value'] = df['value'].str.split('#')
#repeat rows by value column
df = df.explode('value')
#counter for new columns names
df['g'] = df.groupby(['solution_id','variable']).cumcount().add(1)
#pivoting and sorting MultiIndex
df = (df.pivot('solution_id',['variable', 'g'], 'value')
.sort_index(level=1, axis=1, sort_remaining=False))
#flatten MultiIndex
df.columns = x: f'{x[0]}_{x[1]}')
print (df)
type_labour_1 labour_unit_1 est_labour_quantity_1 \
10 WorkA Person 2.0
11 WorkC Person 3.0
est_labour_costs_1 est_labour_total_costs_1 type_labour_2 \
10 300.0 600.0 WorkB
11 300.0 900.0 WorkD
labour_unit_2 est_labour_quantity_2 est_labour_costs_2 \
10 Person 2.0 300.0.
11 Person 2.0 300.0.
10 600.0
11 600.0
You can split your strings, explode and reshape:
df2 = (df
.apply(lambda c: c.str.split('#'))
.assign(idx=lambda d: d.groupby(level=0).cumcount().add(1))
.set_index('idx', append=True)
.sort_index(axis=1, level='idx', sort_remaining=False)
df2.columns = [f'{a}_{b}' for a,b in df2.columns]
type labour_1 labour_unit_1 est_labour_quantity_1 est_labour_costs_1 est_labour_total_costs_1 type labour_2 labour_unit_2 est_labour_quantity_2 est_labour_costs_2 est_labour_total_costs_2
10 WorkA Person 2.0 300.0 600.0 WorkB Person 2.0 300.0. 600.0
11 WorkC Person 3.0 300.0 900.0 WorkD Person 2.0 300.0. 600.0
Or, shorter code using the same initial split followed by slicing and concatenation:
.apply(lambda c: c.str.split('#'))
pd.concat([df2.apply(lambda c: c.str[i]).add_suffix(f'_{i+1}')
for i in range(len(df2.iat[0,0]))], axis=1)

python pandas dataframe indexing

I have a data frame
df = pd.DataFrame(carData)
#df.ffill() This is where i need to fill the amount of columns i want to add with previous value
The data looks something like
1 honda
2 ford
3 chevy
I want to add an index but keep it numerical up to a certain number and forward fill the models column to the last value. so for example the dataset above has 3 entries, I want to add an have a total of 5 entries it should look something like
1 honda
2 ford
3 chevy
4 chevy
5 chevy
Using df.reindex() and df.ffill()
N= 5
0 honda
1 ford
2 chevy
3 chevy
4 chevy
Use reindex with method='ffill' or add ffill:
N = 5
df = df.reindex(np.arange(1, N + 1), method='ffill')
#df = df.reindex(np.arange(1, N + 1)).ffill()
print (df)
1 honda
2 ford
3 chevy
4 chevy
5 chevy
If default RangeIndex:
df = df.reset_index(drop=True)
N = 5
df = df.reindex(np.arange(N), method='ffill')
#df = df.reindex(np.arange(N)).ffill()
print (df)
0 honda
1 ford
2 chevy
3 chevy
4 chevy

How to do Multi Index Pivot when index and values are in the same column?

I have this frame:
regions = pd.read_html('')
messy_regions = regions[8]
Which yields something like this:
|0 | 1
--- |---| ---
0| Region 1 (The Northeast)| nan
1| Division 1 (New England)| Division 2 (Middle Atlantic)
2| Maine | New York
3| New Hampshire | Pennsylvania
4| Vermont | New Jersey
5| Massachusetts |nan
6| Rhode Island |nan
7| Connecticut | nan
8| Region 2 (The Midwest) | nan
9| Division 3 (East North Central)| Division 4 (West North Central)
10| Wisconsin | North Dakota
11| Michigan | South Dakota
12| Illinois | Nebraska
The goal is to make this a tidy dataframe and I think I need to pivot in order to get the regions and Divisions as columns with the states as values under the correct regions/divisions. Once it's in that shape then I can just melt to the desired shape. I can't figure out though how to extract what would be the column headers out of this though. Any help is appreciated and at the very least a good point in the right direction.
You can use:
url = ''
#input dataframe with columns a, b
df = pd.read_html(url)[8]
df.columns = ['a','b']
#extract Region data to new column
df['Region'] = df['a'].where(df['a'].str.contains('Region', na=False)).ffill()
#reshaping, remove rows with NaNs, remove column variable
df = pd.melt(df, id_vars='Region', value_name='Names')
.sort_values(['Region', 'variable'])
.drop('variable', axis=1)
#extract Division data to new column
df['Division'] = df['Names'].where(df['Names'].str.contains('Division', na=False)).ffill()
#remove duplicates from column Names, change order of columns
df = df[(df.Division != df.Names) & (df.Region != df.Names)]
.reindex_axis(['Region','Division','Names'], axis=1)
#temporaly display all columns
with pd.option_context('display.expand_frame_repr', False):
print (df)
Region Division Names
0 Region 1 (The Northeast) Division 1 (New England) Maine
1 Region 1 (The Northeast) Division 1 (New England) New Hampshire
2 Region 1 (The Northeast) Division 1 (New England) Vermont
3 Region 1 (The Northeast) Division 1 (New England) Massachusetts
4 Region 1 (The Northeast) Division 1 (New England) Rhode Island
5 Region 1 (The Northeast) Division 1 (New England) Connecticut
6 Region 1 (The Northeast) Division 2 (Middle Atlantic) New York
7 Region 1 (The Northeast) Division 2 (Middle Atlantic) Pennsylvania
8 Region 1 (The Northeast) Division 2 (Middle Atlantic) New Jersey
9 Region 2 (The Midwest) Division 3 (East North Central) Wisconsin
10 Region 2 (The Midwest) Division 3 (East North Central) Michigan
11 Region 2 (The Midwest) Division 3 (East North Central) Illinois
12 Region 2 (The Midwest) Division 3 (East North Central) Indiana
13 Region 2 (The Midwest) Division 3 (East North Central) Ohio

Resampling a DataFrame to hourly 15min and 5min periods in Julia

I'm quite new to Julia but I'm giving it a try since the benchmarks claim it to be much faster than Python.
I'm trying to use some stock tick data in the format ["unixtime", "price", "amount"]
I managed to load the data and convert the unixtime to a date in Julia, but now I need to resample the data to use olhc (open, high, low, close) for the price and sum for the amount, for a specific period in Julia (hourly, 15min, 5 min, etc...):
julia> head(btc_raw_data)
6x3 DataFrame:
date price amount
[1,] 2011-09-13T13:53:36 UTC 5.8 1.0
[2,] 2011-09-13T13:53:44 UTC 5.83 3.0
[3,] 2011-09-13T13:53:49 UTC 5.9 1.0
[4,] 2011-09-13T13:53:54 UTC 6.0 20.0
[5,] 2011-09-13T14:32:53 UTC 5.95 12.4521
[6,] 2011-09-13T14:35:04 UTC 5.88 7.458
I see there is a package called Resampling, but it doesn't seem to accept a time period only the number of row I want the output data to have.
Any other alternatives?
You can convert DataFrame (from DataFrames.jl) to TimeArray (from TimeSeries.jl) using
using TimeSeriesIO: TimeArray
ta = TimeArray(df, colnames=[:price], timestamp=:date)
You can resample timeseries (TimeArray from TimeSeries.jl) using TimeSeriesResampler
and TimeFrames
using TimeSeriesResampler: resample, mean, ohlc, sum, TimeFrame
# Define a sample timeseries (prices for example)
idx = DateTime(2010,1,1):Dates.Minute(1):DateTime(2011,1,1)
idx = idx[1:end-1]
N = length(idx)
y = rand(-1.0:0.01:1.0, N)
y = 1000 + cumsum(y)
#df = DataFrame(Date=idx, y=y)
ta = TimeArray(collect(idx), y, ["y"])
# Define how datetime should be grouped (timeframe)
tf = TimeFrame(dt -> floor(dt, Dates.Minute(15)))
# resample using OHLC values
ta_ohlc = ohlc(resample(ta, tf))
# resample using mean values
ta_mean = mean(resample(ta, tf))
# Define an other sample timeseries (volume for example)
vol = rand(0:0.01:1.0, N)
ta_vol = TimeArray(collect(idx), vol, ["vol"])
# resample using sum values
ta_vol_sum = sum(resample(ta_vol, tf))
You should get:
julia> ta
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
2010-01-01T00:00:00 | 1000.16
2010-01-01T00:01:00 | 1000.1
2010-01-01T00:02:00 | 1000.98
2010-01-01T00:03:00 | 1001.38
2010-12-31T23:56:00 | 972.3
2010-12-31T23:57:00 | 972.85
2010-12-31T23:58:00 | 973.74
2010-12-31T23:59:00 | 972.8
julia> ta_ohlc
35040x4 TimeSeries.TimeArray{Float64,2,DateTime,Array{Float64,2}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
Open High Low Close
2010-01-01T00:00:00 | 1000.16 1002.5 1000.1 1001.54
2010-01-01T00:15:00 | 1001.57 1002.64 999.38 999.38
2010-01-01T00:30:00 | 999.13 1000.91 998.91 1000.91
2010-01-01T00:45:00 | 1001.0 1006.42 1001.0 1006.42
2010-12-31T23:00:00 | 980.84 981.56 976.53 976.53
2010-12-31T23:15:00 | 975.74 977.46 974.71 975.31
2010-12-31T23:30:00 | 974.72 974.9 971.73 972.07
2010-12-31T23:45:00 | 972.33 973.74 971.49 972.8
julia> ta_mean
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
2010-01-01T00:00:00 | 1001.1047
2010-01-01T00:15:00 | 1001.686
2010-01-01T00:30:00 | 999.628
2010-01-01T00:45:00 | 1003.5267
2010-12-31T23:00:00 | 979.1773
2010-12-31T23:15:00 | 975.746
2010-12-31T23:30:00 | 973.482
2010-12-31T23:45:00 | 972.3427
julia> ta_vol
525600x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:59:00
2010-01-01T00:00:00 | 0.37
2010-01-01T00:01:00 | 0.67
2010-01-01T00:02:00 | 0.29
2010-01-01T00:03:00 | 0.28
2010-12-31T23:56:00 | 0.74
2010-12-31T23:57:00 | 0.66
2010-12-31T23:58:00 | 0.22
2010-12-31T23:59:00 | 0.47
julia> ta_vol_sum
35040x1 TimeSeries.TimeArray{Float64,1,DateTime,Array{Float64,1}} 2010-01-01T00:00:00 to 2010-12-31T23:45:00
2010-01-01T00:00:00 | 7.13
2010-01-01T00:15:00 | 6.99
2010-01-01T00:30:00 | 8.73
2010-01-01T00:45:00 | 8.27
2010-12-31T23:00:00 | 6.11
2010-12-31T23:15:00 | 7.49
2010-12-31T23:30:00 | 5.75
2010-12-31T23:45:00 | 8.36