How to use LinearRegression across groups in DataFrame? - dataframe

Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()

I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python

What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".

Related

How to select a subset of pandas dataframe containing an even distribution of one column's values?

I have a huge dataset over different years. As a subsample for local tests, I need to separate a small dataframe which contains only a few samples distributed over years. Does anyone have any idea how to do that?
After groupby by 'year' column, the count of instances in each year is something like:
year
A
1838
1000
1839
2600
1840
8900
1841
9900
I want to select a subset which after groupby looks like:
| year| A |
| ----| --|
| 1838| 10|
| 1839| 10|
| 1840| 10|
| 1841| 10|
Try groupby().sample().
Here's example usage with dummy data.
import numpy as np
import pandas as pd
# create a long array of 'years' from 1800 to 1805
years = np.random.randint(low=1800,high=1805,size=200)
values = np.random.randint(low=1, high=200,size=200)
df = pd.DataFrame({'Years':years,"Values":values})
number_per_year = 10
sample_df = df.groupby("Years").sample(n=number_per_year, random_state=1)

Scala convert Array to DataFrame Column

I am trying to add an Array of values as a new column to the DataFrame.
Ex:
Lets assume there is an Array(4,5,10) and a dataframe
+----------+-----+
| name | age |
+----------+-----+
| John | 32 |
| Elizabeth| 28 |
| Eric | 41 |
+----------+-----+
My requirement is to add the above array as a new column to the dataframe. My expected output is as follows:
+----------+-----+------+
| name | age | rank |
+----------+-----+------+
| John | 32 | 4 |
| Elizabeth| 28 | 5 |
| Eric | 41 | 10 |
+----------+-----+------+
I am trying if I can achieve this using rdd and zipWithIndex.
df.rdd.zipWithIndex.map(_.swap).join(array_rdd.zipWithIndex.map(_.swap))
This is resulting in something of this sort.
(0,([John, 32],4))
I want to convert the above RDD back to required dataframe. Let me know how to achieve this.
Are there any alternatives available for achieving the desired result other than using rdd and zipWithIndex? What is the best way to do it?
PS:
Context for better understanding:
I am using Xpress optimization suite to solve a mathematical problem. Xpress takes inputs interms of Arrays and also outputs the result in an Array. I get input as a DataFrame and I am extracting columns as Arrays(using collect) and passing to Xpress. Xpress outputs Array[Double] as solution. I want to add this solution back to the dataframe as a column and every value in the solution array corresponds to the row of the dataframe at its index i.e value at index 'n' of the output Array corresponds to 'n'th row of the dataframe
After the join just map the results to what you are looking for.
You can convert this back to a dataframe after joining the RDDs.
val originalDF = Seq(("John", 32), ("Elizabeth", 28), ("Eric", 41)).toDF("name", "age")
val rank = Array(4, 5, 10)
// convert to Seq first
val rankDF = rank.toSeq.toDF("rank")
val joined = originalDF.rdd.zipWithIndex.map(_.swap).join(rankDF.rdd.zipWithIndex.map(_.swap))
val finalRDD = joined.map{ case (k,v) => (k, v._1.getString(0), v._1.getInt(1), v._2.getInt(0)) }
val finalDF = finalRDD.toDF("id", "name", "age", "rank")
finalDF.show()
/*
+---+---------+---+----+
| id| name|age|rank|
+---+---------+---+----+
| 0| John| 32| 4|
| 1|Elizabeth| 28| 5|
| 2| Eric| 41| 10|
+---+---------+---+----+
*/
The only alternate way that I can think of is to use the org.apache.spark.sql.functions.row_number() window function. This essentially achieves the same thing by adding an increasing, consecutive row number to the dataframe.
The drawback with this is the large amount of data shuffle into one partition, since we need to have unrepeated row numbers for all rows in the dataframe. If your data is very large this can lead to an out of memory issue. (Note: this may not be applicable in your case, since you mentioned you are doing a collect on the data and have not mentioned any memory issues in this).
The approach of converting to an rdd and using zipWithIndex is an acceptable solution, but generally converting from dataframe to rdd is not recommended due to the performance difference of using an RDD instead of a dataframe.

In Dask, how would I remove data that is not repeated across all values of another column?

I'm trying to find a set of data that exists across multiple instances of a column's value.
As an example, let's say I have a DataFrame with the following values:
+-------------+------------+----------+
| hardware_id | model_name | data_v |
+-------------+------------+----------+
| a | 1 | 0.595150 |
+-------------+------------+----------+
| b | 1 | 0.285757 |
+-------------+------------+----------+
| c | 1 | 0.278061 |
+-------------+------------+----------+
| d | 1 | 0.578061 |
+-------------+------------+----------+
| a | 2 | 0.246565 |
+-------------+------------+----------+
| b | 2 | 0.942299 |
+-------------+------------+----------+
| c | 2 | 0.658126 |
+-------------+------------+----------+
| a | 3 | 0.160283 |
+-------------+------------+----------+
| b | 3 | 0.180021 |
+-------------+------------+----------+
| c | 3 | 0.093628 |
+-------------+------------+----------+
| d | 3 | 0.033813 |
+-------------+------------+----------+
What I'm trying to get would be a DataFrame with all elements except the rows that contain a hardware_id of d, since they do not occur at least once per model_name.
I'm using Dask as my original data size is on the order of 7 GB, but if I need to drop down to Pandas that is also feasable. I'm very happy to hear any suggestions.
I have tried splitting the dataframe into individual dataframes based on the model_name attribute, then running a loop:
models = ['1','1','1','2','2','2','3','3','3','3']
import dask.dataframe as dd
frame_1 = dd.DataFrame( {'hardware_id':['a','b','c','a','b','c','a','b','c','d'], 'model_name':mn,'data_v':np.random.rand(len(mn))} )
model_splits = []
for i in range(1,4):
model_splits.append(frame_1[frame_1['model_name'.eq(str(i))]])
aggregate_list = []
while len(model_splits) > 0:
data = aggregate_list.pop()
for other_models in aggregate_list:
data = data[data.hardware_id.isin(other_models.hardware_id.to__bag())]
aggregate_list.append(data)
final_data = dd.concat(aggregate_list)
However, this is beyond inefficient, and I'm not entirely sure that my logic is sound.
Any suggestions on how to achieve this?
Thanks!
One way to accomplish this is to treat it as a groupby-aggregation problem.
Pandas
First, we set up the data:
import pandas as pd
import numpy as np
np.random.seed(12)
models = ['1','1','1','2','2','2','3','3','3','3']
df = pd.DataFrame(
{'hardware_id':['a','b','c','a','b','c','a','b','c','d'],
'model_name': models,
'data_v': np.random.rand(len(models))
}
)
Then, collect the unique values of your model_name column.
unique_model_names = df.model_name.unique()
unique_model_names
array(['1', '2', '3'], dtype=object)
Next, we'll do several related steps at once. Our goal is to figure out which hardware_ids co-occur wiht the entire unique set of model_names. First we can do a groupby aggregation to get the unique model_names per hardware_id. This returns a list, but we want this as a tuple for efficiency so it works in the next step. At this point, every hardware ID is associated with a tuple of it's unique models. Next, we check to see if that tuple exactly matches our unique model names, using isin. If it doesn't we know the condition should be False (exactly what we get).
agged = df.groupby("hardware_id", as_index=False).agg({"model_name": "unique"})
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
agged
hardware_id model_name all_present_mask
0 a (1, 2, 3) True
1 b (1, 2, 3) True
2 c (1, 2, 3) True
3 d (3,) False
Finally, we can use this to get our list of "valid" hardware IDs, and then filter our initial dataframe.
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = df.loc[
df.hardware_id.isin(relevant_ids)
]
result
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Dask
We can do essentially the same thing, but we need to be a little clever with our calls to compute.
import dask.dataframe as dd
​
ddf = dd.from_pandas(df, 2)
unique_model_names = ddf.model_name.unique()
​
agged = ddf.groupby("hardware_id").model_name.unique().reset_index()
agged["model_name"] = agged["model_name"].map(tuple)
agged["all_present_mask"] = agged["model_name"].isin([tuple(unique_model_names)])
​
relevant_ids = agged.loc[
agged.all_present_mask
].hardware_id
​
result = ddf.loc[
ddf.hardware_id.isin(relevant_ids.compute()) # cant pass a dask Series to `ddf.isin`
]
result.compute()
hardware_id model_name data_v
0 a 1 0.154163
1 b 1 0.740050
2 c 1 0.263315
3 a 2 0.533739
4 b 2 0.014575
5 c 2 0.918747
6 a 3 0.900715
7 b 3 0.033421
8 c 3 0.956949
Note that you would probably want to persist agged_df and relevant_ids if you have the memory available to avoid some redundant calculation.

Get distinct rows by creation date

I am working with a dataframe like this:
DeviceNumber | CreationDate | Name
1001 | 1.1.2018 | Testdevice
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
I am using databricks and pyspark to do the ETL process. How can I reduce the dataframe in a way that I will only have a single row per "DeviceNumber" and that this will be the row with the highest "CreationDate"? In this example I want the result to look like this:
DeviceNumber | CreationDate | Name
1001 | 30.06.2019 | Device
1002 | 1.1.2019 | Lamp
You can create a additional dataframe with DeviceNumber & it's latest/max CreationDate.
import pyspark.sql.functions as psf
max_df = df\
.groupBy('DeviceNumber')\
.agg(psf.max('CreationDate').alias('max_CreationDate'))
and then join max_df with original dataframe.
joining_condition = [ df.DeviceNumber == max_df.DeviceNumber, df.CreationDate == max_df.max_CreationDate ]
df.join(max_df,joining_condition,'left_semi').show()
left_semi join is useful when you want second dataframe as lookup and does need any column from second dataframe.
You can use PySpark windowing functionality:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
# make sure that creation is a date data-type
df = df.withColumn('CreationDate', f.to_timestamp('CreationDate', format='dd.MM.yyyy'))
# partition on device and get a row number by (descending) date
win = Window.partitionBy('DeviceNumber').orderBy(f.col('CreationDate').desc())
df = df.withColumn('rownum', f.row_number().over(win))
# finally take the first row in each group
df.filter(df['rownum']==1).select('DeviceNumber', 'CreationDate', 'Name').show()
------------+------------+------+
|DeviceNumber|CreationDate| Name|
+------------+------------+------+
| 1002| 2019-01-01| Lamp|
| 1001| 2019-06-30|Device|
+------------+------------+------+

Using Pandas groupby to calculate many slopes

Some illustrative data in a DataFrame (MultiIndex) format:
|entity| year |value|
+------+------+-----+
| a | 1999 | 2 |
| | 2004 | 5 |
| b | 2003 | 3 |
| | 2007 | 2 |
| | 2014 | 7 |
I would like to calculate the slope using scipy.stats.linregress for each entity a and b in the above example. I tried using groupby on the first column, following the split-apply-combine advice, but it seems problematic since it's expecting one Series of values (a and b), whereas I need to operate on the two columns on the right.
This is easily done in R via plyr, not sure how to approach it in pandas.
A function can be applied to a groupby with the apply function. The passed function in this case linregress. Please see below:
In [4]: x = pd.DataFrame({'entity':['a','a','b','b','b'],
'year':[1999,2004,2003,2007,2014],
'value':[2,5,3,2,7]})
In [5]: x
Out[5]:
entity value year
0 a 2 1999
1 a 5 2004
2 b 3 2003
3 b 2 2007
4 b 7 2014
In [6]: from scipy.stats import linregress
In [7]: x.groupby('entity').apply(lambda v: linregress(v.year, v.value)[0])
Out[7]:
entity
a 0.600000
b 0.403226
You can do this via the iterator ability of the group by object. It seems easier to do it by dropping the current index and then specifying the group by 'entity'.
A list comprehension is then an easy way to quickly work through all the groups in the iterator. Or use a dict comprehension to get the labels in the same place (you can then stick the dict into a pd.DataFrame easily).
import pandas as pd
import scipy.stats
#This is your data
test = pd.DataFrame({'entity':['a','a','b','b','b'],'year':[1999,2004,2003,2007,2014],'value':[2,5,3,2,7]}).set_index(['entity','year'])
#This creates the groups
groupby = test.reset_index().groupby(['entity'])
#Process groups by list comprehension
slopes = [scipy.stats.linregress(group.year, group.value)[0] for name, group in groupby]
#Process groups by dict comprehension
slopes = {name:[scipy.stats.linregress(group.year, group.value)[0]] for name, group in groupby}