pyspark- elif statement and assign position to extract a value - dataframe

I have a dataframe like the example
productNo
prodcuctMT
productPR
productList
2389
['xy-5', 'yz-12','zb-56','iu-30']
['pr-1', 'pr-2', 'pr-3', 'pr-4']
['67230','7839','1339','9793']
6745
['xy-4', 'yz-34','zb-8','iu-9']
['pr-6', pr-1', 'pr-3', 'pr-7']
['1111','0987','8910','0348']
I would like to use elif statement for multiple conditions where we look at productMT and if it passes the condition it looks at productPR and takes the position for which it satisfies the condition.
if productMT contains xy-5 then if productPR contains pr-1 , take its position and add new column with value from productList.
productNo
prodcuctMT
productPR
productList
productList
2389
['xy-5', 'yz-12','zb-56','iu-30']
['pr-1', 'pr-2', 'pr-3', 'pr-4']
['67230','7839','1339','9793']
67230
I tried using a filter but it only does the work for one filter and I need to run on multiple filters so it loops through all rows and conditions.
F.arrays_zip('productList', 'prodcuctMT', 'productPR'),
lambda x: (x.prodcuctMT == 'xy-5') & (x.productPR != 'pr-1')
)
df_array_pos = df_array.withColumn('output', filtered[0].productList).withColumn('flag', filtered[0].prodcuctMT)```

You just need to use multiple when functions for each elif conditions you want
Your sample data
df = spark.createDataFrame([
(2389, ['xy-5', 'yz-12','zb-56','iu-30'], ['pr-1', 'pr-2', 'pr-3', 'pr-4'], ['67230','7839','1339','9793']),
(6745, ['xy-4', 'yz-34','zb-8','iu-9'], ['pr-6', 'pr-1', 'pr-3', 'pr-7'], ['1111','0987','8910','0348']),
], ['productNo', 'productMT', 'productPR', 'productList'])
+---------+---------------------------+------------------------+-------------------------+
|productNo|productMT |productPR |productList |
+---------+---------------------------+------------------------+-------------------------+
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |
+---------+---------------------------+------------------------+-------------------------+
You can add as many when as you like
from pyspark.sql import functions as F
(df
.withColumn('output', F
.when(F.array_contains('productMT', 'xy-5') & F.array_contains('productPR', 'pr-1'), F.col('productList')[F.array_position('productMT', 'xy-5') - 1])
)
.show(10, False)
)
+---------+---------------------------+------------------------+-------------------------+------+
|productNo|productMT |productPR |productList |output|
+---------+---------------------------+------------------------+-------------------------+------+
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|67230 |
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |null |
+---------+---------------------------+------------------------+-------------------------+------+

Related

Trying to filter data frame to rows that have a certain value

First post in the community (congrats or I am sorry are in order :-)). I provided some code below for survey data I am trying to analyze. I am trying to capture the rows that have the value "1" in any column. It was noted as a float, but I converted to an interger and it did not work. Used quotes and did not work. Any advice?
# Dependencies and Setup
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
from pprint import pprint
import requests
import time
from scipy import stats
import seaborn as sn
%matplotlib inline
# Read csv
us_path = "us_Data.csv"
us_responses = pd.read_csv(us_path)
# Created filtered data frame.
preexisting_us = us_responses
# Filter data.
preexisting_us = us_responses[us_responses["diabetes"] == "1" | us_responses(us_responses["cardiovascular_disorders"] == "1") | us_responses(us_responses["obesity"] == "1") | us_responses(us_responses["respiratory_infections"] == "1") | us_responses(us_responses["respiratory_disorders_exam"] == "1") | us_responses(us_responses["gastrointestinal_disorders"] == "1") | us_responses(us_responses["chronic_kidney_disease"] == "1") | us_responses(us_responses["autoimmune_disease"] == "1") | us_responses(us_responses["chronic_fatigue_syndrome_a"] == "1")]
First of all, you probably should define your new DataFrame as a copy of the orignal one, such as df = us_responses.copy(). In this way you are sure that the original DataFrame will not modified (I suggest you to have a look at the documentation).
Now, to filter the DataFrame you can use simpler ways than the one of your code. For example:
cols_to_check = ['diabetes', 'cardiovascular_disorders', ... ]
df_filtered = df.loc[df[cols_to_check].sum(axis=1) > 0, :]
In this way, by calculating the sum of the selected columns, if at least one has value 1, the corresponding row is kept in the filtered DataFrame.
However, if you really want to keep your code the way it is (which I would not suggest), you need to make some corrections:
preexisting_us = preexisting_us[preexisting_us["diabetes"] == 1 | (preexisting_us["cardiovascular_disorders"] == 1) | (preexisting_us["obesity"] == 1) | (preexisting_us["respiratory_infections"] == 1) | (preexisting_us["respiratory_disorders_exam"] == 1) | (preexisting_us["gastrointestinal_disorders"] == 1) | (preexisting_us["chronic_kidney_disease"] == 1) | (preexisting_us["autoimmune_disease"] == 1) | (preexisting_us["chronic_fatigue_syndrome_a"] == 1)]
If you are interested in more info about filtering using loc(), here you can find the documentation.
Please, follow #mozway suggestions for posting clearer questions next time.

Pyspark -- Filter dataframe based on row values of another dataframe

I have a master dataframe and a secondary dataframe which I want to go through row by row, filter the master dataframe based on the values in each row, run a function on the filtered master dataframe, and save the output.
The output could either be saved in a separate dataframe, or in a new column of the secondary dataframe.
# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})
#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})
df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age| Job|
+-----+---+-------+
| Mike| 22| Doc|
| Bob| 44|Cashier|
|Steve| 66|Fireman|
| Jim| 22| Doc|
| Dan| 66|Fireman|
+-----+---+-------+
df1 = spark.createDataFrame(df1)
+---+-------+
|Age| Job|
+---+-------+
| 22| Doc|
| 66|Fireman|
+---+-------+
​
# Filter by values in first row of secondary DF
df_filt = df.filter(
(F.col("Age") == 22) &
(F.col('Job') == 'Doc')
)
# Run the filtered DF through my function
def my_func(df_filt):
my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
return '-'.join(my_list)
# Output of function
my_func(df_filt)
'Mike-Jim'
# Filter by values in second row of secondary DF
df_filt = df.filter(
(F.col("Age") == 66) &
(F.col('Job') == 'Fireman')
)
# Output of function
my_func(df_filt)
'Steve-Dan'
# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})
Basically, I want to take my Master DF and filter it in certain ways, and run an algorithm on the filtered dataset and get the output for that filtering, then go on to the next set of filtering and do the same.
What is the best way to go about this?
Try this with join, groupBy, concat_ws/array_join and collect_list.
from pyspark.sql import functions as F
df.join(df1,['Age','Job'])\
.groupBy("Age","Job").agg(F.concat_ws('-',F.collect_list("Name")).alias("Returned_value")).show()
#+---+-------+--------------+
#|Age| Job|Returned_value|
#+---+-------+--------------+
#| 22| Doc| Mike-Jim|
#| 66|Fireman| Steve-Dan|
#+---+-------+--------------+

Update pyspark dataframe from a column having the target column values

I have a dataframe which has a column('target_column' in this case) and I need to update these target columns with 'val' column values.
I have tried using udfs and .withcolumn but they all expect fixed column value. In my case it can be variable. Also using rdd map transformations didn't work as rdds are immutable.
def test():
data = [("jose_1", 'mase', "firstname", "jane"), ("li_1", "ken", 'lastname', 'keno'), ("liz_1", 'durn', 'firstname', 'liz')]
source_df = spark.createDataFrame(data, ["firstname", "lastname", "target_column", "val"])
source_df.show()
if __name__ == "__main__":
spark = SparkSession.builder.appName('Name Group').getOrCreate()
test()
spark.stop()
Input:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jose_1| mase| firstname|jane|
| li_1| ken| lastname|keno|
| liz_1| durn| firstname| liz|
+---------+--------+-------------+----+
Expected output:
+---------+--------+-------------+----+
|firstname|lastname|target_column| val|
+---------+--------+-------------+----+
| jane| mase| firstname|jane|
| li_1| keno| lastname|keno|
| liz| durn| firstname| liz|
+---------+--------+-------------+----+
For e.g. in first row in input the target_column is 'firstname' and val is 'jane'. So I need to update the firstname with 'jane' in that row.
Thanks
You can do a loop with all you columns:
from pyspark.sql import functions as F
for col in df.columns:
df = df.withColumn(
col,
F.when(
F.col("target_column")==F.lit(col),
F.col("val")
).otherwise(F.col(col))
)

How to create a column with all the values in a range given by another column in PySpark

I have a problem with the following scenario using PySpark version 2.0, I have a DataFrame with a column contains an array with start and end value, e.g.
[1000, 1010]
I would like to know how to create and compute another column which contains an array that holds all the values for the given range? the result of the generated range values column will be:
+--------------+-------------+-----------------------------+
| Description| Accounts| Range|
+--------------+-------------+-----------------------------+
| Range 1| [101, 105]| [101, 102, 103, 104, 105]|
| Range 2| [200, 203]| [200, 201, 202, 203]|
+--------------+-------------+-----------------------------+
Try this.
define the udf
def range_value(a):
start = a[0]
end = a[1] +1
return list(range(start,end))
from pyspark.sql import functions as F
from pyspark.sql import types as pt
df = spark.createDataFrame([("Range 1", list([101,105])), ("Range 2", list([200, 203]))],("Description", "Accounts"))
range_value= F.udf(range_value, pt.ArrayType(pt.IntegerType()))
df = df.withColumn('Range', range_value(F.col('Accounts')))
Output
you should use UDF (UDF sample)
Consider your pyspark data frame name is df, your data frame could be like this:
df = spark.createDataFrame(
[("Range 1", list([101,105])),
("Range 2", list([200, 203]))],
("Description", "Accounts"))
And your solution is like this:
import pyspark.sql.functions as F
import numpy as np
def make_range_number(arr):
number_range = np.arange(arr[0], arr[1]+1, 1).tolist()
return number_range
range_udf = F.udf(make_range_number)
df = df.withColumn("Range", range_udf(F.col("Accounts")))
Have a fun time!:)

Iterate over one dataframe and add values to another dataframe without "append" or "concat"?

I have a dataframe "df_edges" where I want to iterate over.
Inside the iteration is an if/else and a string split. I need to add the values from the if/else into a new dataframe (each iteration = one new row in the other dataframe).
Example data of "df_edges":
+-----------------------------------------+
| channelId ... featuredChannelsUrlsCount |
+-----------------------------------------+
| 0 UC-ry8ngUIJHTMBWeoARZGmA ... 1 |
| 1 UC-zK3cJdazy01AKTu8g_amg ... 6 |
| 2 UC05_iIGvXue0sR01JNpRHzw ... 10 |
| 3 UC141nSav5cjmTXN7B70ts0g ... 0 |
| 4 UC1cQzKmbx9x0KipvoCt4NJg ... 0 |
+-----------------------------------------+
# new empty dataframe where I want to add the values
df_edges_to_db = pd.DataFrame(columns=["Source", "Target"])
#iteration over the dataframe
for row in df_edges.itertuples():
if row.featuredChannelsUrlsCount != 0:
featured_channels = row[2].split(',')
for fc in featured_channels:
writer.writerow([row[1], fc])
df_edges_to_db = df_edges_to_db.append({"Source": row[1], "Target": fc}, ignore_index=True)
else:
writer.writerow([row[1], row[1]])
df_edges_to_db = df_edges_to_db.append({"Source": row[1], "Target": row[1]}, ignore_index=True)
This seems to work. But the documentation says (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html):
The following, while not recommended methods for generating DataFrames
So, is there a more "best practice" way (besides append/concat) to add the rows with the values?
Here is possible create list of dictionaries by python append, not DataFrame.append like in your solution, and the call only once DataFrame constructor:
L = []
#iteration over the dataframe
for row in df_edges.itertuples():
if row.featuredChannelsUrlsCount != 0:
featured_channels = row[2].split(',')
for fc in featured_channels:
writer.writerow([row[1], fc])
L.append({"Source": row[1], "Target": fc})
else:
writer.writerow([row[1], row[1]])
L.append({"Source": row[1], "Target": row[1]})
df_edges_to_db = pd.DataFrame(L)
Actually I am not clear how your df_edges dataFrame looks like. By looking your code I will suggest you to replace your body of outer for-loop with something like this :
new_list= [someOperationOn(x) if x==0 else otherOperationOn(x) for x in mylist]