Iterate over one dataframe and add values to another dataframe without "append" or "concat"? - pandas

I have a dataframe "df_edges" where I want to iterate over.
Inside the iteration is an if/else and a string split. I need to add the values from the if/else into a new dataframe (each iteration = one new row in the other dataframe).
Example data of "df_edges":
+-----------------------------------------+
| channelId ... featuredChannelsUrlsCount |
+-----------------------------------------+
| 0 UC-ry8ngUIJHTMBWeoARZGmA ... 1 |
| 1 UC-zK3cJdazy01AKTu8g_amg ... 6 |
| 2 UC05_iIGvXue0sR01JNpRHzw ... 10 |
| 3 UC141nSav5cjmTXN7B70ts0g ... 0 |
| 4 UC1cQzKmbx9x0KipvoCt4NJg ... 0 |
+-----------------------------------------+
# new empty dataframe where I want to add the values
df_edges_to_db = pd.DataFrame(columns=["Source", "Target"])
#iteration over the dataframe
for row in df_edges.itertuples():
if row.featuredChannelsUrlsCount != 0:
featured_channels = row[2].split(',')
for fc in featured_channels:
writer.writerow([row[1], fc])
df_edges_to_db = df_edges_to_db.append({"Source": row[1], "Target": fc}, ignore_index=True)
else:
writer.writerow([row[1], row[1]])
df_edges_to_db = df_edges_to_db.append({"Source": row[1], "Target": row[1]}, ignore_index=True)
This seems to work. But the documentation says (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html):
The following, while not recommended methods for generating DataFrames
So, is there a more "best practice" way (besides append/concat) to add the rows with the values?

Here is possible create list of dictionaries by python append, not DataFrame.append like in your solution, and the call only once DataFrame constructor:
L = []
#iteration over the dataframe
for row in df_edges.itertuples():
if row.featuredChannelsUrlsCount != 0:
featured_channels = row[2].split(',')
for fc in featured_channels:
writer.writerow([row[1], fc])
L.append({"Source": row[1], "Target": fc})
else:
writer.writerow([row[1], row[1]])
L.append({"Source": row[1], "Target": row[1]})
df_edges_to_db = pd.DataFrame(L)

Actually I am not clear how your df_edges dataFrame looks like. By looking your code I will suggest you to replace your body of outer for-loop with something like this :
new_list= [someOperationOn(x) if x==0 else otherOperationOn(x) for x in mylist]

Related

pyspark- elif statement and assign position to extract a value

I have a dataframe like the example
productNo
prodcuctMT
productPR
productList
2389
['xy-5', 'yz-12','zb-56','iu-30']
['pr-1', 'pr-2', 'pr-3', 'pr-4']
['67230','7839','1339','9793']
6745
['xy-4', 'yz-34','zb-8','iu-9']
['pr-6', pr-1', 'pr-3', 'pr-7']
['1111','0987','8910','0348']
I would like to use elif statement for multiple conditions where we look at productMT and if it passes the condition it looks at productPR and takes the position for which it satisfies the condition.
if productMT contains xy-5 then if productPR contains pr-1 , take its position and add new column with value from productList.
productNo
prodcuctMT
productPR
productList
productList
2389
['xy-5', 'yz-12','zb-56','iu-30']
['pr-1', 'pr-2', 'pr-3', 'pr-4']
['67230','7839','1339','9793']
67230
I tried using a filter but it only does the work for one filter and I need to run on multiple filters so it loops through all rows and conditions.
F.arrays_zip('productList', 'prodcuctMT', 'productPR'),
lambda x: (x.prodcuctMT == 'xy-5') & (x.productPR != 'pr-1')
)
df_array_pos = df_array.withColumn('output', filtered[0].productList).withColumn('flag', filtered[0].prodcuctMT)```
You just need to use multiple when functions for each elif conditions you want
Your sample data
df = spark.createDataFrame([
(2389, ['xy-5', 'yz-12','zb-56','iu-30'], ['pr-1', 'pr-2', 'pr-3', 'pr-4'], ['67230','7839','1339','9793']),
(6745, ['xy-4', 'yz-34','zb-8','iu-9'], ['pr-6', 'pr-1', 'pr-3', 'pr-7'], ['1111','0987','8910','0348']),
], ['productNo', 'productMT', 'productPR', 'productList'])
+---------+---------------------------+------------------------+-------------------------+
|productNo|productMT |productPR |productList |
+---------+---------------------------+------------------------+-------------------------+
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |
+---------+---------------------------+------------------------+-------------------------+
You can add as many when as you like
from pyspark.sql import functions as F
(df
.withColumn('output', F
.when(F.array_contains('productMT', 'xy-5') & F.array_contains('productPR', 'pr-1'), F.col('productList')[F.array_position('productMT', 'xy-5') - 1])
)
.show(10, False)
)
+---------+---------------------------+------------------------+-------------------------+------+
|productNo|productMT |productPR |productList |output|
+---------+---------------------------+------------------------+-------------------------+------+
|2389 |[xy-5, yz-12, zb-56, iu-30]|[pr-1, pr-2, pr-3, pr-4]|[67230, 7839, 1339, 9793]|67230 |
|6745 |[xy-4, yz-34, zb-8, iu-9] |[pr-6, pr-1, pr-3, pr-7]|[1111, 0987, 8910, 0348] |null |
+---------+---------------------------+------------------------+-------------------------+------+

Recursively update the dataframe

I have a dataframe called datafe from which I want to combine the hyphenated words.
for example input dataframe looks like this:
,author_ex
0,Marios
1,Christodoulou
2,Intro-
3,duction
4,Simone
5,Speziale
6,Exper-
7,iment
And the output dataframe should be like:
,author_ex
0,Marios
1,Christodoulou
2,Introduction
3,Simone
4,Speziale
5,Experiment
I have written a sample code to achieve this but I am not able to get out of the recursion safely.
def rm_actual(datafe, index):
stem1 = datafe.iloc[index]['author_ex']
stem2 = datafe.iloc[index + 1]['author_ex']
fixed_token = stem1[:-1] + stem2
datafe.drop(index=index + 1, inplace=True, axis=0)
newdf=datafe.reset_index(drop=True)
newdf.iloc[index]['author_ex'] = fixed_token
return newdf
def remove_hyphens(datafe):
for index, row in datafe.iterrows():
flag = False
token=row['author_ex']
if token[-1:] == '-':
datafe=rm_actual(datafe, index)
flag=True
break
if flag==True:
datafe=remove_hyphens(datafe)
if flag==False:
return datafe
datafe=remove_hyphens(datafe)
print(datafe)
Is there any possibilities I can get out of this recursion with expected output?
Another option:
Given/Input:
author_ex
0 Marios
1 Christodoulou
2 Intro-
3 duction
4 Simone
5 Speziale
6 Exper-
7 iment
Code:
import pandas as pd
# read/open file or create dataframe
df = pd.DataFrame({'author_ex':['Marios', 'Christodoulou', 'Intro-', \
'duction', 'Simone', 'Speziale', 'Exper-', 'iment']})
# check input format
print(df)
# create new column 'Ending' for True/False if column 'author_ex' ends with '-'
df['Ending'] = df['author_ex'].shift(1).str.contains('-$', na=False, regex=True)
# remove the trailing '-' from the 'author_ex' column
df['author_ex'] = df['author_ex'].str.replace('-$', '', regex=True)
# create new column with values of 'author_ex' and shifted 'author_ex' concatenated together
df['author_ex_combined'] = df['author_ex'] + df.shift(-1)['author_ex']
# create a series true/false but shifted up
index = (df['Ending'] == True).shift(-1)
# set the last row to 'False' after it was shifted
index.iloc[-1] = False
# replace 'author_ex' with 'author_ex_combined' based on true/false of index series
df.loc[index,'author_ex'] = df['author_ex_combined']
# remove rows that have the 2nd part of the 'author_ex' string and are no longer required
df = df[~df.Ending]
# remove the extra columns
df.drop(['Ending', 'author_ex_combined'], axis = 1, inplace=True)
# output final dataframe
print('\n\n')
print(df)
# notice index 3 and 6 are missing
Outputs:
author_ex
0 Marios
1 Christodoulou
2 Introduction
4 Simone
5 Speziale
6 Experiment

Pyspark -- Filter dataframe based on row values of another dataframe

I have a master dataframe and a secondary dataframe which I want to go through row by row, filter the master dataframe based on the values in each row, run a function on the filtered master dataframe, and save the output.
The output could either be saved in a separate dataframe, or in a new column of the secondary dataframe.
# Master DF
df = pd.DataFrame({"Name": ["Mike", "Bob", "Steve", "Jim", "Dan"], "Age": [22, 44, 66, 22, 66], "Job": ["Doc", "Cashier", "Fireman", "Doc", "Fireman"]})
#Secondary DF
df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"]})
df = spark.createDataFrame(df)
+-----+---+-------+
| Name|Age| Job|
+-----+---+-------+
| Mike| 22| Doc|
| Bob| 44|Cashier|
|Steve| 66|Fireman|
| Jim| 22| Doc|
| Dan| 66|Fireman|
+-----+---+-------+
df1 = spark.createDataFrame(df1)
+---+-------+
|Age| Job|
+---+-------+
| 22| Doc|
| 66|Fireman|
+---+-------+
​
# Filter by values in first row of secondary DF
df_filt = df.filter(
(F.col("Age") == 22) &
(F.col('Job') == 'Doc')
)
# Run the filtered DF through my function
def my_func(df_filt):
my_list = df_filt.select('Name').rdd.flatMap(lambda x: x).collect()
return '-'.join(my_list)
# Output of function
my_func(df_filt)
'Mike-Jim'
# Filter by values in second row of secondary DF
df_filt = df.filter(
(F.col("Age") == 66) &
(F.col('Job') == 'Fireman')
)
# Output of function
my_func(df_filt)
'Steve-Dan'
# Desired output at the end of the iterations
new_df1 = pd.DataFrame({"Age": [22, 66], "Job": ["Doc", "Fireman"], "Returned_value": ['Mike-Jim', 'Steve-Dan']})
Basically, I want to take my Master DF and filter it in certain ways, and run an algorithm on the filtered dataset and get the output for that filtering, then go on to the next set of filtering and do the same.
What is the best way to go about this?
Try this with join, groupBy, concat_ws/array_join and collect_list.
from pyspark.sql import functions as F
df.join(df1,['Age','Job'])\
.groupBy("Age","Job").agg(F.concat_ws('-',F.collect_list("Name")).alias("Returned_value")).show()
#+---+-------+--------------+
#|Age| Job|Returned_value|
#+---+-------+--------------+
#| 22| Doc| Mike-Jim|
#| 66|Fireman| Steve-Dan|
#+---+-------+--------------+

Recommendation - Creating a new dataframe with conditions

I've been studying Spark for a while but today I got stuck, I'm working in a Recommendation model using Audioscrobbler Dataset.
I have my model based in ALS and the following definition for make the recommendations:
def makeRecommendations(model: ALSModel, userID: Int,howMany: Int): DataFrame = {
val toRecommend = model.itemFactors.select($"id".as("artist")).withColumn("user", lit(userID))
model.transform(toRecommend).
select("artist", "prediction", "user").
orderBy($"prediction".desc).
limit(howMany)
}
It's generating the expected output, but now I would like to create a new list of DataFrames using Predictions DF and User Data DF.
DataFrame Example
New list of DF consisting of the Predicted value from "Predictions DF" and "Listened" that will be 0 if the user didn't listened the artist or 1 if the user listened, something like this:
Expected DF
I tried the following solution:
val recommendationsSeq = someUsers.map { userID =>
//Gets the artists from user in testData
val artistsOfUser = testData.where($"user".===(userID)).select("artist").rdd.map(r => r(0)).collect.toList
// Recommendations for each user
val recoms = makeRecommendations(model, userID, numRecom)
//Insert a column listened with 1 if the artist in the test set for the user and 0 otherwise
val recomOutput = recoms.withColumn("listened", when($"artist".isin(artistsOfUser: _*), 1.0).otherwise(0.0)).drop("artist")
(recomOutput)
}.toSeq
But its very time consuming when the recommendation has more than 30 users. I believe there's a better way to do it,
Could someone give some idea?
Thanks,
You can try joining dataframes then goupby and count:
scala> val df1 = Seq((1205,0.9873411,1000019)).toDF("artist","prediction","user")
scala> df1.show()
+------+----------+-------+
|artist|prediction| user|
+------+----------+-------+
| 1205| 0.9873411|1000019|
+------+----------+-------+
scala> val df2 = Seq((1000019,1205,40)).toDF("user","artist","playcount")
scala> df2.show()
+-------+------+---------+
| user|artist|playcount|
+-------+------+---------+
|1000019| 1205| 40|
+-------+------+---------+
scala> df1.join(df2,Seq("artist","user")).groupBy('prediction).count().show()
+----------+-----+
|prediction|count|
+----------+-----+
| 0.9873411| 1|
+----------+-----+

Adding multiple dictionaries into a single Dataframe pandas

I have a set of python dictionaries that I have obtained by means of a for loop. I am trying to have these added to Pandas Dataframe.
Output for a variable called output
{'name':'Kevin','age':21}
{'name':'Steve','age':31}
{'name':'Mark','age':11}
I am trying to append each of these dictionary into a single Dataframe. I tried to perform the below but it just added the first row.
df = pd.DataFrame(output)
Could anyone advice as to where am going wrong and have all the dictionaries added to the Dataframe.
Update on the loop statement
The below code helps to read xml and convert it to a dataframe. Right now I see I am able to loop in through multiple xml files and created dictionaries for each xml file. I am trying to see how could I add each of these dictionaries to a single Dataframe:
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
result = {}
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
print(result)
You can append each dictionary to list and last call DataFrame constructor:
out = []
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
out.append(result)
df = pd.DataFrame(out)
We can add these dicts to a list:
ds = []
for ...: # your loop
ds += [d] # where d is one of the dicts
When we have the list of dicts, we can simply use pd.DataFrame on that list:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31},
{'name':'Mark','age':11}
]
pd.DataFrame(ds)
Output:
name age
0 Kevin 21
1 Steve 31
2 Mark 11
Update:
And it's not a problem if different dicts have different keys, e.g.:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31,'location': 'NY'},
{'name':'Mark','age':11,'favorite_food': 'pizza'}
]
pd.DataFrame(ds)
Output:
age favorite_food location name
0 21 NaN NaN Kevin
1 31 NaN NY Steve
2 11 pizza NaN Mark
Update 2:
Building up on our previous discussion in Python - Converting xml to csv using Python pandas we can do:
results = []
for file in glob.glob('*.xml'):
tree = ET.parse(file)
root = tree.getroot()
result = f(root, {})
result['filename'] = file # added filename to our results
results += [result]
pd.DataFrame(results)