HighRadius Ivoice prediction Challenges - pandas

Generate a new column "avgdelay" from the existing columns
Note - You are expected to make a new column "avgdelay" by grouping "name_customer" column with reapect to mean of the "Delay" column.
This new column "avg_delay" is meant to store "customer_name" wise delay
groupby('name_customer')['Delay'].mean(numeric_only=False)
Display the new "avg_delay" column
Can Anyone guide me

Let df be your main data frame
avgdelay = df.groupby('name_customer')['Delay'].mean(numeric_only=False)
You need to add the "avg_delay" column with the maindata, mapped with "name_customer" column
Note - You need to use map function to map the avgdelay with respect to "name_customer" column
df['avg_delay'] = df['name_customer'].map(avgdelay)

Related

Convert a spark dataframe to a column

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot
From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

Find a column that contains a specific string from another column

I have 2 data frames. One called cuartos (rooms in English) and another called paredes (walls in English) They have room temperatures and walls temperatures. I want to create a new data frame with the temperature difference between each wall and its room. For example
Room name = 2_APTO_1
Walls of the room = 2_APTO_1.FACE2, 2_APTO_1.FACE3 and 2_APTO_1.FACE4
The new data frame should be something like
2_APTO_1.FACE2 = 2_APTO_1.FACE2 - 2_APTO_1
2_APTO_1.FACE3 = 2_APTO_1.FACE3 - 2_APTO_1
2_APTO_1.FACE4 = 2_APTO_1.FACE4 - 2_APTO_1 ....
I tried this:
get a list of paredes and cuartos columns
col_names_paredes= paredes.columns.tolist()
col_names_cuartos= cuartos.columns.tolist()
Check if col_names_paredes has col_names_cuartos names in it
for i in col_names_cuartos:
for k in col_names_paredes:
if col_names_paredes[k] in col_names_cuartos[i]:
print(k)
I got this error
TypeError: list indices must be integers or slices, not str
any help would be appreciated.
When you do for i in col_names_cuartos, i will take column names values, not indice values that you would obtain with for i in range(len(col_names_cuartos)).
So you can use the following code instead :
for col_cuartos in col_names_cuartos:
for col_paredes in col_names_paredes:
if col_paredes in col_cuartos:
print(col_paredes)

Building a new dataset

I want to take data from one set and enter it into another empty set.
So, for example, I want to do something like:
if ([i,x] > 9){
new_data$House[y,x] <- data[i,2]
}
but I want to do it over and over, creating new rows in new_data.
How do I keep adding data to new_data and overriding/saving the new row?
Essentially, I just want to know how to "grow" an empty data set.
Please ignore any errors in the code, it is just an example and I am still working on other details.
Thanks
If you are using r language, I presume you are looking for rbind:
new_data = NULL # define your new dataset
for(i in 1:nrow(data)) # loop over row of data
{
if(data[i,x] > 9) # if statement for implementing a condition
{
new_data = rbind(new_data,data[i,2:6]) # adding values of the row i and column 2 to 6
}
}
At the end, new_data will contain as many rows that satisfy the if statement and each row will contain values extracted from column 2 to 6.
If it is what you are looking for, there is various ways to do that without the need of a for loop, as an example:
new_data = data[data[i,x]>9,2:6]
If this answer is not satisfying for you, please provide more details in your question, include a reproducible example of your data and the expected output

What is the cleanest way to create a new column based on a conditional of an existing column?

In pandas I currently have a data frame containing a column of strings: {Urban, Suburban, Rural}. The column I would like to create is conditional of the first column (i.e. Urban, Suburban, Rural are associated with the corresponding colors) {Coral, Skyblue, Gold}
I tried copying the first column and then using .replace but my new column seems to return NaN values now instead of the colors.
new_column = merge_table["type"]
merge_table["color"] = new_column
color_df = merge_table["color"].replace({'Urban': 'Coral', 'Suburban': 'Skyblue', 'Rural': 'Gold'})
data = pd.DataFrame({'City Type': type,
'Bubble Color': color_df
})
data.head()
You can do
merge_table['New col']=merge_table["color"].replace({'Urban': 'Coral', 'Suburban': 'Skyblue', 'Rural': 'Gold'})
Okay. in the future, its worth typing the codes using 'Code Samples' so that we can view your code easier.
Lots of areas can improve your code. Firstly you do the entire thing in one line:
merge_table["color"] = merge_table["type"].map(mapping_dictionary)
Series.map() is around 4 times faster than Series.replace() for your information.
also other tips:
never use type as a variable name, use something more specific like city_type. type is already a standard built-in method
data = pd.DataFrame({'City Type': city_type, 'Bubble Color': color_df})
if make a copy of a column, use:
a_series = df['column_name'].copy()

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.