Convert a spark dataframe to a column - dataframe

I have a org.apache.spark.sql.DataFrame and I would like to convert it into a column: org.apache.spark.sql.Column.
So basically, this is my dataframe:
val filled_column2 = x.select(first(("col1"),ignoreNulls = true).over(window)) that I want to convert, it into an sql spark column. Anyone could help on that ?
Thank you,
#Jaime Caffarel: this is exactly what i am trying to do, this will give you more visibility. You may also check the error msg in the 2d screenshot

From the documentation of the class org.apache.spark.sql.Column
A column that will be computed based on the data in a DataFrame. A new
column is constructed based on the input columns present in a
dataframe:
df("columnName") // On a specific DataFrame.
col("columnName") // A generic column no yet associcated
with a DataFrame. col("columnName.field") // Extracting a
struct field col("a.column.with.dots") // Escape . in column
names. $"columnName" // Scala short hand for a named
column. expr("a + 1") // A column that is constructed
from a parsed SQL Expression. lit("abc") // A
column that produces a literal (constant) value.
If filled_column2 is a DataFrame, you could do:
filled_column2("col1")
******** EDITED AFTER CLARIFICATION ************
Ok, it seems to me that what you are trying to do is a JOIN operation. Assuming that the product_id is a unique key per each row, I would do something like this:
val filled_column = df.select(df("product_id"), last(("last_prev_week_nopromo"), ignoreNulls = true) over window)
This way, you are also selecting the product_id that you will use as key. Then, you can do the following
val promo_txn_cnt_seas_df2 = promo_txn_cnt_seas_df1
.join(filled_column, promo_txn_cnt_seas_df1("product_id") === filled_column("driver_id"), "inner")
// orderBy("product_id", "week")... (the rest of the operations)
Is this what you are trying to achieve?

Related

HighRadius Ivoice prediction Challenges

Generate a new column "avgdelay" from the existing columns
Note - You are expected to make a new column "avgdelay" by grouping "name_customer" column with reapect to mean of the "Delay" column.
This new column "avg_delay" is meant to store "customer_name" wise delay
groupby('name_customer')['Delay'].mean(numeric_only=False)
Display the new "avg_delay" column
Can Anyone guide me
Let df be your main data frame
avgdelay = df.groupby('name_customer')['Delay'].mean(numeric_only=False)
You need to add the "avg_delay" column with the maindata, mapped with "name_customer" column
Note - You need to use map function to map the avgdelay with respect to "name_customer" column
df['avg_delay'] = df['name_customer'].map(avgdelay)

Pyspark Schema update/alter Dataframe

I need to read a csv file from S3 ,it has string,double data but i will read as string which will provide a dynamic frame of only string. I want to do below for each row
concatenate few columns and create new columns
Add new columns
Convert value in 3rd column from string to date
Convert values of column 4,5,6 individually from string to decimal
Storename,code,created_date,performancedata,accumulateddata,maxmontlydata
GHJ 0,GHJ0000001,2020-03-31,0015.5126-,0024.0446-,0017.1811-
MULT,C000000001,2020-03-31,0015.6743-,0024.4533-,0018.0719-
Below is the code that I have written so far
def ConvertToDec(myString):
pattern = re.compile("[0-9]{0,4}[\\.]?[0-9]{0,4}[-]?")
myString=myString.strip()
doubleVal="";
if myString and not pattern.match(myString):
doubleVal=-9999.9999;
else:
doubleVal=-Decimal(myString);
return doubleVal
def rowwise_function(row):
row_dict = row.asDict()
data='d';
if not row_dict['code']:
data=row_dict['code']
else:
data='CD'
if not row_dict['performancedata']:
data= data +row_dict['performancedata']
else:
data=data + 'HJ'
// new columns
row_dict['LC_CODE']=data
row_dict['CD_CD']=123
row_dict['GBL']=123.345
if rec["created_date"]:
rec["created_date"]= convStr =datetime.datetime.strptime(rec["created_date"], '%Y-%m-%d')
if rec["performancedata"]
rec["performancedata"] = ConvertToDec(rec["performancedata"])
newrow = Row(**row_dict)
return newrow
store_df = spark.read.option("header","true").csv("C:\\STOREDATA.TXT", sep="|")
ratings_rdd = store_df.rdd
ratings_rdd_new = ratings_rdd.map(lambda row: rowwise_function(row))
updatedDF=spark.createDataFrame(ratings_rdd_new)
Basically, I am creating almost new DataFrame. My questions are below -
is this right approach ?
Since i am my changing schema mostly is there any other approach
Use Spark dataframes/sql, why use rdd? You don't need to perform any low level data operations, all are column level so dataframes are easier/efficient to use.
To create new columns - .withColumn(<col_name>, <expression/value>) (refer)
All the if's can be made .filter (refer)
The whole ConvertToDec can be written better using strip and ast module or float.

Building a new dataset

I want to take data from one set and enter it into another empty set.
So, for example, I want to do something like:
if ([i,x] > 9){
new_data$House[y,x] <- data[i,2]
}
but I want to do it over and over, creating new rows in new_data.
How do I keep adding data to new_data and overriding/saving the new row?
Essentially, I just want to know how to "grow" an empty data set.
Please ignore any errors in the code, it is just an example and I am still working on other details.
Thanks
If you are using r language, I presume you are looking for rbind:
new_data = NULL # define your new dataset
for(i in 1:nrow(data)) # loop over row of data
{
if(data[i,x] > 9) # if statement for implementing a condition
{
new_data = rbind(new_data,data[i,2:6]) # adding values of the row i and column 2 to 6
}
}
At the end, new_data will contain as many rows that satisfy the if statement and each row will contain values extracted from column 2 to 6.
If it is what you are looking for, there is various ways to do that without the need of a for loop, as an example:
new_data = data[data[i,x]>9,2:6]
If this answer is not satisfying for you, please provide more details in your question, include a reproducible example of your data and the expected output

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.

Need a TRUE and FALSE column in Spark-SQL

I'm trying to write a multi-value filter for a Spark SQL DataFrame.
I have:
val df: DataFrame // my data
val field: String // The field of interest
val values: Array[Any] // The allowed possible values
and I'm trying to come up with the filter specification.
At the moment, I have:
val filter = values.map(value => df(field) === value)).reduce(_ || _)
But this isn't robust in the case where I get passed an empty list of values. To cover that case, I would like:
val filter = values.map(value => df(field) === value)).fold(falseColumn)(_ || _)
but I don't know how to specify falseColumn.
Anyone know how to do so?
And is there a better way of writing this filter? (If so, I still need the answer for how to get a falseColumn - I need a trueColumn for a separate piece).
A column that is always true:
val trueColumn = lit(true)
A column that is always false:
val falseColumn = lit(false)
Using lit(...) means these will always be valid columns, regardless of what columns the DataFrame contains.