I am new to PySpark Dataframe i am following one sample from this link. In this link they are using pandas dataframe wheras i want to achieve the same using Spark Dataframe. I am stuck up on issue where i want to transpose the table i couldn't find any better way to do it. As there are so many columns i find it difficult to implement and understand Pivot. Is there any better way to do that ? Can i use pandas in Pyspark with cluster environment ?
In pyspark API pyspark.mllib.linalg.distributed.BlockMatrix has transpose function.
if you have a df with columns id, features
bm_transpose = IndexedRowMatrix(df.rdd.map(lambda x:(x[0],
Vectors.dense(x[1])))).toBlockMatrix(2,2).transpose()
Related
I am relatively new to Pandas, and was hoping for guidance on the most efficient and clean way to handle multiple rules/masks to the same dataframe column.
I have two unique and independent conditions working:
Condition 1
df["price"]= df["price"].mask(df["price"].eq("£ 0.00"), df["product_price_old"])
df.drop(axis=1, inplace=True, columns='product_price_old')
Condition 2
df["price"] = df["price"].mask(df["product_price_old"].gt(df["price"]), df["product_price_old"])
df.drop(axis=1, inplace=True, columns='product_price_old')
What is the best syntax in Pandas to merge these conditions together and remove the duplication?
Would a separate Python function and call it via .agg? I came across a .pipe in the docs earlier, would this be a suitable use case?
Any help would be appreciated.
Hello guys. Could you help me with .corrWith? I can't find a solution to 'translate' pandas to spark
EDIT: I'm using two dataframes, so i need to establish a correlation between two dataframes
Code:
pd.DataFrame({col:x.corrwith(y[col]) for col in y.columns})
This image below shows the perfect output but need that to be writed on spark
You can use the .corr() function.
Example:
df.corr(col('x'), col('y')).show()
For multiple columns just chain those functions together.
I'd like to merge/concatenate multiple dataframe together; basically it's too add up many feature columns together based on the same first column 'Name'.
F1.merge(F2, on='Name', how='outer').merge(F3, on='Name', how='outer').merge(F4,on='Name', how='outer')...
I tried the code above, it's working. But I've got say, 100 features to add up together, I'm wondering is there any better way?
Without data it is not easy, but this can works:
df = pd.concat([x.set_index('Name') for x in [df1,df2,df3]]).reset_index()
I am trying to transpose/unpivot a dataframe in SparkR. I don't find any direct method available in SparkR package to accomplish unpivoting a dataframe. Neither I am able to use R package on a SparkR dataframe even after using includePackage method. It would be helpful if someone could let me know if there are direct ways to unpivot using SparkR or other alternatives such as Hive.
Neither I am able to use R package on a SparkR dataframe
Native R commands don't run on Spark DataFrames. Only Spark commands run on Spark DataFrames. If you want to run an R command on a Spark DataFrame you can collect() to convert it to an R data.frame, but you lose the benefits of distributed processing.
The Spark DataFrame is a similar construct to a table in a relational database. By working with Spark commands on a Spark DataFrame you will retain the benefits of distributed processing across the cluster.
It's difficult to answer such a general question - normally on this forum people expect specific examples with data and code. In general, if I wanted to un-pivot a relational table then the most basic way would be to create a set of queries, each query containing the row key plus one column, filtered for non-null in the column. I would then union together the multiple results into a new DataFrame.
If your preference is for R language syntax, that union can be done using the unionAll(x,y) command in SparkR, which will be processed across the cluster (unlike an R command on an R data.frame).
In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one
val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD)
onlyNewData contains the rows in todaySchemRDD that do not exist in yesterdaySchemaRDD.
How can this be achieved with DataFrames in Spark version 1.3.0?
According to the Scala API docs, doing:
dataFrame1.except(dataFrame2)
will return a new DataFrame containing rows in dataFrame1 but not in dataframe2.
In PySpark it would be subtract
df1.subtract(df2)
or exceptAll if duplicates need to be preserved
df1.exceptAll(df2)
From Spark 1.3.0, you can use join with 'left_anti' option:
df1.join(df2, on='key_column', how='left_anti')
These are Pyspark APIs, but I guess there is a correspondent function in Scala too.
I tried subtract, but the result was not consistent.
If I run df1.subtract(df2), not all lines of df1 are shown on the result dataframe, probably due distinct cited on the docs.
exceptAll solved my problem:
df1.exceptAll(df2)
For me, df1.subtract(df2) was inconsistent. Worked correctly on one dataframe, but not on the other. That was because of duplicates. df1.exceptAll(df2) returns a new dataframe with the records from df1 that do not exist in df2, including any duplicates.
From Spark 2.4.0 - exceptAll
data_cl = reg_data.exceptAll(data_fr)