Converting CoordinateMatrix to Pyspark Dataframe - dataframe

how can I convert a CoordinateMatrix to a Pyspark Dataframe?
I have tried to convert my dataframe to a rowmatrix and then to a dataframe using this
df.toRowMatrix().rows.map(lambda x: (x, )).toDF()
but it looks really weird.
| _1|
+--------------------+
|(100,[20,21,22,23...|
|(100,[40,41,42,43...|
|(100,[35,36,37,38...|
|(100,[5,6,7,8,9,1...|
Would appreciate any help, thanks!

Related

PySpark: Transform values of given column in the DataFrame

I am new to PySpark and Spark in general.
I would like to apply transformation on a given column in the DataFrame, essentially call a function for each value on that specific column.
I have my DataFrame df that looks like this:
df.show()
+------------+--------------------+
|version | body |
+------------+--------------------+
| 1|9gIAAAASAQAEAAAAA...|
| 2|2gIAAAASAQAEAAAAA...|
| 3|3gIAAAASAQAEAAAAA...|
| 1|7gIAKAASAQAEAAAAA...|
+------------+--------------------+
I need to read value of body column for each row where the version is 1 and then decrypt it (I have my own logic/function which takes a string and returns a decrypted string). Finally, write the decrypted values in csv format to a S3 bucket.
def decrypt(encrypted_string: str):
# code that returns decrypted string
So, When I do following, I get the corresponding filtered values to which I need to apply my decrypt function.
df.where(col('version') =='1')\
.select(col('body')).show()
+--------------------+
| body|
+--------------------+
|9gIAAAASAQAEAAAAA...|
|7gIAKAASAQAEAAAAA...|
+--------------------+
However, I am not clear how to do that. I tried to use collect() but then it defeats the purpose of using Spark.
I also tried using .rdd.map as follows but that did not work.
df.where(col('version') =='1')\
.select(col('body'))\
.rdd.map(lambda x: decrypt).toDF().show()
OR
.rdd.map(decrypt).toDF().show()
Could someone please help with this.
Please try:
from pyspark.sql.functions import udf
decrypt_udf = udf(decrypt, StringType())
df.where(col('version') =='1').withColumn('body', decrypt_udf('body'))
Got some clue from this post: Pyspark DataFrame UDF on Text Column.
Looks like I can simply get it with following. I was doing it without using udf earlier, so it wasn't working.
dummy_function_udf = udf(decrypt, StringType())
df.where(col('version') == '1')\
.select(col('body')) \
.withColumn('decryptedBody', dummy_function_udf('body')) \
.show()

Python: Convert entire column to dictionary

I am just getting started with pandas recently.
I have a dataframe that looks like this
import pandas as pd
locations=pd.read_csv('locations.csv')
lat lon
0 30.29 -87.44
1 30.21 -87.44
2 31.25 -87.41
I want to convert it to something like this
{'lat': [37.974508, 38.050247, 37.985352],
'lon': [-87.582584, -87.540012, -87.50776]}
Check to_dict
df.to_dict('l')
Out[951]: {'Lon': [-87.44, -87.44, -87.41], 'lat': [30.29, 30.21, 31.25]}
Keys are column names, values are lists of column data
locations.to_dict('list')
Try this:
lat_lon = {'lat': list(locations['lat']), 'lon': list(locations['lon'])}

PySpark Dataframe: append to each value of a column a word

I would like to append to each value of a column in a pyspark dataframe a word( for example from a list of words). I though to just convert it to pandas framework because it is easier but I need to do it on pyspark. Any Ideas? Thank you :)
you can do it easily with concat function:
from pyspark.sql import functions as F
for col in df.columns:
df.withColumn(col, F.concat(F.col(col), F.lit("new_word"))

How to convert datatype of all the columns of a pandas dataframe to string

I have tried multiple ways to achieve this for ex:
inputpd = pd.DataFrame(inputpd.columns,dtype=str)
But it does not work. sorry for asking this question as i am beginner to spark.
If it's a Pandas DataFrame:
df = df.astype(str)
The easiest way I think it is:
df = df.applymap(str)
df is your dataframe.

Convert Pandas DataFrame into Series with multiIndex

Let us consider a pandas DataFrame (df) like the one shown above.
How do I convert it to a pandas Series?
Just select the single column of your frame
df['Count']
result = pd.Series(df['Count'])