extracting numpy array from Pyspark Dataframe - numpy

I have a dataframe gi_man_df where group can be n:
+------------------+-----------------+--------+--------------+
| group | number|rand_int| rand_double|
+------------------+-----------------+--------+--------------+
| 'GI_MAN'| 7| 3| 124.2|
| 'GI_MAN'| 7| 10| 121.15|
| 'GI_MAN'| 7| 11| 129.0|
| 'GI_MAN'| 7| 12| 125.0|
| 'GI_MAN'| 7| 13| 125.0|
| 'GI_MAN'| 7| 21| 127.0|
| 'GI_MAN'| 7| 22| 126.0|
+------------------+-----------------+--------+--------------+
and I am expecting a numpy nd_array i.e, gi_man_array:
[[[124.2],[121.15],[129.0],[125.0],[125.0],[127.0],[126.0]]]
where rand_double values after applying pivot.
I tried the following 2 approaches:
FIRST: I pivot the gi_man_df as follows:
gi_man_pivot = gi_man_df.groupBy("number").pivot('rand_int').sum("rand_double")
and the output I got is:
Row(number=7, group=u'GI_MAN', 3=124.2, 10=121.15, 11=129.0, 12=125.0, 13=125.0, 21=127.0, 23=126.0)
but here the problem is to get the desired output, I can't convert it to matrix then convert again to numpy array.
SECOND:
I created the vector in the dataframe itself using:
assembler = VectorAssembler(inputCols=["rand_double"],outputCol="rand_double_vector")
gi_man_vector = assembler.transform(gi_man_df)
gi_man_vector.show(7)
and I got the following output:
+----------------+-----------------+--------+--------------+--------------+
| group| number|rand_int| rand_double| rand_dbl_Vect|
+----------------+-----------------+--------+--------------+--------------+
| GI_MAN| 7| 3| 124.2| [124.2]|
| GI_MAN| 7| 10| 121.15| [121.15]|
| GI_MAN| 7| 11| 129.0| [129.0]|
| GI_MAN| 7| 12| 125.0| [125.0]|
| GI_MAN| 7| 13| 125.0| [125.0]|
| GI_MAN| 7| 21| 127.0| [127.0]|
| GI_MAN| 7| 22| 126.0| [126.0]|
+----------------+-----------------+--------+--------------+--------------+
but problem here is I can't pivot it on rand_dbl_Vect.
So my question is:
1. Is any of the 2 approaches is correct way of achieving the desired output, if so then how can I proceed further to get the desired result?
2. What other way I can proceed with so the code is optimal and performance is good?

This
import numpy as np
np.array(gi_man_df.select('rand_double').collect())
produces
array([[ 124.2 ],
[ 121.15],
.........])

To convert the spark df to numpy array, first convert it to pandas and then apply the to_numpy() function.
spark_df.select(<list of columns needed>).toPandas().to_numpy()

Related

pySpark not able to handle Multiline string in CSV file while selecting columns

I am trying to load csv file which looks like following, using pyspark code.
A^B^C^D^E^F
"Yash"^"12"^""^"this is first record"^"nice"^"12"
"jay"^"13"^""^"
In second record, I am new line at the beingnning"^"nice"^"12"
"Nova"^"14"^""^"this is third record"^"nice"^"12"
When I read this file and select a few columns entire dataframe gets messed up.
import pyspark.sql.functions as F
df = (
spark.read
.option("delimiter", "^")
.option('header',True) \
.option("multiline", "true")
.option('multiLine', True) \
.option("escape", "\"")
.csv(
"test3.csv",
header=True,
)
)
df.show()
df = df.withColumn("isdeleted", F.lit(True))
select_cols = ['isdeleted','B','D','E','F']
df = df.select(*select_cols)
df.show()
(truncated some import statements for readability of code)
This is what I see when the above code runs
Before column selection (entire DF)
+----+---+----+--------------------+----+---+
| A| B| C| D| E| F|
+----+---+----+--------------------+----+---+
|Yash| 12|null|this is first record|nice| 12|
| jay| 13|null|\nIn second recor...|nice| 12|
|Nova| 14|null|this is third record|nice| 12|
+----+---+----+--------------------+----+---+
After df.select(*select_cols)
+---------+----+--------------------+----+----+
|isdeleted| B| D| E| F|
+---------+----+--------------------+----+----+
| true| 12|this is first record|nice| 12|
| true| 13| null|null|null|
| true|nice| null|null|null|
| true| 14|this is third record|nice| 12|
+---------+----+--------------------+----+----+
Here, second row with newline char is being broken down to 2 rows, output file is also messed up just like dataframe preview I showed above.
I am using apache Glue image amazon/aws-glue-libs:glue_libs_4.0.0_image_01 which uses spark 3.3.0 version. Also tried with spark 3.1.1. I see the same issue in both versions.
I am not sure whether this is a bug in spark package or I am missing something here. Any help will be appreciated
You are giving the wrong escape charactor. It is usually \ and you are specifing this to the quote. Once you change the option,
df = spark.read.csv('test.csv', sep='^', header=True, multiLine=True)
df.show()
df.select('B').show()
+----+---+----+--------------------+----+---+
| A| B| C| D| E| F|
+----+---+----+--------------------+----+---+
|Yash| 12|null|this is first record|nice| 12|
| jay| 13|null|\nIn second recor...|nice| 12|
|Nova| 14|null|this is third record|nice| 12|
+----+---+----+--------------------+----+---+
+---+
| B|
+---+
| 12|
| 13|
| 14|
+---+
You will get the desired result.

Stratified Sampling with Pyspark Multiples Columns

I have a dataframe with millions of registers, like this:
CLI_ID OCCUPA_ID DIG_LABEL
125 2705 1
328 2708 7
400 2712 1
401 2705 2
525 2708 1
I want to take an aleatory sample of 100k rows that contains 70% of 2705, 20% of 2708, 10% of 2712 from OCCUPA_ID and 50% of 1, 20% of 2 and 30% 7 from DIG_LABEL.
How can I get this in Spark, using pyspark?
use sampleBy instead using sample function in pyspark,becasue sample only use for sampling without any column.so we will take sampleBy.
here,sampleBy we have in column,fraction and seed(Optional).
consider,
df_sample = df.sampleBy(column,fraction,seed)
where,
column is defined for selecting column you want to sampling
fraction is just defined for sampling ration like 10% so it will take as 0.1 vice versa.
seed for which of data show will saved as seed through becasue everytime it will show different data if not use this seed.
so your question required answer is,
dfsample = df.sampleBy("OCCUPA_ID",{"2705":0.7,"2708":0.2,"2712":0.1},42).sampleBy("DIG_LABEL",{"1":0.5,"2":0.2,"7":0.3},42)
just take two times of sampling OCCUPA_ID and after DIG_LABEL.
42 is seed here both time
You can use the sampleBy method for pyspark DataFrames to perform stratified sample and pass the column name and a dictionary for the fractions within each column. For example:
spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).show()
+------+---------+---------+
|CLI_ID|OCCUPA_ID|DIG_LABEL|
+------+---------+---------+
| 1| 2705| 7|
| 4| 2705| 1|
| 5| 2705| 7|
| 7| 2708| 2|
| 12| 2705| 1|
| 16| 2708| 2|
| 18| 2708| 2|
| 20| 2705| 7|
| 25| 2705| 2|
| 26| 2705| 2|
| 38| 2705| 7|
| 40| 2705| 1|
| 44| 2705| 2|
| 48| 2708| 7|
| 50| 2708| 2|
| 53| 2705| 1|
| 57| 2705| 1|
| 58| 2712| 1|
| 61| 2705| 2|
| 63| 2708| 7|
+------+---------+---------+
only showing top 20 rows
Since you want one pyspark DataFrame with two samplings performed from two different columns, you can chain the sampleBy methods together:
spark_stratified_sample_df = spark_df.sampleBy("OCCUPA_ID", fractions={"2705": 0.7, "2708": 0.2, "2712": 0.1}, seed=42).sampleBy("DIG_LABEL", fractions={"1": 0.5, "2": 0.2, "7": 0.3}, seed=42)

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+
You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Spark - compare 2 dataframes without using hardcoded column names

I need to create a way to compare 2 data frames and do not use any hardcoded data, so that I can upload at any time 2 files and compare them without changing anything.
df1
+--------------------+--------+----------------+----------+
| ID|colA. |colB. |colC |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 20| APPLE|
|(122C8984ABF9F6EF...| 0| 10| GOOGLE|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 15| SAMSUNG|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
+--------------------+--------+----------------+----------+
df2
+--------------------+--------+----------------+----------+
| ID|colA. |colB |colC |
+--------------------+--------+----------------+----------+
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 20| APPLE|
|(122C8984ABF9F6EF...| 0| 10| APPLE|
|(122C8984ABF9F6EF...| 0| 30| APPLE|
|(122C8984ABF9F6EF...| 0| 15| SAMSUNG|
|(122C8984ABF9F6EF...| 0| 15| GOOGLE|
+--------------------+--------+----------------+----------+
I need to compare these 2 data frames and count the differences from each column.
My output should look like this:|
+--------------+-------------+-----------------+------------+------+
|Attribute Name|Total Records|Number Miss Match|% Miss Match|Status|
+--------------+-------------+-----------------+------------+------+
| colA| 6| 0| 0.0 %| Pass|
colB. | 6| 3| 50 %| Fail|
| colC. | 6| 2| 33.3 %| Fail||
+--------------+-------------+-----------------+------------+------+
I know how to compare the columns when using hardcoded column names , by my requirement is to compare it dynamically.
What I did so far was to select a column from each data frame, but this doesn't seem the right way to do it.
val columnsAll = df1.columns.map(m=>col(m))
val df1_col1 = df1.select(df1.columns.slice(1,2).map(m=>col(m)):_*).as("Col1")
val df2_col1 = df2.select(df2.columns.slice(1,2).map(m=>col(m)):_*).as("Col2")
Two ways here:
The spark-fast-tests library has two methods for making DataFrame comparisons:
The assertSmallDataFrameEquality method collects DataFrames on the driver node and makes the comparison
def assertSmallDataFrameEquality(actualDF: DataFrame, expectedDF: DataFrame): Unit = {
if (!actualDF.schema.equals(expectedDF.schema)) {
throw new DataFrameSchemaMismatch(schemaMismatchMessage(actualDF, expectedDF))
}
if (!actualDF.collect().sameElements(expectedDF.collect())) {
throw new DataFrameContentMismatch(contentMismatchMessage(actualDF, expectedDF))
}
}
Using df2.except(df1)
For details of these 2 ways, you can refer DataFrame equality in Apache Spark
Compare two Spark dataframes

Filtering rows in pyspark dataframe and creating a new column that contains the result

so I am trying to identify the crime that happens within the SF downtown boundary on Sunday. My idea was to first write a UDF to label if each crime is in the area I identify as the downtown area, if it happened within the area, then it will have a label of "1" and "0" if not. After that, I am trying to create a new column to store those results. I tried my best to write everything I can but it just doesn't work for some reason. Here is the code I wrote:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
def filter_dt(x,y):
if (((x < -122.4213) & (x > -122.4313)) & ((y > 37.7540) & (y < 37.7740))):
return '1'
else:
return '0'
schema = StructType([StructField("isDT", BooleanType(), False)])
filter_dt_boolean = udf(lambda row: filter_dt(row[0], row[1]), schema)
#First, pick out the crime cases that happens on Sunday BooleanType()
q3_sunday = spark.sql("SELECT * FROM sf_crime WHERE DayOfWeek='Sunday'")
#Then, we add a new column for us to filter out(identify) if the crime is in DT
q3_final = q3_result.withColumn("isDT", filter_dt(q3_sunday.select('X'),q3_sunday.select('Y')))
The error I am getting is:Picture for the error message
My guess is that the udf I am having right now doesn't support the whole column as input to be compared, but I don't know how to fix it to make it work. Please help! Thank you!
A sample data would have helped. For now I assume that your data looks like this:
+----+---+---+
|val1| x| y|
+----+---+---+
| 10| 7| 14|
| 5| 1| 4|
| 9| 8| 10|
| 2| 6| 90|
| 7| 2| 30|
| 3| 5| 11|
+----+---+---+
Then you dont need a udf, as you can do the evaluation using the when() function
import pyspark.sql.functions as F
tst= sqlContext.createDataFrame([(10,7,14),(5,1,4),(9,8,10),(2,6,90),(7,2,30),(3,5,11)],schema=['val1','x','y'])
tst_res = tst.withColumn("isdt",F.when(((tst.x.between(4,10))&(tst.y.between(11,20))),1).otherwise(0))This will give the result
tst_res.show()
+----+---+---+----+
|val1| x| y|isdt|
+----+---+---+----+
| 10| 7| 14| 1|
| 5| 1| 4| 0|
| 9| 8| 10| 0|
| 2| 6| 90| 0|
| 7| 2| 30| 0|
| 3| 5| 11| 1|
+----+---+---+----+
If i have got the data wrong and still you need to pass multiple values to udf, you have to pass it as an array or a struct. I prefer a struct
from pyspark.sql.functions import udf
from pyspark.sql.types import *
#udf(IntegerType())
def check_data(row):
if((row.x in range(4,5))&(row.y in range(1,20))):
return(1)
else:
return(0)
tst_res1 = tst.withColumn("isdt",check_data(F.struct('x','y')))
The result will be the same. But it is always better to avoid UDF and go for spark inbuilt functions since spark catalyst cannot understand the logic inside the udf and cannot optimize it.
Try changing last line as below-
from pyspark.sql.functions import col
q3_final = q3_result.withColumn("isDT", filter_dt(col('X'),col('Y')))