How to convert print output to pyspark dataframe (no pandas allowed) - dataframe

The usual code
print((sparkdf.count(), len(sparkdf.columns)))
Since I using HDFS system that fully on HDFS, no pandas allowed, The output I need
|-------|-------|
|row |columns|
|-------|-------|
|1500 | 22 |
|-------|-------|

Just use spark.createDataFrame and pass the values as a list of tuple:
spark.createDataFrame([(sparkdf.count(), len(sparkdf.columns))], schema=['rows', 'columns'])

Related

How to select a subset of pandas dataframe containing an even distribution of one column's values?

I have a huge dataset over different years. As a subsample for local tests, I need to separate a small dataframe which contains only a few samples distributed over years. Does anyone have any idea how to do that?
After groupby by 'year' column, the count of instances in each year is something like:
year
A
1838
1000
1839
2600
1840
8900
1841
9900
I want to select a subset which after groupby looks like:
| year| A |
| ----| --|
| 1838| 10|
| 1839| 10|
| 1840| 10|
| 1841| 10|
Try groupby().sample().
Here's example usage with dummy data.
import numpy as np
import pandas as pd
# create a long array of 'years' from 1800 to 1805
years = np.random.randint(low=1800,high=1805,size=200)
values = np.random.randint(low=1, high=200,size=200)
df = pd.DataFrame({'Years':years,"Values":values})
number_per_year = 10
sample_df = df.groupby("Years").sample(n=number_per_year, random_state=1)

Spark scala create dataframe from text with columns split by delimiter | [duplicate]

This question already has answers here:
Java String split not returning the right values
(4 answers)
Closed 10 months ago.
I am trying to create a spark dataframe from text file in which data is delimited by | symbol.
Have to Spark with Scala.
The text files has data as below:
John|1234|$2500|giggle
Ross|1344|$5500|Micsoft
Jennifer|5432|$2100|healthcare
val schemaString = "name,employeeid,salary,company"
val fields = schemaString.split(",").map(fieldName => StructField(fieldName,StringType, nullable=true))
val schema = StructType(fields)
val rddView= sc.textFile("/dev/path/*").map(_.split("|")).map{x
=> org.apache.spark.sql.Row(x:_*)}
val rddViewDf = sqlContext.createDataFrame(rddView,schema)
rddViewDf.show()
Expecting the values to be mapped to corresponding columns but output is not as expected.
Can someone provide the correct solution in Spark using scala language
Output I am getting:
+----+----------+------+-------+
|name|employeeid|salary|company|
+----+----------+------+-------+
| J| o| h| n|
| R| o| s| s|
| J| e| n| n|
+----+----------+------+-------+
Expected Output
+----------+------------+----------+-----------+
|name |employeeid | salary| company|
+---------+-------------+----------+-----------+
| John| 1234| $2500| giggle|
| Ross| 1344| $5500| Micsoft|
| Jennifer| 5432| $2100| healthcare|
+----+----------+------+-----------------------+
As pointed out in the comments, your split delimeter is incorrect.
However, you should not be using RDDs anyway
scala> spark.read.option("delimiter", "|").csv("data.txt").show()
+--------+----+-----+----------+
| _c0| _c1| _c2| _c3|
+--------+----+-----+----------+
| John|1234|$2500| giggle|
| Ross|1344|$5500| Micsoft|
|Jennifer|5432|$2100|healthcare|
+--------+----+-----+----------+
https://spark.apache.org/docs/latest/sql-data-sources-csv.html
To rename the columns, please see this and translate to Scala How to read csv without header and name them with names while reading in pyspark?
Note: Ideally, your employeeId column is defined as a LongType, not StringType

Scala convert Array to DataFrame Column

I am trying to add an Array of values as a new column to the DataFrame.
Ex:
Lets assume there is an Array(4,5,10) and a dataframe
+----------+-----+
| name | age |
+----------+-----+
| John | 32 |
| Elizabeth| 28 |
| Eric | 41 |
+----------+-----+
My requirement is to add the above array as a new column to the dataframe. My expected output is as follows:
+----------+-----+------+
| name | age | rank |
+----------+-----+------+
| John | 32 | 4 |
| Elizabeth| 28 | 5 |
| Eric | 41 | 10 |
+----------+-----+------+
I am trying if I can achieve this using rdd and zipWithIndex.
df.rdd.zipWithIndex.map(_.swap).join(array_rdd.zipWithIndex.map(_.swap))
This is resulting in something of this sort.
(0,([John, 32],4))
I want to convert the above RDD back to required dataframe. Let me know how to achieve this.
Are there any alternatives available for achieving the desired result other than using rdd and zipWithIndex? What is the best way to do it?
PS:
Context for better understanding:
I am using Xpress optimization suite to solve a mathematical problem. Xpress takes inputs interms of Arrays and also outputs the result in an Array. I get input as a DataFrame and I am extracting columns as Arrays(using collect) and passing to Xpress. Xpress outputs Array[Double] as solution. I want to add this solution back to the dataframe as a column and every value in the solution array corresponds to the row of the dataframe at its index i.e value at index 'n' of the output Array corresponds to 'n'th row of the dataframe
After the join just map the results to what you are looking for.
You can convert this back to a dataframe after joining the RDDs.
val originalDF = Seq(("John", 32), ("Elizabeth", 28), ("Eric", 41)).toDF("name", "age")
val rank = Array(4, 5, 10)
// convert to Seq first
val rankDF = rank.toSeq.toDF("rank")
val joined = originalDF.rdd.zipWithIndex.map(_.swap).join(rankDF.rdd.zipWithIndex.map(_.swap))
val finalRDD = joined.map{ case (k,v) => (k, v._1.getString(0), v._1.getInt(1), v._2.getInt(0)) }
val finalDF = finalRDD.toDF("id", "name", "age", "rank")
finalDF.show()
/*
+---+---------+---+----+
| id| name|age|rank|
+---+---------+---+----+
| 0| John| 32| 4|
| 1|Elizabeth| 28| 5|
| 2| Eric| 41| 10|
+---+---------+---+----+
*/
The only alternate way that I can think of is to use the org.apache.spark.sql.functions.row_number() window function. This essentially achieves the same thing by adding an increasing, consecutive row number to the dataframe.
The drawback with this is the large amount of data shuffle into one partition, since we need to have unrepeated row numbers for all rows in the dataframe. If your data is very large this can lead to an out of memory issue. (Note: this may not be applicable in your case, since you mentioned you are doing a collect on the data and have not mentioned any memory issues in this).
The approach of converting to an rdd and using zipWithIndex is an acceptable solution, but generally converting from dataframe to rdd is not recommended due to the performance difference of using an RDD instead of a dataframe.

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.
You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

How to use LinearRegression across groups in DataFrame?

Let us say my spark DataFrame (DF) looks like
id | age | earnings| health
----------------------------
1 | 34 | 65 | 8
2 | 65 | 12 | 4
2 | 20 | 7 | 10
1 | 40 | 75 | 7
. | .. | .. | ..
and I would like to group the DF, apply a function (say linear
regression which depends on multiple columns - two columns in this case -
of aggregated DF) on each aggregated DF and get output like
id | intercept| slope
----------------------
1 | ? | ?
2 | ? | ?
from sklearn.linear_model import LinearRegression
lr_object = LinearRegression()
def linear_regression(ith_DF):
# Note: for me it is necessary that ith_DF should contain all
# data within this function scope, so that I can apply any
# function that needs all data in ith_DF
X = [i.earnings for i in ith_DF.select("earnings").rdd.collect()]
y = [i.health for i in ith_DF.select("health").rdd.collect()]
lr_object.fit(X, y)
return lr_object.intercept_, lr_object.coef_[0]
coefficient_collector = []
# following iteration is not possible in spark as 'GroupedData'
# object is not iterable, please consider it as pseudo code
for ith_df in df.groupby("id"):
c, m = linear_regression(ith_df)
coefficient_collector.append((float(c), float(m)))
model_df = spark.createDataFrame(coefficient_collector, ["intercept", "slope"])
model_df.show()
I think this can be done since Spark 2.3 using pandas_UDF. In fact, there is an example of fitting grouped regressions on the announcement of pandas_UDFs here:
Introducing Pandas UDF for Python
What I'd do is to filter the main DataFrame to create smaller DataFrames and do the processing, say a linear regression.
You can then execute the linear regression in parallel (on separate threads using the same SparkSession which is thread-safe) and the main DataFrame cached.
That should give you the full power of Spark.
p.s. My limited understanding of that part of Spark makes me think that a very similar approach is used for grid search-based model selection in Spark MLlib and also TensorFrames which is "Experimental TensorFlow binding for Scala and Apache Spark".