Why the types are all string while load csv to pyspark dataframe? - dataframe

I have a csv file which contains numbers (no string in it).
It has int and float type. But when I read it in pyspark in this way:
df = spark.read.csv("s3://s3-cdp-prod-hive/novaya/instacart/data.csv",header=False)
all the columns' type of the dataframe are string.
How to read it into numbers with int and float automatically?
Some columns contain nan in it. In file it is represented by nan
0.18277,-0.188931,0.0893389,0.119931,0.318853,-0.132933,-0.0288816,0.136137,0.12939,-0.245342,0.0608182,0.0802028,-0.00625962,0.271222,0.187855,0.132606,-0.0451533,0.140501,0.0704631,0.0229986,-0.0533376,-0.319643,-0.029321,-0.160937,0.608359,0.0513554,-0.246744,0.0817331,-0.410682,0.210652,0.375154,0.021617,0.119288,0.0674939,0.190642,0.161885,0.0385196,-0.341168,0.138659,-0.236908,0.230963,0.23714,-0.277465,0.242136,0.0165013,0.0462388,0.259744,-0.397228,-0.0143719,0.0891644,0.222225,0.0987765,0.24049,0.357596,-0.106266,-0.216665,0.191123,-0.0164234,0.370766,0.279462,0.46796,-0.0835098,0.112693,0.231951,-0.0942302,-0.178815,0.259096,-0.129323,1165491,175882,16.5708805975,6,0,2.80890261184,4.42114773551,0,23,0,13.4645462866,18.0359037455,11,30.0,0.0,11.4435397208,84.7504967125,30.0,5370,136.0,1.0,9.61508192633,62.2006926209,1,0,0,22340,9676,322.71241867,17.7282900627,1,100,4.24701125287,2.72260519248,0,6,17.9743048247,13.3241271262,0,23,82.4988407009,11.4021333588,0.0,30.0,45.1319021862,7.76284691137,1.0,66.0,9.40127026245,2.30880529144,1,73,0.113021725659,0.264843289305,0.0,0.986301369863,1,30450,0

As you can see here:
inferSchema – infers the input schema automatically from data. It requires one extra pass over the data. If None is set, it uses the default value, false.
For NaN values, refer to the same docs above:
nanValue – sets the string representation of a non-number value. If None is set, it uses the default value, NaN
By setting inferSchema as True, you will obtain a dataframe with types infered.
Here I put an example:
CSV file:
12,5,8,9
1.0,3,46,NaN
By default, inferSchema is False and all values are String:
from pyspark.sql.types import *
>>> df = spark.read.csv("prova.csv",header=False)
>>> df.dtypes
[('_c0', 'string'), ('_c1', 'string'), ('_c2', 'string'), ('_c3', 'string')]
>>> df.show()
+---+---+---+---+
|_c0|_c1|_c2|_c3|
+---+---+---+---+
| 12| 5| 8| 9|
|1.0| 3| 46|NaN|
+---+---+---+---+
If you set inferSchema as True:
>>> df = spark.read.csv("prova.csv",inferSchema =True,header=False)
>>> df.dtypes
[('_c0', 'double'), ('_c1', 'int'), ('_c2', 'int'), ('_c3', 'double')]
>>> df.show()
+----+---+---+---+
| _c0|_c1|_c2|_c3|
+----+---+---+---+
|12.0| 5| 8|9.0|
| 1.0| 3| 46|NaN|
+----+---+---+---+

Related

How to filter and select columns and merge streaming dataframes in spark?

I have a streaming dataframe and I am not sure what the best way is to solve this issue
ID
lattitude
longitude
A
28
30
B
40
52
Transform to:
A
B.
Distance
(28,30)
(40,52)
calculate distance
I need to transform it to this and add a distance column in which I pass the coordinates.
I am thinking about producing 2 data streams that are filtered with all the A coordinates and B coordinates. I would then A.join(B).withColumn(distance) and stream the output. Is this the way to go about solving this problem?
Is there a way I could pivot without aggregation to readstream data into the format needed which could be faster than making 2 streaming dataframes filtered and merging them?
Can I add an array column of coordinates in a streaming dataset?
I am not sure how performant this will be, but you can use pivot to force rows of the ID column to become new columns and sum the individual latitude and longitude as a way to obtain the value itself (since there is no F.identity). This will get you the following result:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
)
+----------+-----------+----------+-----------+
|A_latitude|A_longitude|B_latitude|B_longitude|
+----------+-----------+----------+-----------+
| 28| 30| 40| 52|
+----------+-----------+----------+-----------+
Then you can use F.struct to create columns A and B using the latitude and longitude columns:
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
)
+----------+-----------+----------+-----------+--------+--------+
|A_latitude|A_longitude|B_latitude|B_longitude| A| B|
+----------+-----------+----------+-----------+--------+--------+
| 28| 30| 40| 52|{28, 30}|{40, 52}|
+----------+-----------+----------+-----------+--------+--------+
The last step is to use a udf to calculate geographic distance, which has been answered here. Putting this all together:
import pyspark.sql.functions as F
from pyspark.sql.types import FloatType
from geopy.distance import geodesic
#F.udf(returnType=FloatType())
def geodesic_udf(a, b):
return geodesic(a, b).m
streaming_df.groupby().pivot('ID').agg(
F.sum('latitude').alias('latitude'),
F.sum('longitude').alias('longitude')
).withColumn(
'A', F.struct(F.col('A_latitude'), F.col('A_longitude'))
).withColumn(
'B', F.struct(F.col('B_latitude'), F.col('B_longitude'))
).withColumn(
'distance', geodesic_udf(F.array('B.B_longitude','B.B_latitude'), F.array('A.A_longitude','A.A_latitude'))
).select(
'A','B','distance'
)
+--------+--------+---------+
| A| B| distance|
+--------+--------+---------+
|{28, 30}|{40, 52}|2635478.5|
+--------+--------+---------+
EDIT: When I answered your question, I let pyspark infer the datatype of each column, but I also tried to more closely reproduce the schema for your streaming dataframe by specifying the column types:
streaming_df = spark.createDataFrame(
[
("A", 28., 30.),
("B", 40., 52.),
],
StructType([
StructField("ID", StringType(), True),
StructField("latitude", DoubleType(), True),
StructField("longitude", DoubleType(), True),
])
)
streaming_df.printSchema()
root
|-- ID: string (nullable = true)
|-- latitude: double (nullable = true)
|-- longitude: double (nullable = true)
The end result is still the same:
+------------+------------+---------+
| A| B| distance|
+------------+------------+---------+
|{28.0, 30.0}|{40.0, 52.0}|2635478.5|
+------------+------------+---------+

How to remove double quotes from column name while saving dataframe in csv in spark?

I am saving spark dataframe into csv file. All the records is saving in double quotes that is fine but column name also coming in double quotes. Could you please help me how to remove them?
Example:
"Source_System"|"Date"|"Market_Volume"|"Volume_Units"|"Market_Value"|"Value_Currency"|"Sales_Channel"|"Competitor_Name"
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
Desirable Output:
Source_System|Date|Market_Volume|Volume_Units|Market_Value|Value_Currency|Sales_Channel|Competitor_Name
"IMS"|"20080628"|"183.0"|"16470.0"|"165653.256349"|"AUD"|"AUSTRALIA HOSPITAL"|"PFIZER"
I am using below code:
df4.repartition(1).write.csv(Output_Path_ASPAC, quote='"', header=True, quoteAll=True, sep='|', mode='overwrite')
I think only workaround would be concat quotes to the column values in dataframe before writing to csv.
Example:
df.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#| 1| a| 100|
#+---+----+------+
from pyspark.sql.functions import col, concat, lit
cols = [concat(lit('"'), col(i), lit('"')).alias(i) for i in df.columns]
df1=df.select(*cols)
df1.show()
#+---+----+------+
#| id|name|salary|
#+---+----+------+
#|"1"| "a"| "100"|
#+---+----+------+
df1.\
write.\
csv("<path>", header=True, sep='|',escape='', quote='',mode='overwrite')
#output
#cat tmp4/part*
#id|name|salary
#"1"|"a"|"100"

Pyspark number of unique values in dataframe is different compared with Pandas result

I have large dataframe with 4 million rows. One of the columns is a variable called "name".
When I check the number of unique values in Pandas by: df['name].nunique() I get a different answer than from Pyspark df.select("name").distinct().show() (around 1800 in Pandas versus 350 in Pyspark). How can this be? Is this a data partitioning thing?
EDIT:
The record "name" in the dataframe looks like: name-{number}, for example: name-1, name-2, etc.
In Pandas:
df['name'] = df['name'].str.lstrip('name-').astype(int)
df['name'].nunique() # 1800
In Pyspark:
import pyspark.sql.functions as f
df = df.withColumn("name", f.split(df['name'], '\-')[1].cast("int"))
df.select(f.countDistinct("name")).show()
IIUC, it's most likely from non-numeric chars(i.e. SPACE) shown in the name column. Pandas will force the type conversion while with Spark, you get NULL, see below example:
df = spark.createDataFrame([(e,) for e in ['name-1', 'name-22 ', 'name- 3']],['name'])
for PySpark:
import pyspark.sql.functions as f
df.withColumn("name1", f.split(df['name'], '\-')[1].cast("int")).show()
#+--------+-----+
#| name|name1|
#+--------+-----+
#| name-1| 1|
#|name-22 | null|
#| name- 3| null|
#+--------+-----+
for Pandas:
df.toPandas()['name'].str.lstrip('name-').astype(int)
#Out[xxx]:
#0 1
#1 22
#2 3
#Name: name, dtype: int64

pyspark.sql SparkSession load() with schema : Non-StringType fields in schema make all values null

Hi,
I am having trouble using non-StringType as a part of the schema that I use in loading a csv file to create a dataframe.
I was expecting the given schema to be used to convert each field of each record to corresponding data type on the fly while loading.
Instead, all I get is null values.
Here is a simplified way of how to reproduce my problem. In this example, there is a small csv file with four columns that I want to treat, correspondingly, as str, date, int, and bool:
python
Python 3.6.5 (default, Jun 17 2018, 12:13:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> from pyspark import SparkContext
>>> from pyspark.sql import SparkSession
>>> from pyspark.sql.types import *
>>>
>>> data_flnm = 'four_cols.csv'
>>> lines = [ln.rstrip() for ln in open(data_flnm).readlines()[:3]]
>>> lines
['zzzc7c09:66d7:47d6:9415:87e5010fe282|2019-04-08|0|f', 'zzz304fa:6fc0:4337:91d0:05ef4657a6db|2019-07-08|1|f', 'yy251cf0:aa11:44e9:88f4:f6f9c1899cee|2019-05-13|0|t']
>>> parts = [ln.split("|") for ln in lines]
>>> parts
[['zzzc7c09:66d7:47d6:9415:87e5010fe282', '2019-04-08', '0', 'f'], ['zzz304fa:6fc0:4337:91d0:05ef4657a6db', '2019-07-08', '1', 'f'], ['yy251cf0:aa11:44e9:88f4:f6f9c1899cee', '2019-05-13', '0', 't']]
>>> cols1 = [StructField('u_id', StringType(), True), StructField('week', StringType(), True), StructField('flag_0_1', StringType(), True), StructField('flag_t_f', StringType(), True)]
>>> cols2 = [StructField('u_id', StringType(), True), StructField('week', DateType(), True), StructField('flag_0_1', IntegerType(), True), StructField('flag_t_f', BooleanType(), True)]
>>> sch1 = StructType(cols1)
>>> sch2 = StructType(cols2)
>>> sch1
StructType(List(StructField(u_id,StringType,true),StructField(week,StringType,true),StructField(flag_0_1,StringType,true),StructField(flag_t_f,StringType,true)))
>>> sch2
StructType(List(StructField(u_id,StringType,true),StructField(week,DateType,true),StructField(flag_0_1,IntegerType,true),StructField(flag_t_f,BooleanType,true)))
>>> spark_sess = SparkSession.builder.appName("xyz").getOrCreate()
19/09/10 19:32:16 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> df1 = spark_sess.read.format('csv').option("nullValue", "null").load([data_flnm], sep='|', schema = sch1)
>>> df2 = spark_sess.read.format('csv').option("nullValue", "null").load([data_flnm], sep='|', schema = sch2)
>>> df1.show(5)
+--------------------+----------+--------+--------+
| u_id| week|flag_0_1|flag_t_f|
+--------------------+----------+--------+--------+
|zzzc7c09:66d7:47d...|2019-04-08| 0| f|
|zzz304fa:6fc0:433...|2019-07-08| 1| f|
|yy251cf0:aa11:44e...|2019-05-13| 0| t|
|yy1d2f8e:d8f0:4db...|2019-07-08| 1| f|
|zzz5ccad:2cf6:44e...|2019-05-20| 1| f|
+--------------------+----------+--------+--------+
only showing top 5 rows
>>> df2.show(5)
+----+----+--------+--------+
|u_id|week|flag_0_1|flag_t_f|
+----+----+--------+--------+
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
|null|null| null| null|
+----+----+--------+--------+
only showing top 5 rows
>>>
I tried a few different versions of .read(...)....load(...) code.
None produce the expected result.
Please advice. Thank you!
PS: could not add tags "structfield" and "structtype" : not enough reputation (__.
While parsing, youu need to read the flag_t_f column as string. The following schema will work:
StructType(List(StructField(u_id,StringType,true),StructField(week,DateType,true),StructField(flag_0_1,IntegerType,true),StructField(flag_t_f,StringType,true)))
After that you can add a boolean column to the dataframe if required:
import pyspark.sql.functions as f
df = df.withColumn("flag_t_f",
f.when(f.col("flag_t_f") == 'f', 'False')
.when(f.col("flag_t_f") == 't', 'True')
)
If you have more than one boolean columns having values as 'f' and 't' you can convert all of those by iterating over all the columns
cols = df.columns
for col in cols:
df = df.withColumn(col,
f.when(f.col(col) == 'f', 'False')
.when(f.col(col) == 't','True')
.otherwise(f.col(col))
)

Statistics of Columns computed parallely

Best way to get the max value in a Spark dataframe column
This post shows how to run an aggregation (distinct, min, max) on a table something like:
for colName in df.columns:
dt = cd[[colName]].distinct().count()
mx = cd.agg({colName: "max"}).collect()[0][0]
mn = cd.agg({colName: "min"}).collect()[0][0]
print(colName, dt, mx, mn)
This can be easily done by compute statistics. The stats from Hive and spark are different:
Hive gives - distinct, max, min, nulls, length, version
Spark Gives - count, mean, stddev, min, max
Looks like there are quite a few statistics that are calculated. How get all of them for all columns using one command?
However, I have 1000s of columns and doing this serially is very slow. Suppose I want to compute some other function say Standard Deviation on each of the columns - how can that be done parallely?
You can use pyspark.sql.DataFrame.describe() to get aggregate statistics like count, mean, min, max, and standard deviation for all columns where such statistics are applicable. (If you don't pass in any arguments, stats for all columns are returned by default)
df = spark.createDataFrame(
[(1, "a"),(2, "b"), (3, "a"), (4, None), (None, "c")],["id", "name"]
)
df.describe().show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| mean| 2.5|null|
#| stddev|1.2909944487358056|null|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
As you can see, these statistics ignore any null values.
If you're using spark version 2.3, there is also pyspark.sql.DataFrame.summary() which supports the following aggregates:
count - mean - stddev - min - max - arbitrary approximate percentiles specified as a percentage (eg, 75%)
df.summary("count", "min", "max").show()
#+-------+------------------+----+
#|summary| id|name|
#+-------+------------------+----+
#| count| 4| 4|
#| min| 1| a|
#| max| 4| c|
#+-------+------------------+----+
If you wanted some other aggregate statistic for all columns, you could also use a list comprehension with pyspark.sql.DataFrame.agg(). For example, if you wanted to replicate what you say Hive gives (distinct, max, min and nulls - I'm not sure what length and version mean):
import pyspark.sql.functions as f
from itertools import chain
agg_distinct = [f.countDistinct(c).alias("distinct_"+c) for c in df.columns]
agg_max = [f.max(c).alias("max_"+c) for c in df.columns]
agg_min = [f.min(c).alias("min_"+c) for c in df.columns]
agg_nulls = [f.count(f.when(f.isnull(c), c)).alias("nulls_"+c) for c in df.columns]
df.agg(
*(chain.from_iterable([agg_distinct, agg_max, agg_min, agg_nulls]))
).show()
#+-----------+-------------+------+--------+------+--------+--------+----------+
#|distinct_id|distinct_name|max_id|max_name|min_id|min_name|nulls_id|nulls_name|
#+-----------+-------------+------+--------+------+--------+--------+----------+
#| 4| 3| 4| c| 1| a| 1| 1|
#+-----------+-------------+------+--------+------+--------+--------+----------+
Though this method will return one row, rather than one row per statistic as describe() and summary() do.
You can put as many expressions into an agg as you want, when you collect they all get computed at once. The result is a single row with all the values. Here's an example:
from pyspark.sql.functions import min, max, countDistinct
r = df.agg(
min(df.col1).alias("minCol1"),
max(df.col1).alias("maxCol1"),
(max(df.col1) - min(df.col1)).alias("diffMinMax"),
countDistinct(df.col2).alias("distinctItemsInCol2"))
r.printSchema()
# root
# |-- minCol1: long (nullable = true)
# |-- maxCol1: long (nullable = true)
# |-- diffMinMax: long (nullable = true)
# |-- distinctItemsInCol2: long (nullable = false)
row = r.collect()[0]
print(row.distinctItemsInCol2, row.diffMinMax)
# (10, 9)
You can also use the dictionary syntax here, but it's harder to manage for more complex things.