After using cache() in pyspark the row count is wrong

After using cache() in pyspark the row count is wrong - dataframe

When I perform a row count operation on my dataframe with/without cache(), I get a different result.
The example is the following:
videosDF = videosRawDF.withColumn("trending_date", F.to_date("trending_date", "yy.dd.MM"))\
.withColumn("publish_time", F.from_unixtime(F.unix_timestamp('publish_time', "yyyy-MM-dd'T'HH:mm:ss")).cast("timestamp"))\
.dropna(subset = ["trending_date", "publish_time"])
videosDF.count()
120746
On the other hand if I only add .cache() the result changes
videosDF = videosRawDF.withColumn("trending_date", F.to_date("trending_date", "yy.dd.MM"))\
.withColumn("publish_time", F.from_unixtime(F.unix_timestamp('publish_time', "yyyy-MM-dd'T'HH:mm:ss")).cast("timestamp"))\
.dropna(subset = ["trending_date", "publish_time"])
videosDF.cache()
videosDF.count()
0
The correct solution after performing the drop is 120746 rows, but I am forced to perform the cache. What I can do?
Thank you for reading.
EDIT:
The dataset can be found in: https://www.kaggle.com/datasets/datasnaek/youtube-new
and my code until now is the following:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,BooleanType
from pyspark.sql import functions as F
ruta = "gs://ucmbucketrafael/data/"
youtubeSchema = StructType([
StructField('video_id', StringType(), nullable = True),
StructField('trending_date', StringType(), nullable = True),
StructField('title', StringType(), nullable = True),
StructField('channel_title', StringType(), nullable = True),
StructField('category_id', StringType(), nullable = True),
StructField('publish_time', StringType(), nullable = True),
StructField('tags', StringType(), nullable = True),
StructField('views', IntegerType(), nullable = True),
StructField('likes', IntegerType(), nullable = True),
StructField('dislikes', IntegerType(), nullable = True),
StructField('comment_count', IntegerType(), nullable = True),
StructField('comments_disabled', BooleanType(), nullable = True),
StructField('ratings_disabled', BooleanType(), nullable = True),
StructField('video_error_or_removed', BooleanType(), nullable = True),
StructField('description', StringType(), nullable = True)
])
USvideosDF = spark.read.schema(youtubeSchema)\
.option("header", "true")\
.option("quote", "\"").option("escape", "\"")\
.option("mode","DROPMALFORMED")\
.csv(ruta+"youtube_USvideos.csv")\
.withColumn("pais",F.lit("EEUU"))\
.drop("description")
CAvideosDF = spark.read.schema(youtubeSchema)\
.option("header", "true")\
.option("quote", "\"").option("escape", "\"")\
.option("mode","DROPMALFORMED")\
.csv(ruta+"youtube_CAvideos.csv")\
.withColumn("pais",F.lit("CA"))\
.drop("description")
GBvideosDF = spark.read.schema(youtubeSchema)\
.option("header", "true")\
.option("quote", "\"").option("escape", "\"")\
.option("mode","DROPMALFORMED")\
.csv(ruta+"youtube_GBvideos.csv")\
.withColumn("pais",F.lit("GB"))\
.drop("description")
videosRawDF = USvideosDF.union(CAvideosDF).union(GBvideosDF)
videosDF = videosRawDF.withColumn("trending_date", F.to_date("trending_date", "yy.dd.MM"))\
.withColumn("publish_time", F.from_unixtime(F.unix_timestamp('publish_time', "yyyy-MM-dd'T'HH:mm:ss")).cast("timestamp"))\
.dropna(subset = ["trending_date", "publish_time"])
videosDF.cache()
videosDF.count()

I have already discovered the problem. It is the interaction between DROPMALFORMED and the scheme. If the schema is deleted when reading the csv it already works correctly.

Related

Passing schema to construct DataFrame

I'm working on migrating a job. As a part of this, I need to pass column datatypes when constructing dataframes.
I was able to construct a dictionary which holds table name as key and schema definition for the table as value. When I'm trying to pass the values to dataframe schema it is complaining that it should be Struct type, but not string type.
Dictionary I'm creating:
{'table1': StructType([
StructField("column1",varchar(), True),
StructField("column2",numeric(), True),
StructField("column3",numeric(), True),
StructField("column4",timestamp(), True),
StructField("column5",timestamp(), True),
StructField("column6",timestamp(), True)
])}
I'm aware the datatypes up there can be wrongly placed, but this is just an example.
Error: expecting a Struct not a string literal for schema definition.

I'm not sure how you use your dictionary, but the following way of passing schema as dict value works well:
my_dict = {'table1': StructType([
StructField("column1", StringType(), True),
StructField("column2", LongType(), True),
StructField("column3", LongType(), True),
StructField("column4", TimestampType(), True),
StructField("column5", TimestampType(), True),
StructField("column6", TimestampType(), True)
])}
df = spark.createDataFrame([], my_dict["table1"])
df.printSchema()
# root
# |-- column1: string (nullable = true)
# |-- column2: long (nullable = true)
# |-- column3: long (nullable = true)
# |-- column4: timestamp (nullable = true)
# |-- column5: timestamp (nullable = true)
# |-- column6: timestamp (nullable = true)

How do I execute a readStream of a tab-delimited file in Databricks?

I have a 1000-line csv file containing tab-separated values, and want to read them into a data frame.
Here is my schema:
schema = StructType([StructField("TEST1", StringType(), True),
StructField("TEST2", StringType(), True),
StructField("TEST3", StringType(), True),
StructField("TEST4", StringType(), True),
StructField("TEST5", StringType(), True)])
Here is my readStream statement:
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimiter", "\t")
.option("path", "/mnt/data/data.tsv")
.schema(schema).load()
Running the readStream cell results in:
df: pyspark.sql.dataframe.DataFrame
TEST1: string
TEST2: string
TEST3: string
TEST4: string
TEST5: string
In the next cell I executed
display(df)
and ended up with
Query returned no results.
I think there is something wrong with my schema or readStream statements. Is the tab delimiter set correctly?

I found out what the problem was. The path should not have been a specific file. Rather, it should have been a file path containing wild cards. After I changed the command to this:
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimiter", "\t")
.option("path", "/mnt/data/*.tsv")
.schema(schema).load()
it worked just fine.

How to lag with offset

I have a given dataframe that looks like this:
TEST_schema = StructType([StructField("Date", StringType(), True),\
StructField("START", StringType(), True),\
StructField("quantity", IntegerType(), True),\
StructField("col1", StringType(), True),
StructField("col2", StringType(), True)])
TEST_data = [('2020-08-15','2020-08-19',1,'2020-08-05','2020-08-09'),('2020-08-16','2020-08-19',2,'2020-08-05','2020-08-09')\
,('2020-08-17','2020-08-19',3,'2020-08-06','2020-08-09'),\
('2020-08-18','2020-08-19',4,'2020-08-10','2020-08-11'),('2020-08-19','2020-08-19',5,'2020-08-16','2020-08-19'),\
('2020-08-20','2020-08-19',6,'2020-08-20','2020-08-25'),('2020-08-21','2020-08-19',7,'2020-08-20','2020-08-21'),\
('2020-08-22','2020-08-19',8,'2020-08-19','2020-08-24'),('2020-08-23','2020-08-19',9,'2020-08-05','2020-08-09')]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("Date",to_date("Date"))\
.withColumn("START",to_date("START"))\
.withColumn("col1",to_date("col1"))\
.withColumn("col2",to_date("col2"))\
TEST_df.show()
+----------+----------+--------+----------+----------+
| Date| START|quantity| col1| col2|
+----------+----------+--------+----------+----------+
|2020-08-15|2020-08-19| 1|2020-08-05|2020-08-09|
|2020-08-16|2020-08-19| 2|2020-08-05|2020-08-09|
|2020-08-17|2020-08-19| 3|2020-08-06|2020-08-09|
|2020-08-18|2020-08-19| 4|2020-08-10|2020-08-11|
|2020-08-19|2020-08-19| 5|2020-08-16|2020-08-19|
|2020-08-20|2020-08-19| 6|2020-08-20|2020-08-25|
|2020-08-21|2020-08-19| 7|2020-08-20|2020-08-21|
|2020-08-22|2020-08-19| 8|2020-08-19|2020-08-24|
|2020-08-23|2020-08-19| 9|2020-08-05|2020-08-09|
+----------+----------+--------+----------+----------+
where col1 and col2 may have not be unique, and Date is just incremental date, and START is unique.
My logic is that if START == col2, then lag(quantity, offset= datediff(col2,col1),0) otherwise 0.
in this case, datediff(col2,col1) which is 3days.
Attempt 1.
from pyspark.sql.functions import when, col,datediff,expr
TEST_df = TEST_df.withColumn('datedifff', datediff(col('col2'), col('col1')))\
.withColumn('want', expr("IF(START == col2, lag(quantity, datedifff,0),0) "))
which has literal error...
so my result df will look like:

How to select only true values from the tuple in python

How to select True columns only from below
'''
for i in cat:
if i in df:
uniques=((i,df[i].nunique()<6))
print(uniques)
'''
('Exterior1st', False)
('Exterior2nd', False)
('MasVnrType', True)
('ExterQual', True)
('ExterCond', True)
('Foundation', False)
('BsmtQual', True)
('BsmtCond', True)
('BsmtExposure', True)
('BsmtFinType1', False)
('BsmtFinType2', False)
('Heating', False)
('HeatingQC', True)
('CentralAir', True)
('Electrical', True)
('KitchenQual', True)
('Functional', False)
('GarageType', False)
('GarageFinish', True)
('GarageQual', True)
('GarageCond', True)
('PavedDrive', True)
('SaleType', False)
('SaleCondition', False)

You can do this:
tuples = [('Exterior1st', False), ('Exterior2nd', False), ('MasVnrType', True), ('ExterQual', True), ('ExterCond', True), ('Foundation', False), ('BsmtQual', True), ('BsmtCond', True), ('BsmtExposure', True), ('BsmtFinType1', False), ('BsmtFinType2', False), ('Heating', False), ('HeatingQC', True), ('CentralAir', True), ('Electrical', True), ('KitchenQual', True), ('Functional', False), ('GarageType', False), ('GarageFinish', True), ('GarageQual', True), ('GarageCond', True), ('PavedDrive', True), ('SaleType', False), ('SaleCondition', False)]
res = [tup for tup in tuples if any(i in tup for i in [True])]
print(res)
[('MasVnrType', True), ('ExterQual', True), ('ExterCond', True), ('BsmtQual', True), ('BsmtCond', True), ('BsmtExposure', True), ('HeatingQC', True), ('CentralAir', True), ('Electrical', True), ('KitchenQual', True), ('GarageFinish', True), ('GarageQual', True), ('GarageCond', True), ('PavedDrive', True)]

Pyspark: Convert a '\x01'-delimited file from S3 into a dataframe

Spark: 1.4.0
I have a flatfile from Amazon S3 which I loaded into HDFS (in the master node of my EC2 Spark cluster). The flatfile is a Hive output. Note: I couldn't change the context as it is already defined. The following codes are used in the pyspark shell:
Each 'row' corresponds to 1 row of data:
row = sc.textFile("/data/file")
row.first()
u'E8B98\x01John\x01Smith\x01Male\x01Gold\x0125''
Then I split each row using flatmap() since for some reason map() doesn't seem to delimit it (using '\x01' as the delimiter):
elements = row.flatMap(lambda x: x.split('\x01'))
elements.take(8)
[u'E8B98', u'John', u'Smith', u'Male', u'Gold', u'25', u'E8B99', u'Alice']
Since I know the data has 6 columns per row, how do I get the data into a dataframe? I'm intend to sort by attribute, sum etc.
I tried the following but it didn't work:
id = row.flatMap(lambda x: x.split('\x01')[0])
id.first()
E

There is many way to transform an rdd to a dataframe in python :
Considering the following rdd
rdd = sc.parallelize(list(["E8B98\x01John\x01Smith\x01Male\x01Gold\x0125","E8B2\x01Joe\x01Smith\x01Female\x01Gold\x0125"]))
rdd.first()
Output:
'E8B98\x01John\x01Smith\x01Male\x01Gold\x0125'
Let's now create an rdd of tuples :
rdd2 = rdd.map(lambda x : x.split("\x01"))
rdd2.first()
Output:
['E8B98', 'John', 'Smith', 'Male', 'Gold', '25']
We can now create a dataframe with one of the following ways :
Create it directly from the tuples rdd :
sqlContext.createDataFrame(rdd2).collect()
Output:
[Row(_1=u'E8B98', _2=u'John', _3=u'Smith', _4=u'Male', _5=u'Gold', _6=u'25'), Row(_1=u'E8B2', _2=u'Joe', _3=u'Smith', _4=u'Female', _5=u'Gold', _6=u'25')]
or create it with the same rdd specifying the name of the columns :
df = sqlContext.createDataFrame(rdd2, ['id', 'name', 'surname', 'gender', 'description', 'age'])
df.collect()
Output:
[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'), Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]
or create it with the inferred schema :
pyspark.sql.types import *
schema = StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True),
StructField("surname", StringType(), True),
StructField("gender", StringType(), True),
StructField("description", StringType(), True),
StructField("age", StringType(), True)])
df2 = sqlContext.createDataFrame(rdd2, schema)
df2.collect()
Output:
[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'),Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]
or yet specifying your row class schema as following :
from pyspark.sql import Row
Person = Row('id', 'name', 'surname', 'gender', 'description', 'age')
person = rdd2.map(lambda r: Person(*r))
df3 = sqlContext.createDataFrame(person)
df3.collect()
Output:
[Row(id=u'E8B98', name=u'John', surname=u'Smith', gender=u'Male', description=u'Gold', age=u'25'), Row(id=u'E8B2', name=u'Joe', surname=u'Smith', gender=u'Female', description=u'Gold', age=u'25')]
I hope this helps!
NB: Spark version >= 1.3.0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

After using cache() in pyspark the row count is wrong - dataframe

I have already discovered the problem. It is the interaction between DROPMALFORMED and the scheme. If the schema is deleted when reading the csv it already works correctly.

Related

Passing schema to construct DataFrame

How do I execute a readStream of a tab-delimited file in Databricks?

How to lag with offset

How to select only true values from the tuple in python

Pyspark: Convert a '\x01'-delimited file from S3 into a dataframe

Categories

Resources