How to select True columns only from below
'''
for i in cat:
if i in df:
uniques=((i,df[i].nunique()<6))
print(uniques)
'''
('Exterior1st', False)
('Exterior2nd', False)
('MasVnrType', True)
('ExterQual', True)
('ExterCond', True)
('Foundation', False)
('BsmtQual', True)
('BsmtCond', True)
('BsmtExposure', True)
('BsmtFinType1', False)
('BsmtFinType2', False)
('Heating', False)
('HeatingQC', True)
('CentralAir', True)
('Electrical', True)
('KitchenQual', True)
('Functional', False)
('GarageType', False)
('GarageFinish', True)
('GarageQual', True)
('GarageCond', True)
('PavedDrive', True)
('SaleType', False)
('SaleCondition', False)
You can do this:
tuples = [('Exterior1st', False), ('Exterior2nd', False), ('MasVnrType', True), ('ExterQual', True), ('ExterCond', True), ('Foundation', False), ('BsmtQual', True), ('BsmtCond', True), ('BsmtExposure', True), ('BsmtFinType1', False), ('BsmtFinType2', False), ('Heating', False), ('HeatingQC', True), ('CentralAir', True), ('Electrical', True), ('KitchenQual', True), ('Functional', False), ('GarageType', False), ('GarageFinish', True), ('GarageQual', True), ('GarageCond', True), ('PavedDrive', True), ('SaleType', False), ('SaleCondition', False)]
res = [tup for tup in tuples if any(i in tup for i in [True])]
print(res)
[('MasVnrType', True), ('ExterQual', True), ('ExterCond', True), ('BsmtQual', True), ('BsmtCond', True), ('BsmtExposure', True), ('HeatingQC', True), ('CentralAir', True), ('Electrical', True), ('KitchenQual', True), ('GarageFinish', True), ('GarageQual', True), ('GarageCond', True), ('PavedDrive', True)]
Related
How can i match if two RDD generated the way i did contains the same data including number of rows?
I'm using scala test to run the tests and spark version 3 with scala 2.12.12
Below my code from the creation of the schemas of my two rdd, included the expected one and the creation of all 3 rdd with data.
-- CREATING SCHEMA FOR RDD AMOUNTS AND WAREHOUSE AND EXPECTED FINAL SCHEMA
val amountsSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val warehouseSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("eventTime",LongType, nullable = true),
)
)
val expectedDfSchema: StructType = StructType(
Seq(
StructField("positionId", LongType, nullable = true),
StructField("warehouse", StringType, nullable = true),
StructField("product", StringType, nullable = true),
StructField("amount", DecimalType(10, 2), nullable = true),
)
)
--- CREATING DATA FOR RDD AMOUNTS RDD AND WAREHOUSE RDD AND EXPECTED FINAL RDD
val amounts_data = Seq(
Row("1", "5.00", "1528463387"),
Row("1", "7.20", "1528463005"),
Row("2", "5.00", "1528463097"),
Row("2", "7.20", "1528463007"),
Row("3", "6.00", "1528463078"),
Row("4", "24.20", "1528463008"),
Row("4", "15.00", "1528463100"),
)
val wh_data = Seq(
Row("1", "W-1", "P-1", "1528463098"),
Row("2", "W-2", "P-2", "1528463097"),
Row("3", "W-2", "P-3", "1528463078"),
Row("4", "W-1", "P-6", "1528463100"),
)
val expected_data = Seq(
Row("1", "W-1", "P-1", "5.00"),
Row("2", "W-2", "P-2", "5.00"),
Row("3", "W-2", "P-3", "6.00"),
Row("4", "W-1", "P-6", "15.00")
)
---- CREATING RDD WITH SCHEMAS AND DATA FOR DF_AMOUNTS AND DF_WAREHOUSE AND FOR THE EXPECTED RDD WITH EXPECTED_DATA
val df_amounts: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(amounts_data),
amountsSchema
)
val df_wh: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(wh_data),
warehouseSchema
)
val df_expected: DataFrame = spark.createDataFrame(
spark.sparkContext.parallelize(expected_data),
expectedDfSchema
)
---- USING GET_AMOUNTS METHOD TO GENERATE A RDD FROM THE FUNCTION get_amounts
val resDf: DataFrame = get_amounts(df_amounts, df_wh)
---- TESTING IF THE resDf SCHEMA MATCH WITH THE EXPECTED SCHEMA - IT DOES TEST PASSED
test("DataFrame Schema Test") {
assert(assertSchema(resDf.schema, df_expected.schema))
}
---- TESTING IF THE resDf DATA MATCH WITH THE EXPECTED DATA - IT DOESNT'T MATCH
test("DataFrame Data Test") {
assert(assertData(resDf, df_expected))
}
}
The assertData function used to match the data for the expected data rdd and the one coming from my function get_amounts but it fails the test.
def assertData(df1: DataFrame, df2: DataFrame): Boolean = {
df1.exceptAll(df2).rdd.isEmpty()
}
Thank You
The way you create a datasets is valid. A test structure looks good as well.
I would suggest to improve your assert method to see why the test case failes, here you can find some thoughts on your testing method:
exceptAll is not a perfect for testing, if the df2 contains an additional row it will still say that the data matches, consider below code:
val df1 = Seq(
(1, "x"),
(2, "y")
).toDF("x", "y")
val df2 = Seq(
(1, "x"),
(2, "y"),
(3, "z")
).toDF("x", "y")
assert(df1.exceptAll(df2).rdd.isEmpty())
"this function resolves columns by position (not by name)" (from Spark code scala docs), due to this sometimes you can get confused about your test result.
your assert method says nothing about what exactly mismatched
For testing purposes is not bad to collect (small amount of) data and match sequences.
You can consider using a method like this one:
def matchDF(resultDF: DataFrame, expectedDF: DataFrame): Unit = {
resultDF.printSchema()
expectedDF.printSchema()
assert(resultDF.schema == expectedDF.schema,
s"Schema does not match: ${resultDF.schema} != ${expectedDF.schema}")
val expected = expectedDF.collect().toSeq
val result = resultDF.collect().toSeq
assert(expected == result, s"Data does not match: $result != $expected")
}
It's not a perfect approach (still depends on the position in a row), but at least you will be able to find out what is going on and why your test fails.
For wrong data you'll see this:
assertion failed: Data does not match: WrappedArray([1,x], [2,y]) != WrappedArray([1,x], [3,y])
For wrong schema you'll get:
root
|-- x: integer (nullable = false)
|-- y: string (nullable = true)
root
|-- x: string (nullable = true)
|-- y: string (nullable = true)
Exception in thread "main" java.lang.AssertionError: assertion failed: Schema does not match
I hope this will help you understand what is going wrong.
I'm working on migrating a job. As a part of this, I need to pass column datatypes when constructing dataframes.
I was able to construct a dictionary which holds table name as key and schema definition for the table as value. When I'm trying to pass the values to dataframe schema it is complaining that it should be Struct type, but not string type.
Dictionary I'm creating:
{'table1': StructType([
StructField("column1",varchar(), True),
StructField("column2",numeric(), True),
StructField("column3",numeric(), True),
StructField("column4",timestamp(), True),
StructField("column5",timestamp(), True),
StructField("column6",timestamp(), True)
])}
I'm aware the datatypes up there can be wrongly placed, but this is just an example.
Error: expecting a Struct not a string literal for schema definition.
I'm not sure how you use your dictionary, but the following way of passing schema as dict value works well:
my_dict = {'table1': StructType([
StructField("column1", StringType(), True),
StructField("column2", LongType(), True),
StructField("column3", LongType(), True),
StructField("column4", TimestampType(), True),
StructField("column5", TimestampType(), True),
StructField("column6", TimestampType(), True)
])}
df = spark.createDataFrame([], my_dict["table1"])
df.printSchema()
# root
# |-- column1: string (nullable = true)
# |-- column2: long (nullable = true)
# |-- column3: long (nullable = true)
# |-- column4: timestamp (nullable = true)
# |-- column5: timestamp (nullable = true)
# |-- column6: timestamp (nullable = true)
When I perform a row count operation on my dataframe with/without cache(), I get a different result.
The example is the following:
videosDF = videosRawDF.withColumn("trending_date", F.to_date("trending_date", "yy.dd.MM"))\
.withColumn("publish_time", F.from_unixtime(F.unix_timestamp('publish_time', "yyyy-MM-dd'T'HH:mm:ss")).cast("timestamp"))\
.dropna(subset = ["trending_date", "publish_time"])
videosDF.count()
120746
On the other hand if I only add .cache() the result changes
videosDF = videosRawDF.withColumn("trending_date", F.to_date("trending_date", "yy.dd.MM"))\
.withColumn("publish_time", F.from_unixtime(F.unix_timestamp('publish_time', "yyyy-MM-dd'T'HH:mm:ss")).cast("timestamp"))\
.dropna(subset = ["trending_date", "publish_time"])
videosDF.cache()
videosDF.count()
0
The correct solution after performing the drop is 120746 rows, but I am forced to perform the cache. What I can do?
Thank you for reading.
EDIT:
The dataset can be found in: https://www.kaggle.com/datasets/datasnaek/youtube-new
and my code until now is the following:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,BooleanType
from pyspark.sql import functions as F
ruta = "gs://ucmbucketrafael/data/"
youtubeSchema = StructType([
StructField('video_id', StringType(), nullable = True),
StructField('trending_date', StringType(), nullable = True),
StructField('title', StringType(), nullable = True),
StructField('channel_title', StringType(), nullable = True),
StructField('category_id', StringType(), nullable = True),
StructField('publish_time', StringType(), nullable = True),
StructField('tags', StringType(), nullable = True),
StructField('views', IntegerType(), nullable = True),
StructField('likes', IntegerType(), nullable = True),
StructField('dislikes', IntegerType(), nullable = True),
StructField('comment_count', IntegerType(), nullable = True),
StructField('comments_disabled', BooleanType(), nullable = True),
StructField('ratings_disabled', BooleanType(), nullable = True),
StructField('video_error_or_removed', BooleanType(), nullable = True),
StructField('description', StringType(), nullable = True)
])
USvideosDF = spark.read.schema(youtubeSchema)\
.option("header", "true")\
.option("quote", "\"").option("escape", "\"")\
.option("mode","DROPMALFORMED")\
.csv(ruta+"youtube_USvideos.csv")\
.withColumn("pais",F.lit("EEUU"))\
.drop("description")
CAvideosDF = spark.read.schema(youtubeSchema)\
.option("header", "true")\
.option("quote", "\"").option("escape", "\"")\
.option("mode","DROPMALFORMED")\
.csv(ruta+"youtube_CAvideos.csv")\
.withColumn("pais",F.lit("CA"))\
.drop("description")
GBvideosDF = spark.read.schema(youtubeSchema)\
.option("header", "true")\
.option("quote", "\"").option("escape", "\"")\
.option("mode","DROPMALFORMED")\
.csv(ruta+"youtube_GBvideos.csv")\
.withColumn("pais",F.lit("GB"))\
.drop("description")
videosRawDF = USvideosDF.union(CAvideosDF).union(GBvideosDF)
videosDF = videosRawDF.withColumn("trending_date", F.to_date("trending_date", "yy.dd.MM"))\
.withColumn("publish_time", F.from_unixtime(F.unix_timestamp('publish_time', "yyyy-MM-dd'T'HH:mm:ss")).cast("timestamp"))\
.dropna(subset = ["trending_date", "publish_time"])
videosDF.cache()
videosDF.count()
I have already discovered the problem. It is the interaction between DROPMALFORMED and the scheme. If the schema is deleted when reading the csv it already works correctly.
I have a 1000-line csv file containing tab-separated values, and want to read them into a data frame.
Here is my schema:
schema = StructType([StructField("TEST1", StringType(), True),
StructField("TEST2", StringType(), True),
StructField("TEST3", StringType(), True),
StructField("TEST4", StringType(), True),
StructField("TEST5", StringType(), True)])
Here is my readStream statement:
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimiter", "\t")
.option("path", "/mnt/data/data.tsv")
.schema(schema).load()
Running the readStream cell results in:
df: pyspark.sql.dataframe.DataFrame
TEST1: string
TEST2: string
TEST3: string
TEST4: string
TEST5: string
In the next cell I executed
display(df)
and ended up with
Query returned no results.
I think there is something wrong with my schema or readStream statements. Is the tab delimiter set correctly?
I found out what the problem was. The path should not have been a specific file. Rather, it should have been a file path containing wild cards. After I changed the command to this:
df = spark.readStream.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("delimiter", "\t")
.option("path", "/mnt/data/*.tsv")
.schema(schema).load()
it worked just fine.
I have a given dataframe that looks like this:
TEST_schema = StructType([StructField("Date", StringType(), True),\
StructField("START", StringType(), True),\
StructField("quantity", IntegerType(), True),\
StructField("col1", StringType(), True),
StructField("col2", StringType(), True)])
TEST_data = [('2020-08-15','2020-08-19',1,'2020-08-05','2020-08-09'),('2020-08-16','2020-08-19',2,'2020-08-05','2020-08-09')\
,('2020-08-17','2020-08-19',3,'2020-08-06','2020-08-09'),\
('2020-08-18','2020-08-19',4,'2020-08-10','2020-08-11'),('2020-08-19','2020-08-19',5,'2020-08-16','2020-08-19'),\
('2020-08-20','2020-08-19',6,'2020-08-20','2020-08-25'),('2020-08-21','2020-08-19',7,'2020-08-20','2020-08-21'),\
('2020-08-22','2020-08-19',8,'2020-08-19','2020-08-24'),('2020-08-23','2020-08-19',9,'2020-08-05','2020-08-09')]
rdd3 = sc.parallelize(TEST_data)
TEST_df = sqlContext.createDataFrame(TEST_data, TEST_schema)
TEST_df = TEST_df.withColumn("Date",to_date("Date"))\
.withColumn("START",to_date("START"))\
.withColumn("col1",to_date("col1"))\
.withColumn("col2",to_date("col2"))\
TEST_df.show()
+----------+----------+--------+----------+----------+
| Date| START|quantity| col1| col2|
+----------+----------+--------+----------+----------+
|2020-08-15|2020-08-19| 1|2020-08-05|2020-08-09|
|2020-08-16|2020-08-19| 2|2020-08-05|2020-08-09|
|2020-08-17|2020-08-19| 3|2020-08-06|2020-08-09|
|2020-08-18|2020-08-19| 4|2020-08-10|2020-08-11|
|2020-08-19|2020-08-19| 5|2020-08-16|2020-08-19|
|2020-08-20|2020-08-19| 6|2020-08-20|2020-08-25|
|2020-08-21|2020-08-19| 7|2020-08-20|2020-08-21|
|2020-08-22|2020-08-19| 8|2020-08-19|2020-08-24|
|2020-08-23|2020-08-19| 9|2020-08-05|2020-08-09|
+----------+----------+--------+----------+----------+
where col1 and col2 may have not be unique, and Date is just incremental date, and START is unique.
My logic is that if START == col2, then lag(quantity, offset= datediff(col2,col1),0) otherwise 0.
in this case, datediff(col2,col1) which is 3days.
Attempt 1.
from pyspark.sql.functions import when, col,datediff,expr
TEST_df = TEST_df.withColumn('datedifff', datediff(col('col2'), col('col1')))\
.withColumn('want', expr("IF(START == col2, lag(quantity, datedifff,0),0) "))
which has literal error...
so my result df will look like: