The pyspark groupby operation does not produce unique group keys for large data sets
I see repeated keys in the final output.
new_df = df.select('key','value') \
.where(...) \
.groupBy('key') \
.count()
e.g. above query returns multiple rows for a groupBy column (key). The datatype for groupby column('key') is string.
I'm storing output in CSV by doing
new_df.write.format("csv") \
.option("header", "true") \
.mode("Overwrite") \
.save(CSV_LOCAL_PATH)
e.g. output in CSV has duplicate rows
key1, 10
key2, 20
key1, 05
Tested in Spark 2.4.3 and 2.3
There are duplicates. There is no difference in keys. This happens for multiple KEYS.
It gives 1, when I count the rows for particular keys.
new_df.select('key','total')\
.where((col("key") == "key1"))\
.count()
I'm not sure if pyarrow settings make any difference. I'd it enabled before. I tried with both enabling and disabling pyarrow but the same result.
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
I found that the issue was while saving to the CSV which ignores whitespaces.
Adding below options helps to resolve it.
.option("ignoreLeadingWhiteSpace", "false")\
.option("ignoreTrailingWhiteSpace", "false")
Related
I have a line of pyspark that I am running in databricks:
df = df.toDF(*[format_column(c) for c in df.columns])
where format_column is a python function that upper cases, strips and removes the characters full stop . and backtick ` from the column names.
Before and after this line of code, the dataframe randomly loses a bunch of rows. If I do a count before and after the line, then the number of rows drops.
I did some more digging with this and found the same behaviour if I tried the following:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns])
although the following is ok without the aliasing:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name) for column_name in df.columns])
and it is also ok if I don't rename all columns such as:
import pyspark.sql.functions as F
df = df.toDF(*[F.col(column_name).alias(column_name) for column_name in df.columns[:-1]])
And finally, there were some pipe (|) characters in the column names, which when removed manually beforehand then resulted in no issue.
As far as I know, pipe is not actually a special character in spark sql column names (unlike full stop and backtick).
Has anyone seen this kind of behaviour before and know of a solution aside from removing the pipe character manually beforehand?
Running on Databricks Runtime 10.4LTS.
Edit
format_column is defined as follows:
def format_column(column: str) -> str:
column = column.strip().upper() # Case and leading / trailing white spaces
column = re.sub(r"\s+", " ", column) # Multiple white spaces
column = re.sub(r"\.|`", "_", column)
return column
I reproduced this in my environment and there is no loss of any rows in my dataframe.
format_column function and my dataframe:
When I used the format_column as same, you can see the count of dataframe before and after replacing.
Please recheck your dataframe if something other than this function is changing your dataframe.
If you still getting the same, you can try and check if the following results losing any rows or not.
print("before replacing : "+str(df.count()))
df1=df.toDF(*[re.sub('[^\w]', '_', c) for c in df.columns])
df1.printSchema()
print("before replacing : "+str(df1.count()))
If this also results losing rows, then the issue is with something else in your dataframe or code. please recheck on that.
df=pd.read_csv('../input/tipping/tips.csv')
df_1 = df.groupby(['day','time'])
df_1.head()
Guys, what am I missing here ? As it returns to me previous dataframe without groupby
We can print it using the following :
df_1 = df.groupby(['day','time']).apply(print)
groupby doesn't work the way you are assuming by the sounds of it. Using head on the grouped dataframe takes the first 5 rows of the dataframe, even if it is across groups because that is how the groupby object is built. You can use #tlentali's approach to print out each group, but df_1 will not be assigned the grouped dataframe that way, instead, None (the number of groups times) as that is the output of print.
The way below gives a lot of control over how to show/display the groups and their keys
This might also help you understand more about how the grouped data frame structure in pandas works.
df_1 = df.groupby(['day','time'])
# for each (day,time) and grouped data
for key, group in df_1:
# show the (day,time)
print(key)
# display head of the grouped data
group.head()
I have an output spark Dataframe which needs to be written to CSV. A column in the Dataframe is 'struct' type and is not supported by csv. I am trying to convert it to string or convert to pandas DF but nothing works.
userRecs1=userRecs.withColumn("recommendations", explode(userRecs.recommendations))
#userRecs1.write.csv('/user-home/libraries/Sampled_data/datasets/rec_per_user.csv')
Expected result: Recommendations column as string type so that it can be split into two separate columns and write to csv.
Actual results:
(recommendations column is struct type and cannot be written to csv)
ID_CTE| recommendations|
+-------+-----------------+
|3974081| [2229,0.8915096]|
|3974081| [2224,0.8593609]|
|3974081| [2295,0.8577902]|
|3974081|[2248,0.29922757]|
|3974081|[2299,0.28952467]|
Another option is to convert the struct column to a json and then save:
from pyspark.sql import functions as f
userRecs1 \
.select(f.col('ID_CTE'), f.to_json(f.col('recommendations.'))) \
.write.csv('/user-home/libraries/Sampled_data/datasets/rec_per_user.csv')
The following command will flatten your StructType into separate named columns:
userRecs1 \
.select('ID_CTE', 'recommendations.*') \
.write.csv('/user-home/libraries/Sampled_data/datasets/rec_per_user.csv')
I have a list of many dataframes each with a subset schema of a master schema. In order to union these dataframes, I need to construct a common schema among all the dataframes. My thought is that I need to create empty columns for all the missing columns for each of the dataframes. I have about on average 80 missing features and 100s of dataframes.
This is somewhat of a duplicate or inspired by Concatenate two PySpark dataframes
I am currently implementing things this way:
from pyspark.sql.functions import lit
for df in dfs: # list of dataframes
for feature in missing_features: # list of strings
df = df.withColumn(feature, lit(None).cast("string"))
This seems to be taking a significant amount of time. Is there a faster way to concat these dataframes with null in place of missing features?
You might be able to cut time a little by replacing your code with:
cols = ["*"] + [lit(None).cast("string").alias(f) for f in missing_features]
dfs_new = [df.select(cols) for df in dfs]
I have a Spark dataframe with the following data (I use spark-csv to load the data in):
key,value
1,10
2,12
3,0
1,20
is there anything similar to spark RDD reduceByKey which can return a Spark DataFrame as: (basically, summing up for the same key values)
key,value
1,30
2,12
3,0
(I can transform the data to RDD and do a reduceByKey operation, but is there a more Spark DataFrame API way to do this?)
If you don't care about column names you can use groupBy followed by sum:
df.groupBy($"key").sum("value")
otherwise it is better to replace sum with agg:
df.groupBy($"key").agg(sum($"value").alias("value"))
Finally you can use raw SQL:
df.registerTempTable("df")
sqlContext.sql("SELECT key, SUM(value) AS value FROM df GROUP BY key")
See also DataFrame / Dataset groupBy behaviour/optimization
I think user goks missed out on some part in the code. Its not a tested code.
.map should have been used to convert the rdd to a pairRDD using .map(lambda x: (x,1)).reduceByKey. ....
reduceByKey is not available on a single value rdd or regular rdd but pairRDD.
Thx
How about this? I agree this still converts to rdd then to dataframe.
df.select('key','value').map(lambda x: x).reduceByKey(lambda a,b: a+b).toDF(['key','value'])