PySpark Dataframe: Unify Certain Rows

PySpark Dataframe: Unify Certain Rows - dataframe

I'm having some trouble figuring this one out
Here's a simple example:
+---+----+-----+
| Id|Rank|State+
+---+----+-----+
| a| 5| NJ +
| a| 7| GA +
| b| 8| CA +
| b| 1| CA +
+---+----+-----+
I'd like to format this dataframe in a way where if the same Id is in multiple states, have it only store one state. In this example, any row with Id "a" should have state "NJ" instead of "NJ" and "GA".
The result should be something like:
+---+----+-----+
| Id|Rank|State+
+---+----+-----+
| a| 5| NJ +
| a| 7| NJ +
| b| 8| CA +
| b| 1| CA +
+---+----+-----+`
How can this be accompished? Thanks!!

Try first windowing function like:
w = Window().partitionBy("Id").orderBy("Rank")
df.select(col("Id"), col("Rank"), first("State", True).over(w).alias("NewState"))
This will put into "NewState" column the first state according to the rank within id group.
The same thing can easily be expressed in pure SQL, if you want to use it.
BTW, welcome to StackOverflow community!

Related

PySpark: How to concatenate two distinct dataframes?

I have multiple dataframes that I need to concatenate together, row-wise. In pandas, we would typically write: pd.concat([df1, df2]).
This thread: How to concatenate/append multiple Spark dataframes column wise in Pyspark? appears close, but its respective answer:
df1_schema = StructType([StructField("id",IntegerType()),StructField("name",StringType())])
df1 = spark.sparkContext.parallelize([(1, "sammy"),(2, "jill"),(3, "john")])
df1 = spark.createDataFrame(df1, schema=df1_schema)
df2_schema = StructType([StructField("secNo",IntegerType()),StructField("city",StringType())])
df2 = spark.sparkContext.parallelize([(101, "LA"),(102, "CA"),(103,"DC")])
df2 = spark.createDataFrame(df2, schema=df2_schema)
schema = StructType(df1.schema.fields + df2.schema.fields)
df1df2 = df1.rdd.zip(df2.rdd).map(lambda x: x[0]+x[1])
spark.createDataFrame(df1df2, schema).show()
Yields the following error when done on my data at scale: Can only zip RDDs with same number of elements in each partition
How can I join 2 or more data frames that are identical in row length but are otherwise independent of content (they share a similar repeating structure/order but contain no shared data)?
Example expected data looks like:
+---+-----+ +-----+----+ +---+-----+-----+----+
| id| name| |secNo|city| | id| name|secNo|city|
+---+-----+ +-----+----+ +---+-----+-----+----+
| 1|sammy| + | 101| LA| => | 1|sammy| 101| LA|
| 2| jill| | 102| CA| | 2| jill| 102| CA|
| 3| john| | 103| DC| | 3| john| 103| DC|
+---+-----+ +-----+----+ +---+-----+-----+----+

You can create unique IDs with
df1 = df1.withColumn("unique_id", expr("row_number() over (order by (select null))"))
df2 = df2.withColumn("unique_id", expr("row_number() over (order by (select null))"))
then, you can left join them
df1.join(df2, Seq("unique_id"), "left").drop("unique_id")
Final output looks like
+---+----+---+-------+
| id|name|age|address|
+---+----+---+-------+
| 1| a| 7| x|
| 2| b| 8| y|
| 3| c| 9| z|
+---+----+---+-------+

Compare columns from two different dataframes based on id

I have two dataframes to compare, the order of records are different, the name of columns might be different. Have to compare columns (more than one) based on the unique key (id)
Example: consider cataframes df1 and df2
df1:
+---+-------+-----+
| id|student|marks|
+---+-------+-----+
| 1| Vijay| 23|
| 4| Vithal| 24|
| 2| Ram| 21|
| 3| Rahul| 25|
+---+-------+-----+
df2:
+-----+--------+------+
|newId|student1|marks1|
+-----+--------+------+
| 3| Rahul| 25|
| 2| Ram| 23|
| 1| Vijay| 23|
| 4| Vithal| 24|
+-----+--------+------+
Here based on id and newId, I need to compare values studentName and Marks, and need to check that whether the student with same id has same name and marks
In this example student with id 2 has 21 marks but in df2 23 marks

df1.exceptAll(df2).show()
// +---+-------+-----+
// | id|student|marks|
// +---+-------+-----+
// | 2| Ram| 21|
// +---+-------+-----+

I think diff will give the result you are looking for.
scala> df1.diff(df2)
res0: Seq[org.apache.spark.sql.Row] = List([2,Ram,21])

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!

Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL

As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+

concatenate arbitrary long list of matches in SQL subquery

imagine 2 tables (rather stupid example, but for the sake of simplicity, here you go)
words
word_id
letters
letter
word_id
how can i select all words while selecting all letters that belong to a word and concatenating them to said word? it is important that the letters are returned in the order they appear in the table, as the letter may be mixed into other words, but the order is correct.
|word_id| |word_id|letter|
+-------+ +-------+------+
| 1| | 1| H|
| 2| | 2| B|
| 2| Y|
| 1| I|
| 2| E|
should return
|word_id|word|
+-------+----+
| 1| HI|
| 2| BYE|
any way to accomplish this in pure SQL?

Try this:
SELECT word_id, group_concat (letter,'') FROM letters GROUP BY word_id;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PySpark Dataframe: Unify Certain Rows - dataframe

Related

PySpark: How to concatenate two distinct dataframes?

Compare columns from two different dataframes based on id

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

Spark SQL: Is there a way to distinguish columns with same name?

concatenate arbitrary long list of matches in SQL subquery

Categories

Resources