Pyspark join dataframe on comma separted values in a column

Pyspark join dataframe on comma separted values in a column - apache-spark-sql

So i have two data frames which i want to join. The catch is the second table has comma separted values stored in it out of which one matches with the column in Table A. How do I it in Pyspark. Below is an example
Table A has
+-------+--------------------+
|deal_id| deal_name|
+-------+--------------------+
| 613760|ABCDEFGHI |
| 613740|TEST123 |
| 598946|OMG |
Table B has
+-------+---------------------------+--------------------+
| deal_id| deal_type|
+-------+---------------------------+--------------------+
| 613760,613761,613762,613763 |Direct De |
| 613740,613750,613770,613780,613790|Direct |
| 598946 |In |
Expected Result - Join table A and Table B when there is a match with Table A's deal ID against Table B's comma separted value. For instance TableA.dealid - 613760 is in table B's 1 st row, i want that row returned.
+-------+--------------------+---------------+
|deal_id| deal_name| deal_type|
+-------+--------------------+---------------+
| 613760|ABCDEFGHI |Direct De |
| 613740|TEST123 |Direct |
| 598946|OMG |In |
Any assistance is appreciated. I need it in pyspark.
Thanks.

Sample data
from pyspark.sql.types import IntegerType, LongType, StringType, StructField, StructType
tuples_a = [('613760', 'ABCDEFGHI'),
('613740', 'TEST123'),
('598946', 'OMG'),
]
schema_a = StructType([
StructField('deal_id', StringType(), nullable=False),
StructField('deal_name', StringType(), nullable=False)
])
tuples_b = [('613760,613761,613762,613763 ', 'Direct De'),
('613740,613750,613770,613780,613790', 'Direct'),
('598946', 'In'),
]
schema_b = StructType([
StructField('deal_id', StringType(), nullable=False),
StructField('deal_type', StringType(), nullable=False)
])
df_a = spark_session.createDataFrame(data=tuples_a, schema=schema_a)
df_b = spark_session.createDataFrame(data=tuples_b, schema=schema_b)
You need to split the column and explode it in order to join.
from pyspark.sql.functions import split, col, explode
df_b = df_b.withColumn('split', split(col('deal_id'), ','))\
.withColumn('exploded', explode(col('split')))\
.drop('deal_id', 'split')\
.withColumnRenamed('exploded', 'deal_id')
df_a.join(df_b, on = 'deal_id', how = 'left_outer')\
.show(10, False)
and the expected result
+-------+---------+---------+
|deal_id|deal_name|deal_type|
+-------+---------+---------+
|613760 |ABCDEFGHI|Direct De|
|613740 |TEST123 |Direct |
|598946 |OMG |In |
+-------+---------+---------+

Related

How to apply condition on a spark dataframe as per need?

I am having a spark dataframe with below sample data.
+--------------+--------------+
| item_cd | item_nbr |
+--------------+--------------+
|20-10767-58V| 98003351|
|20-10087-58V| 87003872|
|20-10087-58V| 97098411|
|20-10i72-YTW| 99003351|
|27-1o121-YTW| 89659352|
|27-10991-YTW| 98678411|
| At81kk00| 98903458|
| Avp12225| 85903458|
| Akb12226| 99003458|
| Ahh12829| 98073458|
| Aff12230| 88803458|
| Ar412231| 92003458|
| Aju12244| 98773458|
+--------------+--------------+
I want to write a condition like for each item_cd which are having hypen(-) should do nothing and for which not having hypen(-) should add 4 trailing 0's to each item_cd. Then take duplicates on both columns(item_cd, item_nbr) into to one dataframe and unique into other dataframe in pyspark.
could anyone please me with this in pyspark?

Here is how it could be done:
import pyspark.sql.functions as F
from pyspark.sql import Window
data = [("20-10767-58V", "98003351"), ("20-10087-58V", "87003872"), ("At81kk00", "98903458"), ("Ahh12829", "98073458"), ("20-10767-58V", "98003351")]
cols = ["item_cd", "item_nbr"]
df = spark.createDataFrame(data, cols)
df.show()
df = df.withColumn("item_cd", when(~df.item_cd.contains("-"), F.concat(df.item_cd, F.lit("0000"))).otherwise(df.item_cd))
df.show()
unique_df = df.select("*").distinct()
unique_df.show()
w = Window.partitionBy(df.columns)
duplicate_df = df.select("*", F.count("*").over(w).alias("cnt"))\
.where("cnt > 1")\
.drop("cnt")
duplicate_df.show()
Input df (added duplicate):
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|20-10767-58V|98003351|
|20-10087-58V|87003872|
| At81kk00|98903458|
| Ahh12829|98073458|
|20-10767-58V|98003351|
+------------+--------+
Unique df:
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|Ahh128290000|98073458|
|20-10767-58V|98003351|
|20-10087-58V|87003872|
|At81kk000000|98903458|
+------------+--------+
Duplicates df:
+------------+--------+
| item_cd|item_nbr|
+------------+--------+
|20-10767-58V|98003351|
|20-10767-58V|98003351|
+------------+--------+

How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql

I have a dataframe which returns the output as
I would like to transpose this into
Can someone help to understand how to prepare the pyspark code to achieve this result dynamically. I have tried Unpivot in sql but no luck.

df =spark.createDataFrame([
(78,20,19,90),
],
('Machines', 'Books', 'Vehicles', 'Plants'))
Create a new array of struct column that combines column names and value names. Use the magic inline to explode the struct field. Code below
df.withColumn('tab', F.array(*[F.struct(lit(x).alias('Fields'), col(x).alias('Count')).alias(x) for x in df.columns])).selectExpr('inline(tab)').show()
+--------+-----+
| Fields|Count|
+--------+-----+
|Machines| 78|
| Books| 20|
|Vehicles| 19|
| Plants| 90|
+--------+-----+

As mentioned in unpivot-dataframe tutoral use:
df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
Or to generalise:
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
Full example:
df = spark.createDataFrame(data=[[78,20,19,90]], schema=['Machines','Books','Vehicles','Plants'])
# Hard coded
# df = df.selectExpr("""stack(4, "Machines", Machines, "Books", Books, "Vehicles", Vehicles, "Plants", Plants) as (Fields, Count)""")
# Generalised
cols = [f'"{c}", {c}' for c in df.columns]
exprs = f"stack({len(cols)}, {', '.join(str(c) for c in cols)}) as (Fields, Count)"
df = df.selectExpr(exprs)
[Out]:
+--------+-----+
|Fields |Count|
+--------+-----+
|Machines|78 |
|Books |20 |
|Vehicles|19 |
|Plants |90 |
+--------+-----+

Loading csv with pandas, wrong columns

I loaded a csv into a DataFrame with pandas.
The format is the following:
Timestamp | 1014.temperature | 1014.humidity | 1015.temperature | 1015.humidity ....
-------------------------------------------------------------------------------------
2017-... | 23.12 | 12.2 | 25.10 | 10.34 .....
The problem is that the '1014' or '1015' numbers are supposed to be ID's that are supposed to be in a special column.
I would like to end up with the following format for my DF:
TimeStamp | ID | Temperature | Humidity
-----------------------------------------------
. | | |
.
.
.
The CSV is tab separated.
Thanks in advance guys!

import pandas as pd
from io import StringIO
# create sample data frame
s = """Timestamp|1014.temperature|1014.humidity|1015.temperature|1015.humidity
2017|23.12|12.2|25.10|10.34"""
df = pd.read_csv(StringIO(s), sep='|')
df = df.set_index('Timestamp')
# split columns on '.' with list comprehension
l = [col.split('.') for col in df.columns]
# create multi index columns
df.columns = pd.MultiIndex.from_tuples(l)
# stack column level 0, reset the index and rename level_1
final = df.stack(0).reset_index().rename(columns={'level_1': 'ID'})
Timestamp ID humidity temperature
0 2017 1014 12.20 23.12
1 2017 1015 10.34 25.10

Pyspark dataframe - Illegal values appearing in the column?

So I have a table (sample)
I'm using pyspark dataframe APIs to filter out the 'NOC's that has never won a gold medal and here's the code I write
First part of my code
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
spark = SQLContext(sc)
df1 = spark.read.format("csv").options(header = 'true').load("D:\\datasets\\athlete_events.csv")
df = df1.na.replace('NA', '-')
countgdf = gdf.groupBy('NOC').agg(count('Medal').alias('No of Gold medals')).select('NOC').show()
It will generate the output
+---+
|NOC|
+---+
|POL|
|JAM|
|BRA|
|ARM|
|MOZ|
|JOR|
|CUB|
|FRA|
|ALG|
|BRN|
+---+
only showing top 10 rows
The next part of the code is something like
allgdf = df.select('NOC').distinct()
This display the output
+-----------+
| NOC|
+-----------+
| DeRuyter|
| POL|
| Russia|
| JAM|
| BUR|
| BRA|
| ARM|
| MOZ|
| CUB|
| JOR|
| Sweden|
| FRA|
| ALG|
| SOM|
| IVB|
|Philippines|
| BRN|
| MAL|
| COD|
| FSM|
+-----------+
Notice the values that are more than 3 characters? Those are supposed to be the values of the column 'Team' but I'm not sure why those values are getting displayed in 'NOC' column. It's hard to figure out why this is happening i.e illegal values in the column.
When I write the final code
final = allgdf.subtract(countgdf).show()
The same happens as illegal values appear in the final dataframe column.
Any help would be appericiated. Thanks.

You should specify a delimiter for your CSV file. By default Spark is using comma separators (,)
This can be done, for example, with :
.option("delimiter",";")

Transform several Dataframe rows into a single row

The following is an example Dataframe snippet:
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_lid |trace |message |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1103960793391132675|47c10fda9b40407c998c154dc71a9e8c|[app.py:208] Prediction label: {"id": 617, "name": "CENSORED"}, score=0.3874854505062103 |
|1103960793391132676|47c10fda9b40407c998c154dc71a9e8c|[app.py:224] Similarity values: [0.6530804801919593, 0.6359653379418201] |
|1103960793391132677|47c10fda9b40407c998c154dc71a9e8c|[app.py:317] Predict=s3://CENSORED/scan_4745/scan4745_t1_r0_c9_2019-07-15-10-32-43.jpg trait_id=112 result=InferenceResult(predictions=[Prediction(label_id='230', label_name='H3', probability=0.0), Prediction(label_id='231', label_name='Other', probability=1.0)], selected=Prediction(label_id='231', label_name='Other', probability=1.0)). Took 1.3637824058532715 seconds |
+-------------------+--------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have millions of these, log like structures, where they all can be grouped by trace which is unique to a session.
I'm looking to transform these sets of rows into single rows, essentially mapping over them, for for this example I would extract from the first name the "id": 617 from the second row the values 0.6530804801919593, 0.6359653379418201 and from the third row the Prediction(label_id='231', label_name='Other', probability=1.0) value.
Then I would compose a new table having the columns:
| trace | id | similarity | selected |
with the values:
| 47c10fda9b40407c998c154dc71a9e8c | 617 | 0.6530804801919593, 0.6359653379418201 | 231 |
How should I implement this group-map transform over several rows in pyspark ?

I've written the below example in Scala for my own convenience, but it should translate readily to Pyspark.
1) Create the new columns in your dataframe via regexp_extract on the "message" field. This will produce the desired values if the regex matches, or empty strings if not:
scala> val dss = ds.select(
| 'trace,
| regexp_extract('message, "\"id\": (\\d+),", 1) as "id",
| regexp_extract('message, "Similarity values: \\[(\\-?[0-9\\.]+, \\-?[0-9\\.]+)\\]", 1) as "similarity",
| regexp_extract('message, "selected=Prediction\\(label_id='(\\d+)'", 1) as "selected"
| )
dss: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> dss.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617| | |
|47c10fda9b40407c998c154dc71a9e8c| |0.6530804801919593, 0.6359653379418201| |
|47c10fda9b40407c998c154dc71a9e8c| | |231 |
+--------------------------------+---+--------------------------------------+--------+
2) Group by "trace" and eliminate the cases where the regex didn't match. The quick and dirty way (show below) is to select the max of each column, but you might need to do something more sophisticated if you expect to encounter more than one match per trace:
scala> val ds_final = dss.groupBy('trace).agg(max('id) as "id", max('similarity) as "similarity", max('selected) as "selected")
ds_final: org.apache.spark.sql.DataFrame = [trace: string, id: string ... 2 more fields]
scala> ds_final.show(false)
+--------------------------------+---+--------------------------------------+--------+
|trace |id |similarity |selected|
+--------------------------------+---+--------------------------------------+--------+
|47c10fda9b40407c998c154dc71a9e8c|617|0.6530804801919593, 0.6359653379418201|231 |
+--------------------------------+---+--------------------------------------+--------+

I ended up using something in the lines of
expected_schema = StructType([
StructField("event_timestamp", TimestampType(), False),
StructField("trace", StringType(), False),
...
])
#F.pandas_udf(expected_schema, F.PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def transform(pdf):
output = {}
for l in pdf.to_dict(orient='record'):
x = re.findall(r'^(\[.*:\d+\]) (.*)', l['message'])[0][1]
...
return pd.DataFrame(data=[output])
df.groupby('trace').apply(transform)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Pyspark join dataframe on comma separted values in a column - apache-spark-sql

Related

How to apply condition on a spark dataframe as per need?

How to convert 1 row 4 columns dataframe to 4 rows 2 columns dataframe in pyspark or sql

Loading csv with pandas, wrong columns

Pyspark dataframe - Illegal values appearing in the column?

Transform several Dataframe rows into a single row

Categories

Resources