SQL on Spark: How do I get all values of DISTINCT? - sql

So, assume I have the following table:
Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue
I would like to get a table of the distinct colors for each name - how many and their values. Meaning, something like this:
Name | Distinct | Values
--------------------------------------
John | 2 | Blue, Yellow
Greg | 2 | Red, Blue
Any ideas how to do so?

collect_list will give you a list without removing duplicates.
collect_set will automatically remove duplicates
so just
select
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name
this feature is implemented since spark 1.6.0 check it out:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala
/**
* Aggregate function: returns a set of objects with duplicate elements eliminated.
*
* For now this is an alias for the collect_set Hive UDAF.
*
* #group agg_funcs
* #since 1.6.0
*/
def collect_set(columnName: String): Column = collect_set(Column(columnName))

For PySPark; I come from an R/Pandas background, so I'm actually finding Spark Dataframes a little easier to work with.
To do this:
Setup a Spark SQL context
Read your file into a dataframe
Register your dataframe as a temp table
Query it directly using SQL syntax
Save results as objects, output to files..do your thing
Here's a class I created to do this:
class SQLspark():
def __init__(self, local_dir='./', hdfs_dir='/users/', master='local', appname='spark_app', spark_mem=2):
self.local_dir = local_dir
self.hdfs_dir = hdfs_dir
self.master = master
self.appname = appname
self.spark_mem = int(spark_mem)
self.conf = (SparkConf()
.setMaster(self.master)
.setAppName(self.appname)
.set("spark.executor.memory", self.spark_mem))
self.sc = SparkContext(conf=self.conf)
self.sqlContext = SQLContext(self.sc)
def file_to_df(self, input_file):
# import file as dataframe, all cols will be imported as strings
df = self.sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "\t").option("inferSchema", "true").load(input_file)
# # cache df object to avoid rebuilding each time
df.cache()
# register as temp table for querying, use 'spark_df' as table name
df.registerTempTable("spark_df")
return df
# you also cast a spark dataframe as a pandas df
def sparkDf_to_pandasDf(self, input_df):
pandas_df = input_df.toPandas()
return pandas_df
def find_distinct(self, col_name):
my_query = self.sqlContext.sql("""SELECT distinct {} FROM spark_df""".format(col_name))
# now do your thing with the results etc
my_query.show()
my_query.count()
my_query.collect()
###############
if __name__ == '__main__':
# instantiate class
# see function for variables to input
spark = TestETL(os.getcwd(), 'hdfs_loc', "local", "etl_test", 10)
# specify input file to process
tsv_infile = 'path/to/file'

Related

How to convert String to JSON in Spark SQL?

I was working with the "Delta Logs" of Delta Table and the data of Delta table was stored in the Azure Blob Storage.
I used the below query to fetch the JSON data of Delta Log:
SELECT * FROM json.`/mnt/blob/deltaTables/employees/_delta_log/00000000000000000000.json`
I was able to fetch the JSON data as a table in the Databricks notebook (screenshot below):
Another screenshot for add.stats
Now I traverse through the JSON data in the query itself:
SELECT add.stats FROM json.`/mnt/blob/deltaTables/employees/_delta_log/00000000000000000000.json`
But after the stats the JSON data is in the string format and I am not able to traverse through further.
I am attaching the JSON file here for reference:
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"d21b496a-7282-49c9-a71c-3013d780fbeb","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"Id\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"Name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"city\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"age\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"department\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"degree\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"workingLocation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"maxEducation\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"experience\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"jobRole\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1664788829119}}
{"add":{"path":"part-00000-91e882d3-f9bc-481f-ba50-a7d061040401-c000.snappy.parquet","partitionValues":{},"size":6337133,"modificationTime":1664788872000,"dataChange":true,"stats":"{\"numRecords\":136251,\"minValues\":{\"Id\":1,\"Name\":\"Robin\",\"city\":\"Jhunjhunu\",\"age\":22,\"department\":\"Data Integration\",\"degree\":\"bsc\",\"workingLocation\":\"Jaipur\",\"maxEducation\":\"Graduation\",\"experience\":2,\"jobRole\":\"Data Engineer\"},\"maxValues\":{\"Id\":136251,\"Name\":\"Robin99999\",\"city\":\"Jhunjhunu99999\",\"age\":136272,\"department\":\"Data Integration99999\",\"degree\":\"bsc99999\",\"workingLocation\":\"Jaipur99999\",\"maxEducation\":\"Graduation99999\",\"experience\":136252,\"jobRole\":\"Data Engineer99999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000000","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00001-25af3704-bc8d-4201-bbe8-6e6b07864e40-c000.snappy.parquet","partitionValues":{},"size":5969233,"modificationTime":1664788870000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":136252,\"Name\":\"Robin136251\",\"city\":\"Jhunjhunu136251\",\"age\":136273,\"department\":\"Data Integration136251\",\"degree\":\"bsc136251\",\"workingLocation\":\"Jaipur136251\",\"maxEducation\":\"Graduation136251\",\"experience\":136253,\"jobRole\":\"Data Engineer136251\"},\"maxValues\":{\"Id\":264149,\"Name\":\"Robin264148\",\"city\":\"Jhunjhunu264148\",\"age\":264170,\"department\":\"Data Integration264148\",\"degree\":\"bsc264148\",\"workingLocation\":\"Jaipur264148\",\"maxEducation\":\"Graduation264148\",\"experience\":264150,\"jobRole\":\"Data Engineer264148\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000001","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00002-0fbaae5e-371e-45ea-b2ce-b959248ba88a-c000.snappy.parquet","partitionValues":{},"size":5961349,"modificationTime":1664788871000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":264150,\"Name\":\"Robin264149\",\"city\":\"Jhunjhunu264149\",\"age\":264171,\"department\":\"Data Integration264149\",\"degree\":\"bsc264149\",\"workingLocation\":\"Jaipur264149\",\"maxEducation\":\"Graduation264149\",\"experience\":264151,\"jobRole\":\"Data Engineer264149\"},\"maxValues\":{\"Id\":392047,\"Name\":\"Robin392046\",\"city\":\"Jhunjhunu392046\",\"age\":392068,\"department\":\"Data Integration392046\",\"degree\":\"bsc392046\",\"workingLocation\":\"Jaipur392046\",\"maxEducation\":\"Graduation392046\",\"experience\":392048,\"jobRole\":\"Data Engineer392046\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000002","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00003-1c8ac662-7680-42b8-8ffb-eef96ad37085-c000.snappy.parquet","partitionValues":{},"size":5961263,"modificationTime":1664788862000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":392048,\"Name\":\"Robin392047\",\"city\":\"Jhunjhunu392047\",\"age\":392069,\"department\":\"Data Integration392047\",\"degree\":\"bsc392047\",\"workingLocation\":\"Jaipur392047\",\"maxEducation\":\"Graduation392047\",\"experience\":392049,\"jobRole\":\"Data Engineer392047\"},\"maxValues\":{\"Id\":519945,\"Name\":\"Robin519944\",\"city\":\"Jhunjhunu519944\",\"age\":519966,\"department\":\"Data Integration519944\",\"degree\":\"bsc519944\",\"workingLocation\":\"Jaipur519944\",\"maxEducation\":\"Graduation519944\",\"experience\":519946,\"jobRole\":\"Data Engineer519944\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000003","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00004-5c355575-1751-464f-93a5-672ae1f60c29-c000.snappy.parquet","partitionValues":{},"size":5990168,"modificationTime":1664788867000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":519946,\"Name\":\"Robin519945\",\"city\":\"Jhunjhunu519945\",\"age\":519967,\"department\":\"Data Integration519945\",\"degree\":\"bsc519945\",\"workingLocation\":\"Jaipur519945\",\"maxEducation\":\"Graduation519945\",\"experience\":519947,\"jobRole\":\"Data Engineer519945\"},\"maxValues\":{\"Id\":647843,\"Name\":\"Robin647842\",\"city\":\"Jhunjhunu647842\",\"age\":647864,\"department\":\"Data Integration647842\",\"degree\":\"bsc647842\",\"workingLocation\":\"Jaipur647842\",\"maxEducation\":\"Graduation647842\",\"experience\":647844,\"jobRole\":\"Data Engineer647842\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000004","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00005-fac91d9b-9528-4381-8983-419261e7b6ba-c000.snappy.parquet","partitionValues":{},"size":5962518,"modificationTime":1664788865000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":647844,\"Name\":\"Robin647843\",\"city\":\"Jhunjhunu647843\",\"age\":647865,\"department\":\"Data Integration647843\",\"degree\":\"bsc647843\",\"workingLocation\":\"Jaipur647843\",\"maxEducation\":\"Graduation647843\",\"experience\":647845,\"jobRole\":\"Data Engineer647843\"},\"maxValues\":{\"Id\":775741,\"Name\":\"Robin775740\",\"city\":\"Jhunjhunu775740\",\"age\":775762,\"department\":\"Data Integration775740\",\"degree\":\"bsc775740\",\"workingLocation\":\"Jaipur775740\",\"maxEducation\":\"Graduation775740\",\"experience\":775742,\"jobRole\":\"Data Engineer775740\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000005","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00006-41366707-d8cc-1 QQQQQQQAQQQQQQQQQQ46d2-b781-4d9bcc5b5210-c000.snappy.parquet","partitionValues":{},"size":5967367,"modificationTime":1664788866000,"dataChange":true,"stats":"{\"numRecords\":127898,\"minValues\":{\"Id\":775742,\"Name\":\"Robin775741\",\"city\":\"Jhunjhunu775741\",\"age\":775763,\"department\":\"Data Integration775741\",\"degree\":\"bsc775741\",\"workingLocation\":\"Jaipur775741\",\"maxEducation\":\"Graduation775741\",\"experience\":775743,\"jobRole\":\"Data Engineer775741\"},\"maxValues\":{\"Id\":903639,\"Name\":\"Robin903638\",\"city\":\"Jhunjhunu903638\",\"age\":903660,\"department\":\"Data Integration903638\",\"degree\":\"bsc903638\",\"workingLocation\":\"Jaipur903638\",\"maxEducation\":\"Graduation903638\",\"experience\":903640,\"jobRole\":\"Data Engineer903638\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000006","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"add":{"path":"part-00007-8c0a3670-cb24-4cdc-ac37-a9f173c7fe71-c000.snappy.parquet","partitionValues":{},"size":4498552,"modificationTime":1664788864000,"dataChange":true,"stats":"{\"numRecords\":96361,\"minValues\":{\"Id\":903640,\"Name\":\"Robin903639\",\"city\":\"Jhunjhunu903639\",\"age\":903661,\"department\":\"Data Integration903639\",\"degree\":\"bsc903639\",\"workingLocation\":\"Jaipur903639\",\"maxEducation\":\"Graduation903639\",\"experience\":903641,\"jobRole\":\"Data Engineer903639\"},\"maxValues\":{\"Id\":1000000,\"Name\":\"Robin999999\",\"city\":\"Jhunjhunu999999\",\"age\":1000021,\"department\":\"Data Integration999999\",\"degree\":\"bsc999999\",\"workingLocation\":\"Jaipur999999\",\"maxEducation\":\"Graduation999999\",\"experience\":1000001,\"jobRole\":\"Data Engineer999999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}","tags":{"INSERTION_TIME":"1664788862000007","OPTIMIZE_TARGET_SIZE":"268435456"}}}
{"commitInfo":{"timestamp":1664788877620,"userId":"2102279527814428","userName":"robin30121999#gmail.com","operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"notebook":{"notebookId":"2204946593481188"},"clusterId":"1003-090126-rztrsnm4","isolationLevel":"WriteSerializable","isBlindAppend":false,"operationMetrics":{"numFiles":"8","numOutputRows":"1000000","numOutputBytes":"46647583"},"engineInfo":"Databricks-Runtime/10.4.x-scala2.12","txnId":"17e89e4f-380f-4888-9aaf-2ada42e13b8c"}}
Ultimately, I want to fetch the maximum value of experience using SQL
Starting with the example "stats" column of type string which you have...
from pyspark.sql import functions as F
df = spark.createDataFrame(
[("{\"numRecords\":136251,\"minValues\":{\"Id\":1,\"Name\":\"Robin\",\"city\":\"Jhunjhunu\",\"age\":22,\"department\":\"Data Integration\",\"degree\":\"bsc\",\"workingLocation\":\"Jaipur\",\"maxEducation\":\"Graduation\",\"experience\":2,\"jobRole\":\"Data Engineer\"},\"maxValues\":{\"Id\":136251,\"Name\":\"Robin99999\",\"city\":\"Jhunjhunu99999\",\"age\":136272,\"department\":\"Data Integration99999\",\"degree\":\"bsc99999\",\"workingLocation\":\"Jaipur99999\",\"maxEducation\":\"Graduation99999\",\"experience\":136252,\"jobRole\":\"Data Engineer99999\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",),
("{\"numRecords\":127898,\"minValues\":{\"Id\":136252,\"Name\":\"Robin136251\",\"city\":\"Jhunjhunu136251\",\"age\":136273,\"department\":\"Data Integration136251\",\"degree\":\"bsc136251\",\"workingLocation\":\"Jaipur136251\",\"maxEducation\":\"Graduation136251\",\"experience\":136253,\"jobRole\":\"Data Engineer136251\"},\"maxValues\":{\"Id\":264149,\"Name\":\"Robin264148\",\"city\":\"Jhunjhunu264148\",\"age\":264170,\"department\":\"Data Integration264148\",\"degree\":\"bsc264148\",\"workingLocation\":\"Jaipur264148\",\"maxEducation\":\"Graduation264148\",\"experience\":264150,\"jobRole\":\"Data Engineer264148\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",),
("{\"numRecords\":127898,\"minValues\":{\"Id\":264150,\"Name\":\"Robin264149\",\"city\":\"Jhunjhunu264149\",\"age\":264171,\"department\":\"Data Integration264149\",\"degree\":\"bsc264149\",\"workingLocation\":\"Jaipur264149\",\"maxEducation\":\"Graduation264149\",\"experience\":264151,\"jobRole\":\"Data Engineer264149\"},\"maxValues\":{\"Id\":392047,\"Name\":\"Robin392046\",\"city\":\"Jhunjhunu392046\",\"age\":392068,\"department\":\"Data Integration392046\",\"degree\":\"bsc392046\",\"workingLocation\":\"Jaipur392046\",\"maxEducation\":\"Graduation392046\",\"experience\":392048,\"jobRole\":\"Data Engineer392046\"},\"nullCount\":{\"Id\":0,\"Name\":0,\"city\":0,\"age\":0,\"department\":0,\"degree\":0,\"workingLocation\":0,\"maxEducation\":0,\"experience\":0,\"jobRole\":0}}",)],
["stats"])
You can use from_json (providing schema path to the object that you need ("experience")) to extract that object together with the structure leading to the object. Then you can remove the structure by just telling the struct path to it (F.col("exp.maxValues.experience")).
df = df.withColumn("exp", F.from_json("stats", "maxValues struct<experience:long>"))
df = df.withColumn("exp", F.col("exp.maxValues.experience"))
df.show()
# +--------------------+------+
# | stats| exp|
# +--------------------+------+
# |{"numRecords":136...|136252|
# |{"numRecords":127...|264150|
# |{"numRecords":127...|392048|
# +--------------------+------+
In SQL you could do it like this:
spark.sql("""
SELECT from_json(stats, 'maxValues struct<experience:long>').maxValues.experience as exp
FROM df
""").show()
# +------+
# | exp|
# +------+
# |136252|
# |264150|
# |392048|
# +------+

Grouping alternative items with PySpark

The sample of the dataset I am working on:
# Creating the DataFrame
test =sqlContext.createDataFrame([(1,2),(2,1),
(1,3),(2,3),
(3,2),(3,1),
(4,5),(5,4)],
['cod_item','alter_cod'])
And it looks like this after grouping the equivalent items in lists:
test.createOrReplaceTempView("teste")
teste = spark.sql("""select cod_item,
collect_list(alter_cod) as alternative_item
from teste
group by cod_item""")
In the first column, I have certain items and in the second column, I have items that are equivalent. I would like, for each list, to have only one item that represents it.
I would like the final dataframe to look like this:
or
Where the items on the right are the items representing their respective equivalent items.
After collect_list, you should filter out rows where any alter_cod is bigger than cod_item. This method would work on strings too.
test = (test
.groupBy('cod_item')
.agg(F.collect_list('alter_cod').alias('alter_cod'))
.filter(F.forall('alter_cod', lambda x: x > F.col('cod_item')))
)
test.show()
# +--------+---------+
# |cod_item|alter_cod|
# +--------+---------+
# | 1| [2, 3]|
# | 4| [5]|
# +--------+---------+
Or add one line to your SQL:
select cod_item,
collect_list(alter_cod) as alternative_item
from teste
group by cod_item
having forall(alternative_item, x -> x > cod_item)

How do I replace column after encrypting it by using Spark (PySpark)?

I have a question about replacing personal information to encrypted data using Spark.
Let's say for example, if I have a table like:
std_name
phone_number
John
585-1243-2156
Susan
585-4567-2156
I want to change phone_number to encrypted form like:
std_name
phone_number
John
avawehna'vqqa
Susan
vabdsvwegq'qb
I have tried using withColumn with udf, but it does not work well.
Can someone help me out?
You haven't provided your encryption function, but I will assume that there was something simple wrong. If you create a UDF, it will be separately run for every row, so you can use Python inside your UDF.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('John', '585-1243-2156'),
('Susan', '585-4567-2156')],
['std_name', 'phone_number']
)
#F.udf
def encrypting(data):
# Encrypting logic:
encrypted_data = 'xyz' + data[::-1].replace('-', 'w')
return encrypted_data
df = df.withColumn('phone_number', encrypting('phone_number'))
df.show()
# +--------+----------------+
# |std_name| phone_number|
# +--------+----------------+
# | John|xyz6512w3421w585|
# | Susan|xyz6512w7654w585|
# +--------+----------------+

(pyspark) how to make dataframes which have no same user_id mutually

I was trying to collect 2 user_id dataframes which have no same user_id mutually in pyspark.
So, I typed some codes below you can see
import pyspark.sql.functions as f
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_a = df_original.sort(f.rand()).limit(10000)
df_a.count()
# df_a: 10000
df_b = df_original.join(df_a,on="user_id",how="left_anti").sort(f.rand()).limit(10000)
df_b.count()
# df_b: 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# df_a - df_b = 9998
# What?????
As a result, df_a and df_b have the same 2 user_ids... sometimes 1, or 0.
It looks like no problem with codes. However, this occurs due to lazy action of spark mechanism maybe...
I need to solve this problem for collecting 2 user_id dataframes which have no same user_id mutually.
Since you want to generate two different set of users from a given pool of users with no overlap you may use this simple trick : =
from pyspark.sql.functions import monotonically_increasing_id
import pyspark.sql.functions as f
#"Creation of Original DF"
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_original =df.withColumn("UNIQUE_ID", monotonically_increasing_id())
number_groups_needed=2 ## you can adjust the number of group you need for your use case
dfa=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==0)
dfb=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==1)
##dfa and dfb will not have any overlap for user_id
Ps- if your user_id is itself a integer you don't need to create a new UNIQUE_ID column you can use it directly .
I choose randomSplit function pyspark supports.
df_a,df_b = df_original.randomSplit([0.6,0.4])
df_a = df_a.limit(10000)
df_a.count()
# 10000
df_b = df_b.limit(10000)
df_b.count()
# 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# 10000
never conflict between df_a and df_b anymore!

join two dataframes on common column

I want to join two data sources, orders and customers:
orders is an SQL Server table:
orderid| customerid | orderdate | ordercost
------ | -----------| --------- | --------
12000 | 1500 |2008-08-09 | 38610
and customers is a csv file:
customerid,first_name,last_name,starting_date,ending_date,country
1500,Sian,Read,2008-01-07,2010-01-07,Greenland
I want to join these two tables in my Python application, so I wrote the following code:
# Connect to SQL Sever with Pyodbc library
connection = pypyodbc.connect("connection string here")
cursor=connection.cursor();
cursor.execute("SELECT * from order)
result= cursor.fetchall()
# convert the result to pandas Dataframe
df1 = pd.DataFrame(result, columns= ['orderid','customerid','orderdate','ordercost'])
# Read CSV File
df2=pd.read_csv(customer_csv)
# Merge two dataframes
merged= pd.merge( df1, df2, on= 'customerid', how='inner')
print(merged[['first_name', 'country']])
I expect
first_name | country
-----------|--------
Sian | Greenland
But I get empty result.
When I perform this code for two data frames that are both from CSV files, it works fine. Any help?
Thanks.
I think problem is columns customerid has different dtypes in both DataFrames so no match.
So need convert both columns to int or both to str.
df1['customerid'] = df1['customerid'].astype(int)
df2['customerid'] = df2['customerid'].astype(int)
Or:
df1['customerid'] = df1['customerid'].astype(str)
df2['customerid'] = df2['customerid'].astype(str)
Also is possible omit how='inner', because default value of merge:
merged= pd.merge( df1, df2, on= 'customerid')
empty dataframe result for pd.merge means you don't have any matching values across the two frames. Have you checked the type of the the data? use
df1['customerid'].dtype
to check.
as well as converting after importing (as suggested in the other answer), you can also tell pandas what dtype you want when you read the csv
df2=pd.read_csv(customer_csv, dtype={'customerid': str))