Join and combine dataframe by array intersection - sql

I want to join three different tables using an array of aliases as the join condition:
Table 1:
table_1 = spark.createDataFrame([("T1", ['a','b','c']), ("T2", ['d','e','f'])], ["id", "aliases"])
Table 2:
table_2 =spark.createDataFrame([("P1", ['a','h','e']), ("P2", ['j','k','l'])], ["id", "aliases"])
Table 3:
table_3= spark.createDataFrame([("G1", ['a','n','o']), ("G2", ['p','q','l']), ("G3", ['c','z'])], ["id", "aliases"])
And I want to get a table like this:
Aliases
table1_ids
table2_id
table3_id
[n, b, h, o, a, e, d, c, f, z]
[T1, T2]
[P1]
[G1,G3]
[k, q, j, p, l]
[]
[P2]
[G2]
Where all related aliases are in the same row and there is no repeated ID of the three initial tables. In other words, I am trying to group by a common alias and to collect all different IDs in which these aliases can be found.
I have used Spark SQL for the code examples, but feel free of using Pyspark or Pandas.
Thanks in advance.

Well, I have thought about it and I think that what I described it's a Graph problem. More precisely, I was trying to find all connected components, so 'aliases' + 'ids' will be the graph vertices, and once it has been found all the graph components (all the subgraphs that are not connected to other subgraph), it will be extracted the IDs from the results. It's very important to be able to differentiate the IDs from the values (aliases).
To implement a solution, I have used Graphframes (Thanks to this comment):
from graphframes import *
import pyspark.sql.functions as f
df = table_1.unionAll(table_2).unionAll(table_3)
edgesDF = df.select(f.explode("Aliases").alias("src"),f.col("id").alias("dst")) # Columns should be called 'src' and 'dst'.
verticesDF = edgesDF.select('src').union(edgesDF.select('dst'))
verticesDF = verticesDF.withColumnRenamed('src', 'id')
graph=GraphFrame(verticesDF,edgesDF)
components_df = graph.connectedComponents(algorithm="graphx")
components_grouped_df = components_df.groupBy("component").agg(f.collect_set("id").alias("aliases"))
So now we will have a Dataframe like this:
And as we want to have each ID in a different column, we have to extract them from the 'aliases' column and to create three new ones. For doing so, we will use regex and a UDF:
import re
from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType,StructField, StringType, ArrayType, DoubleType
df_schema = StructType([
StructField("table_1_ids",ArrayType(StringType()),True),
StructField("table_2_ids",ArrayType(StringType()),True),
StructField("table_3_ids",ArrayType(StringType()),True),
StructField("aliases",ArrayType(StringType()),True)
])
#udf(returnType=df_schema)
def get_correct_results(aliases):
regex_table_1 = "(T\d)"
regex_table_2 = "(P\d)"
regex_table_3 = "(G\d)"
table_1_ids = []
table_2_ids = []
table_3_ids = []
elems_to_remove = []
for elem in aliases:
result_table_1 = re.search(regex_table_1, elem)
result_table_2 = re.search(regex_table_2, elem)
result_table_3 = re.search(regex_table_3, elem)
if result_table_1:
elems_to_remove.append(elem)
table_1_ids.append(result_table_1.group(1))
elif result_table_2:
elems_to_remove.append(elem)
table_2_ids.append(result_table_2.group(1))
elif result_table_3:
elems_to_remove.append(elem)
table_3_ids.append(result_table_3.group(1))
return {'table_1_ids':list(set(table_1_ids)), 'table_2_ids':list(set(table_2_ids)),'table_3_ids':list(set(table_3_ids)), 'aliases':list(set(aliases) - set(elems_to_remove))}
So, finally, we use the previous UDF in the 'final_df':
master_df = components_grouped_df.withColumn("return",get_correct_results(f.col("aliases")))\
.selectExpr("component as row_id","return.aliases as aliases","return.table_1_ids as table_1_ids","return.table_2_ids as table_2_ids", "return.table_3_ids as table_3_ids")
And the final DF will look like this:

Related

Filter dataframe based on condition before groupby

Suppose I have a dataframe like this
Create sample dataframe:
import pandas as pd
import numpy as np
data = {
'gender': np.random.choice(['m', 'f'], size=100),
'vaccinated': np.random.choice([0, 1], size=100),
'got sick': np.random.choice([0, 1], size=100)
}
df = pd.DataFrame(data)
and I want to see, by gender, what proportion of vaccinated people got sick.
I've tries something like this:
df.groupby('gender').agg(lambda group: sum(group['vaccinated']==1 & group['sick']==1)
/sum(group['sick']==1))
but this doesn't work because agg works on the series level. Same applies for transform. apply doesn't work either, but I'm not as clear why or how apply functions on groupby objects.
Any ideas how to accomplish this with a single line of code?
You could first filter for the vaccinated people and then group by gender and calculate the proportion of people that got sick..
df[df.vaccinated == 1].groupby("gender").agg({"got sick":"mean"})
Output:
got sick
gender
f 0.548387
m 0.535714
In this case the proportion is calculated based on a sample data that I've created
The docs for GroupBy.apply state that the function is applied "group-wise". This means that the function is called on each group separately as a data frame.
That is, df.groupby(c).apply(f) is conceptually equivalent to:
results = {}
for val in df[c]:
group = df.loc[df[c] == val]
result = f(group)
results[val] = result
pd.concat(results)
We can use this understanding to apply your custom aggregation function, using a top-level def just to make the code easier to read:
def calc_vax_sick_frac(group):
vaccinated = group['vaccinated'] == 1
sick = group['sick'] == 1
return (vaccinated & sick).sum() / sick.sum()
(
df
.groupby('gender')
.apply(calc_vax_sick_frac)
)

Select cells in a pandas DataFrame by a Series of its column labels

Say we have a DataFrame and a Series of its column labels, both (almost) sharing a common index:
df = pd.DataFrame(...)
s = df.idxmax(axis=1).shift(1)
How can I obtain cells given a series of columns, getting value from every row using a corresponding column label from the joined series? I'd imagine it would be:
values = df[s] # either
values = df.loc[s] # or
In my example I'd like to have values that are under biggest-in-their-row values (I'm doing a poor man's ML :) )
However I cannot find any interface selecting cells by series of columns. Any ideas folks?
Meanwhile I use this monstrous snippet:
def get_by_idxs(df: pd.DataFrame, idxs: pd.Series) -> pd.Series:
ts_v_pairs = [
(ts, row[row['idx']])
for ts, row in df.join(idxs.rename('idx'), how='inner').iterrows()
if isinstance(row['idx'], str)
]
return pd.Series([v for ts, v in ts_v_pairs], index=[ts for ts, v in ts_v_pairs])
I think you need dataframe lookup
v = s.dropna()
v[:] = df.to_numpy()[range(len(v)), df.columns.get_indexer_for(v)]

Create a Python script which compares several excel files(snapshots) and compares and creates a new dataframe with rows which are diffirent

Am new in Python and will appreciate your help.
I would like to create a python script which perfoms data validation by using my first file excel_file[0] as df1 and comparing it against several other excel_file[0:100] and looping through them while comparing with df1 and appending those rows which are diffirent to a new dataframe df3. Even though I have several columns, I would like to base my comparison on two columns which includes a primary key column; such that if the keys in the two dataframes match; then compare df1 and df2(loop).
Here's what I have tried..
***## import python module: pandasql which allows SQL syntax for Pandas;
It needs installation first though:pip install -U pandasql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, locals(), globals() )
dateTimeObj = dt.datetime.now()
print('start file merge: ' ,dateTimeObj)
#path = os.getcwd()
##files = os.listdir(path1)
files=os.path.abspath(mydrive')
files
dff1 = pd.DataFrame()
##df2 = pd.DataFrame()
# method 1
excel_files = glob.glob(files+ "\*.xlsx")
##excel_files = [f for f in files if f[-4:] == '\*.xlsx' or f[-3:] == '*.xls']
df1=pd.read_excel(excel_files[14])
for f in excel_files[0:100]:
df2 = pd.read_excel(f)
## Lets drop the any unanamed column
##df1=df1.drop(df1.iloc[:, [0]], axis = 1)
### Gets all Rows and columns which are diffirent after comparing the two dataframes ; The
clause " _key HAVING COUNT(*)= 1" resolves to True if the two dataframes are diffirent
### Else we use The clause " _key HAVING COUNT(*)= 2" to output similar rows and columns
data=pysqldf("SELECT * FROM ( SELECT * FROM df1 UNION ALL SELECT * FROM df2) df1 GROUP BY _key
HAVING COUNT(*) = 1 ;")
## df = dff1.append(data).reset_index(drop = True)
print(dt.datetime.now().strftime("%x %X")+': files appended to make a Master file')***

Restructering Dataframes (stored in dictionairy)

i have footballdata stored in a dictionairy by league for different seasons. So for example I have the results from 1 league for the seasons 2017-2020 in one dataframe stored in the dictionary. Now I need to create new dataframes by season, so that I have all the results from 2019 in one dataframe. What is the best way to do this?
Thank you!
assuming you are using open football as your source
use GitHub API to get all files in repo
function to normalize the JSON
simple to generate either a concatenated DF of a dict of all the results
import requests
import pandas as pd
# normalize footbal scores data into a dataframe
def structuredf(res):
js = res.json()
if "rounds" not in res.json().keys():
return (pd.json_normalize(js["matches"])
.pipe(lambda d: d.loc[:,].join(d["score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="score.ft")
.rename(columns={"round":"name"})
.assign(seasonname=js["name"], url=res.url)
)
df = (pd.json_normalize(pd.json_normalize(js["rounds"])
.explode("matches").to_dict("records"))
.assign(seasonname=js["name"], url=res.url)
.pipe(lambda d: d.loc[:,].join(d["matches.score.ft"].apply(pd.Series).rename(columns={0:"home",1:"away"})))
.drop(columns="matches.score.ft")
.pipe(lambda d: d.rename(columns={c:c.split(".")[-1] for c in d.columns}))
)
return df
# get listing of all datafiles that we're interested in
res = requests.get("https://api.github.com/repos/openfootball/football.json/git/trees/master?recursive=1")
dfm = pd.DataFrame(res.json()["tree"])
# concat into one dataframe
df = pd.concat([structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]])
# dictionary of dataframe
d = {res.json()["name"]:structuredf(res)
for p in dfm.loc[dfm["path"].str.contains(r".en.[0-9]+.json"), "path"].iteritems()
for res in [requests.get(f"https://raw.githubusercontent.com/openfootball/football.json/master/{p[1]}")]}

(pyspark) how to make dataframes which have no same user_id mutually

I was trying to collect 2 user_id dataframes which have no same user_id mutually in pyspark.
So, I typed some codes below you can see
import pyspark.sql.functions as f
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_a = df_original.sort(f.rand()).limit(10000)
df_a.count()
# df_a: 10000
df_b = df_original.join(df_a,on="user_id",how="left_anti").sort(f.rand()).limit(10000)
df_b.count()
# df_b: 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# df_a - df_b = 9998
# What?????
As a result, df_a and df_b have the same 2 user_ids... sometimes 1, or 0.
It looks like no problem with codes. However, this occurs due to lazy action of spark mechanism maybe...
I need to solve this problem for collecting 2 user_id dataframes which have no same user_id mutually.
Since you want to generate two different set of users from a given pool of users with no overlap you may use this simple trick : =
from pyspark.sql.functions import monotonically_increasing_id
import pyspark.sql.functions as f
#"Creation of Original DF"
query = "select * from tb_original"
df_original = spark.sql(query)
df_original = df_original.select("user_id").distinct()
df_original =df.withColumn("UNIQUE_ID", monotonically_increasing_id())
number_groups_needed=2 ## you can adjust the number of group you need for your use case
dfa=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==0)
dfb=df_original.filter(df_original.UNIQUE_ID % number_groups_needed ==1)
##dfa and dfb will not have any overlap for user_id
Ps- if your user_id is itself a integer you don't need to create a new UNIQUE_ID column you can use it directly .
I choose randomSplit function pyspark supports.
df_a,df_b = df_original.randomSplit([0.6,0.4])
df_a = df_a.limit(10000)
df_a.count()
# 10000
df_b = df_b.limit(10000)
df_b.count()
# 10000
df_a.join(df_b,on="user_id",how="left_anti").count()
# 10000
never conflict between df_a and df_b anymore!