Execute same function on different columns to make rows to append another table - dataframe

How could I perform the same operation for 15 columns on a DataFrame?
How could I parallelize the operation?
I have an input data that I need to update a reference table. There are more columns but I think these 3 help to understand what I am trying to do.
Table: input
rowid
col1
col2
col3
id1
col1_data1
col2_data1
col3_data1
id2
col1_data2
col2_data2
col3_data2
The reference table contains the values of each corresponding cell of the column, then the md5 and finally the column name
Table: references
col_data
md5
ref_name
col1_data1
md5_col1_data1
col1_name
col1_data2
md5_col1_data2
col1_name
col1_data3
md5_col1_data3
col1_name
col2_data1
md5_col2_data1
col2_name
col2_data2
md5_col2_data2
col2_name
col2_data3
md5_col2_data3
col2_name
col3_data1
md5_col3_data1
col3_name
col3_data2
md5_col3_data2
col3_name
col3_data3
md5_col3_data3
col3_name
I created a function similar to this that checks the input table against
the reference table and when new data is found then the reference is created and
a dataframe is returned so that at the end the references table is updated
def repeatedly_excuted_funcion(input_data, references, col_name):
"""
input_data is the full dataframe
references is the table to check if has the value and if not create it
col_name is the name of the column that will be considered on the execution
"""
# ... some code ...
return partial_df
df_col1 = repeatedly_excuted_funcion(input_data, references, "col1")
df_col2 = repeatedly_excuted_funcion(input_data, references, "col2")
data_to_append = df_col1.union(df_col2)
df_col3 = repeatedly_excuted_funcion(input_data, references, "col3")
data_to_append = data_to_append.union(df_col2)
I only put a 3 column example but there are 15 columns to check.
At the end the idea is to update the references table with the new calculated md5 values.
(
data_to_append.write.format("delta")
.mode("append")
.saveAsTable(database_table)
)

No function, no unions. 1 shuffle (anti join).
Create all the 3 final columns (data, md5, col_name) inside the array in Input table
Unpivot - from every 1 row of 15 cols make 1 col of 15 rows
Split the 1 array col into 3 data cols
Filter out rows which already exist in References
Append result
from pyspark.sql import functions as F
cols = ['col1', 'col2',..., 'col15']
# Change Input columns to arrays
df_input = df_input.select(
*[F.array(F.col(c), F.md5(c), F.lit(c)).alias(c) for c in cols]
)
# Unpivot Input table
stack_string = ", ".join([f"`{c}`" for c in cols])
df_input2 = df_input.select(
F.expr(f"stack({len(cols)}, {stack_string}) as col_data"))
# Make 3 columns from 1 array column
df_input3 = df_input2.select(
F.element_at('col_data', 1).alias('col_data'),
F.element_at('col_data', 2).alias('md5'),
F.element_at('col_data', 3).alias('ref_name'),
)
# Keep only rows which don't exist in References table
data_to_append = df_input3.join(df_references, 'col_data', 'anti')
(
data_to_append.write.format("delta")
.mode("append")
.saveAsTable(database_table)
)

Create an empty DF with the correct schema.
Get All the columns,
Union this to all the rows.
I'm not sure for 15 itesm it's worth parallelizing, or you wouldn't run into issues with spark context (as it's not availble inside an executor). Meaning you would have to have pure python code inside of repeatedly_excuted_function. You might be able to do all rows at once with a UDF, but I'm not sure if that would perform as well. (UDFs are known for poor performance due to the lack of vectorization).
from pyspark.sql.types import StructType,StructField, StringType
unionSchema = StructType([
StructField('column', StringType(), True)])
my_union = spark.createDataFrame( data = [] ,
schema = unionSchema )
for i in myDF.columns:
my_union = my_union.union(repeatedly_excuted_funcion(input_data, references, i)

what about pivoting the data and performing one join?
The code below creates map, the input is a little annoying as I create in python a list of [lit(column_name1), col(column_name1), lit(column_name2), ...] , the main purpose of this map is to explode it and then the first table is in a similar format as the reference df and one normal join can be performed.
from itertools import chain
from pyspark.sql.functions import create_map, array, lit, col, explode
column_names = ["col1", "col2", "col3"]
df \
.withColumn("features_map", create_map(
list(chain(*[(lit(c), col(c)) for c in column_names]))
)) \
.select("rowid", explode("features_map").alias("ref_name", "col_data")) \
.join(ref_df, on=["ref_name", "col_data"], how="left") ....

Related

Create a Python script which compares several excel files(snapshots) and compares and creates a new dataframe with rows which are diffirent

Am new in Python and will appreciate your help.
I would like to create a python script which perfoms data validation by using my first file excel_file[0] as df1 and comparing it against several other excel_file[0:100] and looping through them while comparing with df1 and appending those rows which are diffirent to a new dataframe df3. Even though I have several columns, I would like to base my comparison on two columns which includes a primary key column; such that if the keys in the two dataframes match; then compare df1 and df2(loop).
Here's what I have tried..
***## import python module: pandasql which allows SQL syntax for Pandas;
It needs installation first though:pip install -U pandasql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, locals(), globals() )
dateTimeObj = dt.datetime.now()
print('start file merge: ' ,dateTimeObj)
#path = os.getcwd()
##files = os.listdir(path1)
files=os.path.abspath(mydrive')
files
dff1 = pd.DataFrame()
##df2 = pd.DataFrame()
# method 1
excel_files = glob.glob(files+ "\*.xlsx")
##excel_files = [f for f in files if f[-4:] == '\*.xlsx' or f[-3:] == '*.xls']
df1=pd.read_excel(excel_files[14])
for f in excel_files[0:100]:
df2 = pd.read_excel(f)
## Lets drop the any unanamed column
##df1=df1.drop(df1.iloc[:, [0]], axis = 1)
### Gets all Rows and columns which are diffirent after comparing the two dataframes ; The
clause " _key HAVING COUNT(*)= 1" resolves to True if the two dataframes are diffirent
### Else we use The clause " _key HAVING COUNT(*)= 2" to output similar rows and columns
data=pysqldf("SELECT * FROM ( SELECT * FROM df1 UNION ALL SELECT * FROM df2) df1 GROUP BY _key
HAVING COUNT(*) = 1 ;")
## df = dff1.append(data).reset_index(drop = True)
print(dt.datetime.now().strftime("%x %X")+': files appended to make a Master file')***

How to split a column into multiple columns and then count the null values in the new column in SQL or Pandas?

I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1

Postgres 9.5 upsert command in pandas or psycopg2?

Most of the examples I see are people inserting a single row into a database with the ON CONFLICT DO UPDATE syntax.
Does anyone have any examples using SQLAlchemy or pandas.to_sql?
99% of my inserts are using psycopg2 COPY command (so I save a csv or stringio and then bulk insert), and the other 1% are pd.to_sql. All of my logic to check for new rows or dimensions is done in Python.
def find_new_rows(existing, current, id_col):
current[id_col] = current[id_col].astype(int)
x = existing[['datetime', id_col, 'key1']]
y = current[['datetime', id_col, 'key2']]
final = pd.merge(y, x, how='left', on=['datetime', id_col])
final = final[~(final['key2'] == final['key1'])]
final = final.drop(['key1'], axis=1)
current = pd.merge(current, final, how='left', on=['datetime', id_col])
current = current.loc[current['key2_y'] == 1]
current.drop(['key2_x', 'key2_y'], axis=1, inplace=True)
return current
Can someone show me an example of using the new PostgreSQL syntax for upsert with pyscopg2? A common use case is to check for dimension changes (between 50k - 100k rows daily which I compare to existing values) which is CONFLICT DO NOTHING to only add new rows.
Another use case is that I have fact data which changes over time. I only take the most recent value (I currently use a view to select distinct), but it would be better to UPSERT, if possible.
Here is my code for bulk insert & insert on conflict update query for postgresql from pandas dataframe:
Lets say id is unique key for both postgresql table and pandas df and you want to insert and update based on this id.
import pandas as pd
from sqlalchemy import create_engine, text
engine = create_engine(postgresql://username:pass#host:port/dbname)
query = text(f"""
INSERT INTO schema.table(name, title, id)
VALUES {','.join([str(i) for i in list(df.to_records(index=False))])}
ON CONFLICT (id)
DO UPDATE SET name= excluded.name,
title= excluded.title
""")
engine.execute(query)
Make sure that your df columns must be same order with your table.
FYI, this is the solution I am using currently.
It seems to work fine for my purposes. I had to add a line to replace null (NaT) timestamps with None though, because I was getting an error when I was loading each row into the database.
def create_update_query(table):
"""This function creates an upsert query which replaces existing data based on primary key conflicts"""
columns = ', '.join([f'{col}' for col in DATABASE_COLUMNS])
constraint = ', '.join([f'{col}' for col in PRIMARY_KEY])
placeholder = ', '.join([f'%({col})s' for col in DATABASE_COLUMNS])
updates = ', '.join([f'{col} = EXCLUDED.{col}' for col in DATABASE_COLUMNS])
query = f"""INSERT INTO {table} ({columns})
VALUES ({placeholder})
ON CONFLICT ({constraint})
DO UPDATE SET {updates};"""
query.split()
query = ' '.join(query.split())
return query
def load_updates(df, table, connection):
conn = connection.get_conn()
cursor = conn.cursor()
df1 = df.where((pd.notnull(df)), None)
insert_values = df1.to_dict(orient='records')
for row in insert_values:
cursor.execute(create_update_query(table=table), row)
conn.commit()
row_count = len(insert_values)
logging.info(f'Inserted {row_count} rows.')
cursor.close()
del cursor
conn.close()
For my case, I wrote to a temporary table first, then merged the temp table into the actual table I wanted to upsert to. Performing the upsert this way avoids any conflicts where the strings may have single quotes in them.
def upsert_dataframe_to_table(self, table_name: str, df: pd.DataFrame, schema: str, id_col:str):
"""
Takes the given dataframe and inserts it into the table given. The data is inserted unless the key for that
data already exists in the dataframe. If the key already exists, the data for that key is overwritten.
:param table_name: The name of the table to send the data
:param df: The dataframe with the data to send to the table
:param schema: the name of the schema where the table exists
:param id_col: The name of the primary key column
:return: None
"""
engine = create_engine(
f'postgresql://{postgres_configs["username"]}:{postgres_configs["password"]}#{postgres_configs["host"]}'
f':{postgres_configs["port"]}/{postgres_configs["db"]}'
)
df.to_sql('temp_table', engine, if_exists='replace')
updates = ', '.join([f'{col} = EXCLUDED.{col}' for col in df.columns if col != id_col])
columns = ', '.join([f'{col}' for col in df.columns])
query = f'INSERT INTO "{schema}".{table_name} ({columns}) ' \
f'SELECT {columns} FROM temp_table ' \
f'ON CONFLICT ({id_col}) DO ' \
f'UPDATE SET {updates} '
self.cursor.execute(query)
self.cursor.execute('DROP TABLE temp_table')
self.conn.commit()

Duplicate row in PySpark Dataframe based off value in another column

I have a dataframe that looks like the following:
ID NumRecords
123 2
456 1
789 3
I want to create a new data frame that concatenates the two columns and duplicates the rows based on the value in NumRecords
So the output should be
ID_New 123-1
ID_New 123-2
ID_New 456-1
ID_New 789-1
ID_New 789-2
ID_New 789-3
I was looking into the "explode" function but it seemed to take only a constant based on the example I saw.
I had a similar issue, this code will duplicate the rows based on the value in the NumRecords column:
from pyspark.sql import Row
def duplicate_function(row):
data = [] # list of rows to return
to_duplicate = float(row["NumRecords"])
i = 0
while i < to_duplicate:
row_dict = row.asDict() # convert a Spark Row object to a Python dictionary
row_dict["SERIAL_NO"] = str(i)
new_row = Row(**row_dict) # create a Spark Row object based on a Python dictionary
to_return.append(new_row) # adds this Row to the list
i += 1
return data # returns the final list
# create final dataset based on value in NumRecords column
df_flatmap = df_input.rdd.flatMap(duplicate_function).toDF(df_input.schema)
You can use udf
from pyspark.sql.functions import udf, explode, concat_ws
from pyspark.sql.types import *
range_ = udf(lambda x: [str(y) for y in range(1, x + 1)], ArrayType(StringType()))
df.withColumn("records", range_("NumRecords") \
.withColumn("record", explode("records")) \
.withColumn("ID_New", concat_ws("-", "id", "record"))

Joining files in pandas

I come from an Excel background but I love pandas and it has truly made me more efficient. Unfortunately, I probably carry over some bad habits from Excel. I have three large files (between 2 million and 13 million rows each) which contain data on interactions which could be tied together, unfortunately, there is no unique key connecting the files. I am literally concatenating (Excel formula) 3 fields into one new column on all three files.
Three columns which exist on each file which I combined together (the other fields would be like the reason for interaction on one file, the score on another file, and the some other data on the third file which I would like to tie together back to a certain agentID):
Date | CustomerID | AgentID
I edit my date format to be uniform on each file:
df[Date] = pd.to_datetime(df['Date'], coerce = True)
df[Date] = df[Date].apply(lambda x:x.date().strftime('%Y-%m-%d'))
Then I create a unique column (well, as unique as I can get it.. sometimes the same customer interacts with the same agent on the same date but this should be quite rare):
df[Unique] = df[Date].astype(str) + df[CustomerID].astype(str) + df[AgentID].astype(str)
I do the same steps for df2 and then:
combined = pd.merge(df, df2, how = 'left', on = 'Unique')
I typically send that to a new csv in case something crashes, gzip it, then read it again and do the same process again with the third file.
final = pd.merge(combined, df2, how = 'left', on = 'Unique')
As you can see, this takes time. I have to format the dates on each and then turn them into text, create an object column which adds to the filesize, and (due to the raw data issues themselves) drop duplicates so I don't accidentally inflate numbers. Is there a more efficient workflow for me to follow?
Instead of using on = 'Unique':
combined = pd.merge(df, df2, how = 'left', on = 'Unique')
you can pass a list of columns to the on keyword parameter:
combined = pd.merge(df, df2, how='left', on=['Date', 'CustomerID', 'AgentID'])
Pandas will correctly merge rows based on the triplet of values from the 'Date', 'CustomerID', 'AgentID' columns. This is safer (see below) and easier than building the Unique column.
For example,
import pandas as pd
import numpy as np
np.random.seed(2015)
df = pd.DataFrame({'Date': pd.to_datetime(['2000-1-1','2000-1-1','2000-1-2']),
'CustomerID':[1,1,2],
'AgentID':[10,10,11]})
df2 = df.copy()
df3 = df.copy()
L = len(df)
df['ABC'] = np.random.choice(list('ABC'), L)
df2['DEF'] = np.random.choice(list('DEF'), L)
df3['GHI'] = np.random.choice(list('GHI'), L)
df2 = df2.iloc[[0,2]]
combined = df
for x in [df2, df3]:
combined = pd.merge(combined, x, how='left', on=['Date','CustomerID', 'AgentID'])
yields
In [200]: combined
Out[200]:
AgentID CustomerID Date ABC DEF GHI
0 10 1 2000-1-1 C F H
1 10 1 2000-1-1 C F G
2 10 1 2000-1-1 A F H
3 10 1 2000-1-1 A F G
4 11 2 2000-1-2 A F I
A cautionary note:
Adding the CustomerID to the AgentID to create a Unique ID could be problematic
-- particularly if neither has a fixed-width format.
For example, if CustomerID = '12' and AgentID = '34' Then (ignoring the date which causes no problem since it does have a fixed-width) Unique would be
'1234'. But if CustomerID = '1' and AgentID = '234' then Unique would
again equal '1234'. So the Unique IDs may be mixing entirely different
customer/agent pairs.
PS. It is a good idea to parse the date strings into date-like objects
df['Date'] = pd.to_datetime(df['Date'], coerce=True)
Note that if you use
combined = pd.merge(combined, x, how='left', on=['Date','CustomerID', 'AgentID'])
it is not necessary to convert any of the columns back to strings.