How to include an if condition after merging two dataframes? - pandas

currently in my code I'm merging two dataframes from my desktop, dropping some duplicates and some column and the final output is converted in a picture to be sent via telegram.
import pandas as pd
import dataframe_image as di
import telepot
df = pd.read_csv('a.csv', delimiter=';')
df1 = pd.read_csv('b.csv', delimiter=';')
total = pd.merge(df, df1, on="Conc", how="inner")
total = total.drop_duplicates(subset=["A"], keep="first")
total = total.drop(['A','B','C', 'D', 'E','Conc'], 1)
di.export(total, 'total.png')
bot = telepot.Bot('token')
bot.sendPhoto(chatid, photo=open('total.png', 'rb'))
This is the good path, in case the merging row is giving me a new dataframe with text on it. How can I manage the situation if the merging task as an output an empty df so I can send "NA" via telegram?
Many thanks

Related

Having trouble with and Excel spreadsheet, in google colab and a column is missing

Yes this is homework and no I don't want an answer to the question, but for some reason the column I would like to move using pandas is missing yet I can still see it on my end result. Why is this happening. This is what I have done:
import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
#read xlsx file
df = pd.read_excel("https://docs.google.com/spreadsheets/d/e/2PACX-
1vTd9TqybCunAe9HPPdb5mOW5uFn5m5fXO-mecfsn0TEk10_l8Bz1Kc7k13AFWoyvC1t3A7A27zozfTd/pub?
output=xlsx")
df
#removes last 2 rows
df.iloc[0:, 0:21]
#columns grouped by type float
df.iloc[0:, [0,2,4,9,10,11,12,13,14,15,16,17,18,19,20]]
#columns grouped by type object
df.iloc[0:, [1,3,5,6,7]]
#gets dummies and stores them in variables
type_float = df.iloc[0:, [0,2,4,9,10,11,12,13,14,15,16,17,18,19,20]]
type_object = df.iloc[0:, [1,3,5,6,7]]
#concatonates the dummies to orignal dataframe
df = pd.concat([type_float, type_object], axis='columns')
df
#rename
df.rename(columns = {'Attrition_Flag':'Target'}, inplace = True)
df
#Replaceing target with 0/1
df['Target'].replace(['Existing Customer', 'Attrited Customer'],[0, 1], inplace=True)
df
'''
This is where im having trouble
When I try to move column "target" I cant. Ive tried to pop it, and then move it to the back
and when I try using "df.iloc[0:, [15]]" which is its column, it just goes to the next column. Why is this column non-existent? anymore
Not sure if I understand correctly what you need to do but if you want to change the order of columns (make 'Target' the last column) you can use:
all_columns_in_new_order = list(df.columns.drop('Target')) + ['Target']
and then:
df = df.reindex(all_columns_in_new_order, axis=1)

How to subset a dataframe, groupby and export the dataframes as multiple sheets of a one excel file in Python

Python newbie here
In the dataset below:
import pandas as pd
import numpy as np
data = {'Gender':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Location':['NE','NE','NE','NE','SW','SW','SW','SW','SE','SE','SE','SE','NC','NC','NC','NC'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'PDP':['<10','<10','<10','<10',10,10,10,10,20,20,20,20,'>20','>20','>20','>20'],
'PDP_code':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Revenue_YR1':[1150.78,1162.34,1188.53,1197.69,2108.07,2117.76,2129.48,1319.51,1416.87,1812.54,1819.57,1991.97,2219.28,2414.73,2169.91,2149.19],
'Revenue_YR2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Revenue_YR3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Revenue_YR5':[251.78,221.34,282.53,272.69,310.07,317.7,329.81,333.15,334.87,332.54,336.59,339.97,329.28,334.73,336.91,334.12],
'Revenue_YR6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
'Revenue_YR8':[279.84,289.14,299.53,309.69,318.73,327.47,336.63,398.59,398.75,324.18,324.78,321.55,333.85,339.39,315.52,319.23],
}
df = pd.DataFrame(data,columns = ['Gender','Location','Type','PDP','PDP_code','diff','series',
'Revenue_YR1','Revenue_YR2','Revenue_YR3','Revenue_YR4','Revenue_YR5','Revenue_YR6',
'Revenue_YR7','Revenue_YR8'])
df.head(5)
I want a pythonic way of doing the following :
subset df into 4 dataframes / lists based on unique Location resulting in NE,SW,SE & NC dataframes
aggregating all the Revenue_YR columns while GroupBy series and PDP_code columns and export all the aggregated dataframes (NE,SW,SE & NC) as multiple sheets of one xlsx file
My attempt
### this code returns output of 1 df instead of 4 dfs, I need help aggregating each of the 4 dataframes and export them to 4 sheets of 12312021_output.xlsx
for i, part_df in df.groupby('Location'):
part_df.groupby(['series','PDP_code'])[['Revenue_YR1', 'Revenue_YR2','Revenue_YR3',
'Revenue_YR4', 'Revenue_YR5', 'Revenue_YR6', 'Revenue_YR7']].mean().unstack().style.background_gradient(cmap='Blues').to_excel('12312021_output.xlsx')
Please share your code.
You can use pandas.ExcelWriter, and your loop (which I improved slightly for readability):
import pandas as pd
with pd.ExcelWriter("output.xlsx") as writer:
cols = df.filter(like='Revenue_YR').columns
for g, d in df.groupby('Location'):
(d.groupby(['series','PDP_code'])[cols].mean().unstack()
.style.background_gradient(cmap='Blues')
).to_excel(writer, sheet_name=g)

Generate diff of two CSV files based on a single column using Pandas

I am working with CSV files that are each hundreds of megabytes (800k+ rows), use pipe delimiters, and have 90 columns. What I need to do is compare two files at a time, generating a CSV file of any differences (i.e. rows in File2 that do not exist in File1 as well as rows in File1 that do not exist in File2) but performing the comparison Only using a single column.
For instance, a highly simplified version would be:
File1
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|John|Doe|5555551212|ABC123
20200505|active|Jane|Doe|5555551212|ABC321
File2
claim_date|status|first_name|last_name|phone|claim_number
20200501|active|Someone|New|5555551212|ABC123
20200510|active|Another|Person|5555551212|ABC000
In this example, the output file should look like this:
claim_date|status|first_name|last_name|phone|claim_number
20200505|active|Jane|Doe|5555551212|ABC321
20200510|active|Another|Person|5555551212|ABC000
As this example shows, both input files contained the row with claim_number ABC123 and although the fields first_name and last_name changed between the files I do not care as the claim_number was the same in both files. The other rows contained unique claim_number values and so both were included in the output file.
I have been told that Pandas is the way to do this, so I have set up a Jupyter environment and am able to load the files correctly but am banging my head against the wall at this point. Any suggestions are Highly appreciated!
My code so far:
import os
import pandas as pd
df1 = pd.read_table("/Users/X/Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("/Users/X/Claims_20210618.txt", sep='|', low_memory=False)
Everything else I've written is basically irrelevant at this point as it's just copypasta from the web that doesn't execute for one reason or another.
EDIT: Solution!
import os
import pandas as pd
df1 = pd.read_table("Claims_20210607.txt", sep='|', low_memory=False)
df2 = pd.read_table("Claims_20210618.txt", sep='|', low_memory=False)
df1.astype({'claim_number':'str'})
df2.astype({'claim_number':'str'})
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False,
ignore_index=True)
.to_csv('diff.csv')
)
I still need to figure out how to kill off the first / added column before writing the file but this is fantastic! Thanks!
IIUC, you can try:
If you wanna drop duplicates based on all columns except ['first_name', 'last_name']:
df = pd.concat([df1, df2])
(
df.drop_duplicates(
subset=df.columns.difference(['first_name', 'last_name']),
keep=False)
.to_csv('file3.csv')
)
If you wanna drop duplicates based on duplicate claim_number column only:
df = pd.concat([df1,df2])
(
df.drop_duplicates(
subset=['claim_number'],
keep = False)
.to_csv('file3.csv')
)

How do I swap two (or more) columns in two different data tables? on pandas

new here and I am new to programming.
So.. as the title says I am trying to swap two full columns from two different files (columns has the same name but different data). I started this:
import numpy as np
import pandas as pd
from pandas import DataFrame
df = pd.read_csv('table1.csv', col_name= 'COL1')
df1 = pd.read_csv('table2.csv', col_name = 'COL1')
df1.COL1 = df.COL1
But now I am stack.. how do I select whole column and how can I print the new combined table to a new file (i.e table 3)?
You could perform the swapping by copying one column in a temporary one and deleting afterwards like follows
df1['temp'] = df1['COL1']
df1['COL1'] = df['COL1']
df['COL1'] = df1['temp']
del df1['temp']
and then writing the result via to_csv to a third CSV
df1.to_csv('table3.csv')

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')