Pandas Merge function only giving column headers - Update - pandas

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)

import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do

After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

Related

List Comprehension is transposing Dataframe

I am trying to import a .csv of EMG data as a Dataframe and filter each column of data using a list comprehension. Below is a dummy dataframe.
from scipy.signal import butter, filtfilt
test_array = pd.DataFrame(np.random.normal(0,2,size=(1000,6)),columns = ['time','RF','VM','TA','GM','BF'])
b,a = butter(4,[0.05,0.9],'bandpass',analog=False)
columns = ['RF','VM','TA','GM','BF']
filtered_df = pd.DataFrame([filtfilt(b,a,test_array.loc[:,i] for i in test_array[columns]])
The code above gives a version of the expected output, but instead of returning filtered_df as a (1000,5) dataframe, it is returning a (5,1000) dataframe.
I've tried using df.transpose() on the back end to fix the orientation, but it seems like there should be a more straightforward solution to preventing the transposing in the first place. Is there a way to get the desired output?
This issue is related to how you building the new dataframe. Just passing in a list from:
[filtfilt(b,a,test_array.loc[:,i]) for i in test_array[columns]]
pandas will read that in as a dataframe with four rows and column names representing the indices of the numpy array. If you build your dataframe using a dictionary mapped to each column name like:
results = [filtfilt(b,a,test_array.loc[:,i]) for i in test_array[columns]]
filtered_df = pd.DataFrame(data = dict(zip(columns, results)))
you get your desired result
RF VM TA GM BF
0 -0.072520 0.025846 0.111571 0.043277 0.024290
1 -2.674829 3.139997 0.285869 -0.162487 3.759851
2 -0.521439 3.481993 0.427854 -1.411966 5.422871
3 -2.719175 5.162347 2.195120 -0.535819 -1.721818
4 0.451544 1.730292 0.930652 -2.017700 -0.926594
.. ... ... ... ... ...
995 -5.240183 -0.625118 2.176452 2.065998 1.561615
996 -3.084039 -0.017626 -0.377022 -1.996366 2.041706
997 -5.122489 1.476979 -3.219335 1.609466 -3.707151
998 -2.072177 -0.870773 0.546386 0.031297 0.247766
999 0.141538 -0.048204 -0.601213 0.499631 0.246530
[1000 rows x 5 columns]

Removing values of a certain object type from a dataframe column in Pandas

I have a pandas dataframe where some values are integers and other values are an array. I simply want to drop all of the rows that contain the array (object datatype I believe) in my "ORIGIN_AIRPORT_ID" column, but I have not been able to figure out how to do so after trying many methods.
Here is what the first 20 rows of my dataframe looks like. The values that show up like a list are the ones I want to remove. The dataset is a couple million rows, so I just need to write code that removes all of the array-like values in that specific dataframe column if that makes sense.
df = df[df.origin_airport_ID.str.contains(',') == False]
You should consider next time giving us a data sample in text, instead of a figure. It's easier for us to test your example.
Original data:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
3 20194149 [10397, 10398, 10399, 10400]
4 20194150 10397
In your case, you can use the .to_numeric pandas function:
df['ORIGIN_AIRPORT_ID'] = pd.to_numeric(df['ORIGIN_AIRPORT_ID'], errors='coerce')
It replaces every cell that cannot be converted into a number to a NaN ( Not a Number ), so we get:
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397.0
1 20194147 10397.0
2 20194148 10397.0
3 20194149 NaN
4 20194150 10397.0
To remove these rows now just use .dropna
df = df.dropna().astype('int')
Which results in your desired DataFrame
ITIN_ID ORIGIN_AIRPORT_ID
0 20194146 10397
1 20194147 10397
2 20194148 10397
4 20194150 10397

Pandas rounding

I have the following sample dataframe:
Market Value
0 282024800.37
1 317460884.85
2 1260854026.24
3 320556927.27
4 42305412.79
I am trying to round the values in this dataframe to the nearest whole number. Desired output:
Market Value
282024800
317460885
1260854026
320556927
42305413
I tried:
df.values.round()
and the result was
Market Value
282025000.00
317461000.00
1260850000.00
320557000.00
42305400.00
What am I doing wrong?
Thanks
This might be more appropriate posted as a comment, but put here for proper format.
I can't produce your result. With numpy 1.18.1 and Pandas 1.1.0,
df.round().astype('int')
gives me:
Market Value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413
The only thing I can think of is that you may have a 32 bit system, where
df.astype('float32').round().astype('int')
gives me
Market Value
0 282024800
1 317460896
2 1260854016
3 320556928
4 42305412
The following will keep your data information intact as a float put will have it display/print to the nearest int.
Big caveat: it is only possible to have this apply to ALL dataframes at once (it is a pandas wide option) rather than just a single dataframe.
pd.set_option("display.precision", 0)
If you like #noah's solution but don't want to have to change the variables back if you output something, you can use the following helper function:
import pandas as pd
from contextlib import contextmanager
#contextmanager
def temp_pandas_options(options):
seen_options = set()
old_values = {}
if isinstance(options, dict):
options_pairs = list(options.items())
else:
options_pairs = options
for option, value in options_pairs:
assert not option in seen_options, f"Already saw option {option}"
old_values[option] = pd.get_option(option)
pd.set_option(option, value)
yield
for option, old_value in old_values.items():
pd.set_option(option, old_value)
Then you can run
with temp_pandas_options({'display.float_format': '{:.0f}'.format}):
print(market_value_df)
and get
Market value
0 282024800
1 317460885
2 1260854026
3 320556927
4 42305413

Delete all rows with an empty cell anywhere in the table at once in pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?
As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

How to find unique value range between two values and fill in

Let me start off with the caveat that I'm new to the site and newer to coding, so hopefully, the formating of this all displays correctly.
I looking to locate a few given unique records in a data set (in example data "blah4" and "blah44") and update the value below their location up to the location of the other unique record I flagged. I've tried searching on this topic but haven't had much luck yet. Any help/direction that you could provide would be much appreciated. Cheers!
import pandas as pd
info = [("blah1","blah2","blah3"),
("blah4","blah5","blah6"),
("blah11","blah22","blah33"),
("blah44","blah55","blah66"),
("blah7","blah8","blah9"),
("blah77","blah88","blah99")]
df = pd.DataFrame(info, columns =["Name","Type","Class"])
print(df)
#Highlighting the data values that I'm looking to locate and use to replace values beneth untill reaching the next data value
print(df[df[:].isin(["blah4","blah44"])])
#Desired outcome
info2 = [("blah1","blah2","blah3"),
("blah4","blah5","blah6"),
("blah4","blah22","blah33"),
("blah44","blah55","blah66"),
("blah44","blah8","blah9"),
("blah44","blah88","blah99")]
df2 = pd.DataFrame(info2, columns =["Name","Type","Class"])
print(df2)
You can group by cumsum and transform:
df["Name"] = df.groupby(df["Name"].isin(["blah4","blah44"]).cumsum())["Name"].transform("first")
print (df)
Name Type Class
0 blah1 blah2 blah3
1 blah4 blah5 blah6
2 blah4 blah22 blah33
3 blah44 blah55 blah66
4 blah44 blah8 blah9
5 blah44 blah88 blah99