How to find unique value range between two values and fill in - pandas

Let me start off with the caveat that I'm new to the site and newer to coding, so hopefully, the formating of this all displays correctly.
I looking to locate a few given unique records in a data set (in example data "blah4" and "blah44") and update the value below their location up to the location of the other unique record I flagged. I've tried searching on this topic but haven't had much luck yet. Any help/direction that you could provide would be much appreciated. Cheers!
import pandas as pd
info = [("blah1","blah2","blah3"),
("blah4","blah5","blah6"),
("blah11","blah22","blah33"),
("blah44","blah55","blah66"),
("blah7","blah8","blah9"),
("blah77","blah88","blah99")]
df = pd.DataFrame(info, columns =["Name","Type","Class"])
print(df)
#Highlighting the data values that I'm looking to locate and use to replace values beneth untill reaching the next data value
print(df[df[:].isin(["blah4","blah44"])])
#Desired outcome
info2 = [("blah1","blah2","blah3"),
("blah4","blah5","blah6"),
("blah4","blah22","blah33"),
("blah44","blah55","blah66"),
("blah44","blah8","blah9"),
("blah44","blah88","blah99")]
df2 = pd.DataFrame(info2, columns =["Name","Type","Class"])
print(df2)

You can group by cumsum and transform:
df["Name"] = df.groupby(df["Name"].isin(["blah4","blah44"]).cumsum())["Name"].transform("first")
print (df)
Name Type Class
0 blah1 blah2 blah3
1 blah4 blah5 blah6
2 blah4 blah22 blah33
3 blah44 blah55 blah66
4 blah44 blah8 blah9
5 blah44 blah88 blah99

Related

How to get frequency count of unique values in a column groupby another colum? (Pandas)

In this linked screenshot: I have data in this format on the left, and I need to return a frequency count table that looks like the one on the right side. I know I can use groupby ("Time") and I can get the unique values of the "Type" column by using " df["Type].unique(). But I get stuck in the next steps. Any suggestion?
Screenshot of the data and expected output
import pandas as pd
df = pd.DataFrame({'Time':['AM', 'PM', 'AM','AM','PM','PM','AM','PM'],
'Type':['Egg','Milk','Milk', 'Ham', 'Milk','Ham', 'Milk','Egg']})
.Crosstab() is what you need:
pd.crosstab(df['Time'], df['Type'])
Or you could also use .pivot_table() with aggfunc='size'
df.pivot_table(index='Time', columns='Type', aggfunc='size')
Or groupby + unstack:
df.groupby(['Time', 'Type']).size().unstack()
Output:
Type Egg Ham Milk
Time
AM 1 1 2
PM 1 1 2

Convert transactions with several products from columns to row [duplicate]

I'm having a very tough time trying to figure out how to do this with python. I have the following table:
NAMES VALUE
john_1 1
john_2 2
john_3 3
bro_1 4
bro_2 5
bro_3 6
guy_1 7
guy_2 8
guy_3 9
And I would like to go to:
NAMES VALUE1 VALUE2 VALUE3
john 1 2 3
bro 4 5 6
guy 7 8 9
I have tried with pandas, so I first split the index (NAMES) and I can create the new columns but I have trouble indexing the values to the right column.
Can someone at least give me a direction where the solution to this problem is? I don't expect a full code (I know that this is not appreciated) but any help is welcome.
After splitting the NAMES column, use .pivot to reshape your DataFrame.
# Split Names and Pivot.
df['NAME_NBR'] = df['NAMES'].str.split('_').str.get(1)
df['NAMES'] = df['NAMES'].str.split('_').str.get(0)
df = df.pivot(index='NAMES', columns='NAME_NBR', values='VALUE')
# Rename columns and reset the index.
df.columns = ['VALUE{}'.format(c) for c in df.columns]
df.reset_index(inplace=True)
If you want to be slick, you can do the split in a single line:
df['NAMES'], df['NAME_NBR'] = zip(*[s.split('_') for s in df['NAMES']])

Delete all rows with an empty cell anywhere in the table at once in pandas

I have googled it and found lots of question in stackoverflow. So suppose I have a dataframe like this
A B
-----
1
2
4 4
First 3 rows will be deleted. And suppose I have not 2 but 200 columns. How can I do that?
As per your request - first replace to Nan:
df = df.replace(r'^\s*$', np.nan, regex=True)
df = df.dropna()
If you want to remove on a specific column, then you need to specify the column name in the brackets

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

Remove all duplicate data only show unique

I have a data set:
import pandas as pd
data = pd.read_csv('email_list.csv')
new_data = data[['Email Address','First Name','Last Name']]
Email Address First Name Last Name
0 zoe#gmail.com ZoƩ Z
1 yvonne#yahoo.com Yvonne T
2 Whitney#gmail.com Whitney W
3 zoe#gmail.com Zoe Z
4 yvonne#yahoo.com Yvonne T
I want the output to only show me unique emails and names. So from the short list above the output should be:
Email Address First Name Last Name
1 Whitney#gmail.com Whitney W
How can I do this? The simplest way will be best.
This is what you are searching for:
df.drop_duplicates(keep=False)
drop_duplicates remove dupes in your dataframe. The powerful keep argument let you tune what to keep and what to drop. If the argument is false, all dupes are dropped.