Sum column's values from duplicate rows python3 - sum

I have a old.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
and I need a new.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,30000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
As you can see the difference from these two is:
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
merge to
Berry,KS,Toyota,Camry,30000
Here is my code:
import pandas as pd
df=pd.read_csv('old.csv')
df1=df.sort_values('Name').groupby('Name','State','Brand','Model')
.agg({'Name':'first','Price':'sum'})
print(df1[['Name','State','Brand','Model','Price']])
and It didn't work,and I got these error:
File "------\venv\lib\site-packages\pandas\core\frame.py", line 4421, in sort_values stacklevel=stacklevel)
File "------- \venv\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'Name'
I am a totally new of python,and I found a solutions in stackoverflow:
Sum values from Duplicated rows
The site above has similar question as mine,But It's a sql code,
Not Python
Any help will be great appreciation....

import pandas as pd
df = pd.read_csv('old.csv')
Group by 4 fields('Name', 'State', 'Brand', 'Model') and select Price column and apply aggregate sum to it,
df1 = df.groupby(['Name', 'State', 'Brand', 'Model'])['Price'].agg(['sum'])
print(df1)
This will give you a required output,
sum
Name State Brand Model
Adam MO Toyota RV4 26500
Berry KS Toyota Camry 30000
Kavin CA Ford F150 23000
Yuke OR Nissan Murano 31000
Note: There is only column sum in this df1. All 4 other columns are indexes so to convert it into csv, we first need to convert these 4 index columns to dataframe columns.
list(df1['sum'].index.get_level_values('Name')) will give you an output like this,
['Adam', 'Berry', 'Kavin', 'Yuke']
Now, for all indexes, do this,
df2 = pd.DataFrame()
cols = ['Name', 'State', 'Brand', 'Model']
for col in cols:
df2[col] = list(df1['sum'].index.get_level_values(col))
df2['Price'] = df1['sum'].values
Now, just write df2 to excel file like this,
df2.to_csv('new.csv', index = False)

Related

what is the smartest way to read csv with multiple type data mixed in?

1,100
2,200
3,300
...
many datas
...
9934,321
9935,111
2021-01-01, jane doe, 321
2021-01-10, john doe, 211
2021-01-30, jack doe, 911
...
many datas
...
2021-11-30, jick doe, 921
If I meet csv file like above,
How can I separate it as 2 dataframes? without loop or something calculate
I see this like that:
import pandas as pd
data = 'file.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c']) # I have to name columns
df_1 = df[~df['c'].isnull()] #This is with 3rd column
df_2 = df[df['c'].isnull()] #This is where are only two columns
Second idea was to first find the index of the row where data will switch from 2 to 3 column.
import pandas as pd
import numpy as np
data = 'stack.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c'])
rows = df['c'].index[df['c'].apply(np.isnan)]
df_1 = pd.read_csv(data ,names=['a', 'b','c'],skiprows=rows[-1]+1)
df_2 = pd.read_csv(data ,names=['a', 'b'],nrows = rows[-1]+1)
I think you can easily modify the code when the files will change.
Here is the reason why I named columns link

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

pandas dataframe export row column value

I import data from Excel into python pandas with read_clipboard.
import pandas as pd
df = pd.read_clipboard()
The column index are the month (januar, februar, ...,december). The row index are products name (orange, banana, etc). And the value in cells are the monthly sales.
How can I export a csv of the following format
month;product;sales
To make it more visual, I show the input in the first image and how the output shoud be in the second image.
You can also use xlrd package.
Sample Book1.xlsx:
january february march
Orange 4 2 4
banana 2 6 3
apple 5 1 7
sample code:
import xlrd
book = xlrd.open_workbook("Book1.xlsx")
print(book.sheet_names())
first_sheet = book.sheet_by_index(0)
row1 = first_sheet.row_values(0)
print(first_sheet.nrows)
for i in range(len(row1)):
if i !=0:
next_row = first_sheet.row_values(i)
for j in range(len(next_row)-1):
print("{};{};{}".format(row1[i],next_row[0],next_row[j+1]))
Result:
january;Orange;4.0
january;Orange;2.0
january;Orange;4.0
february;banana;2.0
february;banana;6.0
february;banana;3.0
march;apple;5.0
march;apple;1.0
march;apple;7.0
If that is only the case, it might solve that problem:
month = df1.columns.to_list()*3
product = []
sales=[]
for x in range(0,2):
product += [df1.index[x]]*12
sales += df1.iloc[x].values.tolist()
df2 = pd.DataFrame({'month': month, 'product': product, 'sales': sales})
But you need to look for smarter way if you have a larger Dataframe, like what #Jon Clements suggested in the comment.
I finally solved it thanks to your advice : using unstack
df2 = df.transpose()
df3 = df2 =.unstack()
df3.to_csv('my/path/name.csv', sep=';')

how to extract the unique values and its count of a column and store in data frame with index key

I am new to pandas.I have a simple question:
how to extract the unique values and its count of a column and store in data frame with index key
I have tried to:
df = df1['Genre'].value_counts()
and I am getting a series but I don't know how to convert it to data frame object.
Pandas series has a .to_frame() function. Try it:
df = df1['Genre'].value_counts().to_frame()
And if you wanna "switch" the rows to columns:
df = df1['Genre'].value_counts().to_frame().T
Update: Full example if you want them as columns:
import pandas as pd
import numpy as np
np.random.seed(400) # To reproduce random variables
df1 = pd.DataFrame({
'Genre': np.random.choice(['Comedy','Drama','Thriller'], size=10)
})
df = df1['Genre'].value_counts().to_frame().T
print(df)
Returns:
Thriller Comedy Drama
Genre 5 3 2
try
df = pd.DataFrame(df1['Genre'].value_counts())

frequency table for all columns in pandas

I want to run frequency table on each of my variable in my df.
def frequency_table(x):
return pd.crosstab(index=x, columns="count")
for column in df:
return frequency_table(column)
I got an error of 'ValueError: If using all scalar values, you must pass an index'
How can i fix this?
Thank you!
You aren't passing any data. You are just passing a column name.
for column in df:
print(column) # will print column names as strings
try
ctabs = {}
for column in df:
ctabs[column]=frequency_table(df[column])
then you can look at each crosstab by using the column name as keys in the ctabs dictionary
for column in df:
print(data[column].value_counts())
For example:
import pandas as pd
my_series = pd.DataFrame(pd.Series([1,2,2,3,3,3, "fred", 1.8, 1.8]))
my_series[0].value_counts()
will generate output like in below:
3 3
1.8 2
2 2
fred 1
1 1
Name: 0, dtype: int64