what is the smartest way to read csv with multiple type data mixed in? - pandas

1,100
2,200
3,300
...
many datas
...
9934,321
9935,111
2021-01-01, jane doe, 321
2021-01-10, john doe, 211
2021-01-30, jack doe, 911
...
many datas
...
2021-11-30, jick doe, 921
If I meet csv file like above,
How can I separate it as 2 dataframes? without loop or something calculate

I see this like that:
import pandas as pd
data = 'file.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c']) # I have to name columns
df_1 = df[~df['c'].isnull()] #This is with 3rd column
df_2 = df[df['c'].isnull()] #This is where are only two columns
Second idea was to first find the index of the row where data will switch from 2 to 3 column.
import pandas as pd
import numpy as np
data = 'stack.csv'
df = pd.read_csv(data ,names=['a', 'b', 'c'])
rows = df['c'].index[df['c'].apply(np.isnan)]
df_1 = pd.read_csv(data ,names=['a', 'b','c'],skiprows=rows[-1]+1)
df_2 = pd.read_csv(data ,names=['a', 'b'],nrows = rows[-1]+1)
I think you can easily modify the code when the files will change.
Here is the reason why I named columns link

Related

Transform Dataframe rows to columns and additional steps

I have a dataframe like this:
import pandas as pd
import numpy as np
data={'UNIT':['UNIT_1','UNIT_2','UNIT_3','UNIT_4'],
'Name_1':[ 'werner', 'otto', 'karl', 'fritz'],
'Name_2':[ 'ottilie', 'anna', 'jasmin', ''],
'Name_3':[ 'bello', 'kitti', '', '']}
df=pd.DataFrame(data)
df.replace('', np.nan, inplace=True)
display(df)
which looks like this:
The result that I want is this:
The code I have so far looks like this:
for index, row in df.iterrows():
row_transposed = row.T
row_transposed.dropna(inplace=True)
df_row_transposed = pd.DataFrame(row_transposed)
df_row_transposed_head = df_row_transposed.head(1)
#display(df_row_transposed)
#display(row_transposed_head)
hr_unit = df_row_transposed_head.iloc[0]
add_unit = (hr_unit[index])
for index, row in df_row_transposed.iterrows():
df_row_transposed["UNIT"] = add_unit
#row_transposed = row_transposed.iloc[index: , :]
display(df_row_transposed)
which already creates this:
but now I am stuck...
Any help is very much appriciated
Try df.melt. It will help you to unstack the column.
ddf = df.melt(id_vars='UNIT').sort_values(by='UNIT')
new_df = ddf[["value","UNIT"]]
new_df.dropna().reset_index(drop=True,inplace=True)
new_df
Out[163]:
value UNIT
0 werner UNIT_1
1 ottilie UNIT_1
2 bello UNIT_1
3 otto UNIT_2
4 anna UNIT_2
5 kitti UNIT_2
6 karl UNIT_3
7 jasmin UNIT_3
8 fritz UNIT_4

Sum column's values from duplicate rows python3

I have a old.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
and I need a new.csv like this:
Name,State,Brand,Model,Price
Adam,MO,Toyota,RV4,26500
Berry,KS,Toyota,Camry,30000
Kavin,CA,Ford,F150,23000
Yuke,OR,Nissan,Murano,31000
As you can see the difference from these two is:
Berry,KS,Toyota,Camry,18000
Berry,KS,Toyota,Camry,12000
merge to
Berry,KS,Toyota,Camry,30000
Here is my code:
import pandas as pd
df=pd.read_csv('old.csv')
df1=df.sort_values('Name').groupby('Name','State','Brand','Model')
.agg({'Name':'first','Price':'sum'})
print(df1[['Name','State','Brand','Model','Price']])
and It didn't work,and I got these error:
File "------\venv\lib\site-packages\pandas\core\frame.py", line 4421, in sort_values stacklevel=stacklevel)
File "------- \venv\lib\site-packages\pandas\core\generic.py", line 1382, in _get_label_or_level_values raise KeyError(key)
KeyError: 'Name'
I am a totally new of python,and I found a solutions in stackoverflow:
Sum values from Duplicated rows
The site above has similar question as mine,But It's a sql code,
Not Python
Any help will be great appreciation....
import pandas as pd
df = pd.read_csv('old.csv')
Group by 4 fields('Name', 'State', 'Brand', 'Model') and select Price column and apply aggregate sum to it,
df1 = df.groupby(['Name', 'State', 'Brand', 'Model'])['Price'].agg(['sum'])
print(df1)
This will give you a required output,
sum
Name State Brand Model
Adam MO Toyota RV4 26500
Berry KS Toyota Camry 30000
Kavin CA Ford F150 23000
Yuke OR Nissan Murano 31000
Note: There is only column sum in this df1. All 4 other columns are indexes so to convert it into csv, we first need to convert these 4 index columns to dataframe columns.
list(df1['sum'].index.get_level_values('Name')) will give you an output like this,
['Adam', 'Berry', 'Kavin', 'Yuke']
Now, for all indexes, do this,
df2 = pd.DataFrame()
cols = ['Name', 'State', 'Brand', 'Model']
for col in cols:
df2[col] = list(df1['sum'].index.get_level_values(col))
df2['Price'] = df1['sum'].values
Now, just write df2 to excel file like this,
df2.to_csv('new.csv', index = False)

Pandas Merge function only giving column headers - Update

What I want to achieve.
I have two data frames. DF1 and DF2. Both are being read from different excel file.
DF1 has 9 columns and 3000 rows, of which one of the column name is "Code Group".
DF2 has 2 columns and 20 rows, of which one of the column name is also "Code Group". In this same dataframe another column "Code Management Method" gives the explanation of code group. For eg. H001 is Treated at recyclable, H002 is Treated as landfill.
What happens
When I use the command data = pd.merge(DF1,DF2, on='Code Group') I only get 10 column names but no rows underneath.
What I expect
I would want DF1 and DF2 to be merged and wherever Code Group number is same Code Management Method to be pasted for explanation.
Additional information
Following are datatype for DF1
Entity object
Address object
State object
Site object
Disposal Facility object
Pounds float64
Waste Description object
Shipment Date datetime64[ns]
Code Group object
FollOwing are datatype for DF2
Code Management Method object
Code Group object
What I tried
I tried to follow the suggestions from similar post on SO that the datatypes on both sides should be same and Code Group here both are objects so don't know what's the issue. I also tried Concat function.
Code
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
CH = "C:\Python\Waste\Shipment.xls"
Code = "C:\Python\Waste\Code.xlsx"
Data = pd.read_excel(Code)
data1 = pd.read_excel(CH)
data1.rename(columns={'generator_name':'Entity','generator_address':'Address', 'Site_City':'Site','final_disposal_facility_name':'Disposal Facility', 'wst_dscrpn':'Waste Description', 'drum_wgt':'Pounds', 'wst_dscrpn' : 'Waste Description', 'genrtr_sgntr_dt':'Shipment Date','generator_state': 'State','expected_disposal_management_methodcode':'Code Group'},
inplace=True)
data2 = data1[['Entity','Address','State','Site','Disposal Facility','Pounds','Waste Description','Shipment Date','Code Group']]
data2
merged = data2.merge(Data, on='Code Group')
Getting a Warning
C:\Anaconda\lib\site-packages\pandas\core\generic.py:5890: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
import pandas as pd
df1 = pd.DataFrame({'Region': [1,2,3],
'zipcode':[12345,23456,34567]})
df2 = pd.DataFrame({'ZipCodeLowerBound': [10000,20000,30000],
'ZipCodeUpperBound': [19999,29999,39999],
'Region': [1,2,3]})
df1.merge(df2, on='Region')
this is how the example is given, and the result for this is:
Region zipcode
0 1 12345
1 2 23456
2 3 34567
Region ZipCodeLowerBound ZipCodeUpperBound
0 1 10000 19999
1 2 20000 29999
2 3 30000 39999
and that thing will result in
Region zipcode ZipCodeLowerBound ZipCodeUpperBound
0 1 12345 10000 19999
1 2 23456 20000 29999
2 3 34567 30000 39999
I hope this is what you want to do
After multiple tries I found that the column had some garbage so used the code below and it worked perfectly. Funny thing is I never encountered the problem on two other data-sets that I imported from excel file.
data2['Code'] = data2['Code'].str.strip()

how to extract the unique values and its count of a column and store in data frame with index key

I am new to pandas.I have a simple question:
how to extract the unique values and its count of a column and store in data frame with index key
I have tried to:
df = df1['Genre'].value_counts()
and I am getting a series but I don't know how to convert it to data frame object.
Pandas series has a .to_frame() function. Try it:
df = df1['Genre'].value_counts().to_frame()
And if you wanna "switch" the rows to columns:
df = df1['Genre'].value_counts().to_frame().T
Update: Full example if you want them as columns:
import pandas as pd
import numpy as np
np.random.seed(400) # To reproduce random variables
df1 = pd.DataFrame({
'Genre': np.random.choice(['Comedy','Drama','Thriller'], size=10)
})
df = df1['Genre'].value_counts().to_frame().T
print(df)
Returns:
Thriller Comedy Drama
Genre 5 3 2
try
df = pd.DataFrame(df1['Genre'].value_counts())

iterating over a dictionary of empty pandas dataframes to append them with data from existing dataframe based on list of column names

I'm a biologist and very new to Python (I use v3.5) and pandas. I have a pandas dataframe (df), from which I need to make several dataframes (df1... dfn) that can be placed in a dictionary (dictA), which currently has the correct number (n) of empty dataframes. I also have a dictionary (dictB) of n (individual) lists of column names that were extracted from df. The keys in 2 dictionaries match. I'm trying to append the empty dfs within dictA with parts of df based on the column names within the lists in dictB.
import pandas as pd
listA=['A', 'B', 'C',...]
dictA={i:pd.DataFrame() for i in listA}
lets say I have something like this:
dictA={'A': df1, 'B': df2}
dictB={'A': ['A1', A2', 'A3'],
'B': ['B1', B2']}
df=pd.DataFrame({'A1': [0,2,4,5],
'A2': [2,5,6,7],
'A3': [5,6,7,8],
'B1': [2,5,6,7],
'B2': [1,3,5,6]})
listA=['A', 'B']
what I'm trying to get is for df1 and df2 to get appended with portions of df like this, so that the output for df1 is like this:
A1 A2 A3
0 0 2 5
1 2 4 6
2 4 6 7
3 5 7 8
df2 would have columns B1 and B2.
I tried the following loop and some alterations, but it doesn't yield populated dfs:
for key, values in dictA.items():
values.append(df[dictB[key]])
Thanks and sorry if this was already addressed elsewhere but I couldn't find it.
You could create the dataframes you want like this instead :
df = #Your original dataframe containing all the columns
df_A = df.iloc[:][[col for col in df if 'A' in col]]
df_B = df.iloc[:][[col for col in df if 'B' in col]]