Why would an extra column (unnamed: 0) appear after saving the df and then reading it through pd.read_csv? - pandas

My code to save the df is:
fdi_out_vdem.to_csv("fdi_out_vdem.csv")
To read the df into python is :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
The df:
Unnamed: 0
country_name
value
1
Spain
190
2
Spain
311

Your df has two columns, but also an index with "0" and "1". When writing it to csv it looks like this:
,country_name,value
0,Spain,190
1,Spain,311
When importing it with pandas you it is considered as df with 3 columns (and the first has no name)
You have two possibilities here:
Save it without index column:
df.to_csv("fdi_out_vdem.csv", index=False)
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
or save it with index column and define an index col when reading it with pd.read_csv
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
UPDATE
As recommended by #ouroboros1 in the comments you could also name your index before saving it to csv, so you can define the index column by using that name
df.index.name = "index"
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col="index")

You can either pass the parameter index_col=[0] to pandas.read_csv :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
Or even better, get rid of the index at the beginning when calling pandas.DataFrame.to_csv:
fdi_out_vdem.to_csv("fdi_out_vdem.csv", index=False)

Related

allowing python to impoert csv with duplicate column names in python

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:
Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4

correct accessing of slices with duplicate index-values present

I have a dataframe with an index that sometimes contains rows with the same index-value. Now I want to slice that dataframe and set values based on row-indices.
Consider the following example:
import pandas as pd
df = pd.DataFrame({'index':[1,2,2,3], 'values':[10,20,30,40]})
df.set_index(['index'], inplace=True)
df1 = df.copy()
df2 = df.copy()
#copy warning
df1.iloc[0:2]['values'] = 99
print(df1)
df2.loc[df.index[0:2], 'values'] = 99
print(df2)
df1 is the expected result, but gives me a SettingWithCopyWarning.
df2 seems to be the suggested way of accessing by the doc, but gives me the wrong result (because of the duplicate index)
Is there a "proper" way to set those values correctly with the duplicate index-values present?
.loc is not recommended when you have duplicate index. So you have to go for position based selection iloc. Since we need to pass the positions, we have to use get_loc for getting position of column:
print (df2.columns.get_loc('values'))
0
df1.iloc[0:2, df2.columns.get_loc('values')] = 99
print(df1)
values
index
1 99
2 99
2 30
3 40

Python 3: Creating DataFrame with parsed data

The following data has been parsed from a stock API. The dataframe has the headers of each column in the Dataset respectively. Is there anyway I can link the data to the dataframe effectively creating a labeled data array/table?
DataFrame
df = pd.DataFrame(columns=['Date','Close','High','Low','Open','Volume'])
DataSet
20140502,36.8700,37.1200,36.2100,36.5900,22454100
20140505,36.9100,37.0500,36.3000,36.6800,13129100
20140506,36.4900,37.1700,36.4800,36.9400,19156000
20140507,34.0700,35.9900,33.6700,35.9900,66062700
20140508,33.9200,34.5700,33.6100,33.8800,30407700
20140509,33.7600,34.1000,33.4100,34.0100,20303400
20140512,34.4500,34.6000,33.8700,33.9900,22520600
20140513,34.4000,34.6900,34.1700,34.4300,12477100
20140514,34.1700,34.6500,33.9800,34.4800,17039000
20140515,33.8000,34.1900,33.4000,34.1800,18879800
20140516,33.4100,33.6600,33.1000,33.6600,18847100
20140519,33.8900,33.9900,33.2800,33.4100,14845700
20140520,33.8700,34.4700,33.6700,33.9900,18596700
20140521,34.3600,34.3900,33.8900,34.0000,13804500
20140522,34.7000,34.8600,34.2600,34.6000,17522800
20140523,35.0200,35.0800,34.5100,34.8500,16294400
20140527,35.1200,35.1300,34.7300,35.0000,13057000
20140528,34.7800,35.1700,34.4200,35.1500,16960500
20140529,34.9000,35.1000,34.6700,34.9000,9780800
20140530,34.6500,34.9300,34.1300,34.9200,13153000
20140602,34.8700,34.9500,34.2800,34.6900,9178900
20140603,34.6500,34.9700,34.5800,34.8000,6557500
20140604,34.7300,34.8300,34.2600,34.4800,9434100
I'm assuming that you are receiving the data as a list of lists. So something like -
vals = [[20140502,36.8700,37.1200,36.2100,36.5900,22454100], [20140505,36.9100,37.0500,36.3000,36.6800,13129100], ...]
In that case, you can populate your dataframe with loc -
for index, val in enumerate(vals):
df.loc[index] = val
Which will give you -
In [6]: df
Out[6]:
Date Close High Low Open Volume
0 20140502 36.87 37.12 36.21 36.59 22454100
1 20140505 36.91 37.05 36.3 36.68 13129100
...
Here, enumerate gives us the index of the row, so we can use that to populate the dataframe index.
If somehow the data was saved as csv, then you can simply use read_csv -
df = pd.read_csv('data.csv', names=['Date','Close','High','Low','Open','Volume'])