Copying column values between a range of index doesn't work - pandas

I am trying to convert values in 2 dimensions in excel to one dimension, adding columns one under the other. However, this script doesn't add the values to a specific row range.
I am using pandas to do that. Excel file is here:
https://drive.google.com/file/d/1dfsfJhLFoGiO8_FG4kmZ87JxT2XFBpvX/view?usp=sharing
import pandas as pd
inpExcelFile = 'C:/sample.xlsx'
gridCells = pd.read_excel(inpExcelFile, sheetname='Sheet1')
Filter=pd.DataFrame()
for i in range(1938, 1940, 1):
gridCells_filter = gridCells[gridCells['Year']==i]
gridCells_filter=gridCells_filter.reset_index(drop=True)
gridCells_filter.replace(to_replace =",", value =".")
#BELOW IS COPYING COLUMN
Filter.at[0:30,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'JAN']
#AFTER THIS, IT DOESNT COPY COLUMN VALUES
Filter.at[31:61,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'FEB']
Filter.at[62:92,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'MAR']
Filter.at[93:123,'Filtered '+str(i)]=gridCells_filter.loc[:30,'APR']
Filter.at[124:154,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'MAY']
Filter.loc[155:185,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'JUN']
Filter.at[186:216,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'JUL']
Filter.at[217:247,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'AUG']
Filter.at[248:278,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'SEP']
Filter.at[279:309,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'OCT']
Filter.at[310:340,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'NOV']
Filter.at[341:371,'Filtered '+str(i)]=gridCells_filter.loc[0:30,'DEC']
Filter[Filter.Filtered +str(i) != '-----']
The expected result is that all columns values are needed to be in one column as desired order.

You can use general solution for all years - reshape by DataFrame.melt and use to_datetime with DataFrame.pop for extract columns, last sorting by DataFrame.sort_values and remove bad datetimes like 30.2.1938 by DataFrame.dropna:
df = pd.read_excel('sample.xlsx', decimal=',')
df = df.melt(['DAY','Year'], value_name='val')
s = df.pop('DAY').astype(str) + df.pop('variable') + df.pop('Year').astype(str)
df['datetime'] = pd.to_datetime(s, format='%d%b%Y', errors='coerce')
df = df.sort_values('datetime').dropna(subset=['datetime'])
val datetime
279 --- 1938-01-01
280 --- 1938-01-02
281 --- 1938-01-03
282 --- 1938-01-04
283 --- 1938-01-05

Related

Why would an extra column (unnamed: 0) appear after saving the df and then reading it through pd.read_csv?

My code to save the df is:
fdi_out_vdem.to_csv("fdi_out_vdem.csv")
To read the df into python is :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
The df:
Unnamed: 0
country_name
value
1
Spain
190
2
Spain
311
Your df has two columns, but also an index with "0" and "1". When writing it to csv it looks like this:
,country_name,value
0,Spain,190
1,Spain,311
When importing it with pandas you it is considered as df with 3 columns (and the first has no name)
You have two possibilities here:
Save it without index column:
df.to_csv("fdi_out_vdem.csv", index=False)
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv")
or save it with index column and define an index col when reading it with pd.read_csv
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
UPDATE
As recommended by #ouroboros1 in the comments you could also name your index before saving it to csv, so you can define the index column by using that name
df.index.name = "index"
df.to_csv("fdi_out_vdem.csv")
df = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col="index")
You can either pass the parameter index_col=[0] to pandas.read_csv :
fdi_out_vdem = pd.read_csv("C:/Users/asus/Desktop/classen/fdi_out_vdem.csv", index_col=[0])
Or even better, get rid of the index at the beginning when calling pandas.DataFrame.to_csv:
fdi_out_vdem.to_csv("fdi_out_vdem.csv", index=False)

Python compare 2 dataframe and result of not in 2nd dataframe

Want to is compare 2 dataframes (df). if exat_merge (df1) not df_ss_cpd2 (df2) want df_missing(df3) with results.
code:
df_missing = exat_merge.loc[exat_merge[df_ss_cpd2.columns.to_list()].isnull().all(axis = 1), df_ss_cpd2.columns.to_list()]
Index Column plus 8 columns. All column names are identical both dataframe (df). Nothing works. What do you think I am doing incorrect on this code? Thanks.

Streamlit - Applying value_counts / groupby to column selected on run time

I am trying to apply value_counts method to a Dataframe based on the columns selected dynamically in the Streamlit app
This is what I am trying to do:
if st.checkbox("Select Columns To Show"):
all_columns = df.columns.tolist()
selected_columns = st.multiselect("Select", all_columns)
new_df = df[selected_columns]
st.dataframe(new_df)
The above lets me select columns and displays data for the selected columns. I am trying to see how could I apply value_counts/groupby method on this output in Streamlit app
If I try to do the below
st.table(new_df.value_counts())
I get the below error
AttributeError: 'DataFrame' object has no attribute 'value_counts'
I believe the issue lies in passing a list of columns to a dataframe. When you pass a single column in [] to a dataframe, you get back a pandas.Series object (which has the value_counts method). But when you pass a list of columns, you get back a pandas.DataFrame (which doesn't have value_counts method defined on it).
Can you try st.table(new_df[col_name].value_counts())
I think the error is because value_counts() is applicable on a Series and not dataframe.
You can try Converting ".value_counts" output to dataframe
If you want to apply on one single column
def value_counts_df(df, col):
"""
Returns pd.value_counts() as a DataFrame
Parameters
----------
df : Pandas Dataframe
Dataframe on which to run value_counts(), must have column `col`.
col : str
Name of column in `df` for which to generate counts
Returns
-------
Pandas Dataframe
Returned dataframe will have a single column named "count" which contains the count_values()
for each unique value of df[col]. The index name of this dataframe is `col`.
Example
-------
>>> value_counts_df(pd.DataFrame({'a':[1, 1, 2, 2, 2]}), 'a')
count
a
2 3
1 2
"""
df = pd.DataFrame(df[col].value_counts())
df.index.name = col
df.columns = ['count']
return df
val_count_single = value_counts_df(new_df, selected_col)
If you want to apply for all object columns in the dataframe
def valueCountDF(df, object_cols):
c = df[object_cols].apply(lambda x: x.value_counts(dropna=False)).T.stack().astype(int)
p = (df[object_cols].apply(lambda x: x.value_counts(normalize=True,
dropna=False)).T.stack() * 100).round(2)
cp = pd.concat([c,p], axis=1, keys=["Count", "Percentage %"])
return cp
val_count_df_cols = valueCountDF(df, selected_columns)
And Finally, you can use st.table or st.dataframe to show the dataframe in your streamlit app

correct accessing of slices with duplicate index-values present

I have a dataframe with an index that sometimes contains rows with the same index-value. Now I want to slice that dataframe and set values based on row-indices.
Consider the following example:
import pandas as pd
df = pd.DataFrame({'index':[1,2,2,3], 'values':[10,20,30,40]})
df.set_index(['index'], inplace=True)
df1 = df.copy()
df2 = df.copy()
#copy warning
df1.iloc[0:2]['values'] = 99
print(df1)
df2.loc[df.index[0:2], 'values'] = 99
print(df2)
df1 is the expected result, but gives me a SettingWithCopyWarning.
df2 seems to be the suggested way of accessing by the doc, but gives me the wrong result (because of the duplicate index)
Is there a "proper" way to set those values correctly with the duplicate index-values present?
.loc is not recommended when you have duplicate index. So you have to go for position based selection iloc. Since we need to pass the positions, we have to use get_loc for getting position of column:
print (df2.columns.get_loc('values'))
0
df1.iloc[0:2, df2.columns.get_loc('values')] = 99
print(df1)
values
index
1 99
2 99
2 30
3 40

Python 3: Creating DataFrame with parsed data

The following data has been parsed from a stock API. The dataframe has the headers of each column in the Dataset respectively. Is there anyway I can link the data to the dataframe effectively creating a labeled data array/table?
DataFrame
df = pd.DataFrame(columns=['Date','Close','High','Low','Open','Volume'])
DataSet
20140502,36.8700,37.1200,36.2100,36.5900,22454100
20140505,36.9100,37.0500,36.3000,36.6800,13129100
20140506,36.4900,37.1700,36.4800,36.9400,19156000
20140507,34.0700,35.9900,33.6700,35.9900,66062700
20140508,33.9200,34.5700,33.6100,33.8800,30407700
20140509,33.7600,34.1000,33.4100,34.0100,20303400
20140512,34.4500,34.6000,33.8700,33.9900,22520600
20140513,34.4000,34.6900,34.1700,34.4300,12477100
20140514,34.1700,34.6500,33.9800,34.4800,17039000
20140515,33.8000,34.1900,33.4000,34.1800,18879800
20140516,33.4100,33.6600,33.1000,33.6600,18847100
20140519,33.8900,33.9900,33.2800,33.4100,14845700
20140520,33.8700,34.4700,33.6700,33.9900,18596700
20140521,34.3600,34.3900,33.8900,34.0000,13804500
20140522,34.7000,34.8600,34.2600,34.6000,17522800
20140523,35.0200,35.0800,34.5100,34.8500,16294400
20140527,35.1200,35.1300,34.7300,35.0000,13057000
20140528,34.7800,35.1700,34.4200,35.1500,16960500
20140529,34.9000,35.1000,34.6700,34.9000,9780800
20140530,34.6500,34.9300,34.1300,34.9200,13153000
20140602,34.8700,34.9500,34.2800,34.6900,9178900
20140603,34.6500,34.9700,34.5800,34.8000,6557500
20140604,34.7300,34.8300,34.2600,34.4800,9434100
I'm assuming that you are receiving the data as a list of lists. So something like -
vals = [[20140502,36.8700,37.1200,36.2100,36.5900,22454100], [20140505,36.9100,37.0500,36.3000,36.6800,13129100], ...]
In that case, you can populate your dataframe with loc -
for index, val in enumerate(vals):
df.loc[index] = val
Which will give you -
In [6]: df
Out[6]:
Date Close High Low Open Volume
0 20140502 36.87 37.12 36.21 36.59 22454100
1 20140505 36.91 37.05 36.3 36.68 13129100
...
Here, enumerate gives us the index of the row, so we can use that to populate the dataframe index.
If somehow the data was saved as csv, then you can simply use read_csv -
df = pd.read_csv('data.csv', names=['Date','Close','High','Low','Open','Volume'])