Pandas reset_index() creating level0 column - sql

I'm reading one CSV file eliminating the duplicates and exporting to a database.
The problem here is that it is creating a column called level0 instead of resetting the index.
Here is my code
df = pd.read_csv('SampleData.csv', sep=';', encoding='latin1', low_memory=False)
df_projects = df['External'].drop_duplicates()
df_projects = df_projects.to_frame()
df_projects.rename(columns={'External': 'name'}, inplace=True)
df_projects = df_projects.reset_index()
con = create_engine('sqlite:///db.sqlite3')
df_projects.to_sql("inventory_projects", con, index=True, if_exists='replace')

You need add parameter drop=True to reset_index:
...
df_projects = df_projects.rename('name').to_frame()
df_projects = df_projects.reset_index(drop=True)
...

Related

Use pandas df.concat to replace .append with custom index

I'm currently trying to replace .append in my code since it won't be supported in the future and I have some trouble with the custom index I'm using
I read the names of every .shp files in a directory and extract some date from it
To make the link with an excel file I have, I use the name I extract from the title of the file
df = pd.DataFrame(columns = ['date','fichier'])
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
df = df.append(new_row, ignore_index=False)
This works exactly as I want it to be
Sadly, I can't find a way to replace it with .concat
I looked for ways to keep the index whith concat but didn't find anything that worked as I intended
Did I miss anything?
Try the approach below with pandas.concat based on your code :
import glob
import pandas as pd
​
df = pd.DataFrame(columns = ['date','fichier'])
dico_dfs={}
​
for i in glob.glob("*.shp"):
nom_parcelle = i.split("_")[2]
if not nom_parcelle in df.index:
# print(df.last_valid_index())
date_recolte = i.split("_")[-1]
new_row = pd.Series(data={'date':date_recolte.split(".")[0], 'fichier':i}, name = nom_parcelle)
dico_dfs[i]= new_row.to_frame()
df= pd.concat(dico_dfs, ignore_index=False, axis=1).T.droplevel(0)
# Output :
print(df)
date fichier
nom1 20220101 a_xx_nom1_20220101.shp
nom2 20220102 b_yy_nom2_20220102.shp
nom3 20220103 c_zz_nom3_20220103.shp

Replacing append with concat?

Whenever I run this code I get:
The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
What should I do to make the code run with concat?
final_dataframe = pd.DataFrame(columns = my_columns)
for symbol in stocks['Ticker']:
api_url = f'https://sandbox.iexapis.com/stable/stock/{symbol}/quote?token={IEX_CLOUD_API_TOKEN}'
data = requests.get(api_url).json()
final_dataframe = final_dataframe.append(
pd.Series([symbol,
data['latestPrice'],
data['marketCap'],
'N/A'],
index = my_columns),
ignore_index = True)
See this release note
or from another post:
"Append is the specific case(axis=0, join='outer') of concat" link
The changes in your code should be: (changed the pd.Series to variable just for presentation)
s = pd.Series([symbol, data['latestPrice'], data['marketCap'], 'N/A'], index = my_columns)
final_dataframe = pd.concat([final_dataframe, s], ignore_index = True)

Copy/assign a Pandas dataframe based on their name in a for loop

I am relatively new with python - and I am struggling to do the following:
I have a set of different data frames, with sequential naming (df_i), which I want to access in a for loop based on their name (with an string), how can I do that? e.g.
df_1 = pd.read_csv('...')
df_2 = pd.read_csv('...')
df_3 = pd.read_csv('...')
....
n_df = 3
for i in range(len(n_df)):
df_namestr= 'df_' + str(i+1)
# ---------------------
df_temp = df_namestr
# ---------------------
# Operate with df_temp. For i+1= 1, df_temp should be df_1
Kind regards,
DF
You can try something like that:
for n in range(1, n_df+1):
df_namestr = f"df_{n}"
df_tmp = locals().get(df_namestr)
if not isinstance(df_tmp, pd.DataFrame):
continue
print(df_namestr)
print(df_tmp)
Refer to the documentation of locals() to know more.
Would it be better to approach the accessing of multiple dataframes by reading them into a list?
You could put all the csv files required in a subfolder and read them all in. Then they are in a list and you can access each one as an item in that list.
Example:
import pandas as pd
import glob
path = r'/Users/myUsername/Documents/subFolder'
csv_files = glob.glob(path + "/*.csv")
dfs = []
for filename in csv_files:
df = pd.read_csv(filename)
dfs.append(df)
print(len(dfs))
print(dfs[1].head())

df.groupby('columns').apply(''.join()), join all the cells to a string

df.groupby('columns').apply(''.join()), join all the cells to a string.
This is for a junior dataprocessor. In the past, I've tried many ways.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index(‘key’)
#df2 is expected result
data2 = {'a':['12j5g9d'],'b':['3d6d'],'c':['4d7t']}
df2 = pd.DataFrame(data2)
df2 = df2.set_index(‘key’)
Here's a simple solution, where we first translate the integers to strings and then concatenate profit and income, then finally we concatenate all strings under the same key:
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df['profit_income'] = df['profit'].apply(str) + df['income']
res = df.groupby('key')['profit_income'].agg(''.join)
print(res)
output:
key
a 12j5g9d
b 3d6d
c 4d7t
Name: profit_income, dtype: object
This question can be solved couple different ways:
First add an extra column by concatenating the profit and income columns.
import pandas as pd
data = {'key':['a','b','c','a','b','c','a'], 'profit':
[12,3,4,5,6,7,9],'income':['j','d','d','g','d','t','d']}
df = pd.DataFrame(data)
df = df.set_index('key')
df['profinc']=df['profit'].astype(str)+df['income']
1) Using sum
df2=df.groupby('key').profinc.sum()
2) Using apply and join
df2=df.groupby('key').profinc.apply(''.join)
Results from both of the above would be the same:
key
a 12j5g9d
b 3d6d
c 4d7t

Quantile across rows and down columns using selected columns only [duplicate]

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I'm searching for 'spike' in column names like 'spike-2', 'hey spike', 'spiked-in' (the 'spike' part is always continuous).
I want the column name to be returned as a string or a variable, so I access the column later with df['name'] or df[name] as normal. I've tried to find ways to do this, to no avail. Any tips?
Just iterate over DataFrame.columns, now this is an example in which you will end up with a list of column names that match:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
spike_cols = [col for col in df.columns if 'spike' in col]
print(list(df.columns))
print(spike_cols)
Output:
['hey spke', 'no', 'spike-2', 'spiked-in']
['spike-2', 'spiked-in']
Explanation:
df.columns returns a list of column names
[col for col in df.columns if 'spike' in col] iterates over the list df.columns with the variable col and adds it to the resulting list if col contains 'spike'. This syntax is list comprehension.
If you only want the resulting data set with the columns that match you can do this:
df2 = df.filter(regex='spike')
print(df2)
Output:
spike-2 spiked-in
0 1 7
1 2 8
2 3 9
This answer uses the DataFrame.filter method to do this without list comprehension:
import pandas as pd
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6]}
df = pd.DataFrame(data)
print(df.filter(like='spike').columns)
Will output just 'spike-2'. You can also use regex, as some people suggested in comments above:
print(df.filter(regex='spike|spke').columns)
Will output both columns: ['spike-2', 'hey spke']
You can also use df.columns[df.columns.str.contains(pat = 'spike')]
data = {'spike-2': [1,2,3], 'hey spke': [4,5,6], 'spiked-in': [7,8,9], 'no': [10,11,12]}
df = pd.DataFrame(data)
colNames = df.columns[df.columns.str.contains(pat = 'spike')]
print(colNames)
This will output the column names: 'spike-2', 'spiked-in'
More about pandas.Series.str.contains.
# select columns containing 'spike'
df.filter(like='spike', axis=1)
You can also select by name, regular expression. Refer to: pandas.DataFrame.filter
df.loc[:,df.columns.str.contains("spike")]
Another solution that returns a subset of the df with the desired columns:
df[df.columns[df.columns.str.contains("spike|spke")]]
You also can use this code:
spike_cols =[x for x in df.columns[df.columns.str.contains('spike')]]
Getting name and subsetting based on Start, Contains, and Ends:
# from: https://stackoverflow.com/questions/21285380/find-column-whose-name-contains-a-specific-string
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html
# from: https://cmdlinetips.com/2019/04/how-to-select-columns-using-prefix-suffix-of-column-names-in-pandas/
# from: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html
import pandas as pd
data = {'spike_starts': [1,2,3], 'ends_spike_starts': [4,5,6], 'ends_spike': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
print("\n")
print("----------------------------------------")
colNames_contains = df.columns[df.columns.str.contains(pat = 'spike')].tolist()
print("Contains")
print(colNames_contains)
print("\n")
print("----------------------------------------")
colNames_starts = df.columns[df.columns.str.contains(pat = '^spike')].tolist()
print("Starts")
print(colNames_starts)
print("\n")
print("----------------------------------------")
colNames_ends = df.columns[df.columns.str.contains(pat = 'spike$')].tolist()
print("Ends")
print(colNames_ends)
print("\n")
print("----------------------------------------")
df_subset_start = df.filter(regex='^spike',axis=1)
print("Starts")
print(df_subset_start)
print("\n")
print("----------------------------------------")
df_subset_contains = df.filter(regex='spike',axis=1)
print("Contains")
print(df_subset_contains)
print("\n")
print("----------------------------------------")
df_subset_ends = df.filter(regex='spike$',axis=1)
print("Ends")
print(df_subset_ends)