pandas remove spaces from Series - pandas

The question is, how to gain access to the strings inside of the first column so that string manipulations can be performed with each value. For example remove spaces in front of each string.
import pandas as pd
data = pd.read_csv("adult.csv", sep='\t', index_col=0)
series = data['workclass'].value_counts()
print(series)
Here is the file:
Zipped csv file

It is index, so use str.strip with series.index:
series.index = series.index.str.strip()
But if need convert series here to 2 columns DataFrame use:
df = series.rename_axis('a').reset_index(name='b')

Related

Extracting portions of the entries of Pandas dataframe

I have a Pandas dataframe with several columns wherein the entries of each column are a combination of​ numbers, upper and lower case letters and some special characters:, i.e, "=A-Za-z0-9_|"​. Each entry of the column is of the form:
​'x=ABCDefgh_5|123|' ​
I want to retain only the numbers 0-9 appearing only between | | and strip out all other characters​. Here is my code for one column of the dataframe:
list(map(lambda x: x.lstrip(r'\[=A-Za-z_|,]+'), df[1]))
However, the code returns the full entry ​'x=ABCDefgh_5|123|' ​ without stripping out anything. Is there an error in my code?
Instead of working with these unreadable regex expressions, you might want to consider a simple split. For example:
import pandas as pd
d = {'col': ["x=ABCDefgh_5|123|", "x=ABCDefgh_5|123|"]}
df = pd.DataFrame(data=d)
output = df["col"].str.split("|").str[1]

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

Read json files in pandas dataframe

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.
The dataframe looks something like this:
0 /home/user/processed/config1.json
1 /home/user/processed/config2.json
2 /home/user/processed/config3.json
3 /home/user/processed/config4.json
4 /home/user/processed/config5.json
... ...
16995 /home/user/processed/config16995.json
16996 /home/user/processed/config16996.json
16997 /home/user/processed/config16997.json
16998 /home/user/processed/config16998.json
16999 /home/user/processed/config16999.json
What is the most efficient way to do this?
I believe a simple for-loop might be best suited here?
import json
json_content = []
for row in df:
with open(row) as file:
json_content.append(json.load(file))
result = pd.DataFrame(json_content)
Generally, I'd try with iterrows() function (as a first hit to improve efficiency).
Implementation could possibly look like that:
import json
import pandas as pd
json_content = []
for row in df.iterrows():
with open(row) as file:
json_content.append(json.load(file))
result = pd.Series(json_content)
Possible solution is the following:
# pip install pandas
import pandas as pd
#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()
all_dfs = []
for path in paths:
df = pd.read_json(path, encoding='utf-8')
all_dfs.append(df)
Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.
If you wish you can merge all_dfs into the single dataframe.
dfs = df.concat(all_dfs, axis=1)

allowing python to impoert csv with duplicate column names in python

i have a data frame that looks like this:
there are in total 109 columns.
when i import the data using the read_csv it adds ".1",".2" to duplicate names .
is there any way to go around it ?
i have tried this :
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding = "ISO-8859-1",
sep='|', header=None)
df = df.rename(columns=df.iloc[0], copy=False).iloc[1:].reset_index(drop=True)
but it changed the data frame and wasnt helpful.
this is what it did to my data
python:
excel:
Remove header=None, because it is used for avoid convert first row of file to df.columns and then remove . with digits from columns names:
df = pd.read_csv(r'C:\Users\agns1\Downloads\treatment1.csv',encoding="ISO-8859-1", sep=',')
df.columns = df.columns.str.replace('\.\d+$','')

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')