Read json files in pandas dataframe - pandas

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.
The dataframe looks something like this:
0 /home/user/processed/config1.json
1 /home/user/processed/config2.json
2 /home/user/processed/config3.json
3 /home/user/processed/config4.json
4 /home/user/processed/config5.json
... ...
16995 /home/user/processed/config16995.json
16996 /home/user/processed/config16996.json
16997 /home/user/processed/config16997.json
16998 /home/user/processed/config16998.json
16999 /home/user/processed/config16999.json
What is the most efficient way to do this?
I believe a simple for-loop might be best suited here?
import json
json_content = []
for row in df:
with open(row) as file:
json_content.append(json.load(file))
result = pd.DataFrame(json_content)

Generally, I'd try with iterrows() function (as a first hit to improve efficiency).
Implementation could possibly look like that:
import json
import pandas as pd
json_content = []
for row in df.iterrows():
with open(row) as file:
json_content.append(json.load(file))
result = pd.Series(json_content)

Possible solution is the following:
# pip install pandas
import pandas as pd
#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()
all_dfs = []
for path in paths:
df = pd.read_json(path, encoding='utf-8')
all_dfs.append(df)
Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.
If you wish you can merge all_dfs into the single dataframe.
dfs = df.concat(all_dfs, axis=1)

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

python - if-else in a for loop processing one column

I am interested to loop through column to convert into processed series.
Below is an example of two row, four columns data frame:
import pandas as pd
from rapidfuzz import process as process_rapid
from rapidfuzz import utils as rapid_utils
data = [['r/o ac. nephritis. /. nephrotic syndrome', ' ac. nephritis. /. nephrotic syndrome',1,'ac nephritis nephrotic syndrome'], [ 'sternocleidomastoid contracture','sternocleidomastoid contracture',0,"NA"]]
# Create the pandas DataFrame
df_diagnosis = pd.DataFrame(data, columns = ['diagnosis_name', 'diagnosis_name_edited','is_spell_corrected','spell_corrected_value'])
I want to use spell_corrected_value column if is_spell_corrected column is more than 1. Else, use diagnosis_name_edited
At the moment, I have following code to directly use diagnosis_name_edited column. How do I make into if-else/lambda check for is_spell_corrected column?
unmapped_diag_series = (rapid_utils.default_process(d) for d in df_diagnosis['diagnosis_name_edited'].astype(str)) # characters (generator)
unmapped_processed_diagnosis = pd.Series(unmapped_diag_series) #
Thank you.
If I get you right, try out this fast solution using numpy.where:
df_diagnosis['new_column'] = np.where(df_diagnosis['is_spell_corrected'] > 1, df_diagnosis['spell_corrected_value'], df_diagnosis['diagnosis_name_edited'])

pandas remove spaces from Series

The question is, how to gain access to the strings inside of the first column so that string manipulations can be performed with each value. For example remove spaces in front of each string.
import pandas as pd
data = pd.read_csv("adult.csv", sep='\t', index_col=0)
series = data['workclass'].value_counts()
print(series)
Here is the file:
Zipped csv file
It is index, so use str.strip with series.index:
series.index = series.index.str.strip()
But if need convert series here to 2 columns DataFrame use:
df = series.rename_axis('a').reset_index(name='b')

Concatenate a pandas dataframe to CSV file without reading the entire file

I have a quite large CSV file. I have a pandas dataframe that has exactly the columns with the CSV file.
I checked on stackoverflow and I see several answers suggested to read_csv then concatenate the read dataframe with the current one then write back to a CSV file.
But for a large file I think it is not the best way.
Can I concatenate a pandas dataframe to an existed CSV file without reading the whole file?
Update: Example
import pandas as pd
df1 = pd.DataFramce ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
# what to do here? I would like to concatenate df2 to my.csv
The expected my.csv
a b
0 1 2
1 3 4
Look at using mode='a' in to_csv:
MCVE:
df1 = pd.DataFrame ({'a':1,'b':2}, index = [0])
df1.to_csv('my.csv')
df2 = pd.DataFrame ({'a':3, 'b':4}, index = [1])
df2.to_csv('my.csv', mode='a', header=False)
!type my.csv #Windows machine use 'type' command or on unix use 'cat'
Output:
,a,b
0,1,2
1,3,4

when reading an html (pandas.read_html), how to select dataframe and set_ index in one line

I'm reading an html which brings back a list of dataframes. I want to be able to choose the dataframe from the list and set my index (index_col) in the least amount of lines.
Here is what I have right now:
import pandas as pd
df =pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)
df2 =df[4] #here I'm assigning df2 to dataframe#4 from the list of dataframes I read
df2.set_index('Date', inplace =True)
Is it possible to do all this in one line? Do I need to create another dataframe (df2) to assign one dataframe from a list, or is it possible I can assign the dataframe as soon as I read the list of dataframes (df).
Thanks.
Anyway:
import pandas as pd
df = pd.read_html('http://finviz.com/insidertrading.ashx?or=-10&tv=100000&tc=1&o=-transactionvalue', header = 0)[4].set_index('Date')