Parse JSON to Excel - Pandas + xlwt - pandas

I'm kind of half way through this functionality. However, I need some help with formatting the data in the sheet that contains the output.
My current code...
response = {"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}
# Create a Pandas dataframe from the data.
df = pd.DataFrame.from_dict(json.loads(response), orient='index')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The output is as follows...
What I want is something like this...
I suppose that first I would need to extract and organise the headers.
This would also include manually assigning a header for a column that cannot have a header by default as in case of SIC column.
After that, I can feed data to the columns with their respective headers.

You can loop over the keys of your json object and create a dataframe from each, then use pd.concat to combine them all:
import json
import pandas as pd
response = '{"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}'
json_data = json.loads(response)
all_frames = []
for k, v in json_data.items():
df = pd.DataFrame(v)
df['SIC Category'] = k
all_frames.append(df)
final_data = pd.concat(all_frames).set_index('SIC Category')
print(final_data)
This prints:
confidence label
SIC Category
sic2 1.00 73
sic4 0.50 7310
sic8 0.50 73101000
sic8 0.25 73102000
sic8 0.25 73109999
Which you can export to Excel as before, through final_data.to_excel(writer, sheet_name='Sheet1')

Related

Read json files in pandas dataframe

I have large pandas dataframe (17 000 rows) with a filepath in each row associated with a specific json file. For each row I want to read the json file content and extract the content into a new dataframe.
The dataframe looks something like this:
0 /home/user/processed/config1.json
1 /home/user/processed/config2.json
2 /home/user/processed/config3.json
3 /home/user/processed/config4.json
4 /home/user/processed/config5.json
... ...
16995 /home/user/processed/config16995.json
16996 /home/user/processed/config16996.json
16997 /home/user/processed/config16997.json
16998 /home/user/processed/config16998.json
16999 /home/user/processed/config16999.json
What is the most efficient way to do this?
I believe a simple for-loop might be best suited here?
import json
json_content = []
for row in df:
with open(row) as file:
json_content.append(json.load(file))
result = pd.DataFrame(json_content)
Generally, I'd try with iterrows() function (as a first hit to improve efficiency).
Implementation could possibly look like that:
import json
import pandas as pd
json_content = []
for row in df.iterrows():
with open(row) as file:
json_content.append(json.load(file))
result = pd.Series(json_content)
Possible solution is the following:
# pip install pandas
import pandas as pd
#convert column with paths to list, where: : - all rows, 0 - first column
paths = df.iloc[:, 0].tolist()
all_dfs = []
for path in paths:
df = pd.read_json(path, encoding='utf-8')
all_dfs.append(df)
Each df in all_dfs can be accessed individually or in loop by index like all_dfs[0], all_dfs[1] and etc.
If you wish you can merge all_dfs into the single dataframe.
dfs = df.concat(all_dfs, axis=1)

How can I sort and replace the rank of each item instead of it's value before sorting in a merged final csv?

I have 30 different csv files and each row begin with date and some similar features measured for each of 30 items daily. The value of each feature is not important, but the rank they gain after sorting in each day is important. How can I have one merged csv from 30 separate csv with the rank of each feature?
If your files are all the same format, you can do a loop and consolidate in a single data frame. From there you can manipulate as needed:
import pandas as pd
import glob
path = r'C:\path_to_files\files' # use your path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Other examples of the same thing: Import multiple csv files into pandas and concatenate into one DataFrame.

How to subset a dataframe, groupby and export the dataframes as multiple sheets of a one excel file in Python

Python newbie here
In the dataset below:
import pandas as pd
import numpy as np
data = {'Gender':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Location':['NE','NE','NE','NE','SW','SW','SW','SW','SE','SE','SE','SE','NC','NC','NC','NC'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'PDP':['<10','<10','<10','<10',10,10,10,10,20,20,20,20,'>20','>20','>20','>20'],
'PDP_code':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Revenue_YR1':[1150.78,1162.34,1188.53,1197.69,2108.07,2117.76,2129.48,1319.51,1416.87,1812.54,1819.57,1991.97,2219.28,2414.73,2169.91,2149.19],
'Revenue_YR2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Revenue_YR3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Revenue_YR5':[251.78,221.34,282.53,272.69,310.07,317.7,329.81,333.15,334.87,332.54,336.59,339.97,329.28,334.73,336.91,334.12],
'Revenue_YR6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Revenue_YR7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
'Revenue_YR8':[279.84,289.14,299.53,309.69,318.73,327.47,336.63,398.59,398.75,324.18,324.78,321.55,333.85,339.39,315.52,319.23],
}
df = pd.DataFrame(data,columns = ['Gender','Location','Type','PDP','PDP_code','diff','series',
'Revenue_YR1','Revenue_YR2','Revenue_YR3','Revenue_YR4','Revenue_YR5','Revenue_YR6',
'Revenue_YR7','Revenue_YR8'])
df.head(5)
I want a pythonic way of doing the following :
subset df into 4 dataframes / lists based on unique Location resulting in NE,SW,SE & NC dataframes
aggregating all the Revenue_YR columns while GroupBy series and PDP_code columns and export all the aggregated dataframes (NE,SW,SE & NC) as multiple sheets of one xlsx file
My attempt
### this code returns output of 1 df instead of 4 dfs, I need help aggregating each of the 4 dataframes and export them to 4 sheets of 12312021_output.xlsx
for i, part_df in df.groupby('Location'):
part_df.groupby(['series','PDP_code'])[['Revenue_YR1', 'Revenue_YR2','Revenue_YR3',
'Revenue_YR4', 'Revenue_YR5', 'Revenue_YR6', 'Revenue_YR7']].mean().unstack().style.background_gradient(cmap='Blues').to_excel('12312021_output.xlsx')
Please share your code.
You can use pandas.ExcelWriter, and your loop (which I improved slightly for readability):
import pandas as pd
with pd.ExcelWriter("output.xlsx") as writer:
cols = df.filter(like='Revenue_YR').columns
for g, d in df.groupby('Location'):
(d.groupby(['series','PDP_code'])[cols].mean().unstack()
.style.background_gradient(cmap='Blues')
).to_excel(writer, sheet_name=g)

Extracting column from Array in python

I am beginner in Python and I am stuck with data which is array of 32763 number, separated by comma. Please find the data here data
I want to convert this into two column 1 from (0:16382) and 2nd column from (2:32763). in the end I want to plot column 1 as x axis and column 2 as Y axis. I tried the following code but I am not able to extract the columns
import numpy as np
import pandas as pd
import matplotlib as plt
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
df = pd.DataFrame(data.flatten())
print(df)
and then I want to write the data in some file let us say data1 in the format as shown in attached pic
It is hard to answer without seeing the format of your data, but you can try
data = np.genfromtxt('oscilloscope.txt',delimiter=',')
print(data.shape) # here we check we got something useful
# this should split data into x,y at position 16381
x = data[:16381]
y = data[16381:]
# now you can create a dataframe and print to file
df = pd.DataFrame({'x':x, 'y':y})
df.to_csv('data1.csv', index=False)
Try this.
#input as dataframe df, its chunk_size, extract output as list. you can mention chunksize what you want.
def split_dataframe(df, chunk_size = 16382):
chunks = list()
num_chunks = len(df) // chunk_size + 1
for i in range(num_chunks):
chunks.append(df[i*chunk_size:(i+1)*chunk_size])
return chunks
or
np.array_split

Append values to pandas dataframe incrementally inside for loop

I am trying to add rows to pandas dataframe incrementally inside the for loop.
My for loop is like below:
def print_values(cc):
data = []
for x in values[cc]:
data.append(labels[x])
# cc is a constant and data is a list. I need these values to be appended to a row in pandas dataframe.
# Pandas dataframe structure is like follows: df=pd.DataFrame(columns = ['Index','Names'])
print cc
print data
# This does not work - Not sure about the problem !!
#df_clustercontents.loc['Cluster_Index'] = cc
#df_clustercontents.loc['DatabaseNames'] = data
for x in range(0,10):
print_values(x)
I need the values "cc" and "data" to be appended to the dataframe incrementally.
Any help would be really appreciated !!
You can use ,
...
print(cc)
print(data)
df_clustercontents.loc[len(df_clustercontents)]=[cc,data]
...