Read_excel in Pandas - pandas

Using pandas 0.19.2.
My goal is to read an excel file and keep everything as string, no conversions.
My excel file that contains the following:
Row1 Row 2
52.60 52.80
68.7k 67.5k
0.80% 0.80%
I tried reading the excel file using the following commands
df = pd.read_excel('C:\Dash\static\Calendar-01-01-2017.xls')
df = pd.read_excel('C:\Dash\static\Calendar-01-01-2017.xls', converters={'Row1':str,'Row2':str}))
df = pd.read_excel('C:\Dash\static\Calendar-01-01-2017.xls', converters={0:str,1:str}))
Unfortunately I end up with this:
Row1 Row 2
52.6 52.8
68.7k 67.5k
0.008 0.008
In the end, I would like to pass it to a list:
df = df.values.tolist()
but I end up with long values such as
0.0080000000000001

Related

assigning csv file to a variable name

I have a .csv file, i uses pandas to read the .csv file.
import pandas as pd
from pandas import read_csv
data=read_csv('input.csv')
print(data)
0 1 2 3 4 5
0 -3.288733e-08 2.905263e-08 2.297046e-08 2.052534e-08 3.767194e-08 4.822049e-08
1 2.345769e-07 9.462636e-08 4.331173e-08 3.137627e-08 4.680112e-08 6.067109e-08
2 -1.386798e-07 1.637338e-08 4.077676e-08 3.339685e-08 5.020153e-08 5.871679e-08
3 -4.234607e-08 3.555008e-08 2.563824e-08 2.320405e-08 4.008257e-08 3.901410e-08
4 3.899913e-08 5.368551e-08 3.713510e-08 2.367323e-08 3.172775e-08 4.799337e-08
My aim is to assign the file to a column name so that i can access the data in later time. For example by doing something like
new_data= df['filename']
filename
0 -3.288733e-08,2.905263e-08,2.297046e-08,2.052534e-08,3.767194e-08,4.822049e-08
1 2.345769e-07,9.462636e-08,4.331173e-08,3.137627e-08,4.680112e-08, 6.067109e-08
2 -1.386798e-07,1.637338e-08,4.077676e-08,3.339685e-08,5.020153e-08,5.871679e-08
3 -4.234607e-08,3.555008e-08,2.563824e-08,2.320405e-08,4.008257e-08,3.901410e-08
4 3.899913e-08,5.368551e-08,3.713510e-08,2.367323e-08,3.172775e-08,4.799337e-08
I don't really like it (and I still don't completely get the point), but you could just read in your data as 1 column (by using a 'wrong' seperator) and renaming the column.
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename, sep=';')
df.columns = [filename]
If you then wish, you could add other files by doing the same thing (with a different name for df at first) and then concatenate that with df.
A more usefull approach IMHO would be to add the dataframe to a dictionary (or a list would be possible).
import pandas as pd
filename = 'input.csv'
df = pd.read_csv(filename)
data_dict = {filename: df}
# ... Add multiple files to data_dict by repeating steps above in a loop
You can then access your data later on by calling data_dict[filename] or data_dict['input.csv']

Compare two files and find the difference

I have two csv files, have to find difference for both files and generate the output file in sheet1 - difference data for txt1.csv and sheet2 - difference data for txt2.csv. Kindly advise me.
Sample Input :
txt1.csv
txt2.csv
Code
with open('txt1.csv', 'r') as t1, open('txt2.csv', 'r') as t2:
fileone = t1.readlines()
filetwo = t2.readlines()
with open('update.csv', 'w') as outFile:
for line in filetwo:
if line not in fileone:
outFile.write(line)
with open('update1.csv', 'w') as outFile:
for line in fileone:
if line not in filetwo:
outFile.write(line)
Expected output:
In sheet1
In sheet2
Note :
When the input file is too large above code is executing very slow
You could try the following.
Dataset:
df1=pd.DataFrame({"A":[1,2,3,4],"B":[5,6,7,8]})
df2=pd.DataFrame({"A":[1, 2],"B":[2,9]})
Output1:
df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))].reset_index(drop=True)
A
0 3
1 4
Output2:
df2[~df2.apply(tuple,1).isin(df1.apply(tuple,1))].reset_index(drop=True)
A
0 8
1 9
In your case something like:
df1 = pd.read_csv("txt1.csv")
df2 = pd.read_csv("txt2.csv")
delta1 = df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))].reset_index(drop=True)
delta2 = df2[~df2.apply(tuple,1).isin(df1.apply(tuple,1))].reset_index(drop=True)
delta1.to_csv("txt1_delta.csv", index=False)
delta2.to_csv("txt2_delta.csv", index=False)
edit, or if you want to have it in Excel with multiple sheets:
pip install xlsxwriter # if required
import xlsxwriter
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter("your_output_excel.xlsx", engine="xlsxwriter")
# Write each dataframe to a different worksheet.
delta1.to_excel(writer, sheet_name="Delta1")
delta2.to_excel(writer, sheet_name="Delta2")
# Close the Pandas Excel writer and output the Excel file.
writer.save()

Parse JSON to Excel - Pandas + xlwt

I'm kind of half way through this functionality. However, I need some help with formatting the data in the sheet that contains the output.
My current code...
response = {"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}
# Create a Pandas dataframe from the data.
df = pd.DataFrame.from_dict(json.loads(response), orient='index')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The output is as follows...
What I want is something like this...
I suppose that first I would need to extract and organise the headers.
This would also include manually assigning a header for a column that cannot have a header by default as in case of SIC column.
After that, I can feed data to the columns with their respective headers.
You can loop over the keys of your json object and create a dataframe from each, then use pd.concat to combine them all:
import json
import pandas as pd
response = '{"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}'
json_data = json.loads(response)
all_frames = []
for k, v in json_data.items():
df = pd.DataFrame(v)
df['SIC Category'] = k
all_frames.append(df)
final_data = pd.concat(all_frames).set_index('SIC Category')
print(final_data)
This prints:
confidence label
SIC Category
sic2 1.00 73
sic4 0.50 7310
sic8 0.50 73101000
sic8 0.25 73102000
sic8 0.25 73109999
Which you can export to Excel as before, through final_data.to_excel(writer, sheet_name='Sheet1')

xlwings excel selection to dataframe - error with the date column

I am trying to write a UDF in python/Excel using xlwings. I have time-series data in three columns in a spreadsheet of the form:
Date Hour Value
01/11/2017 1 43.1
01/11/2017 2 41.8
01/11/2017 3 38.6
01/11/2017 4 38.6
01/11/2017 5 38.6
And I want to be able to select this range, manipulate it in several ways (monthly average etc.) using pandas groupby functions then output the results back into a new spreadsheet. This code works for me:
#xw.sub
def get_df_from_range():
"""get's df"""
#make the current selection a dataframe
wb = xw.Book.caller()
df = wb.selection.options(pd.DataFrame, index = False).value
#simple check: add a sheet and print the dataframe
sht = wb.sheets.add()
sht.range('A1').options(index = False).value = df
However, as soon as I try to do any manipulation of the dataframe before printing it, I get error messages. For example:
#xw.sub
def get_df_from_range():
"""get's df"""
#make the current selection a dataframe
wb = xw.Book.caller()
df = wb.selection.options(pd.DataFrame, index = False).value
#simple manipulation task
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')
#add a sheet and print the dataframe
sht = wb.sheets.add()
sht.range('A1').options(index = False).value = df
This gives an error message:
Run-time error '2147467259 (80004005)':
AttributeError: 'tuple' object has no attribute 'lower'
if value.lower() in _unit_map:
File
"C:\User\AppData\Local\Programs\Python\Python36-32
line 441, in f
unit = {k: f(k) for k in arg.keys()}
I thought I could debug this better by creating the same code, but not as a UDF; just by writing code in Spyder and connecting to the spreadsheet - so I would have the df variable in my variable explorer. But when I wrote almost exactly the same code, it did not give me an error message:
wb = xw.Book("my_spreadsheet.xlsm")
df = wb.selection.options(pd.DataFrame, index = False).value
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')
I am really stuck as to why this would be. Can someone please help?
I should note, I am aware that xlwings automatically reads Excel dates as datetime64[ns] formats. That isn't the point I am trying to make. I want to do other things with the dataframe (e.g. left join it to another dataframe) and all those other tasks also fail when I try the UDF method, but work O.K. when I just connect to the spreadsheet from Spyder. I am hoping that if I can get that one "simple manipulation task" to work, then all the other tasks may also work.

How to rename pandas dataframe column with another dataframe?

I really don't understand what I'm doing. I have two data frames. One has a list of column labels and another has a bunch of data. I want to just label the columns in my data with my column labels.
My Code:
airportLabels = pd.read_csv('airportsLabels.csv', header= None)
airportData = pd.read_table('airports.dat', sep=",", header = None)
df = DataFrame(airportData, columns = airportLabels)
When I do this, all the data turns into "NaN" and there is only one column anymore. I am really confused.
I think you need add parameter nrows to read_csv, if you need read only columns, remove header= None, because first row of csv is column names and then use parameter names in read_table with columns from DataFrame airportLabels :
import pandas as pd
import io
temp=u"""col1,col2,col3
1,5,4
7,8,5"""
#after testing replace io.StringIO(temp) to filename
airportLabels = pd.read_csv(io.StringIO(temp), nrows=0)
print airportLabels
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
temp=u"""
a,d,f
e,r,t"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_table(io.StringIO(temp), sep=",", header = None, names=airportLabels.columns)
print df
col1 col2 col3
0 a d f
1 e r t