Importing Numbers using DataFrame - pandas

I'm trying to import numbers from a xlsx file using pandas DataFrame. But I'm getting numbers in a slightly different format,
let's say the number is: 9582*****4
the number i get using this code is 9582*****4.0
df=pd.read_excel("Contacts.xlsx")
for i in range(len(df)):
print(df.iloc[i,0])
It was working just fine till last night.

i guess you need to change the data type from float to int
df=pd.read_excel("Contacts.xlsx")
df = df.astype(int) # for all columns
type df = df.astype({"Column_name": int}) # for specific column
for i in range(len(df)):
print(df.iloc[i,0])

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

one-line dataframe df.to_csv fails, flipping all the data

I'm using a dataframe to calculate a bunch of stuff, with results winding up in the SECOND-TO-LAST LINE of the df.
I need to append JUST THAT ONE LINE to a CSV file.
Instead of storing the labels across and data beneath, the thing continually puts labels in the first column, with the data in the second column.
Subsequent writes keep appending data DOWN - under the first column.
I'm using code like this:
if not os.path.isfile(csvFilePath):
df.iloc[-2].to_csv(csvFilePath, mode='w', index=True, sep=';', header=True)
else:
df.iloc[-2].to_csv(csvFilePath, mode='a', index=False, sep=';', header=False)
The "csv" file it produces looks like this (two iterations):
;2021-04-29 07:00:00
open;54408.26
high;54529.67
low;54300.0
close;54500.0
volume;180.44990968
ATR;648.08
RSI;41.2556049907123
ticker;54228.51
BidTarget_1;53012.42
Bdistance_1;1216.0
BidTarget_2;54031.94
BCOGdistance_2;197.0
AskTarget_1;54934.18
ACOGdistance_1;705.67
AskTarget_2;55494.92
ACOGdistance_2;1266.41
TotBid;207.34781091999974
TotAsk;199.80037382000046
AskBidRatio;0.96
54408.26
54529.67
54300.0
54500.0
180.44990968
648.08
41.2556049907123
54071.49
53011.46
1060.0
53665.5
406.0
54620.97
549.48
54398.77
327.28
208.08094453999973
186.65960602000038
0.9
I'm at a complete loss ...
I start with a .csv that contains
hello, from, this, file
another, amazing, line, csv
save, this, line, data
last, line, of, file
where the second-to-last line is the desired output.
I think you can get what you want by using
import pandas
df = pandas.read_csv("myfile.csv", header=None)
df.iloc[-2].to_frame().transpose()
The trick is that df.iloc[-2] returns a Pandas Series. You can determine the datatype using
type(df.iloc[-2])
which returns pandas.core.series.Series. I'm not sure why the Pandas Series are oriented the way they are.
The Pandas Series can be converted back to a dataframe using df.iloc[-2].to_frame(), but the orientation is flipped 90 degrees (matching the Series orientation). To get back to the desired orientation, the transformation called transpose (flip about the diagonal) is needed.

Save output in CSV without losing previous data on that CSV in pandas dataframe

I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.

Adding a column with a calculation to multiple CSVs

I'm SUPER green to Python and am having some issues trying to automate some calculations.
I know that this works to add a new column called "Returns" that divides "value" of current to "value" of previous to a csv:
import pandas as pd
import numpy as np
import csv
a = pd.read_csv("/Data/a_data.csv", index_col = "time")
a ["Returns"] = (a["value"]/a["value"].shift(1) -1)*100
However, I have a lot of these CSVs. I need this calculation to happen prior to merging them all together. So I was hoping to write something that just looped through all of the CSVs and did the calculation and added the column but clearly this was incorrect as I get Syntax error:
import pandas as pd
import numpy as np
import csv
a = pd.read_csv("/Data/a_data.csv", index_col = "time")
b = pd.read_csv("/Data/b_data.csv", index_col = "time")
c = pd.read_csv("/Data/c_data.csv", index_col = "time")
my_lists = ['a','b','c']
for my_list in my_lists:
{my_list}["Returns"] = ({my_list}["close"]/{my_list}["close"].shift(1) -1)*100
print(f"Calculating: {my_list.upper()}")
I'm sure there is an easy way to do this that I just haven't reached in my Python education yet, so any guidance would be greatly appreciated!
Assuming "close" and "time" are fields defined in each of your csv files, you could define a function that reads each file, do the shift and returns a dataframe:
def your_func(my_file): # this function takes a file name as an argument.
my_df = pd.read_csv(my_file, index_col = "time") # The function reads its content into a data frame,
my_df["Returns"] = (my_df["close"]/{my_df}["close"].shift(1) -1)*100 # makes the calculation
return my_df #and returns it as an output.
Then as a main code, you collect all csv files from a folder with glob package. Using the above function, you build a data frame for each file with the calculation done.
import glob
path =r'/Data/' # path to the directory where you have the csv files
filenames = glob.glob(path + "/*.csv") # grab the csv files names using glob package with path+all csv files present
for filename in filenames: # loop into all csv files names in the list of csv files present in the directory
df= your_func (filename) # call the function, defined above block of code, that reads the file from its name as argument, then makes the calculation and returns it.
print (df)
Above, there is a print of the data Frame which shows results; I am not sure what you intend to do with upper (I dont think this is a function on a data frame).
Finally, this returns independent data frames with calculations done prior to other or final transformation.
1.Do a, b, c data frames have the same dimension?
2.You don't need to import the CSV library because it includes in the Pandas library.
3.If you want to union data frames, you can use like this :
my_lists = [a,b,c]
and you can concatenate with this way:
result=pd.concat(my_lists)
Lastly, your calculation should be :
result["Returns"]=(result.loc[:, "close"].div(result.loc[:, "close"].shift()).fillna(0).replace([np.inf, -np.inf], 0))
You need to add an index-label selection (loc) function to the data frame in order to access the values. When numbers are dividing, results can be NaN(Not a Number) or infinite. Therefore, replace and fillna functions are related to NaN and Inf.

Slice Spark’s DataFrame SQL by row (pyspark)

I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.