one-line dataframe df.to_csv fails, flipping all the data - pandas

I'm using a dataframe to calculate a bunch of stuff, with results winding up in the SECOND-TO-LAST LINE of the df.
I need to append JUST THAT ONE LINE to a CSV file.
Instead of storing the labels across and data beneath, the thing continually puts labels in the first column, with the data in the second column.
Subsequent writes keep appending data DOWN - under the first column.
I'm using code like this:
if not os.path.isfile(csvFilePath):
df.iloc[-2].to_csv(csvFilePath, mode='w', index=True, sep=';', header=True)
else:
df.iloc[-2].to_csv(csvFilePath, mode='a', index=False, sep=';', header=False)
The "csv" file it produces looks like this (two iterations):
;2021-04-29 07:00:00
open;54408.26
high;54529.67
low;54300.0
close;54500.0
volume;180.44990968
ATR;648.08
RSI;41.2556049907123
ticker;54228.51
BidTarget_1;53012.42
Bdistance_1;1216.0
BidTarget_2;54031.94
BCOGdistance_2;197.0
AskTarget_1;54934.18
ACOGdistance_1;705.67
AskTarget_2;55494.92
ACOGdistance_2;1266.41
TotBid;207.34781091999974
TotAsk;199.80037382000046
AskBidRatio;0.96
54408.26
54529.67
54300.0
54500.0
180.44990968
648.08
41.2556049907123
54071.49
53011.46
1060.0
53665.5
406.0
54620.97
549.48
54398.77
327.28
208.08094453999973
186.65960602000038
0.9
I'm at a complete loss ...

I start with a .csv that contains
hello, from, this, file
another, amazing, line, csv
save, this, line, data
last, line, of, file
where the second-to-last line is the desired output.
I think you can get what you want by using
import pandas
df = pandas.read_csv("myfile.csv", header=None)
df.iloc[-2].to_frame().transpose()
The trick is that df.iloc[-2] returns a Pandas Series. You can determine the datatype using
type(df.iloc[-2])
which returns pandas.core.series.Series. I'm not sure why the Pandas Series are oriented the way they are.
The Pandas Series can be converted back to a dataframe using df.iloc[-2].to_frame(), but the orientation is flipped 90 degrees (matching the Series orientation). To get back to the desired orientation, the transformation called transpose (flip about the diagonal) is needed.

Related

Save output in CSV without losing previous data on that CSV in pandas dataframe

I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.

Importing Numbers using DataFrame

I'm trying to import numbers from a xlsx file using pandas DataFrame. But I'm getting numbers in a slightly different format,
let's say the number is: 9582*****4
the number i get using this code is 9582*****4.0
df=pd.read_excel("Contacts.xlsx")
for i in range(len(df)):
print(df.iloc[i,0])
It was working just fine till last night.
i guess you need to change the data type from float to int
df=pd.read_excel("Contacts.xlsx")
df = df.astype(int) # for all columns
type df = df.astype({"Column_name": int}) # for specific column
for i in range(len(df)):
print(df.iloc[i,0])

added labels to a pandas df and then concatenated that df to another df - now the labels are a list - what gives?

I have two csv files that I need to concatenate. I read in the two csv files as pandas dfs. One has col labels and the other doesn't. I add labels to the df that needed them, then concatenated the two dfs. Concatenation works fine, but the labels I added look like individual lists or something. I can't figure out what python is doing, especially when you print the labels and the df and it all looks good. Call this approach one.
I was able to fix the problem by adding col labels to the csv when I read it in. Then it works fine. Call this approach two. What is going on with approach one?
Code and results below.
Approach One
#read in the vectors as a pandas df vec
vecs=pd.read_csv(os.path.join(path,filename), header=None)
#label the feature vectors v1-vn and attach to the df
endrange=features+1
string='v'
vecnames=[string + str(i) for i in range(1,endrange)]
vecs.columns = [vecnames]
print('\nvecnames')
display(vecnames) #they look ok here
display(vecs.head()) #they look ok here
#read in the IDs and phrases as a pandas df
recipes=pd.read_csv(os.path.join(path,'2a_2d_id_all_recipe_forms.csv'))
print('\nrecipes file - ids and recipe phrases')
display(recipes.head())
test=pd.concat([recipes, vecs], axis=1)
print('\ncol labels for vectors look like lists!')
display(test.head())
Results of Approach One:
['v1',
'v2',
'v3',
'v4',
'v5',
'v6',
'v7',
'v8',
'v9',
'v10',
'v11',
'v12',
'v13',
'v14',
'v15',
'v16',
'v17',
'v18',
'v19',
'v20',
'v21',
'v22',
'v23',
'v24',
'v25']
Approach Two
By adding the col labels to the csv when I read in the unlabeled file, it works fine. Why?
#label the feature vectors v1-vn and attach to the df
endrange=features+1
string='v'
vecnames=[string + str(i) for i in range(1,endrange)]
#read in the vectors as a pandas df and label the cols
vecs=pd.read_csv(os.path.join(path,filename), names=vecnames, header=None)
#read in the IDs and phrases as a pandas df
recipes=pd.read_csv(os.path.join(path,'2a_2d_id_all_recipe_forms.csv'))
test=pd.concat([recipes, vecs], axis=1)
print('\ncol labels for vectors as expected')
display(test.head())
Results of Approach Two
The odd behaviour comes from this line:
vecs.columns = [vecnames]
vecnames is already a list, but the above line wraps it in another list. The column names display properly when you print the DataFrame, but concatenating vecs with another DataFrame causes pandas to unwrap the column names of vecs into single-element tuples.
Fix: change the above line to:
vecs.columns = vecnames
And run everything else as is.

How to write a multiple dataframes to same sheet without duplicating the column labels

I have two questions regarding writing dataframe data to a file:
My program produces summary statistics on many grouped rows of a dataframe and save those to a StringIO buffer which writes to my output.csv file at completion. I have a feeling the pd.concat would be better suited but I couldn't get that to work. I can try adding a snippet of code when I get a chance and hopefully someone can explain how to properly concat and I assume that will solve my issue.
That being said, my program works and that's more than I can ask for. What is bugging me though is how the CSV file ends up repeating the same column labels for every summary statistic dataframe that was written to the buffer and incidentally into my CSV file as well. Is there a way to only write the column labels once and avoid multiple duplicate label rows?
My second question is in regards to writing to Excel to skip an unnecessary copy and paste. Like my previous issue, this is only a minor hindrance but still bugs me as I would like to do things the right way. The issue is that i want all the frames written to the same sheet. In order to avoid overwriting the same data it is necessary to use a buffer to store the data until the end. None of the docs seemed to helpful in my particular situation. I devised a workaround: xlwt to buffer -> output.write(buffer.getvalue()) -> pd.to_csv(output) and then reimport that same file via pd.read_csv and finally add another writer that writes the dataframe to Excel. After all that work I ended just sticking with the simplicity of CSV since the Excel writer actually magnified the ugliness of the duplicating rows. Any suggestions on how my buffer issue can be handled better as I would prefer the streamline and control of Excel writer to a CSV output.
Sorry for not having any code for context. I tired my best to explain without it. If necessary I can add the code when I get a free chance.
I'd agree that concatenating the dataframes is probably a better solution. You should probably ask a question specifically for that with some sample codes/dataframes.
For your second question you can position a dataframe in an Excel worksheet using the startrow and startcol parameters. You can skip the repeated header using the header boolean parameter, and you can skip the index using the index boolean parameter.
For example:
import pandas as pd
# Create some Pandas dataframes from some data.
df1 = pd.DataFrame({'Data': [11, 12, 13, 14]})
df2 = pd.DataFrame({'Data': [21, 22, 23, 24]})
df3 = pd.DataFrame({'Data': [31, 32, 33, 34]})
df4 = pd.DataFrame({'Data': [41, 42, 43, 44]})
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_test.xlsx', engine='xlsxwriter')
# Add the first dataframe to the worksheet.
df1.to_excel(writer, sheet_name='Sheet1', index=False)
offset = len(df1) + 1 # Add extra row for column header.
# Add the other dataframes.
for df in (df2, df3, df4):
# Write the datafram without a column header or index.
df.to_excel(writer, sheet_name='Sheet1', startrow=offset,
header=False, index=False)
offset += len(df)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Output:

Slice Spark’s DataFrame SQL by row (pyspark)

I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.