Repeat the task of exporting multiple Panda datames into multiple csv-files - pandas

I'm somewhat new to Pandas/Python (more into SAS), but my task is the following: I have four Pandas dataframes, and I would like to export each of them into a separate csv-file. The name of the csv should be the same as the original dataframe (forsyning.csv, inntak.csv etc).
So far I've made a list with the names of the dataframes, and then tried to put the list through a for-loop in order to generate one csv after another. But I've only made it half-way through. My code so far:
df_list = ['forsyning', 'inntak', 'behandling', 'transport']
for i in df_list:
i.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
What I believe is missing is a proper reference where it says "i.to_csv" in my code above as it now only give me the error "'str' object has no attribute 'to_csv'". I justs don't know how to twist this code the right way - appreciate any advice in this matter. Thanks.

If need write list of DataFrames to files you need 2 lists - first for DataFrames objects and second for new file names in strings:
df_list = [forsyning, inntak, behandling, transport]
names = ['forsyning', 'inntak', 'behandling', 'transport']
So for write use zip of both lists and write df:
for i, df in zip(names, df_list):
df.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
Or use dictionary of DataFrames and loop values by dict.items():
df_dict = {'forsyning': forsyning, 'inntak':inntak,
'behandling': behandling, 'transport': transport}
for i, df in df_dict.items():
df.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')

Your df_list should have a list of dataframe objects. but rather you seem to have the dataframe names in str format as elements.
I believe your df_list should be:
df_list = [forsyning, inntak, behandling, transport]

Related

How to work around the frame.append method is deprecated use pandas.concat instead pandas error

df = pd.DataFrame()
c = WebSocketClient( api_key='APIKEYHERE', feed='socket.polygon.io', market='crypto', subscriptions=["XT.BTC-USD"] )
def handle_msg(msgs: List[WebSocketMessage]):
global df
df = df.append(msgs, ignore_index=True)
print(df)
c.run(handle_msg)
I have a WebSocket client open through polygon.io, when I run this I get exactly what I want but then I get a warning that the frame.append is being deprecated and that I should use pandas.concat instead. Unfortunately, my little fragile brain has no idea how to do this.
I tried doing df = pd.concat(msgs, ignore_index=True) but get TypeError: cannot concatenate object of type '<class 'polygon.websocket.models.models.CryptoTrade'>';
Thanks for any help
To use pandas.concat instead of DataFrame.append, you need to convert the WebSocketMessage objects in the msgs list to a DataFrame and then concatenate them. Here's an example:
def handle_msg(msgs: List[WebSocketMessage]):
global df
msgs_df = pd.DataFrame([msg.to_dict() for msg in msgs])
df = pd.concat([df, msgs_df], ignore_index=True)
print(df)
This code converts each WebSocketMessage object in the msgs list to a dictionary using msg.to_dict() and then creates a DataFrame from the list of dictionaries. Finally, it concatenates this DataFrame with the existing df using pd.concat.

Save output in CSV without losing previous data on that CSV in pandas dataframe

I'm doing sentiment analysis of Tweeter data. For this work, I've made some datasets in CSV format where different month in different dataset. When I do the preprocessing of every dataset individually, I want to save all dataset in 1 single CSV file. but when I write the below's code by using pandas dataframe:
df.to_csv('dataset.csv', index=False)
It removes previous data (Rows) of that dataset. Is there any way that I can keep the previous data too on that file? So that I can merge all data together. Thank you..........
It's not entirely clear what you want from your question, so this is just a guess, but something like this might be what you're looking for. if you keep assigning dataframes to df, then new data will overwrite the old data. Try reassigning them to differently named dataframes like df1 and `df21. Then you can merge them.
# vertically merge the multiple dataframes and reassign to new variable
df = pd.concat([df1, df2])
# save the dataframe
df.to_csv('my_dataset.csv', index=False)
In python you can use the open("file") method with the parameter 'a':
open("file", 'a').
The 'a' means "append" so you will add lines at the end of your file.
You can use the same parameter for the pandas.dataFrame.to_csv() method.
e.g:
import pandas as pd
# code where you get data and return df
pd.df.to_csv("file", mode='a')
#thehand0: Your code works, but it's inefficient, so it will take longer for your script to run.

Is there a way to export pandas dataframe info -- df.info() into an excel file?

I have a .csv file locally. I am reading the file with pandas. I want to move the df.info() result into an excel. Looks like df.info().to_excel does not work as it is not supported. Is there any way to do this?
I tried df.info().to_excel
import pandas as pd
from openpyxl.workbook import Workbook
pd.read_csv("file.csv",sep='|', error_bad_lines=False)
writer = pd.ExcelWriter('output.xlsx')
df.info()
df.info().to_excel(writer,sheet_name='info')
I want to show the dataframe info output in a single tab of the excel.
The easiest way for me is to get the same information in dataframes, but separately:
df_datatypes = pd.DataFrame(df.dtypes)
df_null_count = df.count()
Then write to excel as usual.
to_excel is a method of the DataFrame https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html and DataFrame.info() doesn't return a DataFrame
You can write the info to a text file like so:
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.txt", "w", encoding="utf-8") as f:
f.write(s)
You can modify this code by removing last two lines and parsing the s variable and creating a DataFrame out of it (in the way you would like this to appear in the excel file) and then use the to_excel() method.
I agree with #yl_low but you could have a more elegant solution as shown:
def get_dataframe_info(df):
"""
input
df -> DataFrame
output
df_null_counts -> DataFrame Info (sorted)
"""
df_types = pd.DataFrame(df.dtypes)
df_nulls = df.count()
df_null_count = pd.concat([df_types, df_nulls], axis=1)
df_null_count = df_null_count.reset_index()
# Reassign column names
col_names = ["features", "types", "non_null_counts"]
df_null_count.columns = col_names
# Add this to sort
df_null_count = df_null_count.sort_values(by=["null_counts"], ascending=False)
return df_null_count
You can do this in Python 3.
pd.DataFrame({"name": train.columns, "non-nulls": len(train)-train.isnull().sum().values, "nulls": train.isnull().sum().values, "type": train.dtypes.values}).to_excel("op.xlsx")
Just one line code (without non-null column);
df.dtypes.reset_index(name='Dtype').rename(columns={'index' : 'Column'}).to_excel(pd.ExcelWriter('Name.xlsx'), 'info')

added labels to a pandas df and then concatenated that df to another df - now the labels are a list - what gives?

I have two csv files that I need to concatenate. I read in the two csv files as pandas dfs. One has col labels and the other doesn't. I add labels to the df that needed them, then concatenated the two dfs. Concatenation works fine, but the labels I added look like individual lists or something. I can't figure out what python is doing, especially when you print the labels and the df and it all looks good. Call this approach one.
I was able to fix the problem by adding col labels to the csv when I read it in. Then it works fine. Call this approach two. What is going on with approach one?
Code and results below.
Approach One
#read in the vectors as a pandas df vec
vecs=pd.read_csv(os.path.join(path,filename), header=None)
#label the feature vectors v1-vn and attach to the df
endrange=features+1
string='v'
vecnames=[string + str(i) for i in range(1,endrange)]
vecs.columns = [vecnames]
print('\nvecnames')
display(vecnames) #they look ok here
display(vecs.head()) #they look ok here
#read in the IDs and phrases as a pandas df
recipes=pd.read_csv(os.path.join(path,'2a_2d_id_all_recipe_forms.csv'))
print('\nrecipes file - ids and recipe phrases')
display(recipes.head())
test=pd.concat([recipes, vecs], axis=1)
print('\ncol labels for vectors look like lists!')
display(test.head())
Results of Approach One:
['v1',
'v2',
'v3',
'v4',
'v5',
'v6',
'v7',
'v8',
'v9',
'v10',
'v11',
'v12',
'v13',
'v14',
'v15',
'v16',
'v17',
'v18',
'v19',
'v20',
'v21',
'v22',
'v23',
'v24',
'v25']
Approach Two
By adding the col labels to the csv when I read in the unlabeled file, it works fine. Why?
#label the feature vectors v1-vn and attach to the df
endrange=features+1
string='v'
vecnames=[string + str(i) for i in range(1,endrange)]
#read in the vectors as a pandas df and label the cols
vecs=pd.read_csv(os.path.join(path,filename), names=vecnames, header=None)
#read in the IDs and phrases as a pandas df
recipes=pd.read_csv(os.path.join(path,'2a_2d_id_all_recipe_forms.csv'))
test=pd.concat([recipes, vecs], axis=1)
print('\ncol labels for vectors as expected')
display(test.head())
Results of Approach Two
The odd behaviour comes from this line:
vecs.columns = [vecnames]
vecnames is already a list, but the above line wraps it in another list. The column names display properly when you print the DataFrame, but concatenating vecs with another DataFrame causes pandas to unwrap the column names of vecs into single-element tuples.
Fix: change the above line to:
vecs.columns = vecnames
And run everything else as is.

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))