How to work around the frame.append method is deprecated use pandas.concat instead pandas error - polygon.io

df = pd.DataFrame()
c = WebSocketClient( api_key='APIKEYHERE', feed='socket.polygon.io', market='crypto', subscriptions=["XT.BTC-USD"] )
def handle_msg(msgs: List[WebSocketMessage]):
global df
df = df.append(msgs, ignore_index=True)
print(df)
c.run(handle_msg)
I have a WebSocket client open through polygon.io, when I run this I get exactly what I want but then I get a warning that the frame.append is being deprecated and that I should use pandas.concat instead. Unfortunately, my little fragile brain has no idea how to do this.
I tried doing df = pd.concat(msgs, ignore_index=True) but get TypeError: cannot concatenate object of type '<class 'polygon.websocket.models.models.CryptoTrade'>';
Thanks for any help

To use pandas.concat instead of DataFrame.append, you need to convert the WebSocketMessage objects in the msgs list to a DataFrame and then concatenate them. Here's an example:
def handle_msg(msgs: List[WebSocketMessage]):
global df
msgs_df = pd.DataFrame([msg.to_dict() for msg in msgs])
df = pd.concat([df, msgs_df], ignore_index=True)
print(df)
This code converts each WebSocketMessage object in the msgs list to a dictionary using msg.to_dict() and then creates a DataFrame from the list of dictionaries. Finally, it concatenates this DataFrame with the existing df using pd.concat.

Related

Convert type object column to float

I have a table with a column named "price". This column is of type object. So, it contains numbers as strings and also NaN or ? characters. I want to find the mean of this column but first I have to remove the NaN and ? values and also convert it to float
I am using the following code:
import pandas as pd
import numpy as np
df = pd.read_csv('Automobile_data.csv', sep = ',')
df = df.dropna('price', inplace=True)
df['price'] = df['price'].astype('int')
df['price'].mean()
But, this doesn't work. The error says:
ValueError: No axis named price for object type DataFrame
How can I solve this problem?
edit: in pandas version 1.3 and less, you need subset=[col] wrapped in a list/array. In verison 1.4 and greater you can pass a single column as a string.
You've got a few problems:
df.dropna() arguments require the axis and then the subset. The axis is rows/columns, and then subset is which of those to look at. So you want this to be (I think) df.dropna(axis='rows',subset='price')
Using inplace=True makes the whole thing return None, and so you have set df = None. You don't want to do that. If you are using inplace=True, then you don't assign something to that, the whole line would just be df.dropna(...,inplace=True).
Don't use inplace=True, just do the assignment. That is, you should use df=df.dropna(axis='rows',subset='price')

TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid. It was supposed to work with df.append

df = pd.DataFrame(columns=['locale', 'description'])
for text in texts:
df = pd.concat(
dict(
locale=text.locale,
description=text.description
),
ignore_index=True
)
Are there any workaround for this? It was supposed to work with df.append but it says FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

Repeat the task of exporting multiple Panda datames into multiple csv-files

I'm somewhat new to Pandas/Python (more into SAS), but my task is the following: I have four Pandas dataframes, and I would like to export each of them into a separate csv-file. The name of the csv should be the same as the original dataframe (forsyning.csv, inntak.csv etc).
So far I've made a list with the names of the dataframes, and then tried to put the list through a for-loop in order to generate one csv after another. But I've only made it half-way through. My code so far:
df_list = ['forsyning', 'inntak', 'behandling', 'transport']
for i in df_list:
i.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
What I believe is missing is a proper reference where it says "i.to_csv" in my code above as it now only give me the error "'str' object has no attribute 'to_csv'". I justs don't know how to twist this code the right way - appreciate any advice in this matter. Thanks.
If need write list of DataFrames to files you need 2 lists - first for DataFrames objects and second for new file names in strings:
df_list = [forsyning, inntak, behandling, transport]
names = ['forsyning', 'inntak', 'behandling', 'transport']
So for write use zip of both lists and write df:
for i, df in zip(names, df_list):
df.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
Or use dictionary of DataFrames and loop values by dict.items():
df_dict = {'forsyning': forsyning, 'inntak':inntak,
'behandling': behandling, 'transport': transport}
for i, df in df_dict.items():
df.to_csv('{}.csv'.format(i), index=False, decimal=',', sep=';')
Your df_list should have a list of dataframe objects. but rather you seem to have the dataframe names in str format as elements.
I believe your df_list should be:
df_list = [forsyning, inntak, behandling, transport]

Is there a way to export pandas dataframe info -- df.info() into an excel file?

I have a .csv file locally. I am reading the file with pandas. I want to move the df.info() result into an excel. Looks like df.info().to_excel does not work as it is not supported. Is there any way to do this?
I tried df.info().to_excel
import pandas as pd
from openpyxl.workbook import Workbook
pd.read_csv("file.csv",sep='|', error_bad_lines=False)
writer = pd.ExcelWriter('output.xlsx')
df.info()
df.info().to_excel(writer,sheet_name='info')
I want to show the dataframe info output in a single tab of the excel.
The easiest way for me is to get the same information in dataframes, but separately:
df_datatypes = pd.DataFrame(df.dtypes)
df_null_count = df.count()
Then write to excel as usual.
to_excel is a method of the DataFrame https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html and DataFrame.info() doesn't return a DataFrame
You can write the info to a text file like so:
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.txt", "w", encoding="utf-8") as f:
f.write(s)
You can modify this code by removing last two lines and parsing the s variable and creating a DataFrame out of it (in the way you would like this to appear in the excel file) and then use the to_excel() method.
I agree with #yl_low but you could have a more elegant solution as shown:
def get_dataframe_info(df):
"""
input
df -> DataFrame
output
df_null_counts -> DataFrame Info (sorted)
"""
df_types = pd.DataFrame(df.dtypes)
df_nulls = df.count()
df_null_count = pd.concat([df_types, df_nulls], axis=1)
df_null_count = df_null_count.reset_index()
# Reassign column names
col_names = ["features", "types", "non_null_counts"]
df_null_count.columns = col_names
# Add this to sort
df_null_count = df_null_count.sort_values(by=["null_counts"], ascending=False)
return df_null_count
You can do this in Python 3.
pd.DataFrame({"name": train.columns, "non-nulls": len(train)-train.isnull().sum().values, "nulls": train.isnull().sum().values, "type": train.dtypes.values}).to_excel("op.xlsx")
Just one line code (without non-null column);
df.dtypes.reset_index(name='Dtype').rename(columns={'index' : 'Column'}).to_excel(pd.ExcelWriter('Name.xlsx'), 'info')

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))