Pandas Dataframe - pandas

import numpy as np
import pandas as pd
from pandas_datareader import data as wb
from yahoofinancials import YahooFinancials
sympol = [input()]
abc = YahooFinancials(sympol)
l=abc.get_financial_stmts('annual', 'cash')
df=pd.concat([pd.DataFrame(key) for key in l['cashflowStatementHistory']['FB']],axis=1,sort=True).reset_index().rename(columns={'index':'Time'})
Thank you it works but only with specific sympol. How to make it customised so that I can enter any symbol instead. What to write instead of 'FB'. I tried `[[sympol]] but it gives my an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
7 l=abc.get_financial_stmts('annual', 'cash')
----> 8 df=pd.concat([pd.DataFrame(key) for key in l['cashflowStatementHistory'][[sympol]]],axis=1,sort=True).reset_index().rename(columns={'index':'Time'})
TypeError: unhashable type: 'list'
What do you think is the problem? How to fix it.
Thank you for help.

I think simpliest is use list comprehension with transpose by DataFrame.T of DataFrame and last use concat.
If need working later with TimeSeries, the best is also create DatetimeIndex:
df = pd.concat([pd.DataFrame(x).T for x in l['cashflowStatementHistory']['FB']], sort=True)
df.index = pd.to_datetime(df.index)

Related

Pyspark pandas TypeError when try to concatenate two dataframes

I got an below error while I am trying to concatenate two pandas dataframes:
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?
import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
2 split_col = split_col.toPandas()
3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
2464 for obj in objs:
2465 if not isinstance(obj, (Series, DataFrame)):
-> 2466 raise TypeError(
2467 "cannot concatenate object of type "
2468 "'{name}"
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame
I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?
You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.
Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:
split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()
More documentation on this method.
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.to_pandas_on_spark.html

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

its code i give below its give me error ..I used jupyter notebook and i write a code to drop the table columns and row but code give me error

its my code
import pandas as pd area=pd.Series({"Peshawar":
> 123456,"Karak":4784832,"Kohat":843932,"Mansehra":748392})
>
> pop=pd.Series({"Peshawar":
> 123456,"Karak":4784832,"Kohat":849392,"Mansehra":743392})
>
> data=pd.DataFrame({"area":area,"pop":pop}) data
> data.drop[:"peshawar",:"pop"]
code error
NameError Traceback (most recent
call last) ~\AppData\Local\Temp/ipykernel_9516/3219075350.py in
----> 1 data.drop[:"peshawar",:"pop"]
NameError: name 'data' is not defined
Formatting is important in python, respect newlines!
import pandas as pd
area=pd.Series({"Peshawar": 123456,"Karak":4784832,"Kohat":843932,"Mansehra":748392})
pop=pd.Series({"Peshawar": 123456,"Karak":4784832,"Kohat":849392,"Mansehra":743392})
data=pd.DataFrame({"area":area,"pop":pop})
#data.drop[:"peshawar",:"pop"] # this is not valid in pandas
Proper code to create your data:
area=pd.Series({"Peshawar":123456,"Karak":4784832,"Kohat":843932,"Mansehra":748392})
pop=pd.Series({"Peshawar":123456,"Karak":4784832,"Kohat":849392,"Mansehra":743392})
data=pd.DataFrame({"area":area,"pop":pop})
Remove a row:
>>> data.drop('Peshawar')
area pop
Karak 4784832 4784832
Kohat 843932 849392
Mansehra 748392 743392
Remove a column:
>>> data.drop('pop', axis=1)
area
Peshawar 123456
Karak 4784832
Kohat 843932
Mansehra 748392
Remove a column and a row:
>>> data.drop('pop', axis=1).drop('Peshawar', axis=0) # axis=0 is the default
area
Karak 4784832
Kohat 843932
Mansehra 748392
Remove a cell is not possible but you can set it to NaN:
>>> data.loc['Peshawar', 'pop'] = float('nan')
area pop
Peshawar 123456 NaN
Karak 4784832 4784832.0
Kohat 843932 849392.0
Mansehra 748392 743392.0

Pandas read csv using column names included in a list

I'm quite new to Pandas.
I'm trying to create a dataframe reading thousands of csv files.
The files are not structured in the same way, but I want to extract only columns I'm interested in, so I created a list which inlcudes all the column names I want, but then i have an error cause not all of them are included in each dataset.
import pandas as pd
import numpy as np
import os
import glob
# select the csv folder
csv_folder= r'myPath'
# select all xlsx files within the folder
all_files = glob.glob(csv_folder + "/*.csv")
# Set the column names to include in the dataframe
columns_to_use = ['Name1', 'Name2', 'Name3', 'Name4', 'Name5', 'Name6']
# read one by one all the excel
for filename in all_files:
df = pd.read_csv(filename,
header=0,
usecols = columns_to_use)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-0d9670495660> in <module>
1 for filename in all_files:
----> 2 df = pd.read_csv(filename,
3 header=0,
4 usecols = columns_to_use)
5
ValueError: Usecols do not match columns, columns expected but not found: ['Name1', 'Name2', 'Name4']
How could I handle this issue by including a columns if this is present in the list?
Usa a callable for usecols, i.e. df = pd.read_csv(filename, header=0, usecols=lambda c: c in columns_to_use). From the docs of the usecols parameter:
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True.
Working example that will only read col1 and not throw an error on missing col3:
import pandas as pd
import io
s = """col1,col2
1,2"""
df = pd.read_csv(io.StringIO(s), usecols=lambda c: c in ['col1', 'col3'])

Pandas timestamp on array

Pandas does not convert my array into an array of Timestamps:
a = np.array([1457392827660434006, 1457392828660434012, 1457392829660434023,1457474706167386148])
pd.Timestamp(a)
gives an an error :
TypeError Traceback (most recent call last)
<ipython-input-42-cdf0e494942d> in <module>()
1 a = np.array([1457392827660434006, 1457392828660434012, 1457392829660434023,1457474706167386148])
----> 2 pd.Timestamp(a)
pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:8967)()
pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:23508)()
TypeError: Cannot convert input to Timestamp
Whereas looping on the array elements works just fine :
for i in range(4):
t = pd.Timestamp(a[i])
print t
gives :
2016-03-07 23:20:27.660434006
2016-03-07 23:20:28.660434012
2016-03-07 23:20:29.660434023
2016-03-08 22:05:06.167386148
As expected.
Moreover when that array is my first column in a csv file, it does not get parsed to a TimeStamp automatically, even if I specify parse_date correctly.
Any help please?
I think you can use to_datetime and then if you need array values:
import pandas as pd
import numpy as np
a = np.array([1457392827660434006, 1457392828660434012,
1457392829660434023,1457474706167386148])
print pd.to_datetime(a).values
['2016-03-08T00:20:27.660434006+0100' '2016-03-08T00:20:28.660434012+0100'
'2016-03-08T00:20:29.660434023+0100' '2016-03-08T23:05:06.167386148+0100']
print pd.to_datetime(a, unit='ns').values
['2016-03-08T00:20:27.660434006+0100' '2016-03-08T00:20:28.660434012+0100'
'2016-03-08T00:20:29.660434023+0100' '2016-03-08T23:05:06.167386148+0100']