Pandas timestamp on array - pandas

Pandas does not convert my array into an array of Timestamps:
a = np.array([1457392827660434006, 1457392828660434012, 1457392829660434023,1457474706167386148])
pd.Timestamp(a)
gives an an error :
TypeError Traceback (most recent call last)
<ipython-input-42-cdf0e494942d> in <module>()
1 a = np.array([1457392827660434006, 1457392828660434012, 1457392829660434023,1457474706167386148])
----> 2 pd.Timestamp(a)
pandas/tslib.pyx in pandas.tslib.Timestamp.__new__ (pandas/tslib.c:8967)()
pandas/tslib.pyx in pandas.tslib.convert_to_tsobject (pandas/tslib.c:23508)()
TypeError: Cannot convert input to Timestamp
Whereas looping on the array elements works just fine :
for i in range(4):
t = pd.Timestamp(a[i])
print t
gives :
2016-03-07 23:20:27.660434006
2016-03-07 23:20:28.660434012
2016-03-07 23:20:29.660434023
2016-03-08 22:05:06.167386148
As expected.
Moreover when that array is my first column in a csv file, it does not get parsed to a TimeStamp automatically, even if I specify parse_date correctly.
Any help please?

I think you can use to_datetime and then if you need array values:
import pandas as pd
import numpy as np
a = np.array([1457392827660434006, 1457392828660434012,
1457392829660434023,1457474706167386148])
print pd.to_datetime(a).values
['2016-03-08T00:20:27.660434006+0100' '2016-03-08T00:20:28.660434012+0100'
'2016-03-08T00:20:29.660434023+0100' '2016-03-08T23:05:06.167386148+0100']
print pd.to_datetime(a, unit='ns').values
['2016-03-08T00:20:27.660434006+0100' '2016-03-08T00:20:28.660434012+0100'
'2016-03-08T00:20:29.660434023+0100' '2016-03-08T23:05:06.167386148+0100']

Related

Pyspark pandas TypeError when try to concatenate two dataframes

I got an below error while I am trying to concatenate two pandas dataframes:
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
At the beginning I thought It was emerged because of one the dataframe that includes list on some column. So I tried to concatenate the two dataframes that does not include list on their columns. But I got the same error. I printed the type of dataframes to be sure. Both of them are pandas.core.frame.DataFrame. Why I got this error even they are not list?
import pyspark.pandas as ps
split_col = split_col.toPandas()
split_col2 = split_col2.toPandas()
dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_1455538/463168233.py in <module>
2 split_col = split_col.toPandas()
3 split_col2 = split_col2.toPandas()
----> 4 dfNew = ps.concat([split_col,split_col2],axis=1,ignore_index=True)
/home/anaconda3/envs/virtenv/lib/python3.10/site-packages/pyspark/pandas/namespace.py in concat(objs, axis, join, ignore_index, sort)
2464 for obj in objs:
2465 if not isinstance(obj, (Series, DataFrame)):
-> 2466 raise TypeError(
2467 "cannot concatenate object of type "
2468 "'{name}"
TypeError: cannot concatenate object of type 'list; only ps.Series and ps.DataFrame are valid
type(split_col)
pandas.core.frame.DataFrame
type(split_col2)
pandas.core.frame.DataFrame
I want to concatenate 2 dataframe but I stuck. Do you have any suggestion?
You're having this error because you're trying to concatenate two pandas DataFrames using the Pandas API for pyspark.
Instead of converting your pyspark dataframes to pandas dataframes using the toPandas() method, try the following:
split_col = split_col.to_pandas_on_spark()
split_col2 = split_col2.to_pandas_on_spark()
More documentation on this method.
https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.DataFrame.to_pandas_on_spark.html

Convert pandas to dask code and it errors out

I have pandas code which works perfectly.
import pandas as pd
courses_df = pd.DataFrame(
[
["Jay", "MS"],
["Jay", "Music"],
["Dorsey", "Music"],
["Dorsey", "Piano"],
["Mark", "MS"],
],
columns=["Name", "Course"],
)
pandas_df_json = (
courses_df.groupby(["Name"])
.apply(lambda x: x.drop(columns="Name").to_json(orient="records"))
.reset_index(name="courses_json")
)
But when I convert the dataframe to Dask and try the same operation.
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
name="courses_json"
).compute()
And the error i get is
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(lambda x: x.to_json(orient="records")).reset_index(
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [37], in <module>
1 from dask import dataframe as dd
3 df = dd.from_pandas(courses_df, npartitions=2)
----> 4 df.groupby(["Name"]).apply(lambda x: x.drop(columns="Name").to_json(orient="records")).reset_index(
5 name="courses_json"
6 ).compute()
TypeError: _Frame.reset_index() got an unexpected keyword argument 'name'
My expected output from dask and pandas should be same that is
Name courses_json
0 Dorsey [{"Course":"Music"},{"Course":"Piano"}]
1 Jay [{"Course":"MS"},{"Course":"Music"}]
2 Mark [{"Course":"MS"}]
How do i achieve this in dask ?
My try so far
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records")
).compute()
UserWarning: `meta` is not specified, inferred from partial data. Please provide `meta` if the result is unexpected.
Before: .apply(func)
After: .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
or: .apply(func, meta=('x', 'f8')) for series result
df.groupby(["Name"]).apply(
Out[57]:
Name
Dorsey [{"Course":"Piano"},{"Course":"Music"}]
Jay [{"Course":"MS"},{"Course":"Music"}]
Mark [{"Course":"MS"}]
dtype: object
I want to pass in a meta arguement and also want the second column
to have a meaningful name like courses_json
For the meta warning, Dask is expecting you to specify the column datatypes for the result. It's optional, but if you do not specify this it's entirely possible that Dask may infer faulty datatypes. One partition could for example be inferred as an int type and another as a float. This is particularly the case for sparse datasets. See the docs page for more details:
https://docs.dask.org/en/stable/generated/dask.dataframe.DataFrame.apply.html
This should solve the warning:
from dask import dataframe as dd
df = dd.from_pandas(courses_df, npartitions=2)
new_df = df.groupby(["Name"]).apply(
lambda x: x.drop(columns="Name").to_json(orient="records"),
meta=("Name", "O")
).to_frame()
# rename columns
new_df.columns = ["courses_json"]
# use numeric int index instead of name as in the given example
new_df = new_df.reset_index()
new_df.compute()
The result of your computation is a dask Series, not a Dataframe. This is why you need to use numpy types here (https://www.w3schools.com/python/numpy/numpy_data_types.asp). It consists of an index and a value. And you're not directly able to name the second column without converting it back to a dataframe using the .to_frame() method.

'pandas' has no attribute 'to_float'

I'm testing my data using the SVM Classifier. And my dataset is in a form of text and I'm trying to transform it into float.
I have data that may look like this:
dataset
Transform as float
df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
TypeError Traceback (most recent call last)
<ipython-input-66-74921537411d> in <module>
1 # Transform as float
----> 2 df.columns = df('columns').str.rstrip('%').astype('float') / 100.0
3
TypeError: 'DataFrame' object is not callable
Basically, it is impossible to convert text to float. In your dataset, it seems that all the columns have text values, and not sure if the value can be numbers by using rstrip('%') (because the values are too long, so truncated in the image).
If the values of a columns can be numbers by using rstrip('%'), then you can convert it. In addition, you are using (), not [] for the dataframe. Because you are using`df(...'), it looks like a function call. You can do what you want if the values of a columns is numbers, as follows:
df['columns'] = df['columns'].str.rstrip('%').astype('float') / 100.0
Here is a full code sample:
import pandas as pd
df = pd.DataFrame({
'column_name': ['111%', '222%'],
})
# df looks like:
# columns
#0 111%
#1 222%
df['column_name'] = df['column_name'].str.rstrip('%').astype('float') / 100.0
print(df)
# columns
#0 1.11
#1 2.22

Pandas Dataframe

import numpy as np
import pandas as pd
from pandas_datareader import data as wb
from yahoofinancials import YahooFinancials
sympol = [input()]
abc = YahooFinancials(sympol)
l=abc.get_financial_stmts('annual', 'cash')
df=pd.concat([pd.DataFrame(key) for key in l['cashflowStatementHistory']['FB']],axis=1,sort=True).reset_index().rename(columns={'index':'Time'})
Thank you it works but only with specific sympol. How to make it customised so that I can enter any symbol instead. What to write instead of 'FB'. I tried `[[sympol]] but it gives my an error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
7 l=abc.get_financial_stmts('annual', 'cash')
----> 8 df=pd.concat([pd.DataFrame(key) for key in l['cashflowStatementHistory'][[sympol]]],axis=1,sort=True).reset_index().rename(columns={'index':'Time'})
TypeError: unhashable type: 'list'
What do you think is the problem? How to fix it.
Thank you for help.
I think simpliest is use list comprehension with transpose by DataFrame.T of DataFrame and last use concat.
If need working later with TimeSeries, the best is also create DatetimeIndex:
df = pd.concat([pd.DataFrame(x).T for x in l['cashflowStatementHistory']['FB']], sort=True)
df.index = pd.to_datetime(df.index)

convert data that is in the form of object in a csv to a pivot

I have a file that is not beautiful and searchable so i downloaded it in the csv format. It contains 4 columns and 116424 rows.
I'm not able to plot its three columns namely Year, Age and Ratio onto a heat map.
The link for the csv file is: https://gist.github.com/JustGlowing/1f3d7ff0bba7f79651b00f754dc85bf1
import numpy as np
import pandas as pd
from pandas import DataFrame
from numpy.random import randn
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('new_file.csv')
print(df.info())
print(df.shape)
couple_columns = df[['Year','Age','Ratio']]
print(couple_columns.head())
Error
C:\Users\Pranav\AppData\Local\Programs\Python\Python36-32\python.exe C:/Users/Pranav/PycharmProjects/takenmind/Data_Visualization/a1.py
Traceback (most recent call last):
RangeIndex: 116424 entries, 0 to 116423
File "C:/Users/Pranav/PycharmProjects/takenmind/Data_Visualization/a1.py", line 12, in
Data columns (total 4 columns):
couple_columns = df[['Year','Age','Ratio']]
AREA 116424 non-null object
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\frame.py", line 2682, in getitem
YEAR 116424 non-null int64
AGE 116424 non-null object
RATIO 116424 non-null object
dtypes: int64(1), object(3)
memory usage: 2.2+ MB
None
(116424, 4)
return self._getitem_array(key)
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Users\Pranav\AppData\Roaming\Python\Python36\site-packages\pandas\core\indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['Year' 'Age' 'Ratio'] not in index"
It seems that your columns are uppercase from the info output: YEAR 116424 non-null int64. You should be able to get e.g. the year column with df[['YEAR']].
If you would rather use lowercase, you can use
df = pd.read_csv('new_file.csv').rename(columns=str.lower)
The csv has some text in the top 8 lines before your actual data begins. You can skip those by using the skiprows argument
df = pd.read_csv('f2m_ratios.csv', skiprows=8)
Lets say you want to plot heatmap only for one Area
df = df[df['Area'] == 'Afghanistan']
Before you plot a heatmap, you need data in a certain format (pivot table)
df = df.pivot('Year','Age','Ratio')
Now your dataframe is ready for a heatmap
sns.heatmap(df)