pandas fillna datetime column with timezone now - pandas

I have a pandas datetime column with None values which I would like to fill with datetime.now() in a specific timezone.
This is my MWE dataframe:
df = pd.DataFrame([
{'end': "2017-07-01 12:00:00"},
{'end': "2017-07-02 18:13:00"},
{'end': None},
{'end': "2017-07-04 10:45:00"}
])
If I fill with fillna:
pd.to_datetime(df['end']).fillna(datetime.now())
The result is a series with expected dtype: datetime64[ns]. But when I specify the timezone, for example:
pd.to_datetime(df['end']).fillna(
datetime.now(pytz.timezone('US/Pacific')))
This returns a series with dtype: object

It seems you need convert date to to_datetime in fillna:
df['end'] = pd.to_datetime(df['end'])
df['end'] = df['end'].fillna(pd.to_datetime(pd.datetime.now(pytz.timezone('US/Pacific'))))
print (df)
end
0 2017-07-01 12:00:00
1 2017-07-02 18:13:00
2 2017-07-04 03:35:08.499418-07:00
3 2017-07-04 10:45:00
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
But still dtype is not datetime64:
print (df['end'].dtype)
object
I think solution is pass paramter utc to to_datetime:
utc : boolean, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
df['end'] = pd.to_datetime(df['end'], utc=True)
#print (df)
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
print (df['end'].dtypes)
datetime64[ns]
And final solution from comment of OP:
df['end'] = pd.to_datetime(df['end']).dt.tz_localize('US/Pacific')
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
print (df.end.dtype)
datetime64[ns, US/Pacific]

Related

Panda Dataframe read_json for list values

I have a file with record json strings like:
{"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
{"foo": [0.0621534586, 0.0509529933, 0.122285351]}
{"foo": [0.0169468746, 0.00475309044, 0.0085169]}
When I call read_json on this file I get a dataframe where the column foo is an object. Calling .to_numpy() on this dataframe gives me an numpy array in the form of:
array([list([-0.050888903400000005, -0.00733460533, -0.0595958121]),
list([0.10726073400000001, -0.0247702841, -0.0298063811]), ...,
list([-0.10156482500000001, -0.0402663834, -0.0609775148])],
dtype=object)
I want to parse the values of foo as numpy array instead of list. Anyone have any ideas?
The easiest way is to create your DataFrame using .from_dict().
See a minimal example with one of your dicts.
d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
df = pd.DataFrame().from_dict(d)
>>> df
foo
0 -0.048201
1 0.041648
2 -0.049558
>>> df.dtypes
foo float64
dtype: object
How about doing:
df['foo'] = df['foo'].apply(np.array)
df
foo
0 [-0.0482006893, 0.0416476727, -0.0495583452]
1 [0.0621534586, 0.0509529933, 0.12228535100000001]
2 [0.0169468746, 0.00475309044, 0.00851689999999...
This shows that these have been converted to numpy.ndarray instances:
df['foo'].apply(type)
0 <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
Name: foo, dtype: object

How to convert pandas float64 type to NUMERIC Bigquery type?

I have a panda dataframe df:
<bound method NDFrame.head of DAT_RUN DAT_FORECAST LIB_SOURCE MES_LONGITUDE MES_LATITUDE MES_TEMPERATURE MES_HUMIDITE MES_PLUIE MES_VITESSE_VENT MES_U_WIND MES_V_WIND
0 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 3.75 11.994824 72.0 0.0 2.653137 -2.402910 -1.124792
1 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.00 13.094824 74.3 0.0 2.976434 -2.972910 -0.144792
2 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.25 12.594824 75.3 0.0 3.128418 -2.702910 1.575208
3 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.50 12.094824 75.5 0.0 3.183418 -2.342910 2.155208
I convert DAT_RUN and DAT_FORECAST columns to datetime format :
df["DAT_RUN"] = pd.to_datetime(df['DAT_RUN'], format="%Y-%m-%dT%H:%M:%SZ") # previously "%Y-%m-%d %H:%M:%S"
df["DAT_FORECAST"] = pd.to_datetime(df['DAT_FORECAST'], format="%Y-%m-%dT%H:%M:%SZ")
df.dtypes:
DAT_RUN datetime64[ns]
DAT_FORECAST datetime64[ns]
LIB_SOURCE object
MES_LONGITUDE float64
MES_LATITUDE float64
MES_TEMPERATURE float64
MES_HUMIDITE float64
MES_PLUIE float64
MES_VITESSE_VENT float64
MES_U_WIND float64
MES_V_WIND float64
I use bigquery.Client().load_table_from_dataframe() function to insert data into Bigquery table which numeric columns have NUMERIC bigquery table.
It returns this error :
pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16)
I tried to fix it with :
df["MES_LONGITUDE"] = df["MES_LONGITUDE"].astype(str).map(decimal.Decimal)
But no more.
Thanks.
I managed to work around this issue with a decimal.Context, hope it helps:
import decimal
import numpy as np
import pandas as pd
from google.cloud import bigquery
df = pd.DataFrame(
data={
"MES_HUMIDITE": np.array([2.653137, 2.976434, 3.128418, 3.183418]),
"MES_PLUIE": np.array([-2.402910, -2.972910, -2.702910, -2.342910]),
},
dtype="float",
)
We check data type declaration:
df.dtypes
# MES_HUMIDITE float64
# MES_PLUIE float64
# dtype: object
Initialize Context to 7 digits, because it is the precision in those columns, you can create multiple Context if you need different precision values for each column:
context = decimal.Context(prec=7)
df["MES_HUMIDITE"] = df["MES_HUMIDITE"].apply(context.create_decimal_from_float)
df["MES_PLUIE"] = df["MES_PLUIE"].apply(context.create_decimal_from_float)
Now, each item is a Decimal object:
df["MES_HUMIDITE"][0]
# Decimal('2.653137')
Types have changed and Pandas stores Decimals as objects, as I guess is not a native data format:
df.dtypes
# MES_HUMIDITE object
# MES_PLUIE object
# dtype: object
table_id = "test_dataset.test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("MES_HUMIDITE", "NUMERIC"),
bigquery.SchemaField("MES_PLUIE", "NUMERIC"),
],
write_disposition="WRITE_TRUNCATE",
)
client = bigquery.Client.from_service_account_json("/path_to_key.json")
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
However, decimal types are generally recommended for financial calculations and, although I do not know your exact case and usage, you are probably safe using FLOAT64, at least for latitude and longitude.

Combine two series as new one in dataframe?

type(x)
<class 'pandas.core.frame.DataFrame'>
x.shape
(18, 12)
To reference the first row and 3:5 columns with expression:
type(x.iloc[0,3:5])
<class 'pandas.core.series.Series'>
x.iloc[0,3:5]
total_operating_revenue NaN
net_profit 3.43019e+07
Name: 2001-12-31, dtype: object
To reference the first row and 8:10 columns with expression:
type(x.iloc[0,8:10])
<class 'pandas.core.series.Series'>
x.iloc[0,8:10]
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
I want to get the combined new series (suppose it y)as following:
type(y)
<class 'pandas.core.series.Series'>
y.shape
(4,)
y contains:
total_operating_revenue NaN
net_profit 3.43019e+07
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
My failed tries:
x.iloc[0,[3:5,8:10]]
x.iloc[0,3:5].combine(x.iloc[0,8:10])
pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1) is not my expect,totally differ from y.
z = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1)
type(z)
<class 'pandas.core.frame.DataFrame'>
z.shape
(4, 2)
My mistake to previously suggest you concat along the columns.
Instead you should concat along the rows:
y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
Example:
import numpy as np
x = pd.DataFrame(np.random.randint(0,100,size=(18, 12)),
columns=list('ABCDEFGHIJKL'))
And then:
In [392]: y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
In [393]: y.shape
Out[393]: (4,)

Return count for specific value in pandas .value_counts()?

Assume running pandas' dataframe['prod_code'].value_counts() and storing result as 'df'. The operation outputs:
125011 90300
762 72816
None 55512
7156 14892
75162 8825
How would I extract the count for None? I'd expect the result to be 55512.
I've tried
>>> df.loc[df.index.isin(['None'])]
>>> Series([], Name: prod_code, dtype: int64)
and also
>>> df.loc['None']
>>> KeyError: 'the label [None] is not in the [index]'
It seems you need None, not string 'None':
df.loc[df.index.isin([None])]
df.loc[None]
EDIT:
If need check where NaN in index:
print (s1.loc[np.nan])
#or
print (df[pd.isnull(df.index)])
Sample:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', None, np.nan])
s1 = s.value_counts(dropna=False)
print (s1)
8825 3
90300 2
NaN 2
dtype: int64
print (s1[pd.isnull(s1.index)])
NaN 2
dtype: int64
print (s1.loc[np.nan])
2
print (s1.loc[None])
2
EDIT1:
For stripping whitespaces:
s = pd.Series(['90300', '90300', '8825', '8825', '8825', 'None ', np.nan])
print (s)
0 90300
1 90300
2 8825
3 8825
4 8825
5 None
6 NaN
dtype: object
s1 = s.value_counts()
print (s1)
8825 3
90300 2
None 1
dtype: int64
s1.index = s1.index.str.strip()
print (s1.loc['None'])
1
Couple of things
pd.Series([None] * 2 + [1] * 3).value_counts() automatically drops the None.
pd.Series([None] * 2 + [1] * 3).value_counts(dropna=False) converts the None to np.NaN
That tells me that your None is a string. But since df.loc['None'] didn't work, I suspect your string has white space around it.
Try:
df.filter(regex='None', axis=0)
Or:
df.index = df.index.to_series().str.strip().combine_first(df.index.to_series())
df.loc['None']
All that said, I was curious how to reference np.NaN in the index
s = pd.Series([1, 2], [0, np.nan])
s.iloc[s.index.get_loc(np.nan)]
2

pandas read_csv convert object to float

i'm trying to read a csv file. in one column (hpi) which should be float32 there are two records populated with a . to indicate missing values. pandas interprets the . as a character.
how do force numeric on this column?
data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
#,converters={'hpi': float})
#print data.head()
#print(data.dtypes)
print(data[data.hpi == '.'])
Use na.values parameter in read.csv:
df = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"],
na_values='.')
df.dtypes
Out:
state object
year int64
qtr int64
hpi float64
dtype: object
Apply to_numeric over the desired column (with apply):
data.loc[data.hpi == '.', 'hpi'] = -1.0
data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
For example:
In[69]: data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
In[70]: data[['hpi']].dtypes
Out[70]:
hpi object
dtype: object
In[74]: data.loc[data.hpi == '.'] = -1.0
In[75]: data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
In[77]: data[['hpi']].dtypes
Out[77]:
hpi float64
dtype: object
EDIT:
For some reason it changes all the columns to float64. This is a small workaround that changes them back to int.
Before:
In[89]: data.dtypes
Out[89]:
state object
year float64
qtr float64
hpi float64
After:
In[90]: data[['year','qtr']] = data[['year','qtr']].astype(int)
In[91]: data.dtypes
Out[91]:
state object
year int64
qtr int64
hpi float64
dtype: object
If anyone could shed light over way it happens that'd be great.
You could just cast this after you read it in. e.g.
data.loc[data.hpi == '.', 'hpi'] = pd.np.nan
data.hpi = data.hpi.astype(pd.np.float64)
Alternatively you can use the na_values parameter for read_csv