i'm trying to read a csv file. in one column (hpi) which should be float32 there are two records populated with a . to indicate missing values. pandas interprets the . as a character.
how do force numeric on this column?
data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
#,converters={'hpi': float})
#print data.head()
#print(data.dtypes)
print(data[data.hpi == '.'])
Use na.values parameter in read.csv:
df = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"],
na_values='.')
df.dtypes
Out:
state object
year int64
qtr int64
hpi float64
dtype: object
Apply to_numeric over the desired column (with apply):
data.loc[data.hpi == '.', 'hpi'] = -1.0
data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
For example:
In[69]: data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
In[70]: data[['hpi']].dtypes
Out[70]:
hpi object
dtype: object
In[74]: data.loc[data.hpi == '.'] = -1.0
In[75]: data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
In[77]: data[['hpi']].dtypes
Out[77]:
hpi float64
dtype: object
EDIT:
For some reason it changes all the columns to float64. This is a small workaround that changes them back to int.
Before:
In[89]: data.dtypes
Out[89]:
state object
year float64
qtr float64
hpi float64
After:
In[90]: data[['year','qtr']] = data[['year','qtr']].astype(int)
In[91]: data.dtypes
Out[91]:
state object
year int64
qtr int64
hpi float64
dtype: object
If anyone could shed light over way it happens that'd be great.
You could just cast this after you read it in. e.g.
data.loc[data.hpi == '.', 'hpi'] = pd.np.nan
data.hpi = data.hpi.astype(pd.np.float64)
Alternatively you can use the na_values parameter for read_csv
Related
I have the following df:
I want to group this df on the first column(ID) and on the second column(key), from there to build a cumsum for each day. The cumsum should be on the last column(speed).
I tried this with the following code :
df = pd.read_csv('df.csv')
df['Time'] = pd.to_datetime(df['Time'], format='%Y-%m-%d %H:%M:%S')
df = df.sort_values(['ID','key'])
grouped = df.groupby(['ID','key'])
test = pd.DataFrame()
test2 = pd.DataFrame()
for name, group in grouped:
test = group.groupby(pd.Grouper(key='Time', freq='1d'))['Speed'].cumsum()
test = test.reset_index()
test['ID'] = ''
test['ID'] = name[0]
test['key'] = ''
test['key'] = name[1]
test2 = test2.append(test)
But the result seem quite off, there are more rows than 5. For each day one row with the cumsum of each ID and key.
Does anyone see the reason for my problem ?
thanks in advance
Friendly reminder, it's useful to include a runable example
import pandas as pd
data = [{"cid":33613,"key":14855,"ts":1550577600000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550579340000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550584800000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550682000000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550685900000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550773380000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550858400000,"value":50.0},
{"cid":33613,"key":14855,"ts":1550941200000,"value":25.0},
{"cid":33613,"key":14855,"ts":1550978400000,"value":50.0}]
df = pd.DataFrame(data)
df['ts'] = pd.to_datetime(df['ts'], unit='ms')
I believe what you need can be accomplished as follows:
df.set_index('ts').groupby(['cid', 'key'])['value'].resample('D').sum().cumsum()
Result:
cid key ts
33613 14855 2019-02-19 150.0
2019-02-20 250.0
2019-02-21 300.0
2019-02-22 350.0
2019-02-23 375.0
2019-02-24 425.0
Name: value, dtype: float64
I have a dataframe with numerical values between 0 and 1. I am trying to create simple summary statistics (manually). I when using boolean I can get the index but when I try to use math.isclose the function does not work and gives an error.
For example:
import pandas as pd
df1 = pd.DataFrame({'col1':[0,.05,0.74,0.76,1], 'col2': [0,
0.05,0.5, 0.75,1], 'x1': [1,2,3,4,5], 'x2':
[5,6,7,8,9]})
result75 = df1.index[round(df1['col2'],2) == 0.75].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This will give the correct result but occasionally the value result is NAN so I tried:
result75 = df1.index[math.isclose(round(df1['col2'],2), 0.75, abs_tol = 0.011)].tolist()
value75 = df1['x2'][result75]
print(value75.mean())
This results in the following error message:
TypeError: cannot convert the series to <class 'float'>
Both are type "bool" so not sure what is going wrong here...
This works:
rows_meeting_condition = df1[(df1['col2'] > 0.74) & (df1['col2'] < 0.76)]
print(rows_meeting_condition['x2'].mean())
I have a file with record json strings like:
{"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
{"foo": [0.0621534586, 0.0509529933, 0.122285351]}
{"foo": [0.0169468746, 0.00475309044, 0.0085169]}
When I call read_json on this file I get a dataframe where the column foo is an object. Calling .to_numpy() on this dataframe gives me an numpy array in the form of:
array([list([-0.050888903400000005, -0.00733460533, -0.0595958121]),
list([0.10726073400000001, -0.0247702841, -0.0298063811]), ...,
list([-0.10156482500000001, -0.0402663834, -0.0609775148])],
dtype=object)
I want to parse the values of foo as numpy array instead of list. Anyone have any ideas?
The easiest way is to create your DataFrame using .from_dict().
See a minimal example with one of your dicts.
d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
df = pd.DataFrame().from_dict(d)
>>> df
foo
0 -0.048201
1 0.041648
2 -0.049558
>>> df.dtypes
foo float64
dtype: object
How about doing:
df['foo'] = df['foo'].apply(np.array)
df
foo
0 [-0.0482006893, 0.0416476727, -0.0495583452]
1 [0.0621534586, 0.0509529933, 0.12228535100000001]
2 [0.0169468746, 0.00475309044, 0.00851689999999...
This shows that these have been converted to numpy.ndarray instances:
df['foo'].apply(type)
0 <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
Name: foo, dtype: object
I have a panda dataframe df:
<bound method NDFrame.head of DAT_RUN DAT_FORECAST LIB_SOURCE MES_LONGITUDE MES_LATITUDE MES_TEMPERATURE MES_HUMIDITE MES_PLUIE MES_VITESSE_VENT MES_U_WIND MES_V_WIND
0 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 3.75 11.994824 72.0 0.0 2.653137 -2.402910 -1.124792
1 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.00 13.094824 74.3 0.0 2.976434 -2.972910 -0.144792
2 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.25 12.594824 75.3 0.0 3.128418 -2.702910 1.575208
3 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.50 12.094824 75.5 0.0 3.183418 -2.342910 2.155208
I convert DAT_RUN and DAT_FORECAST columns to datetime format :
df["DAT_RUN"] = pd.to_datetime(df['DAT_RUN'], format="%Y-%m-%dT%H:%M:%SZ") # previously "%Y-%m-%d %H:%M:%S"
df["DAT_FORECAST"] = pd.to_datetime(df['DAT_FORECAST'], format="%Y-%m-%dT%H:%M:%SZ")
df.dtypes:
DAT_RUN datetime64[ns]
DAT_FORECAST datetime64[ns]
LIB_SOURCE object
MES_LONGITUDE float64
MES_LATITUDE float64
MES_TEMPERATURE float64
MES_HUMIDITE float64
MES_PLUIE float64
MES_VITESSE_VENT float64
MES_U_WIND float64
MES_V_WIND float64
I use bigquery.Client().load_table_from_dataframe() function to insert data into Bigquery table which numeric columns have NUMERIC bigquery table.
It returns this error :
pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16)
I tried to fix it with :
df["MES_LONGITUDE"] = df["MES_LONGITUDE"].astype(str).map(decimal.Decimal)
But no more.
Thanks.
I managed to work around this issue with a decimal.Context, hope it helps:
import decimal
import numpy as np
import pandas as pd
from google.cloud import bigquery
df = pd.DataFrame(
data={
"MES_HUMIDITE": np.array([2.653137, 2.976434, 3.128418, 3.183418]),
"MES_PLUIE": np.array([-2.402910, -2.972910, -2.702910, -2.342910]),
},
dtype="float",
)
We check data type declaration:
df.dtypes
# MES_HUMIDITE float64
# MES_PLUIE float64
# dtype: object
Initialize Context to 7 digits, because it is the precision in those columns, you can create multiple Context if you need different precision values for each column:
context = decimal.Context(prec=7)
df["MES_HUMIDITE"] = df["MES_HUMIDITE"].apply(context.create_decimal_from_float)
df["MES_PLUIE"] = df["MES_PLUIE"].apply(context.create_decimal_from_float)
Now, each item is a Decimal object:
df["MES_HUMIDITE"][0]
# Decimal('2.653137')
Types have changed and Pandas stores Decimals as objects, as I guess is not a native data format:
df.dtypes
# MES_HUMIDITE object
# MES_PLUIE object
# dtype: object
table_id = "test_dataset.test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("MES_HUMIDITE", "NUMERIC"),
bigquery.SchemaField("MES_PLUIE", "NUMERIC"),
],
write_disposition="WRITE_TRUNCATE",
)
client = bigquery.Client.from_service_account_json("/path_to_key.json")
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
However, decimal types are generally recommended for financial calculations and, although I do not know your exact case and usage, you are probably safe using FLOAT64, at least for latitude and longitude.
The print for average of the spreads come out grouped and calculated right. Why do I get this returned as the result for the std_deviation column instead of the standard deviation of the spread grouped by ticker?:
pandas.core.groupby.SeriesGroupBy object at 0x000000000484A588
df = pd.read_csv('C:\\Users\\William\\Desktop\\tickdata.csv',
dtype={'ticker': str, 'bidPrice': np.float64, 'askPrice': np.float64, 'afterHours': str},
usecols=['ticker', 'bidPrice', 'askPrice', 'afterHours'],
nrows=3000000
)
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
df['std_deviation'] = df['spread'].std(ddof=0)
df = df.groupby(['ticker'])
print(df['std_deviation'])
print(df['spread'].mean())
UPDATE: no longer being returned an object but now trying to figure out how to have the standard deviation displayed by ticker
df['spread'] = (df.askPrice - df.bidPrice)
df2 = df.groupby(['ticker'])
print(df2['spread'].mean())
df = df.set_index('ticker')
print(df['spread'].std(ddof=0))
UPDATE2: got the dataset I needed using
df = df[df.afterHours == "False"]
df = df[df.bidPrice != 0]
df = df[df.askPrice != 0]
df['spread'] = (df.askPrice - df.bidPrice)
print(df.groupby(['ticker'])['spread'].mean())
print(df.groupby(['ticker'])['spread'].std(ddof=0))
This line:
df = df.groupby(['ticker'])
assigns df to a DataFrameGroupBy object, and
df['std_deviation']
is a SeriesGroupBy object (of the column).
It's a good idea not to "shadow" / re-assign one variable to a completely different datatype. Try to use a different variable name for the groupby!