Multiple columns to int - pandas

I have the following data that I am working with:
import pandas as pd
url="https://raw.githubusercontent.com/dothemathonthatone/maps/master/population.csv"
bevdf2=pd.read_csv(url)
I would like to change multiple files from object to integer. I have recently discovered the .loc and would like to put it to use:
aus = bevdf2.iloc[:, 39:75]
bevdf2[aus] = bevdf2[aus].astype(int)
but I get this output:
Boolean array expected for the condition, not object
Is there a simple to continue with the .loc tool to convert the multiple columns to int?

Problem is some invalid values like -, / so first convert them to missing values by to_numeric and if need convert floats to integers use Int64 (pandas 0.24+):
bevdf2.iloc[:, 39:75] = (bevdf2.iloc[:, 39:75]
.apply(pd.to_numeric, errors='coerce')
.astype('Int64'))
print (bevdf2.iloc[:, 39:75].dtypes)
deu50 Int64
aus15 Int64
aus16 Int64
aus17 Int64
aus18 Int64
aus19 Int64
aus20 Int64
aus21 Int64
aus22 Int64
aus23 Int64
aus24 Int64
aus25 Int64
aus26 Int64
aus27 Int64
aus28 Int64
aus29 Int64
aus30 Int64
aus31 Int64
aus32 Int64
aus33 Int64
aus34 Int64
aus35 Int64
aus36 Int64
aus37 Int64
aus38 Int64
aus39 Int64
aus40 Int64
aus41 Int64
aus42 Int64
aus43 Int64
aus44 Int64
aus45 Int64
aus46 Int64
aus47 Int64
aus48 Int64
aus49 Int64
dtype: object

Related

Panda Dataframe read_json for list values

I have a file with record json strings like:
{"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
{"foo": [0.0621534586, 0.0509529933, 0.122285351]}
{"foo": [0.0169468746, 0.00475309044, 0.0085169]}
When I call read_json on this file I get a dataframe where the column foo is an object. Calling .to_numpy() on this dataframe gives me an numpy array in the form of:
array([list([-0.050888903400000005, -0.00733460533, -0.0595958121]),
list([0.10726073400000001, -0.0247702841, -0.0298063811]), ...,
list([-0.10156482500000001, -0.0402663834, -0.0609775148])],
dtype=object)
I want to parse the values of foo as numpy array instead of list. Anyone have any ideas?
The easiest way is to create your DataFrame using .from_dict().
See a minimal example with one of your dicts.
d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
df = pd.DataFrame().from_dict(d)
>>> df
foo
0 -0.048201
1 0.041648
2 -0.049558
>>> df.dtypes
foo float64
dtype: object
How about doing:
df['foo'] = df['foo'].apply(np.array)
df
foo
0 [-0.0482006893, 0.0416476727, -0.0495583452]
1 [0.0621534586, 0.0509529933, 0.12228535100000001]
2 [0.0169468746, 0.00475309044, 0.00851689999999...
This shows that these have been converted to numpy.ndarray instances:
df['foo'].apply(type)
0 <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
Name: foo, dtype: object

How to convert pandas float64 type to NUMERIC Bigquery type?

I have a panda dataframe df:
<bound method NDFrame.head of DAT_RUN DAT_FORECAST LIB_SOURCE MES_LONGITUDE MES_LATITUDE MES_TEMPERATURE MES_HUMIDITE MES_PLUIE MES_VITESSE_VENT MES_U_WIND MES_V_WIND
0 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 3.75 11.994824 72.0 0.0 2.653137 -2.402910 -1.124792
1 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.00 13.094824 74.3 0.0 2.976434 -2.972910 -0.144792
2 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.25 12.594824 75.3 0.0 3.128418 -2.702910 1.575208
3 2022-03-29T00:00:00Z 2022-03-29T01:00:00Z gfs_025 43.50 4.50 12.094824 75.5 0.0 3.183418 -2.342910 2.155208
I convert DAT_RUN and DAT_FORECAST columns to datetime format :
df["DAT_RUN"] = pd.to_datetime(df['DAT_RUN'], format="%Y-%m-%dT%H:%M:%SZ") # previously "%Y-%m-%d %H:%M:%S"
df["DAT_FORECAST"] = pd.to_datetime(df['DAT_FORECAST'], format="%Y-%m-%dT%H:%M:%SZ")
df.dtypes:
DAT_RUN datetime64[ns]
DAT_FORECAST datetime64[ns]
LIB_SOURCE object
MES_LONGITUDE float64
MES_LATITUDE float64
MES_TEMPERATURE float64
MES_HUMIDITE float64
MES_PLUIE float64
MES_VITESSE_VENT float64
MES_U_WIND float64
MES_V_WIND float64
I use bigquery.Client().load_table_from_dataframe() function to insert data into Bigquery table which numeric columns have NUMERIC bigquery table.
It returns this error :
pyarrow.lib.ArrowInvalid: Got bytestring of length 8 (expected 16)
I tried to fix it with :
df["MES_LONGITUDE"] = df["MES_LONGITUDE"].astype(str).map(decimal.Decimal)
But no more.
Thanks.
I managed to work around this issue with a decimal.Context, hope it helps:
import decimal
import numpy as np
import pandas as pd
from google.cloud import bigquery
df = pd.DataFrame(
data={
"MES_HUMIDITE": np.array([2.653137, 2.976434, 3.128418, 3.183418]),
"MES_PLUIE": np.array([-2.402910, -2.972910, -2.702910, -2.342910]),
},
dtype="float",
)
We check data type declaration:
df.dtypes
# MES_HUMIDITE float64
# MES_PLUIE float64
# dtype: object
Initialize Context to 7 digits, because it is the precision in those columns, you can create multiple Context if you need different precision values for each column:
context = decimal.Context(prec=7)
df["MES_HUMIDITE"] = df["MES_HUMIDITE"].apply(context.create_decimal_from_float)
df["MES_PLUIE"] = df["MES_PLUIE"].apply(context.create_decimal_from_float)
Now, each item is a Decimal object:
df["MES_HUMIDITE"][0]
# Decimal('2.653137')
Types have changed and Pandas stores Decimals as objects, as I guess is not a native data format:
df.dtypes
# MES_HUMIDITE object
# MES_PLUIE object
# dtype: object
table_id = "test_dataset.test"
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("MES_HUMIDITE", "NUMERIC"),
bigquery.SchemaField("MES_PLUIE", "NUMERIC"),
],
write_disposition="WRITE_TRUNCATE",
)
client = bigquery.Client.from_service_account_json("/path_to_key.json")
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
However, decimal types are generally recommended for financial calculations and, although I do not know your exact case and usage, you are probably safe using FLOAT64, at least for latitude and longitude.

Pandas equals gives False result even it should be True

Let me generate a dataframe
df=pd.DataFrame([[1,2],[2,1]])
then I compare
df[0].equals(df[1].sort_values())
this gives False.
However, both d[0] and df[1].sort_values() gives the same output
0 1
1 2
Name: 0, dtype: int64
Why equals gives False? What is wrong?
There is different order of index values, so if create same e.g. here by Series.reset_index with drop=true it working like you expected:
a = df[0].equals(df[1].sort_values().reset_index(drop=True))
print (a)
True
Details:
print (df[0])
0 1
1 2
Name: 0, dtype: int64
print (df[1].sort_values())
1 1
0 2
Name: 1, dtype: int64
print (df[1].sort_values().reset_index(drop=True))
0 1
1 2
Name: 1, dtype: int64
You can also directly access Series values:
np.equal(df[0].values, df[1].sort_values().values)
array([ True, True])
np.equal(df[0].values, df[1].sort_values().values).all()
True
np.array_equal(df[0], df[1].sort_values())
True
As time perfomance is concerned, second and third approaches are equivalent while df[0].equals(df[1].sort_values().reset_index(drop=True)) is 1.5x slower.

pandas fillna datetime column with timezone now

I have a pandas datetime column with None values which I would like to fill with datetime.now() in a specific timezone.
This is my MWE dataframe:
df = pd.DataFrame([
{'end': "2017-07-01 12:00:00"},
{'end': "2017-07-02 18:13:00"},
{'end': None},
{'end': "2017-07-04 10:45:00"}
])
If I fill with fillna:
pd.to_datetime(df['end']).fillna(datetime.now())
The result is a series with expected dtype: datetime64[ns]. But when I specify the timezone, for example:
pd.to_datetime(df['end']).fillna(
datetime.now(pytz.timezone('US/Pacific')))
This returns a series with dtype: object
It seems you need convert date to to_datetime in fillna:
df['end'] = pd.to_datetime(df['end'])
df['end'] = df['end'].fillna(pd.to_datetime(pd.datetime.now(pytz.timezone('US/Pacific'))))
print (df)
end
0 2017-07-01 12:00:00
1 2017-07-02 18:13:00
2 2017-07-04 03:35:08.499418-07:00
3 2017-07-04 10:45:00
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
But still dtype is not datetime64:
print (df['end'].dtype)
object
I think solution is pass paramter utc to to_datetime:
utc : boolean, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
df['end'] = pd.to_datetime(df['end'], utc=True)
#print (df)
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
print (df['end'].dtypes)
datetime64[ns]
And final solution from comment of OP:
df['end'] = pd.to_datetime(df['end']).dt.tz_localize('US/Pacific')
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
print (df.end.dtype)
datetime64[ns, US/Pacific]

pandas read_csv convert object to float

i'm trying to read a csv file. in one column (hpi) which should be float32 there are two records populated with a . to indicate missing values. pandas interprets the . as a character.
how do force numeric on this column?
data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
#,converters={'hpi': float})
#print data.head()
#print(data.dtypes)
print(data[data.hpi == '.'])
Use na.values parameter in read.csv:
df = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"],
na_values='.')
df.dtypes
Out:
state object
year int64
qtr int64
hpi float64
dtype: object
Apply to_numeric over the desired column (with apply):
data.loc[data.hpi == '.', 'hpi'] = -1.0
data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
For example:
In[69]: data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
In[70]: data[['hpi']].dtypes
Out[70]:
hpi object
dtype: object
In[74]: data.loc[data.hpi == '.'] = -1.0
In[75]: data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
In[77]: data[['hpi']].dtypes
Out[77]:
hpi float64
dtype: object
EDIT:
For some reason it changes all the columns to float64. This is a small workaround that changes them back to int.
Before:
In[89]: data.dtypes
Out[89]:
state object
year float64
qtr float64
hpi float64
After:
In[90]: data[['year','qtr']] = data[['year','qtr']].astype(int)
In[91]: data.dtypes
Out[91]:
state object
year int64
qtr int64
hpi float64
dtype: object
If anyone could shed light over way it happens that'd be great.
You could just cast this after you read it in. e.g.
data.loc[data.hpi == '.', 'hpi'] = pd.np.nan
data.hpi = data.hpi.astype(pd.np.float64)
Alternatively you can use the na_values parameter for read_csv