type(x)
<class 'pandas.core.frame.DataFrame'>
x.shape
(18, 12)
To reference the first row and 3:5 columns with expression:
type(x.iloc[0,3:5])
<class 'pandas.core.series.Series'>
x.iloc[0,3:5]
total_operating_revenue NaN
net_profit 3.43019e+07
Name: 2001-12-31, dtype: object
To reference the first row and 8:10 columns with expression:
type(x.iloc[0,8:10])
<class 'pandas.core.series.Series'>
x.iloc[0,8:10]
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
I want to get the combined new series (suppose it y)as following:
type(y)
<class 'pandas.core.series.Series'>
y.shape
(4,)
y contains:
total_operating_revenue NaN
net_profit 3.43019e+07
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
My failed tries:
x.iloc[0,[3:5,8:10]]
x.iloc[0,3:5].combine(x.iloc[0,8:10])
pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1) is not my expect,totally differ from y.
z = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1)
type(z)
<class 'pandas.core.frame.DataFrame'>
z.shape
(4, 2)
My mistake to previously suggest you concat along the columns.
Instead you should concat along the rows:
y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
Example:
import numpy as np
x = pd.DataFrame(np.random.randint(0,100,size=(18, 12)),
columns=list('ABCDEFGHIJKL'))
And then:
In [392]: y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
In [393]: y.shape
Out[393]: (4,)
Related
I have a file with record json strings like:
{"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
{"foo": [0.0621534586, 0.0509529933, 0.122285351]}
{"foo": [0.0169468746, 0.00475309044, 0.0085169]}
When I call read_json on this file I get a dataframe where the column foo is an object. Calling .to_numpy() on this dataframe gives me an numpy array in the form of:
array([list([-0.050888903400000005, -0.00733460533, -0.0595958121]),
list([0.10726073400000001, -0.0247702841, -0.0298063811]), ...,
list([-0.10156482500000001, -0.0402663834, -0.0609775148])],
dtype=object)
I want to parse the values of foo as numpy array instead of list. Anyone have any ideas?
The easiest way is to create your DataFrame using .from_dict().
See a minimal example with one of your dicts.
d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
df = pd.DataFrame().from_dict(d)
>>> df
foo
0 -0.048201
1 0.041648
2 -0.049558
>>> df.dtypes
foo float64
dtype: object
How about doing:
df['foo'] = df['foo'].apply(np.array)
df
foo
0 [-0.0482006893, 0.0416476727, -0.0495583452]
1 [0.0621534586, 0.0509529933, 0.12228535100000001]
2 [0.0169468746, 0.00475309044, 0.00851689999999...
This shows that these have been converted to numpy.ndarray instances:
df['foo'].apply(type)
0 <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
Name: foo, dtype: object
I have a pandas dataframe like below
Col1 Col2
0 a apple
1 a anar
2 b ball
3 b banana
I am looking to output json which outputs like
{ 'a' : ['apple', 'anar'], 'b' : ['ball', 'banana'] }
Use groupby with apply and last convert Series to json by Series.to_json:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
If want write json to file:
s = df.groupby('Col1')['Col2'].apply(list)
s.to_json('file.json')
Check difference:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
d = df.groupby('Col1')['Col2'].apply(list).to_dict()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
print (d)
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}
print (type(j))
<class 'str'>
print (type(d))
<class 'dict'>
You can groupby() 'Col1' and apply() list to 'Col2' and convert to_dict(), Use:
df.groupby('Col1')['Col2'].apply(list).to_dict()
Output:
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}
I have a pandas datetime column with None values which I would like to fill with datetime.now() in a specific timezone.
This is my MWE dataframe:
df = pd.DataFrame([
{'end': "2017-07-01 12:00:00"},
{'end': "2017-07-02 18:13:00"},
{'end': None},
{'end': "2017-07-04 10:45:00"}
])
If I fill with fillna:
pd.to_datetime(df['end']).fillna(datetime.now())
The result is a series with expected dtype: datetime64[ns]. But when I specify the timezone, for example:
pd.to_datetime(df['end']).fillna(
datetime.now(pytz.timezone('US/Pacific')))
This returns a series with dtype: object
It seems you need convert date to to_datetime in fillna:
df['end'] = pd.to_datetime(df['end'])
df['end'] = df['end'].fillna(pd.to_datetime(pd.datetime.now(pytz.timezone('US/Pacific'))))
print (df)
end
0 2017-07-01 12:00:00
1 2017-07-02 18:13:00
2 2017-07-04 03:35:08.499418-07:00
3 2017-07-04 10:45:00
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
But still dtype is not datetime64:
print (df['end'].dtype)
object
I think solution is pass paramter utc to to_datetime:
utc : boolean, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
df['end'] = pd.to_datetime(df['end'], utc=True)
#print (df)
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
print (df['end'].dtypes)
datetime64[ns]
And final solution from comment of OP:
df['end'] = pd.to_datetime(df['end']).dt.tz_localize('US/Pacific')
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
print (df.end.dtype)
datetime64[ns, US/Pacific]
i'm trying to read a csv file. in one column (hpi) which should be float32 there are two records populated with a . to indicate missing values. pandas interprets the . as a character.
how do force numeric on this column?
data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
#,converters={'hpi': float})
#print data.head()
#print(data.dtypes)
print(data[data.hpi == '.'])
Use na.values parameter in read.csv:
df = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"],
na_values='.')
df.dtypes
Out:
state object
year int64
qtr int64
hpi float64
dtype: object
Apply to_numeric over the desired column (with apply):
data.loc[data.hpi == '.', 'hpi'] = -1.0
data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
For example:
In[69]: data = pd.read_csv('http://www.fhfa.gov/DataTools/Downloads/Documents/HPI/HPI_AT_state.csv',
header=0,
names = ["state", "year", "qtr", "hpi"])
In[70]: data[['hpi']].dtypes
Out[70]:
hpi object
dtype: object
In[74]: data.loc[data.hpi == '.'] = -1.0
In[75]: data[['hpi']] = data[['hpi']].apply(pd.to_numeric)
In[77]: data[['hpi']].dtypes
Out[77]:
hpi float64
dtype: object
EDIT:
For some reason it changes all the columns to float64. This is a small workaround that changes them back to int.
Before:
In[89]: data.dtypes
Out[89]:
state object
year float64
qtr float64
hpi float64
After:
In[90]: data[['year','qtr']] = data[['year','qtr']].astype(int)
In[91]: data.dtypes
Out[91]:
state object
year int64
qtr int64
hpi float64
dtype: object
If anyone could shed light over way it happens that'd be great.
You could just cast this after you read it in. e.g.
data.loc[data.hpi == '.', 'hpi'] = pd.np.nan
data.hpi = data.hpi.astype(pd.np.float64)
Alternatively you can use the na_values parameter for read_csv
I am looking up a value in a dataframe using a multi-index. df[value1,value2]. This works, but throws a keyerror if the value is not in the index. I can handle the exception but is there an equivalent syntax to a python dict.get()? That is, I would prefer the lookup to return None if the value is not found.
Mark
Just call DataFrame.get():
In [50]: from pandas.util.testing import makeCustomDataframe as mkdf
In [51]: df = mkdf(5, 2, c_idx_nlevels=2, data_gen_f=lambda *args: rand())
In [52]: df
Out[52]:
C0 C_l0_g0 C_l0_g1
C1 C_l1_g0 C_l1_g1
R0
R_l0_g0 0.155 0.989
R_l0_g1 0.427 0.330
R_l0_g2 0.951 0.720
R_l0_g3 0.745 0.485
R_l0_g4 0.674 0.841
In [53]: level = df.columns[0]
In [54]: level
Out[54]: ('C_l0_g0', 'C_l1_g0')
In [55]: df.get(level)
Out[55]:
R0
R_l0_g0 0.155
R_l0_g1 0.427
R_l0_g2 0.951
R_l0_g3 0.745
R_l0_g4 0.674
Name: (C_l0_g0, C_l1_g0), dtype: float64
In [56]: df.get('how are you?')
In [57]: df.get('how are you?', 'Fine')
Out[57]: 'Fine'
You can also just define a function:
def get_from_index(df, key, default=None):
try:
return df.loc[key]
except KeyError:
return default
If your df has a multiindex in columns 'key1' and 'key2' and you want to look up value xxx on key1 and yyy on key2 , try this
df.ix[df.index.get_level_values('key1') == xxx &
df.index.get_level_values('key2') == yyy]