Panda Dataframe read_json for list values - pandas

I have a file with record json strings like:
{"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
{"foo": [0.0621534586, 0.0509529933, 0.122285351]}
{"foo": [0.0169468746, 0.00475309044, 0.0085169]}
When I call read_json on this file I get a dataframe where the column foo is an object. Calling .to_numpy() on this dataframe gives me an numpy array in the form of:
array([list([-0.050888903400000005, -0.00733460533, -0.0595958121]),
list([0.10726073400000001, -0.0247702841, -0.0298063811]), ...,
list([-0.10156482500000001, -0.0402663834, -0.0609775148])],
dtype=object)
I want to parse the values of foo as numpy array instead of list. Anyone have any ideas?

The easiest way is to create your DataFrame using .from_dict().
See a minimal example with one of your dicts.
d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
df = pd.DataFrame().from_dict(d)
>>> df
foo
0 -0.048201
1 0.041648
2 -0.049558
>>> df.dtypes
foo float64
dtype: object

How about doing:
df['foo'] = df['foo'].apply(np.array)
df
foo
0 [-0.0482006893, 0.0416476727, -0.0495583452]
1 [0.0621534586, 0.0509529933, 0.12228535100000001]
2 [0.0169468746, 0.00475309044, 0.00851689999999...
This shows that these have been converted to numpy.ndarray instances:
df['foo'].apply(type)
0 <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
Name: foo, dtype: object

Related

How to add new columns to vaex dataframe? Type Error

How to add new columns to vaex dataframe?
I received the type error when I try to assign a list object to the dataframe, as is done in pandas, but received following error:
ValueError: [1, 1, 1, 1, 1, 1, 1] is not of string or Expression type, but <class 'list'>
Simple; convert list object to numpy array, and i guess that's what they define as expression type;
import numpy as np
a = [1]*7
a = np.array(a)
sub["new"] = a
sub
Let us first create a data frame using Vaex package:
import vaex
import numpy as np
x = np.arange(6)
y = x*2
df = vaex.from_arrays(x=x, y=y)
df
output:
# x y
0 0 0
1 1 2
2 2 4
3 3 6
Now, if you would like to add a new column called greeting:
df['greeting'] = ['hi', 'أهلا', 'hola', 'bonjour']
you will get this error:
ValueError: ['hi', 'أهلا', 'hola', 'bonjour'] is not of string or > Expression type, but <class 'list'>
To handle this problem, please use the following code:
output:
df['text'] = np.asanyarray(['hi', 'أهلا', 'hola', 'bonjour'])
df
# x y text
0 0 0 hi
1 1 2 أهلا
2 2 4 hola
3 3 6 bonjour
Enjoy!

Combine two series as new one in dataframe?

type(x)
<class 'pandas.core.frame.DataFrame'>
x.shape
(18, 12)
To reference the first row and 3:5 columns with expression:
type(x.iloc[0,3:5])
<class 'pandas.core.series.Series'>
x.iloc[0,3:5]
total_operating_revenue NaN
net_profit 3.43019e+07
Name: 2001-12-31, dtype: object
To reference the first row and 8:10 columns with expression:
type(x.iloc[0,8:10])
<class 'pandas.core.series.Series'>
x.iloc[0,8:10]
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
I want to get the combined new series (suppose it y)as following:
type(y)
<class 'pandas.core.series.Series'>
y.shape
(4,)
y contains:
total_operating_revenue NaN
net_profit 3.43019e+07
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
My failed tries:
x.iloc[0,[3:5,8:10]]
x.iloc[0,3:5].combine(x.iloc[0,8:10])
pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1) is not my expect,totally differ from y.
z = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1)
type(z)
<class 'pandas.core.frame.DataFrame'>
z.shape
(4, 2)
My mistake to previously suggest you concat along the columns.
Instead you should concat along the rows:
y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
Example:
import numpy as np
x = pd.DataFrame(np.random.randint(0,100,size=(18, 12)),
columns=list('ABCDEFGHIJKL'))
And then:
In [392]: y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
In [393]: y.shape
Out[393]: (4,)

Pandas merge two columns into Json

I have a pandas dataframe like below
Col1 Col2
0 a apple
1 a anar
2 b ball
3 b banana
I am looking to output json which outputs like
{ 'a' : ['apple', 'anar'], 'b' : ['ball', 'banana'] }
Use groupby with apply and last convert Series to json by Series.to_json:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
If want write json to file:
s = df.groupby('Col1')['Col2'].apply(list)
s.to_json('file.json')
Check difference:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
d = df.groupby('Col1')['Col2'].apply(list).to_dict()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
print (d)
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}
print (type(j))
<class 'str'>
print (type(d))
<class 'dict'>
You can groupby() 'Col1' and apply() list to 'Col2' and convert to_dict(), Use:
df.groupby('Col1')['Col2'].apply(list).to_dict()
Output:
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}

pandas fillna datetime column with timezone now

I have a pandas datetime column with None values which I would like to fill with datetime.now() in a specific timezone.
This is my MWE dataframe:
df = pd.DataFrame([
{'end': "2017-07-01 12:00:00"},
{'end': "2017-07-02 18:13:00"},
{'end': None},
{'end': "2017-07-04 10:45:00"}
])
If I fill with fillna:
pd.to_datetime(df['end']).fillna(datetime.now())
The result is a series with expected dtype: datetime64[ns]. But when I specify the timezone, for example:
pd.to_datetime(df['end']).fillna(
datetime.now(pytz.timezone('US/Pacific')))
This returns a series with dtype: object
It seems you need convert date to to_datetime in fillna:
df['end'] = pd.to_datetime(df['end'])
df['end'] = df['end'].fillna(pd.to_datetime(pd.datetime.now(pytz.timezone('US/Pacific'))))
print (df)
end
0 2017-07-01 12:00:00
1 2017-07-02 18:13:00
2 2017-07-04 03:35:08.499418-07:00
3 2017-07-04 10:45:00
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
But still dtype is not datetime64:
print (df['end'].dtype)
object
I think solution is pass paramter utc to to_datetime:
utc : boolean, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
df['end'] = pd.to_datetime(df['end'], utc=True)
#print (df)
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
print (df['end'].dtypes)
datetime64[ns]
And final solution from comment of OP:
df['end'] = pd.to_datetime(df['end']).dt.tz_localize('US/Pacific')
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
print (df.end.dtype)
datetime64[ns, US/Pacific]

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!
I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)