Panda Dataframe read_json for list values

Panda Dataframe read_json for list values - pandas

I have a file with record json strings like:
{"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
{"foo": [0.0621534586, 0.0509529933, 0.122285351]}
{"foo": [0.0169468746, 0.00475309044, 0.0085169]}
When I call read_json on this file I get a dataframe where the column foo is an object. Calling .to_numpy() on this dataframe gives me an numpy array in the form of:
array([list([-0.050888903400000005, -0.00733460533, -0.0595958121]),
list([0.10726073400000001, -0.0247702841, -0.0298063811]), ...,
list([-0.10156482500000001, -0.0402663834, -0.0609775148])],
dtype=object)
I want to parse the values of foo as numpy array instead of list. Anyone have any ideas?

The easiest way is to create your DataFrame using .from_dict().
See a minimal example with one of your dicts.
d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]}
df = pd.DataFrame().from_dict(d)
>>> df
foo
0 -0.048201
1 0.041648
2 -0.049558
>>> df.dtypes
foo float64
dtype: object

How about doing:
df['foo'] = df['foo'].apply(np.array)
df
foo
0 [-0.0482006893, 0.0416476727, -0.0495583452]
1 [0.0621534586, 0.0509529933, 0.12228535100000001]
2 [0.0169468746, 0.00475309044, 0.00851689999999...
This shows that these have been converted to numpy.ndarray instances:
df['foo'].apply(type)
0 <class 'numpy.ndarray'>
1 <class 'numpy.ndarray'>
2 <class 'numpy.ndarray'>
Name: foo, dtype: object

Related

How to add new columns to vaex dataframe? Type Error

How to add new columns to vaex dataframe?
I received the type error when I try to assign a list object to the dataframe, as is done in pandas, but received following error:
ValueError: [1, 1, 1, 1, 1, 1, 1] is not of string or Expression type, but <class 'list'>

Simple; convert list object to numpy array, and i guess that's what they define as expression type;
import numpy as np
a = [1]*7
a = np.array(a)
sub["new"] = a
sub

Let us first create a data frame using Vaex package:
import vaex
import numpy as np
x = np.arange(6)
y = x*2
df = vaex.from_arrays(x=x, y=y)
df
output:
# x y
0 0 0
1 1 2
2 2 4
3 3 6
Now, if you would like to add a new column called greeting:
df['greeting'] = ['hi', 'أهلا', 'hola', 'bonjour']
you will get this error:
ValueError: ['hi', 'أهلا', 'hola', 'bonjour'] is not of string or > Expression type, but <class 'list'>
To handle this problem, please use the following code:
output:
df['text'] = np.asanyarray(['hi', 'أهلا', 'hola', 'bonjour'])
df
# x y text
0 0 0 hi
1 1 2 أهلا
2 2 4 hola
3 3 6 bonjour
Enjoy!

Combine two series as new one in dataframe?

type(x)
<class 'pandas.core.frame.DataFrame'>
x.shape
(18, 12)
To reference the first row and 3:5 columns with expression:
type(x.iloc[0,3:5])
<class 'pandas.core.series.Series'>
x.iloc[0,3:5]
total_operating_revenue NaN
net_profit 3.43019e+07
Name: 2001-12-31, dtype: object
To reference the first row and 8:10 columns with expression:
type(x.iloc[0,8:10])
<class 'pandas.core.series.Series'>
x.iloc[0,8:10]
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
I want to get the combined new series (suppose it y)as following:
type(y)
<class 'pandas.core.series.Series'>
y.shape
(4,)
y contains:
total_operating_revenue NaN
net_profit 3.43019e+07
total_operating_revenue_parent 5.05e+8
net_profit_parent 4.4e+07
Name: 2001-12-31, dtype: object
My failed tries:
x.iloc[0,[3:5,8:10]]
x.iloc[0,3:5].combine(x.iloc[0,8:10])
pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1) is not my expect,totally differ from y.
z = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]],axis=1)
type(z)
<class 'pandas.core.frame.DataFrame'>
z.shape
(4, 2)

My mistake to previously suggest you concat along the columns.
Instead you should concat along the rows:
y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
Example:
import numpy as np
x = pd.DataFrame(np.random.randint(0,100,size=(18, 12)),
columns=list('ABCDEFGHIJKL'))
And then:
In [392]: y = pd.concat([x.iloc[0,3:5],x.iloc[0,8:10]])
In [393]: y.shape
Out[393]: (4,)

Pandas merge two columns into Json

I have a pandas dataframe like below
Col1 Col2
0 a apple
1 a anar
2 b ball
3 b banana
I am looking to output json which outputs like
{ 'a' : ['apple', 'anar'], 'b' : ['ball', 'banana'] }

Use groupby with apply and last convert Series to json by Series.to_json:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
If want write json to file:
s = df.groupby('Col1')['Col2'].apply(list)
s.to_json('file.json')
Check difference:
j = df.groupby('Col1')['Col2'].apply(list).to_json()
d = df.groupby('Col1')['Col2'].apply(list).to_dict()
print (j)
{"a":["apple","anar"],"b":["ball","banana"]}
print (d)
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}
print (type(j))
<class 'str'>
print (type(d))
<class 'dict'>

You can groupby() 'Col1' and apply() list to 'Col2' and convert to_dict(), Use:
df.groupby('Col1')['Col2'].apply(list).to_dict()
Output:
{'a': ['apple', 'anar'], 'b': ['ball', 'banana']}

pandas fillna datetime column with timezone now

I have a pandas datetime column with None values which I would like to fill with datetime.now() in a specific timezone.
This is my MWE dataframe:
df = pd.DataFrame([
{'end': "2017-07-01 12:00:00"},
{'end': "2017-07-02 18:13:00"},
{'end': None},
{'end': "2017-07-04 10:45:00"}
])
If I fill with fillna:
pd.to_datetime(df['end']).fillna(datetime.now())
The result is a series with expected dtype: datetime64[ns]. But when I specify the timezone, for example:
pd.to_datetime(df['end']).fillna(
datetime.now(pytz.timezone('US/Pacific')))
This returns a series with dtype: object

It seems you need convert date to to_datetime in fillna:
df['end'] = pd.to_datetime(df['end'])
df['end'] = df['end'].fillna(pd.to_datetime(pd.datetime.now(pytz.timezone('US/Pacific'))))
print (df)
end
0 2017-07-01 12:00:00
1 2017-07-02 18:13:00
2 2017-07-04 03:35:08.499418-07:00
3 2017-07-04 10:45:00
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
But still dtype is not datetime64:
print (df['end'].dtype)
object
I think solution is pass paramter utc to to_datetime:
utc : boolean, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
df['end'] = pd.to_datetime(df['end'], utc=True)
#print (df)
print (df['end'].apply(type))
0 <class 'pandas._libs.tslib.Timestamp'>
1 <class 'pandas._libs.tslib.Timestamp'>
2 <class 'pandas._libs.tslib.Timestamp'>
3 <class 'pandas._libs.tslib.Timestamp'>
Name: end, dtype: object
print (df['end'].dtypes)
datetime64[ns]
And final solution from comment of OP:
df['end'] = pd.to_datetime(df['end']).dt.tz_localize('US/Pacific')
df['end'] = df['end'].fillna(pd.datetime.now(pytz.timezone('US/Pacific')))
print (df.end.dtype)
datetime64[ns, US/Pacific]

Defining a function to play a graph from CSV data - Python panda

I am trying to play around with data analysis, taking in data from a simple CSV file I have created with random values in it.
I have defined a function that should allow the user to type in a value3 then from the dataFrame, plot a bar graph. The below:
def analysis_currency_pair():
x=raw_input("what currency pair would you like to analysie ? :")
print type(x)
global dataFrame
df1=dataFrame
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
When I call the function, the code returns my question, along with giving the output of the currency pair. However, it doesn't seem to put x (the value input by the user) into the later half of the function, and so no graph is produced.
Am I doing something wrong here?
This code works when we just put the value in, and not within a function.
I am confused!

I think you need rewrite your function with two parameters: x and df, which are passed to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": pd.Series({1: 'EURUSD', 2: 'EURGBP', 3: 'CADUSD'}),
"amount": pd.Series({1: 2, 2: 2, 3: 3.5}),
"a": pd.Series({1: 7, 2: 8, 3: 9})})
print df
# a amount currencyPair
#1 7 2.0 EURUSD
#2 8 2.0 EURGBP
#3 9 3.5 CADUSD
def analysis_currency_pair(x, df1):
print type(x)
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
df2 = df2.loc[x].plot(kind = 'bar')
#raw input is EURUSD or EURGBP or CADUSD
pair=raw_input("what currency pair would you like to analysie ? :")
analysis_currency_pair(pair, df)
Or you can pass string to function analysis_currency_pair:
import pandas as pd
df = pd.DataFrame({"currencyPair": [ 'EURUSD', 'EURGBP', 'CADUSD', 'EURUSD', 'EURGBP'],
"amount": [ 1, 2, 3, 4, 5],
"amount1": [ 5, 4, 3, 2, 1]})
print df
# amount amount1 currencyPair
#0 1 5 EURUSD
#1 2 4 EURGBP
#2 3 3 CADUSD
#3 4 2 EURUSD
#4 5 1 EURGBP
def analysis_currency_pair(x, df1):
print type(x)
#<type 'str'>
df2=df1[['currencyPair','amount']]
df2 = df2.groupby(['currencyPair']).sum()
print df2
# amount
#currencyPair
#CADUSD 3
#EURGBP 7
#EURUSD 5
df2 = df2.loc[x].plot(kind = 'bar')
analysis_currency_pair('CADUSD', df)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Panda Dataframe read_json for list values - pandas

The easiest way is to create your DataFrame using .from_dict(). See a minimal example with one of your dicts. d = {"foo": [-0.0482006893, 0.0416476727, -0.0495583452]} df = pd.DataFrame().from_dict(d) >>> df foo 0 -0.048201 1 0.041648 2 -0.049558 >>> df.dtypes foo float64 dtype: object

Related

How to add new columns to vaex dataframe? Type Error

Combine two series as new one in dataframe?

Pandas merge two columns into Json

pandas fillna datetime column with timezone now

Defining a function to play a graph from CSV data - Python panda

Categories

Resources