I am trying to append a streaming data series to a pandas dataframe.
The columns are constant. I have used the following
import pandas as pd
import random
import time
while True:
ltp=random.randint(0, 100)
trade={'token':12345,'name':'abc','ltp':ltp}
time.sleep(2)
df=pd.DataFrame(trade,index=[1])
df=df.append(trade,ignore_index=True)
print(df)
In the above, only the ltp values keep changing.
The output i get is only two rows with same LTP and not an expanding serialised dataframe with new data.
The output is:
token name ltp
0 12345 abc 9
1 12345 abc 9
token name ltp
0 12345 abc 93
1 12345 abc 93
token name ltp
0 12345 abc 92
1 12345 abc 92
token name ltp
0 12345 abc 10
1 12345 abc 10
Further, am not sure why the same LTP is appearing twice for index 0 & 1.
Your problem is that you create DataFrame object each iteration, using this line:
while True:
...
df=pd.DataFrame(trade,index=[1])
...
You need to create new DataFarme before starting the while loop, like this:
import pandas as pd
import random
import time
# init new DataFrame with headers as columns
headers = ['token' ,'name' ,'ltp']
df = pd.DataFrame(columns=headers)
while True:
ltp = random.randint(0, 100)
trade = {'token':12345,'name':'abc','ltp':ltp}
time.sleep(2)
df = df.append(trade, ignore_index=True)
print(df)
Related
This question already has answers here:
Efficiently replace values from a column to another column Pandas DataFrame
(5 answers)
Closed 10 months ago.
I have created a dataframe called df with this code:
import numpy as np
import pandas as pd
# initialize data of lists.
data = {'Feature1':[1,2,-9999999,4,5],
'Age':[20, 21, 19, 18,34,]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
The dataframe looks like this:
Feature1 Age
0 1 20
1 2 21
2 -9999999 19
3 4 18
4 5 34
Every time there is a value of -9999999 in column Feature1 I need to replace it with the correspondent value from column Age. so, the output dataframe would look this this:
Feature1 Age
0 1 20
1 2 21
2 19 19
3 4 18
4 5 34
Bear in mind that the actual dataframe that I am using has 200K records (the one I have shown above is just an example).
How do I do that in pandas?
You can use np.where or Series.mask
df['Feature1'] = df['Feature1'].mask(df['Feature1'].eq(-9999999), df['Age'])
# or
df['Feature1'] = np.where(df['Feature1'].eq(-9999999), df['Age'], df['Feature1'])
I have a dataframe of 25M x 3 cols of format:
import pandas as pd
import numpy as np
d={'ID':['A1','A1','A2','A2','A2'], 'date':['Jan 1','Jan7','Jan4','Jan5','Jan12'],'value':[10,12,3,5,2]}
df=pd.DataFrame(data=d)
df
ID date value
0 A1 Jan 1 10
1 A1 Jan7 12
2 A2 Jan4 3
3 A2 Jan5 5
4 A2 Jan12 2
...
An
And want to pivot it using:
df['date'] = pd.to_datetime(df['date'], format='%b%d')
(df.pivot(index='date', columns='ID',values='value')
.asfreq('D')
.interpolate()
.bfill()
.reset_index()
)
df.index = df.index.strftime('%b%d')
This works for 500k rows
df3=(df.iloc[:500000,:].pivot(index='date', columns='ID',values='value')
.resample('M').mean()
.interpolate()
.bfill()
.reset_index()
)
, but when I used my full data set, or >1M rows, it fails with:
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Are there any suggestions on how I can get this to run to completion?
A further computation is performed on the wide table:
N=19/df2.iloc[0]
df2.mul(N.tolist(),axis=1).sum(1)
I have a df like this:
import pandas as pd
df = pd.DataFrame(columns=['Concat','SearchTerm'])
df = df.append({'Concat':'abc','SearchTerm':'aa'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aab'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aac'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'ddd'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cef'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'plo'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cefa'}, ignore_index=True)
print(df)
Concat SearchTerm
0 abc aa
1 abc aab
2 abc aac
3 abc ddd
4 def cef
5 def plo
6 def cefa
I want to group up the df by Concat, and count how many times each SearchTerm appears within the strings of that subset. So the final result should look like this:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1
For Concat abc, aa is found 3 times among the 4 SearchTerms. I can get the solution using a loop, but for my larger dataset, it is too slow.
I have tried two solutions from this thread and this thread.
df['Count'] = df['SearchTerm'].str.contains(df['SearchTerm']).groupby(df['Concat']).sum()
df['Count'] = df.groupby(['Concat'])['SearchTerm'].transform(lambda x: x[x.str.contains(x)].count())
In either case, there is a TypeError:
'Series' objects are mutable, thus they cannot be hashed
Any help would be appreciated.
Use transform and listcomp
s = df.groupby('Concat').SearchTerm.transform('|'.join)
df['Count'] = [s[i].count(term) for i, term in enumerate(df.SearchTerm)]
Out[77]:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1
I have a dict 'd' set up which is a list of dataframes E.g.:
d["DataFrame1"]
Will return that dataframe with all its columns:
ID Name
0 123 John
1 548 Eric
2 184 Sam
3 175 Andy
Each dataframe has a column in it called 'Names'. I want to extract this column from each dataframe in the dict and to create a new dataframe consisting of these columns.
df_All_Names = pd.DataFrame()
for df in d:
df_All_Names[df] = df['Names']
Returns the error:
TypeError: string indices must be integers
Unsure where I'm going wrong here.
For example you have df as follow
df=pd.DataFrame({'Name':['X', 'Y']})
df1=pd.DataFrame({'Name':['X1', 'Y1']})
And we create a dict
d=dict()
d['df']=df
d['df1']=df1
Then presetting a empty data frame:
yourdf=pd.DataFrame()
Using items with for loop
for key,val in d.items():
yourdf[key]=val['Name']
yield :
yourdf
Out[98]:
df df1
0 X X1
1 Y Y1
Your can use reduce and concatenate all of the columns named ['Name'] in your dictionary of dataframes
Sample Data
from functools import reduce
d = {'df1':pd.DataFrame({'ID':[0,1,2],'Name':['John','Sam','Andy']}),'df2':pd.DataFrame({'ID':[3,4,5],'Name':['Jen','Cara','Jess']})}
You can stack the data side by side using axis=1
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=1),d.values())
Name Name
0 John Jen
1 Sam Cara
2 Andy Jess
Or on top of one an other usingaxis=0
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=0),d.values())
0 John
1 Sam
2 Andy
0 Jen
1 Cara
2 Jess
I have a string array where each element of this array is a csv file's row(comma separated). I want to convert this into a pandas Dataframe.However when I tried row by row it is very slow.Can a faster alternative be proposed apart from writelines() followed by pandas.read_csv()?
CSV Import
In pandas you can read an entire csv at once without looping over the lines.
Use read_csv with filename as argument:
import pandas as pd
from cStringIO import StringIO
# Set up fake csv data as test for example only
fake_csv = '''
Col_0,Col_1,Col_2,Col_3
0,0.5,A,123
1,0.2,J,234
2,1.4,F,345
3,0.7,E,456
4,0.4,G,576
5,0.8,T,678
6,1.6,A,789
'''
# Read in whole csv to DataFrame at once
# StringIO is for example only
# Normally you would load your file with
# df = pd.read_csv('/path/to/your/file.csv')
df = pd.read_csv(StringIO(fake_csv))
print 'DataFrame from CSV:'
print df
DataFrame from CSV:
Col_0 Col_1 Col_2 Col_3
0 0 0.5 A 123
1 1 0.2 J 234
2 2 1.4 F 345
3 3 0.7 E 456
4 4 0.4 G 576
5 5 0.8 T 678
6 6 1.6 A 789