Fast conversion from String array to Pandas Dataframe - pandas

I have a string array where each element of this array is a csv file's row(comma separated). I want to convert this into a pandas Dataframe.However when I tried row by row it is very slow.Can a faster alternative be proposed apart from writelines() followed by pandas.read_csv()?

CSV Import
In pandas you can read an entire csv at once without looping over the lines.
Use read_csv with filename as argument:
import pandas as pd
from cStringIO import StringIO
# Set up fake csv data as test for example only
fake_csv = '''
Col_0,Col_1,Col_2,Col_3
0,0.5,A,123
1,0.2,J,234
2,1.4,F,345
3,0.7,E,456
4,0.4,G,576
5,0.8,T,678
6,1.6,A,789
'''
# Read in whole csv to DataFrame at once
# StringIO is for example only
# Normally you would load your file with
# df = pd.read_csv('/path/to/your/file.csv')
df = pd.read_csv(StringIO(fake_csv))
print 'DataFrame from CSV:'
print df
DataFrame from CSV:
Col_0 Col_1 Col_2 Col_3
0 0 0.5 A 123
1 1 0.2 J 234
2 2 1.4 F 345
3 3 0.7 E 456
4 4 0.4 G 576
5 5 0.8 T 678
6 6 1.6 A 789

Related

pandas: read multiple dataframes from one csv

I have a csv file that looks like this:
col A, col B
1, 5
2,7
78,65
###########
5,8
15,23
###########
17, 15
25,62
12,15
95,56
How to transform it into set of dataframes, one for each area between ######### lines (I can change the marker if needed)?
The result should be something like this:
df1 = {col A :{1,2,78}, col B: {5,7,65}}
df2 = {col A: {5,15}, col B: {8,23}}
df3 = {col A: {17,25,12,95}, col B: {15,62,15,56}}
I know there is a workaround using file.readlines(), but it is "not very elegant" - I wonder if there is a pandas way to do it directly.
Highly inspired by piRSquared's answer here, you can approach your goal like this :
import pandas as pd
import numpy as np
df = pd.read_csv("/input_file.csv")
# is the row a horizontal delimiter ?
m = df["col A"].str.contains("#", na=False)
l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m) + 1)))
_ = [exec(f"globals()['df{idx}'] = df.loc[~m]") for idx, df in enumerate(l_df, start=1)]
#if you need a dictionnary (instead of a dataframe), you can use df.loc[~m].to_dict("list")
NB : We used globals to create the variables/sub-dataframes dynamically.
# Output :
print(df1, type(df1)), print(df2, type(df2)), print(df3, type(df3))
col A col B
0 1 5.0
1 2 7.0
2 78 65.0 <class 'pandas.core.frame.DataFrame'>
col A col B
4 5 8.0
5 15 23.0 <class 'pandas.core.frame.DataFrame'>
col A col B
7 17 15.0
8 25 62.0
9 12 15.0
10 95 56.0 <class 'pandas.core.frame.DataFrame'>

pandas appending a streaming data series

I am trying to append a streaming data series to a pandas dataframe.
The columns are constant. I have used the following
import pandas as pd
import random
import time
while True:
ltp=random.randint(0, 100)
trade={'token':12345,'name':'abc','ltp':ltp}
time.sleep(2)
df=pd.DataFrame(trade,index=[1])
df=df.append(trade,ignore_index=True)
print(df)
In the above, only the ltp values keep changing.
The output i get is only two rows with same LTP and not an expanding serialised dataframe with new data.
The output is:
token name ltp
0 12345 abc 9
1 12345 abc 9
token name ltp
0 12345 abc 93
1 12345 abc 93
token name ltp
0 12345 abc 92
1 12345 abc 92
token name ltp
0 12345 abc 10
1 12345 abc 10
Further, am not sure why the same LTP is appearing twice for index 0 & 1.
Your problem is that you create DataFrame object each iteration, using this line:
while True:
...
df=pd.DataFrame(trade,index=[1])
...
You need to create new DataFarme before starting the while loop, like this:
import pandas as pd
import random
import time
# init new DataFrame with headers as columns
headers = ['token' ,'name' ,'ltp']
df = pd.DataFrame(columns=headers)
while True:
ltp = random.randint(0, 100)
trade = {'token':12345,'name':'abc','ltp':ltp}
time.sleep(2)
df = df.append(trade, ignore_index=True)
print(df)

alternatives to pivot very large table pandas

I have a dataframe of 25M x 3 cols of format:
import pandas as pd
import numpy as np
d={'ID':['A1','A1','A2','A2','A2'], 'date':['Jan 1','Jan7','Jan4','Jan5','Jan12'],'value':[10,12,3,5,2]}
df=pd.DataFrame(data=d)
df
ID date value
0 A1 Jan 1 10
1 A1 Jan7 12
2 A2 Jan4 3
3 A2 Jan5 5
4 A2 Jan12 2
...
An
And want to pivot it using:
df['date'] = pd.to_datetime(df['date'], format='%b%d')
(df.pivot(index='date', columns='ID',values='value')
.asfreq('D')
.interpolate()
.bfill()
.reset_index()
)
df.index = df.index.strftime('%b%d')
This works for 500k rows
df3=(df.iloc[:500000,:].pivot(index='date', columns='ID',values='value')
.resample('M').mean()
.interpolate()
.bfill()
.reset_index()
)
, but when I used my full data set, or >1M rows, it fails with:
ValueError: Unstacked DataFrame is too big, causing int32 overflow
Are there any suggestions on how I can get this to run to completion?
A further computation is performed on the wide table:
N=19/df2.iloc[0]
df2.mul(N.tolist(),axis=1).sum(1)

TypeError when using chunksize argument to pandas method pd.read_csv()

I have a csv file like this:
1 1.1 0 0.1 13.1494 32.7957 2.27266 0.2 3 5.4 ... \
0 2 2 0 8.17680 4.76726 25.6957 1.13633 0 3 4.8 ...
1 3 0 0 8.22718 2.35340 15.2934 1.13633 0 3 4.8 ...
read the file using panda.read_csv:
data_raw = pd.read_csv(filename, chunksize=chunksize)
Now, I want to make a dataframe:
df = pd.DataFrame(data_raw, columns=['id', 'colNam1', 'colNam2', 'colNam3',...])
But I met a problem:
File "test.py", line 143, in <module>
data = load_frame(csvfile)
File "test.py", line 53, in load_frame
'id', 'colNam1', 'colNam2', 'colNam3',...])
File "/usr/local/lib/python2.7/dist-packages/pandas/core/frame.py", line 325, in __init__
raise TypeError("data argument can't be an iterator")
TypeError: data argument can't be an iterator
I don't know why.
This is because what is returned when you pass chunksize as a param to read_csv is an iterable rather than a df as such.
To demonstrate:
In [67]:
import io
import pandas as pd
t="""a b
0 -0.278303 -1.625377
1 -1.954218 0.843397
2 1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), chunksize=1)
df
Out[67]:
<pandas.io.parsers.TextFileReader at 0x7e9e8d0>
You can see that the df here is in this case not a DataFrame but a TextFileReader object
It's unclear to me what you're really trying to achieve but if you want to read a specific number of rows you can pass nrows instead:
In [69]:
t="""a b
0 -0.278303 -1.625377
1 -1.954218 0.843397
2 1.213572 -0.098594"""
df = pd.read_csv(io.StringIO(t), nrows=1)
df
Out[69]:
a b
0 0 -0.278303 -1.625377
The idea here with your original problem is that you need to iterate over it in order to get the chunks:
In [73]:
for r in df:
print(r)
a b
0 0 -0.278303 -1.625377
a b
1 1 -1.954218 0.843397
a b
2 2 1.213572 -0.098594
If you want to generate a df from the chunks you need to append to a list and then call concat:
In [77]:
df_list=[]
for r in df:
df_list.append(r)
pd.concat(df_list)
Out[77]:
a b
0 0 -0.278303 -1.625377
1 1 -1.954218 0.843397
2 2 1.213572 -0.098594

Python 2.7, np.asarray, TypeError: cannot perform reduce with flexible type

I have a .txt file with 390 rows and some 8000 columns. The data consist of only 1s and 0s separated by a white space. I want to count the number of times the number 1 appears in each column (total sum per column) for all columns. I am using numpy arrays for this. The problem is I keep getting the following error message in the script line "b = a.sum(axis=0)":
"TypeError: cannot perform reduce with flexible type"
Any suggestion would be welcome !
This is the simple code I am using:
import csv
import numpy as np
from numpy import genfromtxt
my_data = genfromtxt('test1.txt', dtype='S', delimiter=',')
a = np.asarray(my_data)
import sys
sys.stdout = open("test1.csv", "w")
b = a.sum(axis=0)
print b
An input example, test1.txt:
1 0 0 0 1 1 0 1
0 1 1 0 1 1 1 1
1 0 0 0 0 1 0 0
0 1 1 0 1 0 1 1
Expected output:
2 2 2 0 3 3 2 3
You get that error because you are importing data with dtype='S' that is a string. You have to import data with the proper dtype, like int.
You don't need to import csv and you don't need to use np.asarray. Just open file with np.genfromtxt with delimiter=' ' and dtype=int.
Try:
import numpy as np
my_data = np.genfromtxt('test1.txt', dtype=int, delimiter=' ')
b = my_data.sum(axis=0)