Convert a tab- and newline-delimited string to pandas dataframe - pandas

I have a string of the following format:
aString = '123\t456\t789\n321\t654\t987 ...'
And I would like to convert it to a pandas DataFrame
frame:
123 456 789
321 654 987
...
I have tried to convert it to a Python list:
stringList = aString.split('\n')
which results in:
stringList = ['123\t456\t789',
'321\t654\t987',
...
]
Have no idea what to do next.

one option is list comprehension with str.split
pd.DataFrame([x.split('\t') for x in stringList], columns=list('ABC'))
A B C
0 123 456 789
1 321 654 987
You can use StringIO
from io import StringIO
pd.read_csv(StringIO(aString), sep='\t', header=None)
0 1 2
0 123 456 789
1 321 654 987

Related

check sorting by year and quarter pandas dataframe

I have a df that looks like below
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
The df has over 200 rows. I need to -
check if the data is sorted by date
if not, then sort it by date
Can someone please help me with this?
Many thanks
Use DataFrame.sort_values with key parameter and converting values to datetimes:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
print (df)
date col1 col2
0 2000 Q1 123 456
1 2000 Q2 234 567
2 2000 Q3 345 678
3 2000 Q4 456 789
4 2001 Q1 567 890
EDIT: You can use Series.is_monotonic for test if values are monotonic_increasing:
if not df['date'].is_monotonic:
df = df.sort_values('date', key=lambda x: pd.to_datetime(x.str.replace('\s+', '')))
You can convert your date column as pd.Index (or define it as the index of your dataframe):
if not pd.Index(df['date']).is_monotonic_increasing:
df = df.sort_values('date')

Getting variable no of pandas rows w.r.t. a dictionary lookup

In this sample dataframe df:
import pandas as pd
import numpy as np
import random, string
max_rows = {'A': 3, 'B': 2, 'D': 4} # max number of rows to be extracted
data_size = 1000
df = pd.DataFrame({'symbol': pd.Series(random.choice(string.ascii_uppercase) for _ in range(data_size)),
'qty': np.random.randn(data_size)}).sort_values('symbol')
How to get a dataframe with variable rows from a dictionary?
Tried using [df.groupby('symbol').head(i) for i in df.symbol.map(max_rows)]. It gives a RuntimeWarning and looks very incorrect.
You can use concat with list comprehension:
print (pd.concat([df.loc[df["symbol"].eq(k)].head(v) for k,v in max_rows.items()]))
symbol qty
640 A -0.725947
22 A -1.361063
190 A -0.596261
451 B -0.992223
489 B -2.014979
593 D 1.581863
600 D -2.162044
793 D -1.162758
738 D 0.345683
Adding another method using groupby+cumcount and df.query
df.assign(v=df.groupby("symbol").cumcount()+1,k=df['symbol'].map(max_rows)).query("v<=k")
Or same logic without assigning extra columns #thanks #jezrael
df[df.groupby("symbol").cumcount()+1 <= df['symbol'].map(max_rows)]
symbol qty
882 A -0.249236
27 A 0.625584
122 A -1.154539
229 B -1.269212
55 B 1.403455
457 D -2.592831
449 D -0.433731
634 D 0.099493
734 D -1.551012

pandas appending a streaming data series

I am trying to append a streaming data series to a pandas dataframe.
The columns are constant. I have used the following
import pandas as pd
import random
import time
while True:
ltp=random.randint(0, 100)
trade={'token':12345,'name':'abc','ltp':ltp}
time.sleep(2)
df=pd.DataFrame(trade,index=[1])
df=df.append(trade,ignore_index=True)
print(df)
In the above, only the ltp values keep changing.
The output i get is only two rows with same LTP and not an expanding serialised dataframe with new data.
The output is:
token name ltp
0 12345 abc 9
1 12345 abc 9
token name ltp
0 12345 abc 93
1 12345 abc 93
token name ltp
0 12345 abc 92
1 12345 abc 92
token name ltp
0 12345 abc 10
1 12345 abc 10
Further, am not sure why the same LTP is appearing twice for index 0 & 1.
Your problem is that you create DataFrame object each iteration, using this line:
while True:
...
df=pd.DataFrame(trade,index=[1])
...
You need to create new DataFarme before starting the while loop, like this:
import pandas as pd
import random
import time
# init new DataFrame with headers as columns
headers = ['token' ,'name' ,'ltp']
df = pd.DataFrame(columns=headers)
while True:
ltp = random.randint(0, 100)
trade = {'token':12345,'name':'abc','ltp':ltp}
time.sleep(2)
df = df.append(trade, ignore_index=True)
print(df)

Pandas - groupby and count series string over column

I have a df like this:
import pandas as pd
df = pd.DataFrame(columns=['Concat','SearchTerm'])
df = df.append({'Concat':'abc','SearchTerm':'aa'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aab'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'aac'}, ignore_index=True)
df = df.append({'Concat':'abc','SearchTerm':'ddd'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cef'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'plo'}, ignore_index=True)
df = df.append({'Concat':'def','SearchTerm':'cefa'}, ignore_index=True)
print(df)
Concat SearchTerm
0 abc aa
1 abc aab
2 abc aac
3 abc ddd
4 def cef
5 def plo
6 def cefa
I want to group up the df by Concat, and count how many times each SearchTerm appears within the strings of that subset. So the final result should look like this:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1
For Concat abc, aa is found 3 times among the 4 SearchTerms. I can get the solution using a loop, but for my larger dataset, it is too slow.
I have tried two solutions from this thread and this thread.
df['Count'] = df['SearchTerm'].str.contains(df['SearchTerm']).groupby(df['Concat']).sum()
df['Count'] = df.groupby(['Concat'])['SearchTerm'].transform(lambda x: x[x.str.contains(x)].count())
In either case, there is a TypeError:
'Series' objects are mutable, thus they cannot be hashed
Any help would be appreciated.
Use transform and listcomp
s = df.groupby('Concat').SearchTerm.transform('|'.join)
df['Count'] = [s[i].count(term) for i, term in enumerate(df.SearchTerm)]
Out[77]:
Concat SearchTerm Count
0 abc aa 3
1 abc aab 1
2 abc aac 1
3 abc ddd 1
4 def cef 2
5 def plo 1
6 def cefa 1

Fast conversion from String array to Pandas Dataframe

I have a string array where each element of this array is a csv file's row(comma separated). I want to convert this into a pandas Dataframe.However when I tried row by row it is very slow.Can a faster alternative be proposed apart from writelines() followed by pandas.read_csv()?
CSV Import
In pandas you can read an entire csv at once without looping over the lines.
Use read_csv with filename as argument:
import pandas as pd
from cStringIO import StringIO
# Set up fake csv data as test for example only
fake_csv = '''
Col_0,Col_1,Col_2,Col_3
0,0.5,A,123
1,0.2,J,234
2,1.4,F,345
3,0.7,E,456
4,0.4,G,576
5,0.8,T,678
6,1.6,A,789
'''
# Read in whole csv to DataFrame at once
# StringIO is for example only
# Normally you would load your file with
# df = pd.read_csv('/path/to/your/file.csv')
df = pd.read_csv(StringIO(fake_csv))
print 'DataFrame from CSV:'
print df
DataFrame from CSV:
Col_0 Col_1 Col_2 Col_3
0 0 0.5 A 123
1 1 0.2 J 234
2 2 1.4 F 345
3 3 0.7 E 456
4 4 0.4 G 576
5 5 0.8 T 678
6 6 1.6 A 789