Perform csv sanitisation in linear time - pandas

I'm using the HTC anemometer and it gives me data in the following order, where in two of the columns are merged into one and there's some useless data I want to exclude.
The data looks like below
"NO.","T&RH","DATA","UNIT","TIME"
1," 27�C 70.5%",0,"m/s","30-11-2020\15:33:34"
2," 27�C 70.5%",0,"m/s","30-11-2020\15:33:35"
3," 27�C 70.5%",0,"m/s","30-11-2020\15:33:36"
4," 27�C 70.5%",0,"m/s","30-11-2020\15:33:37"
...
...
When I try to load it into a pandas data-frame, there's all kind of weird errors.
I've come up with the following code to clean the data and export it as a df.
import pandas as pd
def _formathtc(text_data:list) ->pd.DataFrame:
data = []
for l in rawdata:
d = []
l = l.split(",")
try:
_,t,h = l[1].strip('"').split(" ")
d.append(t.replace("°C",""))
d.append(h.replace("%",""))
d.append(l[2])
d.append(l[-1].strip('\n'))
data.append(d)
except Exception as e:
pass
df = pd.DataFrame(data=data)
df.columns=['temp','relhum','data','time']
return df
def gethtc(filename:str)->pd.DataFrame:
text_data = open(filename, "r", encoding="iso-8859-1").readlines()
return _formathtc(text_data)
df = gethtc(somefilename)
My problem is that the above shown operations operate in linear time, i.e., as the file grows in size more is the time take to extract out the info and get that data-frame.
How can I make it more efficient?

You can use pd.read_csv in place of the DataFrame constructor here. There are a ton of options (including encoding, and engine quotechar which may be helpful). At least pandas does all the parsing for you, and probably has better performance (esp. setting engine="c"). If this doesn't help with performance, I'm not sure there is a better native pandas option:
df = pd.read_csv("htc.csv", engine="c")
df["TIME"] = pd.to_datetime(df.TIME.str.replace("\\", " "))
df["T&RH"] = df['T&RH'].str.replace("�", "")
output:
NO. T&RH DATA UNIT TIME
0 1 27C 70.5% 0 m/s 2020-11-30 15:33:34
1 2 27C 70.5% 0 m/s 2020-11-30 15:33:35
2 3 27C 70.5% 0 m/s 2020-11-30 15:33:36
3 4 27C 70.5% 0 m/s 2020-11-30 15:33:37
The post-processing is optional of course, but I don't think should slow things down much.

Related

Dask appropriate for my goal? ```Compute()``` taking very long

I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match

from dask_ml.preprocessing import LabelEncoder



df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']

df['stopped_tom'] = 1 * (df['stopped'] > 0)

def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df

df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])

df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

pandas resample when cumulative function returns data frame

I would like to use resampling function from pandas but applying my own custom function. The problem I'm facing is that the custom function returns a pandas Data Frame instead of a single array.
The following example illustrate my problem:
>>> import pandas as pd
>>> import numpy as np
>>> def f(data):
... return ((1+data).cumprod(axis=0)-1)
...
>>> data = np.random.randn(1000,3)
>>> index = pd.date_range("20170101", periods = 1000, freq="B")
>>> df = pd.DataFrame(data= data, index =index)
Now suppose I want to resample the business days to business end month frequency:
>>> resampler = df.resample("BM")
If I apply now the my function f I don't get the desired result. I would like to get the last row of my output from f.
>>> resampler.apply(f)
this is becaumes the cumprod in my function f returns a pandas data frame. I could write my f such that it returns just the last row. However, I would like to use this function in other places as well to return the whole Data Frame. This could be solved via introducing a flag like "last_row" in the function f which steers to return the complete or just the last row. But this solutions seem rather nasty.
Just define your function f with a last_row parameter. You can default it to False so that it returns the entire dataframe. When True it returns the last row
def f(data, last_row=False):
df = ((1+data).cumprod(axis=0)-1)
if last_row:
return df.iloc[-1]
return df
Get the last row
df.resample('BM').apply(f, last_row=True)
0 1 2
2017-01-31 0.185662 -0.580058 -1.004879
2017-02-28 -1.004035 -0.999878 17.059846
2017-03-31 -0.995280 -1.000001 -1.000507
2017-04-28 -1.000656 -240.369487 -1.002645
2017-05-31 47.646827 -72.042190 -1.000016
....
Return all the rows as you already did.
df.resample('BM').apply(f)
I think you could refactor in the following way, which will be much faster for larger dataframes:
(1+df).resample('BM').prod() - 1
0 1 2
2017-01-31 -0.999436 -1.259078 -1.000215
2017-02-28 -1.221404 0.342863 9.841939
2017-03-31 -0.820196 -1.002598 -0.450662
2017-04-28 -1.000299 2.739184 -1.035557
2017-05-31 -0.999986 -0.920445 -2.103289
That gives the same answer as #TedPetrou although you can't tell because we used different random seeds, but you can easily test this yourself. Though actually, I'm still sorting out why this gives the same answer via prod() rather than cumprod(). Anyway, as you can see this is a mix of intuition and reverse engineering I'm using here and will update as I double check things...
For this relatively small dataframe with 1,000 rows, this way is only around twice as fast, but if you increase the rows you'll find this way scales much better (about 250x faster at 10,000 rows).
Alternative approaches: These give different answers from the above (and from each other) but I wonder if they might be closer to what you are looking for?
(1+df).resample('BM').mean().expanding().apply( lambda x: x.prod() - 1)
(1+df).expanding().apply( lambda x: x.prod() - 1).resample('BM').mean()

There are three problems(Load database, loop, and append series)

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)

Reading variable column and row structure to Pandas by column amount

I need to create a Pandas DataFrame from a large file with space delimited values and row structure that is depended on the number of columns.
Raw data looks like this:
2008231.0 4891866.0 383842.0 2036693.0 4924388.0 375170.0
On one line or several, line breaks are ignored.
End result looks like this, if number of columns is three:
[(u'2008231.0', u'4891866.0', u'383842.0'),
(u'2036693.0', u'4924388.0', u'375170.0')]
Splitting the file into rows is depended on the number of columns which is stated in the meta part of the file.
Currently I split the file into one big list and split it into rows:
def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
(code is from itertools examples)
Problem is, I end up with multiple copies of the data in memory. With 500MB+ files this eats up the memory fast and Pandas has some trouble reading lists this big with large MultiIndexes.
How can I use Pandas file reading functionality (read_csv, read_table, read_fwf) with this kind of data?
Or is there an other way of reading data into Pandas without auxiliary data structures?
Although it is possible to create a custom file-like object, this will be very slow compared to the normal usage of pd.read_table:
import pandas as pd
import re
filename = 'raw_data.csv'
class FileLike(file):
""" Modeled after FileWrapper
http://stackoverflow.com/a/14279543/190597 (Thorsten Kranz)
"""
def __init__(self, *args):
super(FileLike, self).__init__(*args)
self.buffer = []
def next(self):
if not self.buffer:
line = super(FileLike, self).next()
self.buffer = re.findall(r'(\S+\s+\S+\s+\S+)', line)
if self.buffer:
line = self.buffer.pop()
return line
with FileLike(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
When I try using FileLike on a 5.8M file (consisting of 200000 lines), the above code takes 3.9 seconds to run.
If I instead preprocess the data (splitting each line into 2 lines and writing the result to disk):
import fileinput
import sys
import re
filename = 'raw_data.csv'
for line in fileinput.input([filename], inplace = True, backup='.bak'):
for part in re.findall(r'(\S+\s+\S+\s+\S+)', line):
print(part)
then you can of course load the data normally into Pandas using pd.read_table:
with open(filename, 'r') as f:
df = pd.read_table(f, header=None, delimiter='\s+')
print(len(df))
The time required to rewrite the file was ~0.6 seconds, and now loading the DataFrame took ~0.7 seconds.
So, it appears you will be better off rewriting your data to disk first.
I don't think there is a way to seperate rows with the same delimiter as columns.
One way around this is to reshape (this will most likely be a copy rather than a view, to keep the data contiguous) after creating a Series using read_csv:
s = pd.read_csv(file_name, lineterminator=' ', header=None)
df = pd.DataFrame(s.values.reshape(len(s)/n, n))
In your example:
In [1]: s = pd.read_csv('raw_data.csv', lineterminator=' ', header=None, squeeze=True)
In [2]: s
Out[2]:
0 2008231
1 4891866
2 383842
3 2036693
4 4924388
5 375170
Name: 0, dtype: float64
In [3]: pd.DataFrame(s.values.reshape(len(s)/3, 3))
Out[3]:
0 1 2
0 2008231 4891866 383842
1 2036693 4924388 375170