There are three problems(Load database, loop, and append series) - pandas

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop

I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)

Related

Dask appropriate for my goal? ```Compute()``` taking very long

I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match

from dask_ml.preprocessing import LabelEncoder



df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']

df['stopped_tom'] = 1 * (df['stopped'] > 0)

def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df

df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])

df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.

Perform csv sanitisation in linear time

I'm using the HTC anemometer and it gives me data in the following order, where in two of the columns are merged into one and there's some useless data I want to exclude.
The data looks like below
"NO.","T&RH","DATA","UNIT","TIME"
1," 27�C 70.5%",0,"m/s","30-11-2020\15:33:34"
2," 27�C 70.5%",0,"m/s","30-11-2020\15:33:35"
3," 27�C 70.5%",0,"m/s","30-11-2020\15:33:36"
4," 27�C 70.5%",0,"m/s","30-11-2020\15:33:37"
...
...
When I try to load it into a pandas data-frame, there's all kind of weird errors.
I've come up with the following code to clean the data and export it as a df.
import pandas as pd
def _formathtc(text_data:list) ->pd.DataFrame:
data = []
for l in rawdata:
d = []
l = l.split(",")
try:
_,t,h = l[1].strip('"').split(" ")
d.append(t.replace("°C",""))
d.append(h.replace("%",""))
d.append(l[2])
d.append(l[-1].strip('\n'))
data.append(d)
except Exception as e:
pass
df = pd.DataFrame(data=data)
df.columns=['temp','relhum','data','time']
return df
def gethtc(filename:str)->pd.DataFrame:
text_data = open(filename, "r", encoding="iso-8859-1").readlines()
return _formathtc(text_data)
df = gethtc(somefilename)
My problem is that the above shown operations operate in linear time, i.e., as the file grows in size more is the time take to extract out the info and get that data-frame.
How can I make it more efficient?
You can use pd.read_csv in place of the DataFrame constructor here. There are a ton of options (including encoding, and engine quotechar which may be helpful). At least pandas does all the parsing for you, and probably has better performance (esp. setting engine="c"). If this doesn't help with performance, I'm not sure there is a better native pandas option:
df = pd.read_csv("htc.csv", engine="c")
df["TIME"] = pd.to_datetime(df.TIME.str.replace("\\", " "))
df["T&RH"] = df['T&RH'].str.replace("�", "")
output:
NO. T&RH DATA UNIT TIME
0 1 27C 70.5% 0 m/s 2020-11-30 15:33:34
1 2 27C 70.5% 0 m/s 2020-11-30 15:33:35
2 3 27C 70.5% 0 m/s 2020-11-30 15:33:36
3 4 27C 70.5% 0 m/s 2020-11-30 15:33:37
The post-processing is optional of course, but I don't think should slow things down much.

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

Looking up multiple values from a pandas DataFrame

I have been struggling to find an elegant way of looking up multiple values from a pandas DataFrame. Assume we have a dataframe df that holds the “result” R, that depends on multiple index keys, and we have another dataframe keys where each row is a list of values to look up from df. The problem is to loop over the keys and look up the corresponding value from df. If the value does not exist in df, I expect to get a np.nan.
So far I have come up with three different methods, but I feel that all of them lack elegance. So my question is there another prettier method for multiple lookups? Note that the three methods below all give the same result.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':range(5),
'B':range(10,15),
'C':range(100,105),
'R':np.random.rand(5)}).set_index(['A','B','C'])
print 'df'
print df
keys = pd.DataFrame({'A':[0,0,5],'B':[10,10,10],'C':[100,100,100]})
print '--'
print 'keys'
print keys
# By merge
print '--'
print pd.merge(df.reset_index(), keys, on=['A','B','C'],how='right').reset_index().R
# By reindex
print '--'
print df.reindex(keys.set_index(['A','B','C']).index).reset_index().R
# By apply
print '--'
print keys.apply(lambda s : df.R.get((s.A,s.B,s.C)),axis=1).to_frame('R').R
I think update is pretty.
result = keys.set_index( ['A','B','C']) # looks like R
result['R'] = pd.np.nan # add nan
Them use update
result.update(df)
R
A B C
0 10 100 0.068085
100 0.068085
5 10 100 NaN
I found an even simpler solution:
keys = (pd.DataFrame({'A':[0,0,5],'B':[10,10,10],'C':[100,100,100]})
.set_index(['A','B','C']))
keys['R'] = df
or similarly (and more chaining compatible):
keys.assign(R = df)
That's all that is needed. The automatic alignment of the index does the rest of the work! :-)

How to avoid temporary variables when creating new column via groupby.apply

I would like to create a new column newcol in a dataframe df as the result of
df.groupby('keycol').apply(somefunc)
The obvious:
df['newcol'] = df.groupby('keycol').apply(somefunc)
does not work: either df['newcol'] ends up containing all nan's (which is certainly not what the RHS evaluates to), OR some exception is raised (the details of the exception vary wildly depending on what somefunc returns).
I have tried many variations of the above, including stuff like
import pandas as pd
df['newcol'] = pd.Series(df.groupby('keycol').apply(somefunc), index=df.index)
They all fail.
The only thing that has worked requires defining an intermediate variable:
import pandas as pd
tmp = df.groupby('keycol').apply(lambda x: pd.Series(somefunc(x)))
tmp.index = df.index
df['rank'] = tmp
Is there a way to achieve this without having to create an intermediate variable?
(The documentation for GroupBy.apply is almost content-free.)
Let's build up an example and I think I can illustrate why your first attempts are failing:
Example data:
n = 25
df = pd.DataFrame({'expenditure' : np.random.choice(['foo','bar'], n),
'groupid' : np.random.choice(['one','two'], n),
'coef' : randn(n)})
print df.head(10)
results in:
coef expenditure groupid
0 0.874076 bar one
1 -0.972586 foo two
2 -0.003457 bar one
3 -0.893106 bar one
4 -0.387922 bar two
5 -0.109405 bar two
6 1.275657 foo two
7 -0.318801 foo two
8 -1.134889 bar two
9 1.812964 foo two
So if apply a simple function, mean, to the grouped data we get the following:
df2= df.groupby('groupid').apply(mean)
print df2
Which is:
coef
groupid
one -0.215539
two 0.149459
So the dataframe above is indexed by groupid and has one column, coef.
What you tried to do first was, effectively, the following:
df['newcol'] = df2
That gives all NaNs for newcol. Honestly I have no idea why that doesn't throw an error. I'm not sure why it would produce anything at all. I think what you really want to do is merge df2 back into df
To merge df and df2 we need to remove the index from df2, rename the new column, then merge:
df2= df.groupby('groupid').apply(mean)
df2.reset_index(inplace=True)
df2.columns = ['groupid','newcol']
df.merge(df2)
which I think is what you were after.
This is such a common idiom that Pandas includes the transform method which wraps all this up into a much simpler syntax:
df['newcol'] = df.groupby('groupid').transform(mean)
print df.head()
results:
coef expenditure groupid newcol
0 1.705825 foo one -0.025112
1 -0.608750 bar one -0.025112
2 -1.215015 bar one -0.025112
3 -0.831478 foo two -0.073560
4 2.174040 bar one -0.025112
Better documentation is here.