Correct way of iterating over pandas dataframe by date - pandas

I want to iterate over a dataframe's major axis date by date.
Example:
tdf = df.ix[date]
The issue I am having is that the type returned by df.ix changes, leaving me with 3 possible situations
If the date does not exist in tdf an error is thrown: KeyError: 1394755200000000000
If there is only one item in tdf: print type(tdf) returns
<class 'pandas.core.series.Series'>
If there is more than one item in tdf: print type(tdf) returns
<class 'pandas.core.frame.DataFrame'>
To avoid the first case I can simply wrap this in a try catch block or thanks to jxstanford, I can avoid the try catch block by using if date in df.index:
I run into the issue afterwards with an inconsistent API with a pandas series and a pandas data frame. I could solve this by checking for types but it seems I shouldn't have to do that. I would ideally like to keep the types the same. Is there a better way of doing this?
I'm running pandas 0.13.1 and I am currently loading my data from a CSV using
Here's a full example demonstrating the problem.
from pandas import DataFrame
import datetime
path_to_csv = '/home/n/Documents/port/test.csv'
df = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
start_dt = df.index.min()
end_dt = df.index.max()
dt_step = datetime.timedelta(days=1)
df.sort_index(inplace=True)
cur_dt = start_dt
while cur_dt != end_dt:
if cur_dt in df.index:
print type(df.ix[cur_dt])
#run some other steps using cur_dt
cur_dt += dt_step
An example CSV that demonstrates the problem is as follows:
value1,value2,value3,Date,type
1,2,4,03/13/14,a
2,3,3,03/21/14,b
3,4,2,03/21/14,a
4,5,1,03/27/14,b
The above code prints out
<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>
Is it possible to get the value of value1 from tdf in a consistent manner? or am I stuck making an if statement for and separately handle each case?
if type(df.ix[cur_dt]) == DataFrame:
....
if type(df.ix[cur_dt]) == Series:
....

Not sure what your trying to do with the dataframe, but this might be better than a try/except:
tdf = DataFrame.from_csv(path_to_csv, index_col=3, header=0, parse_dates=True, sep=',')
while cur_dt != end_dt:
if cur_dt in df.index:
# do your thing
cur_dt += dt_step

This toy code will return DataFrames consistently.
def framer(rows):
if ndim(rows) == 1:
return rows.to_frame().T
else:
return rows
for cur_date in df.index:
print type(framer(df.ix[cur_date]))
And this will give you the missing days:
df.resample(rule='D')
Have a look at the resample method docstring. It has its own options to fill up the missing data. And if you decide to make your multiple dates into a single one, the method you're looking at is groupby (if you want to combine values across rows) and drop_duplicates (if you want to ignore them). There is no need to reinvent the wheel.

You can use the apply method of the DataFrame, using axis = 1 to work on each row of the DataFrame to build a Series with the same Index.
e.g.
def calculate_value(row):
if row.date == pd.datetime(2014,3,21):
return 0
elif row.type == 'a':
return row.value1 + row.value2 + row.value3
else:
return row.value1 * row.value2 * row.value3
df['date'] = df.index
df['NewValue'] = df.apply(calculate_value, axis=1)
modifies your example input as follows
value1 value2 value3 type NewValue date
Date
2014-03-13 1 2 4 a 7 2014-03-13
2014-03-21 2 3 3 b 0 2014-03-21
2014-03-21 3 4 2 a 0 2014-03-21
2014-03-27 4 5 1 b 20 2014-03-27
[4 rows x 6 columns]

Related

Dataframe Column is not Read as List in Lambda Function

I have a dataframe which contains list value, let us call it df1:
Text
-------
["good", "job", "we", "are", "so", "proud"]
["it", "was", "his", "honor", "as", "well", "as", "guilty"]
And also another dataframe, df2:
Word Value
-------------
good 7.47
proud 8.03
honor 7.66
guilty 2.63
I want to create apply plus lambda function to create df1['score'] where the values are derived from sum-aggregating words per list in df1 which are found in df2's words. Currently, this is my code:
def score(list_word):
sum = count = mean = sd = 0
for word in list_word:
if word in df2['Word']:
sum = sum + df2.loc[df2['Word'] == word, 'Value'].iloc[0]
count = count + 1
if count != 0:
return sum/count
else:
return 0
df['score'] = df.apply(lambda x: score(x['words']), axis=1)
This is what I envision:
Score
-------
7.75 #average of good (7.47) and proud (8.03)
5.145 #average of honor (7.66) and guilty (2.63)
However, it seems x['words'] did not pass as list object, and I do not know how to modify the score function to meet the object type. I try to convert it by tolist() method, but no avail. Any help appreciated.
Giving the first df1, and df2 with explode and map , Notice explode is after pandas 0.25
#import ast
#df1.Text=df1.Text.apply(ast.literal_eval)
#If the list is string type , we need bring the format list back with fast
s=df1.Text.explode().map(dict(zip(df2.Word,df2.Value))).mean(level=0)
0 7.750
1 5.145
Name: Text, dtype: float64
Update
df1.Text.explode().to_frame('Word').reset_index().merge(df2,how='left').groupby('index').mean()
Value
index
0 7.750
1 5.145

Indexing lists in a Pandas dataframe column based on variable length

I've got a column in a Pandas dataframe comprised of variable-length lists and I'm trying to find an efficient way of extracting elements conditional on list length. Consider this minimal reproducible example:
t = pd.DataFrame({'a':[['1234','abc','444'],
['5678'],
['2468','def']]})
Say I want to extract the 2nd element (where relevant) into a new column, and use NaN otherwise. I was able to get it in a very inefficient way:
_ = []
for index,row in t.iterrows():
if (len(row['a']) > 1):
_.append(row['a'][1])
else:
_.append(np.nan)
t['element_two'] = _
And I gave an attempt using np.where(), but I'm not specifying the 'if' argument correctly:
np.where(t['a'].str.len() > 1, lambda x: x['a'][1], np.nan)
Corrections and tips to other solutions would be greatly appreciated! I'm coming from R where I take vectorization for granted.
I'm on pandas 0.25.3 and numpy 1.18.1.
Use str accesor :
n = 2
t['second'] = t['a'].str[n-1]
print(t)
a second
0 [1234, abc, 444] abc
1 [5678] NaN
2 [2468, def] def
While not incredibly efficient, apply is at least clean:
t['a'].apply(lambda _: np.nan if len(_)<2 else _[1])

Changing dataframe value dtype with mixed types efficiently

I'm working with a scientific data set where measurable values are represented numerically and non-measurable values are represented by a default string "Present < RDL". The first roadblock met while working with this data is the difficulty that comes from having two different data types, string and float, in a column. pd.read_csv appears to cast all values as strings in certain columns (not sure why as of now). So I would like to have all numerical values as an appropriate type, like float, and all "Present < RDL" to remain as strings.
I have figured out a way around the mixed dtypes, and I can apply the logic to individual columns, but for some reason when I apply the same logic in a loop, it doesn't work:
# Dummy data:
lst = ['1.01', '2.05', 'Present < RDL', '3.50', '1.23', 'Present < RDL', '1.72']
lst2 = ['1.2', 'Present < RDL', '0.75', '1.53', '2.34', 'Present < RDL', '0.96']
data = {'test1': lst, 'test2': lst2}
data = pd.DataFrame(data)
# Works to convert numeric values in series from string to float.
lst = []
for i in data.test1:
try:
lst.append(float(i))
except:
lst.append(i)
test = pd.Series(lst)
# Verify that numbers have been converted to numeric type.
map(type, test)
# Now, the same logic looping through the dataframe columns:
for col in data.columns:
lst = []
for i in col:
try:
lst.append(float(i))
except:
lst.append(i)
col = pd.DataFrame(lst)
# Shows no change in dtypes.
map(type, data.test1)
I've observed a similar trend with pandas functions, in addition to having more trouble to get them to work consistently.
data.test1 = pd.to_numeric(data.test1, errors='ignore')
I realize my first solution is probably not going to be as elegant as the pandas functions, so I'm open to any and all suggestions for how to achieve the goal. Thanks for reading.
Update:
After integrating the answer below, I was able to fix the looping issue:
for col in data.columns:
data[col] = pd.to_numeric(data[col], errors='coerce').fillna(data[col])
Use pd.to_numeric with argument errors='coerce' to convert the strings to Nan then finally fillna these with the strings in your original column:
data['test1'] = pd.to_numeric(data['test1'], errors='coerce').fillna(data['test1'])
If we then check the types of each row:
print(data['test1'].apply(type))
0 <class 'float'>
1 <class 'float'>
2 <class 'str'>
3 <class 'float'>
4 <class 'float'>
5 <class 'str'>
6 <class 'float'>
Name: test1, dtype: object
We see the mixed type column as you want.
Now we can actually do calculations on our column, obviously for strings it will give weird results, but that's the downside of mixed type columns:
data['test1'] * 2
0 2.02
1 4.1
2 Present < RDLPresent < RDL
3 7
4 2.46
5 Present < RDLPresent < RDL
6 3.44
Name: test1, dtype: object

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0

There are three problems(Load database, loop, and append series)

Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)