Add 'document_id' column to pandas dataframe of word-id's and wordcounts - pandas

I have following dataset:
import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['#Zuora wants to help #Network4Good with Hurric...','#ztrip please help spread the good word on hel...']})
DOCUMENT_ID MESSAGE
0 263403828328665088 #Zuora wants to help #Network4Good with Hurric...
1 264142543883739136 #ztrip please help spread the good word on hel...
I am trying to reshape my data in the form of
docID wordID count
0 1 118 1
1 1 285 1
2 1 1229 1
3 1 1688 1
4 1 2068 1
I used following
r=[]
for i in jsonDF['MESSAGE']:
for j in sortedValues(wordsplit(i)):
r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
gives me following result
0 17
1 help 2
2 wants 1
3 hurricane 1
4 relief 1
5 text 1
6 sandy 1
7 donate 1
8 6
9 please 1
I can get word counts
I have no idea to to append Document_ID to the in the above dataframe.
Following functions were used to split words
from nltk.corpus import stopwords
import re
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
UPDATE: SOLUTION HERE:
string split and and assign unique ids to Pandas DataFrame

'DOCUMENT_ID' is one of the two fields in each row of jsonDF. Your current code doesn't access it because it directly works on jsonDF['MESSAGE'].
Here is some non-working pseudocode - something like:
for _, row in jsonDF.iterrows():
doc_id, msg = row
words = [word for word in wordsplit(msg)][0].split() # hack
wordcounts = Counter(words).most_common() # sort by decr frequency
Then do a pd.concat(pd.DataFrame({'DOCUMENT_ID': doc_id, ...
and get the 'wordId' and 'count' fields from wordcounts.

Related

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?
Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

Count in string terms and stored mapped to other value

I have a pandas dataframe which includes columns (amongst others) like this, with RATING being integers 0 to 5 and COMMENT is string:
RATING COMMENT
1 some text
2 more text
3 other text
... ...
I would now like to mine (for lack of better word ) the key words for a list of strings:
list = ['like', trust', 'etc etc etc']
and would like to iterate through the COMMENT and count the number of key words by rating to get a df out like so
KEYWORD RATING COUNT
like 1 202
like 2 325
like 3 0
like 4 967
like 5 534
...
trust 1 126
....
how can I achieve this?
I am beginner so would really appreciate your help (and the simpler and more understandable the better)
thank you
hi at the moment I have been iterating through manually,
ie
#DATA_df is the original data
word_list = ['word', 'words', 'words', 'more']
values = [0] * len(word_list)
tot_val=[values]*5
rating_table = pd.DataFrame(tot_val, columns=word_list)
for i in len(word_list):
for g in len (DATA_df[COMMENT]):
if i in DATA_df[COMMENT][g]:
rating_table[i][DATA_df[RATING]-1] +=1
this give a DF like so
word words words more
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
that I am then trying to add to.... it appears really clunky
I managed to solve it, key points learnt are use group by to pre-select data based on the rating, this slices the data and it is possible to alternate through the groups. also use of str.lower() in combination with str.count() worked well.
I am thankful if more experienced programmers could show me a better solution, but at least this works.
rating = [1,2,3,4,5]
rategroup = tp_clean.groupby('Rating')
#print (rategroup.groups)
results_list =[]
for w in word_list:
current = [w]
for r in rating:
stargroup = rategroup.get_group(str(r))
found = stargroup['Content'].str.lower().str.count(w)
c = found.sum()
current.append(c)
results_list.append(current)
results_df = pd.DataFrame (results_list, columns=['Keyword','1 Star','2 Star','3 Star','4 Star','5 Star'])
The one thing I am still struggling with is how to use regex to make it look for full words. I believe \b is the right one but how do I put it into str.count function?

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:
row_numbers ID code amount
1 med a 1
2 med a, b 1
3 med b, c 1
4 med c 1
5 med d 10
6 cad a, b 1
7 cad a, b, d 0
8 cad e 2
Pasted the above df:
I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.
Expected results:
Explanation:
column-row_numbers has no role here in df. I just took here to explain the output.
A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.
Thanks in advance for your time and thoughts:).
EDIT - 1
Pasting the real time.
Expected output:
Explanation:
have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.
Thanks!.
Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.
Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.
However, before processing depth-first search, you need to preprocess strings to set to check overlapping.
Method 1: Recursive
Do the following:
Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them
to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output
Codes as follows:
import numpy as np
def dfs(node, index, glist, g_checked_rows):
ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
g_checked_rows.add(index)
for j, s in glist:
if j not in g_checked_rows and not node.isdisjoint(s):
t_arr = dfs(s, j, glist, g_checked_rows)
ret_arr[0] += ', ' + t_arr[0]
ret_arr[1:] += t_arr[1:]
return ret_arr
def gfunc(x):
checked_rows = set()
final = []
code_list = list(x.new_code.items())
for i, row in code_list:
if i not in checked_rows:
final.append(dfs(row, i, code_list, checked_rows))
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
Out[16]:
ID code units vol
0 med CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18 4 4
1 med CO-16, CO-B20, CO-16 3 3
2 med CO-252, CO-252, CO-45 3 3
3 med OA-258 1 1
4 cad PR-96, PR-96, CO-243 4 4
5 cad PR-87, OA-258, PR-87 3 3
Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows
df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
Method 2: Iterative
To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function.
However, the function is more complicated. The logic of iterative process as follows
push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that
haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and
process from step 2
Code as follows:
def gfunc_iter(x):
checked_rows = set()
final = []
q = deque()
code_list = list(x.new_code.items())
code_list_rev = code_list[::-1]
for i, row in code_list:
if i not in checked_rows:
q.append((i, row))
ret_arr = np.array(['', 0, 0], dtype='O')
while (q):
n, node = q.pop()
if n in checked_rows:
continue
ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
if not ret_arr[0]:
ret_arr = ret_arr_child.copy()
else:
ret_arr[0] += ', ' + ret_arr_child[0]
ret_arr[1:] += ret_arr_child[1:]
checked_rows.add(n)
#push to `q` all neighbors in the reversed list of nodes
for j, s in code_list_rev:
if j not in checked_rows and not node.isdisjoint(s):
q.append((j, s))
final.append(ret_arr)
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()
I believe the three main ideas for executing what you want are:
create an accumulator datastructure ( a DataFrame in this case)
iterate over a pair of rows, in each iteration you have (currentRow, nextRow)
pattern matching of current row in next row and pattern matching in the accumulated rows
It's not totally clear the exactly pattern match you're looking for, so I assumed that if any letter of currentRow code is on the next one, then concatenate them.
using a data.csv (with espace separators) as example:
row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2
import pandas as pd
from itertools import zip_longest
def generate_pairs(group):
''' generate pairs (currentRow, nextRow) '''
group_curriterrows = group.iterrows()
group_nextiterrows = group.iterrows()
group_nextiterrows.__next__()
zip_list = zip_longest(group_curriterrows, group_nextiterrows)
return zip_list
def generate_lists_to_check(currRow, nextRow, accumulated_rows):
''' generate list if any next letters are in current ones and
another list if any next letters are in the accumulated codes '''
currLetters = str(currRow["code"]).split(",")
nextLetters = str(nextRow["code"]).split(",")
letter_inNext = [letter in nextLetters for letter in currLetters]
unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
letter_inHistory = [any(letter in unq for letter in nextLetters)
for unq in unique_acc_codes]
return letter_inNext, letter_inHistory
def create_newRow(accumulated_rows, nextRow):
nextRow["row_numbers"] = str(nextRow["row_numbers"])
accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
return accumulated_rows
def update_existingRow(accumulated_rows, match_idx, Row):
accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
return accumulated_rows
if __name__ == "__main__":
df = pd.read_csv("extended.tsv",sep=" ")
groups = pd.DataFrame(columns=df.columns)
for ID, group in df.groupby(["ID"], sort=False):
accumulated_rows = pd.DataFrame(columns=df.columns)
group_firstRow = group.iloc[0]
accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values
row_numbers = str(group_firstRow.values[0])
accumulated_rows.set_value(0,'row_numbers',row_numbers)
zip_list = generate_pairs(group)
for (currRow_idx, currRow), Next in zip_list:
if not (Next is None):
(nextRow_idx, nextRow) = Next
letter_inNext, letter_inHistory = \
generate_lists_to_check(currRow, nextRow, accumulated_rows)
if any(letter_inNext) :
accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow)
elif any(letter_inHistory):
matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
first_match_idx = matches[0]
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
for match_idx in matches[1:]:
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
accumulated_rows = accumulated_rows.drop(match_idx)
elif not any(letter_inNext):
accumulated_rows = create_newRow(accumulated_rows, nextRow)
groups = groups.append(accumulated_rows)
groups.reset_index(inplace=True,drop=True)
print(groups)
OUTPUT normal rows order REMOVING lines using column volume from current code because first exampe has no column volume:
row_numbers ID code amount
0 1 med a,a,b,b,c,c 4
1 5 med d 10
2 6 cad a,b,a,b,d 1
3 8 cad e 2
OUTPUT new example:
row_numbers ID code amount volume
0 1,5,9,3 med CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18 4 4
1 2,7 med CO-16,CO-B20,CO-16 3 3
2 4,6 med CO-252,CO-252,CO-45 3 3
3 8 med OA-258 1 1
4 10,13 cad PR-96,PR-96,CO-243 4 4
5 11,12 cad PR-87,OA-258,PR-87 3 3

Fill Empty Panda Dataframe Using Loop Method

I am currently working with some telematics data where the trip id is missing. Trip id is unique. 1 trip id contains multiple of rows of data consisting i.e gps coordinate, temp, voltage, rpm, timestamp, engine status (on or off). The data pattern indicate time of engine status on and off, can be cluster as a unique trip id. Though, I have difficulty to translate the above logic in order to generate these tripId.
Tried to use few pandas loop methods but keep failing.
import pandas as pd
inp = [{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON','tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'ON', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''},
{'Ignition_Status':'ON', 'tripID':''},{'Ignition_Status':'OFF', 'tripID':''}]
test = pd.DataFrame(inp)
print (test)
Approach Taken
n=1
for index, row in test.iterrows():
test['tripID']=np.where(test['Ignition_Status']=='ON',n,n)
n=n+1
Expected Result
Use series.eq() to check for OFF and series.shift() with series.cumsum():
test=test.assign(tripID=test.Ignition_Status.eq('OFF')
.shift(fill_value=False).cumsum().add(1))
Ignition_Status tripID
0 ON 1
1 ON 1
2 ON 1
3 OFF 1
4 ON 2
5 ON 2
6 ON 2
7 ON 2
8 ON 2
9 OFF 2
10 ON 3
11 OFF 3

collapse pandas dataframe rows based on index column

I have a dataframe that contains information that is linked by an ID column. The rows are sequential with the odd rows containing a "start-point" and the even rows containing an "end" point. My goal is to collapse the data from these into a single row with columns for "start" and "end" following each other. The rows do have a "packet ID" that would link them if the sequential nature of the dataframe is not consistent.
example:
df:
0 1 2 3 4 5
0 hs6 106956570 106956648 ID_A1 60 -
1 hs1 153649721 153649769 ID_A1 60 -
2 hs1 865130744 865130819 ID_A2 0 -
3 hs7 21882206 21882237 ID_A2 0 -
4 hs1 74230744 74230819 ID_A3 0 +
5 hs8 92041314 92041508 ID_A3 0 +
The resulting dataframe that I am trying to achieve is:
new_df
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508
with each row containing the information on both the start and the end-point.
I have tried to pass the IDs in to an array and use a for loop to pull the information out of the original dataframe into a new dataframe but this has not worked. I was looking at the melt documentation which would suggest that pd.melt(df, id_vars=[3], value_vars=[0,1,2]) may work but I cannot see how to get the corresponding row in to positions new_df[3,4,5].
I think that it may be something really simple that I am missing but any suggestions would be appreciated.
You can try this:
df_out = df.set_index([df.index%2, df.index//2])[df.columns[:3]]\
.unstack(0).sort_index(level=1, axis=1)
df_out.columns = np.arange(len(df_out.columns))
df_out
Output:
0 1 2 3 4 5
0 hs6 106956570 106956648 hs1 153649721 153649769
1 hs1 865130744 865130819 hs7 21882206 21882237
2 hs1 74230744 74230819 hs8 92041314 92041508