Seperation of path in csv file - pandas

I am facing a problem regarding separation of path of files in a csv file.
I have such a structure in a csv file:
w:b/c/v/n/1/y/i
w:b/c/v/n/2/h/l
w:b/c/v/n/3/n/r
w:b/c/v/n/4/f/e
this is one column of my csv file as the path name. Now what I need to do is to preserve my first column and create 3 more columns for:
1/ y /i
2 h l
3 n r
4 f e
I know that .str.split('/', expand=True) works in such a cases but my problem is how to show that it should leave out "b/c/v/n" part. Could you please help me with it?

If you have DataFrame like this:
path
0 w:b/c/v/n/1/y/i
1 w:b/c/v/n/2/h/l
2 w:b/c/v/n/3/n/r
3 w:b/c/v/n/4/f/e
Then:
df[["col1", "col2", "col3"]] = (
df["path"].str.split("/", expand=True).iloc[:, -3:]
)
print(df)
creates 3 new columns:
path col1 col2 col3
0 w:b/c/v/n/1/y/i 1 y i
1 w:b/c/v/n/2/h/l 2 h l
2 w:b/c/v/n/3/n/r 3 n r
3 w:b/c/v/n/4/f/e 4 f e

You could first get a slice of the string and then use split like the following:
.str[10:].str.split('/', expand=True)

Another try is:
s = 'w:b/c/v/n/1/y/i'
temp = s.split("/")
s = "/".join(temp[0:4]) + "," + ",".join(temp[4:7])
Result is:
w:b/c/v/n,1,y,i

concise solution
df[['col', 'col1', 'col2', 'col3']] = df['path'].str.rsplit('/',3, expand=True)
df
path col col1 col2 col3
0 w:b/c/v/n/1/y/i w:b/c/v/n 1 y i
1 w:b/c/v/n/2/h/l w:b/c/v/n 2 h l
2 w:b/c/v/n/3/n/r w:b/c/v/n 3 n r
3 w:b/c/v/n/4/f/e w:b/c/v/n 4 f e

Related

Join cells from Pandas data frame and safe as one string or .txt file

I am trying to extract data from a data frame in Pandas and merge the results into one string (or .txt file)
Data Frame:
NUM
LETTER
0
4
Z
1
5
U
2
6
A
3
7
P
4
1
B
5
4
P
6
5
L
7
6
T
8
7
V
9
1
E
Script so far:
data = pd.read_csv("TEST.csv")
fdata = data[data["LETTER"].str.contains("A|E|L|P")]
ffdata = fdata.RESULT.to_string()
print(ffdata)
Running the script on TEST.csv gives me this result:
LETTER
2
A
3
P
5
P
6
L
9
E
Next, I want to join the data from the filtered cells and join them into one string:
--> "APPLE", optional with saving them as .txt to use them later.
How do I proceed from here? I was thinking about iterating over the data frame and use join, but I have no idea how to implement this. Any clues?
Based on #Frodnar's answer, this is the code that works:
data = pd.read_csv("TEST.csv")
fdata = data[data["LETTER"].str.contains("A|E|L|P")]
ffdata = ''.join(fdata.LETTER.to_list())
print(ffdata)
which gives the output 'APPLE'
Thank you for your help!!

Reorder rows of pandas DataFrame according to a known list of values

I can think of 2 ways of doing this:
Apply df.query to match each row, then collect the index of each result
Set the column domain to be the index, and then reorder based on the index (but this would lose the index which I want, so may be trickier)
However I'm not sure these are good solutions (I may be missing something obvious)
Here's an example set up:
domain_vals = list("ABCDEF")
df_domain_vals = list("DECAFB")
df_num_vals = [0,5,10,15,20,25]
df = pd.DataFrame.from_dict({"domain": df_domain_vals, "num": df_num_vals})
This gives df:
domain num
0 D 0
1 E 5
2 C 10
3 A 15
4 F 20
5 B 25
1: Use df.query on each row
So I want to reorder the rows according using the values in order of domain_vals for the column domain.
A possible way to do this is to repeatedly use df.query but this seems like an un-Pythonic (un-panda-ese?) solution:
>>> pd.concat([df.query(f"domain == '{d}'") for d in domain_vals])
domain num
3 A 15
5 B 25
2 C 10
0 D 0
1 E 5
4 F 20
2: Setting the column domain as the index
reorder = df.domain.apply(lambda x: domain_vals.index(x))
df_reorder = df.set_index(reorder)
df_reorder.sort_index(inplace=True)
df_reorder.index.name = None
Again this gives
>>> df_reorder
domain num
0 A 15
1 B 25
2 C 10
3 D 0
4 E 5
5 F 20
Can anyone suggest something better (in the sense of "less of a hack"). I understand that my solution works, I just don't think that calling pandas.concat along with a list comprehension is the right approach here.
Having said that, it's shorter than the 2nd option, so I presume there must be some equally simple way I can do this with pandas methods I've overlooked?
Another way is merge:
(pd.DataFrame({'domain':df_domain_vals})
.merge(df, on='domain', how='left')
)

Find the row contain two column values in pandas

I have csv below
ID,Name1,Name2
1,A,A
2,B,B
3,C,D
4,0,Z
5,0,Z
Find all row which is having Name1 ==0 and Name2==Z
Expected out is
4,0,Z
5,0,Z
My Code
df[df['Name1'].str.contains("0") & df['Name2'].str.contains('Z')]
any alternate way?
Vectorized operations provide the fastest run:
df[(df.Name1 == "0") & (df.Name2 == "Z")]
ID Name1 Name2
3 4 0 Z
4 5 0 Z
Another take, using query method:
df.query('Name1.str.contains("0") and Name2.str.contains("Z")')
# or
df.query('Name1 == "0" and Name2 == "Z"')
# ID Name1 Name2
# 3 4 0 Z
# 4 5 0 Z
As in the comments suggestion contains is a bit slower than using == as the first will try to match substrings.

Groupby and A)Concate matching strings(and or substring) B)Sum the values

I have df:
row_numbers ID code amount
1 med a 1
2 med a, b 1
3 med b, c 1
4 med c 1
5 med d 10
6 cad a, b 1
7 cad a, b, d 0
8 cad e 2
Pasted the above df:
I wanted to do groupby on column-ID and A)Combine the strings if substring/string matches(on column-code) B)sum the values of column-amount.
Expected results:
Explanation:
column-row_numbers has no role here in df. I just took here to explain the output.
A)grouping on column-ID and looking at column-code, row1 string i.e., a is matching with row2's sub string. row2's substring i.e., b is matching with row3's substring. row3's substring i.e., c is matching with string of row4 and Hence combining row1, row2, row3 and row4. row5 string is not matching with any of string/substring so it is separate group. B) Based on this adding row1, row2, row3 and row4 values. and row5 as separate group.
Thanks in advance for your time and thoughts:).
EDIT - 1
Pasting the real time.
Expected output:
Explanation:
have to do on grouping column-id and concatenating the values of column-code and summing the values of column-units and vol. It is color coded the matching(to be contacted) values of column-code. row1 has link with row5 and row9. row9 has inturn link with row3. Hence combining row1, row5, row9, row3. Simliarly row2 and row7 and so on. row8 has no link with any of the values with-in group-med(column-id) and hence will be as separate row.
Thanks!.
Update: From your latest sample data, this is not a simple data munging. There is no vectorized solution. It relates to graph theory. You need to find connected components within each group of ID and do the calculation on each connected components.
Consider each string as a node of graph. If 2 strings are overlapped, they are connected nodes. For every node, you need to traverse all paths connected to it. Do calculation on all connected nodes through these paths. This traversal can be done by using Depth-first search logic.
However, before processing depth-first search, you need to preprocess strings to set to check overlapping.
Method 1: Recursive
Do the following:
Define a function dfs to recursively run depth-first search
Define a function gfunc to use with groupby apply. This function will traverse elements of each group of ID and return the desired dataframe.
Get rid of any blank spaces in each string and split and convert them
to sets using replace, split and map and assign it to a new column new_code to df
Call groupby on ID and apply using function gfunc. Call droplevel and reset_index to get the desired output
Codes as follows:
import numpy as np
def dfs(node, index, glist, g_checked_rows):
ret_arr = df.loc[index, ['code', 'amount', 'volume']].values
g_checked_rows.add(index)
for j, s in glist:
if j not in g_checked_rows and not node.isdisjoint(s):
t_arr = dfs(s, j, glist, g_checked_rows)
ret_arr[0] += ', ' + t_arr[0]
ret_arr[1:] += t_arr[1:]
return ret_arr
def gfunc(x):
checked_rows = set()
final = []
code_list = list(x.new_code.items())
for i, row in code_list:
if i not in checked_rows:
final.append(dfs(row, i, code_list, checked_rows))
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc).droplevel(1).reset_index()
Out[16]:
ID code units vol
0 med CO-96, CO-B15, CO-B15, CO-96, OA-18, OA-18 4 4
1 med CO-16, CO-B20, CO-16 3 3
2 med CO-252, CO-252, CO-45 3 3
3 med OA-258 1 1
4 cad PR-96, PR-96, CO-243 4 4
5 cad PR-87, OA-258, PR-87 3 3
Note: I assume your pandas version is 0.24+. If it is < 0.24, the last step you need to use reset_index and drop instead of droplevel and reset_index as follows
df_final = df.groupby('ID', sort=False).apply(gfunc).reset_index().drop('level_1', 1)
Method 2: Iterative
To make this complete, I implement a version of gfunc using iterative process instead of recursive. Iterative process requires only one function.
However, the function is more complicated. The logic of iterative process as follows
push the first node to deque. Check if deque not empty, pop the top node out.
if a node is not marked checked, process it and mark it as checked
find all its neighbors in the reverse order of list of nodes that
haven't been marked, push them to the deque
Check if deque not empty, pop out a node from the top deque and
process from step 2
Code as follows:
def gfunc_iter(x):
checked_rows = set()
final = []
q = deque()
code_list = list(x.new_code.items())
code_list_rev = code_list[::-1]
for i, row in code_list:
if i not in checked_rows:
q.append((i, row))
ret_arr = np.array(['', 0, 0], dtype='O')
while (q):
n, node = q.pop()
if n in checked_rows:
continue
ret_arr_child = df.loc[n, ['code', 'amount', 'volume']].values
if not ret_arr[0]:
ret_arr = ret_arr_child.copy()
else:
ret_arr[0] += ', ' + ret_arr_child[0]
ret_arr[1:] += ret_arr_child[1:]
checked_rows.add(n)
#push to `q` all neighbors in the reversed list of nodes
for j, s in code_list_rev:
if j not in checked_rows and not node.isdisjoint(s):
q.append((j, s))
final.append(ret_arr)
return pd.DataFrame(final, columns=['code','units','vol'])
df['new_code'] = df.code.str.replace(' ','').str.split(',').map(set)
df_final = df.groupby('ID', sort=False).apply(gfunc_iter).droplevel(1).reset_index()
I believe the three main ideas for executing what you want are:
create an accumulator datastructure ( a DataFrame in this case)
iterate over a pair of rows, in each iteration you have (currentRow, nextRow)
pattern matching of current row in next row and pattern matching in the accumulated rows
It's not totally clear the exactly pattern match you're looking for, so I assumed that if any letter of currentRow code is on the next one, then concatenate them.
using a data.csv (with espace separators) as example:
row_numbers ID code amount
1 med a 1
2 med a,b 1
3 med b,c 1
4 med c 1
5 med d 10
6 cad a,b 1
7 cad a,b,d 0
8 cad e 2
import pandas as pd
from itertools import zip_longest
def generate_pairs(group):
''' generate pairs (currentRow, nextRow) '''
group_curriterrows = group.iterrows()
group_nextiterrows = group.iterrows()
group_nextiterrows.__next__()
zip_list = zip_longest(group_curriterrows, group_nextiterrows)
return zip_list
def generate_lists_to_check(currRow, nextRow, accumulated_rows):
''' generate list if any next letters are in current ones and
another list if any next letters are in the accumulated codes '''
currLetters = str(currRow["code"]).split(",")
nextLetters = str(nextRow["code"]).split(",")
letter_inNext = [letter in nextLetters for letter in currLetters]
unique_acc_codes = [str(v) for v in accumulated_rows["code"].unique()]
letter_inHistory = [any(letter in unq for letter in nextLetters)
for unq in unique_acc_codes]
return letter_inNext, letter_inHistory
def create_newRow(accumulated_rows, nextRow):
nextRow["row_numbers"] = str(nextRow["row_numbers"])
accumulated_rows = accumulated_rows.append(nextRow,ignore_index=True)
return accumulated_rows
def update_existingRow(accumulated_rows, match_idx, Row):
accumulated_rows.loc[match_idx]["code"] += ","+Row["code"]
accumulated_rows.loc[match_idx]["amount"] += Row["amount"]
accumulated_rows.loc[match_idx]["volume"] += Row["volume"]
accumulated_rows.loc[match_idx]["row_numbers"] += ','+str(Row["row_numbers"])
return accumulated_rows
if __name__ == "__main__":
df = pd.read_csv("extended.tsv",sep=" ")
groups = pd.DataFrame(columns=df.columns)
for ID, group in df.groupby(["ID"], sort=False):
accumulated_rows = pd.DataFrame(columns=df.columns)
group_firstRow = group.iloc[0]
accumulated_rows.loc[len(accumulated_rows)] = group_firstRow.values
row_numbers = str(group_firstRow.values[0])
accumulated_rows.set_value(0,'row_numbers',row_numbers)
zip_list = generate_pairs(group)
for (currRow_idx, currRow), Next in zip_list:
if not (Next is None):
(nextRow_idx, nextRow) = Next
letter_inNext, letter_inHistory = \
generate_lists_to_check(currRow, nextRow, accumulated_rows)
if any(letter_inNext) :
accumulated_rows = update_existingRow(accumulated_rows, (len(accumulated_rows)-1), nextRow)
elif any(letter_inHistory):
matches = [ idx for (idx, bool_val) in enumerate(letter_inHistory) if bool_val == True ]
first_match_idx = matches[0]
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, nextRow)
for match_idx in matches[1:]:
accumulated_rows = update_existingRow(accumulated_rows, first_match_idx, accumulated_rows.loc[match_idx])
accumulated_rows = accumulated_rows.drop(match_idx)
elif not any(letter_inNext):
accumulated_rows = create_newRow(accumulated_rows, nextRow)
groups = groups.append(accumulated_rows)
groups.reset_index(inplace=True,drop=True)
print(groups)
OUTPUT normal rows order REMOVING lines using column volume from current code because first exampe has no column volume:
row_numbers ID code amount
0 1 med a,a,b,b,c,c 4
1 5 med d 10
2 6 cad a,b,a,b,d 1
3 8 cad e 2
OUTPUT new example:
row_numbers ID code amount volume
0 1,5,9,3 med CO-96,CO-B15,CO-B15,CO-96,OA-18,OA-18 4 4
1 2,7 med CO-16,CO-B20,CO-16 3 3
2 4,6 med CO-252,CO-252,CO-45 3 3
3 8 med OA-258 1 1
4 10,13 cad PR-96,PR-96,CO-243 4 4
5 11,12 cad PR-87,OA-258,PR-87 3 3

Add 'document_id' column to pandas dataframe of word-id's and wordcounts

I have following dataset:
import pandas as pd
jsonDF = pd.DataFrame({'DOCUMENT_ID':[263403828328665088,264142543883739136], 'MESSAGE':['#Zuora wants to help #Network4Good with Hurric...','#ztrip please help spread the good word on hel...']})
DOCUMENT_ID MESSAGE
0 263403828328665088 #Zuora wants to help #Network4Good with Hurric...
1 264142543883739136 #ztrip please help spread the good word on hel...
I am trying to reshape my data in the form of
docID wordID count
0 1 118 1
1 1 285 1
2 1 1229 1
3 1 1688 1
4 1 2068 1
I used following
r=[]
for i in jsonDF['MESSAGE']:
for j in sortedValues(wordsplit(i)):
r.append(j)
IDCount_Re=pd.DataFrame(r)
IDCount_Re[:5]
gives me following result
0 17
1 help 2
2 wants 1
3 hurricane 1
4 relief 1
5 text 1
6 sandy 1
7 donate 1
8 6
9 please 1
I can get word counts
I have no idea to to append Document_ID to the in the above dataframe.
Following functions were used to split words
from nltk.corpus import stopwords
import re
def wordsplit(wordlist):
j=wordlist
j=re.sub(r'\d+', '', j)
j=re.sub('RT', '',j)
j=re.sub('http', '', j)
j = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", j)
j=j.lower()
j=j.strip()
if not j in stopwords.words('english'):
yield j
def wordSplitCount(wordlist):
'''merges a list into string, splits it, removes stop words and
then counts the occurrences returning an ordered dictitonary'''
#stopwords=set(stopwords.words('english'))
string1=''.join(list(itertools.chain(filter(None, wordlist))))
cnt=Counter()
j = []
for i in string1.split(" "):
i=re.sub(r'&', ' ', i.lower())
if i not in stopwords.words('english'):
cnt[i]+=1
return OrderedDict(cnt)
def sortedValues(wordlist):
'''creates a dictionary list of occurenced w/ values descending'''
d=wordSplitCount(wordlist)
return sorted(d.items(), key=lambda t: t[1], reverse=True)
UPDATE: SOLUTION HERE:
string split and and assign unique ids to Pandas DataFrame
'DOCUMENT_ID' is one of the two fields in each row of jsonDF. Your current code doesn't access it because it directly works on jsonDF['MESSAGE'].
Here is some non-working pseudocode - something like:
for _, row in jsonDF.iterrows():
doc_id, msg = row
words = [word for word in wordsplit(msg)][0].split() # hack
wordcounts = Counter(words).most_common() # sort by decr frequency
Then do a pd.concat(pd.DataFrame({'DOCUMENT_ID': doc_id, ...
and get the 'wordId' and 'count' fields from wordcounts.