Pandas - Setting column value, based on a function that runs on another column - pandas

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation

With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Related

Using CuPy/cuDF, remove elements that are not distant enough to their previous elements from a sorted list

The purpose of the code is similar to this post
I have a code that runs on CPUs:
import pandas as pd
def remove(s: pd.Series, thres:int):
pivot = -float("inf")
new_s = []
for e in s:
if (e-pivot)>thres:
new_s.append(e)
pivot=e
return pd.Series(new_s)
# s is an ascending sequence
s = pd.Series([0,1,2,4,6,9])
remove(s, thres=3)
# Out:
# 0 0
# 1 4
# 2 9
# dtype: int64
The input is an ascending sequence with integer values.
This function simply removes those points s[i] where d(s[i], s[i-1]) < thres
My problem is that CuPy/cuDF do not support loops, so I can't use GPUs to accelerate the code. I only have options like cumsum, diff, and mod that don't fit my needs.
Is there a function like scan in tensorflow?
The remove function can be reformulated in a form that is similar to prefix sum (scan):
For a sequence [a1, a2, a3], the output should be [a1, a1⨁a2, (a1⨁a2)⨁a3], and ⨁ is equal to
⨁=lambda x,y: x if (y-x)>thres else y
Then set(output) is what I want.
Note that (a1⨁a2)⨁a3 != a1⨁(a2⨁a3), in the absence of associative property, parallel computation might not be feasible.
Update
I found that there is already a function called Inclusive Scan, all I need is a python wrapper.
Or is there any other way?

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Pandas series replace value ignoring case but only if exact match

As Title says, I'm looking for a perfect solution to replace exact string in a series ignoring case.
ls = {'CAT':'abc','DOG' : 'def','POT':'ety'}
d = pd.DataFrame({'Data': ['cat','dog','pot','Truncate','HotDog','ShuPot'],'Result':['abc','def','ety','Truncate','HotDog','ShuPot']})
d
In the above code, ref hold the key-value pair where key is the existing value in a dataframe column and value is value to replace with.
Issue with this case is, service that pass the dictionary always holds dictionary key in upper case where dataframe might have value in lowercase.
expected output is stored in 'Result Column.
I tried including re.ignore = True which changes the last 2 values.
following code but that is not working as expected. it also converting values to upper case from previous iteration.
for k,v in ls.items():
print (k,v)
d['Data'] = d['Data'].astype(str).str.upper().replace({k:v})
print (d)
I'd appreciate any help.
Create a mapping series from the given dictionary, then transform the index of the mapping series to lower case, then using Series.map map the values in Data column to the values in mappings, then use Series.fillna to fill the missing values in the mapped series:
mappings = pd.Series(ls)
mappings.index = mappings.index.str.lower()
d['Result'] = d['Data'].str.lower().map(mappings).fillna(d['Data'])
# print(d)
Data Result
0 cat abc
1 dog def
2 pot ety
3 Truncate Truncate
4 HotDog HotDog
5 ShuPot ShuPot

How do I create multiple new columns, and populate columns depending on values in 2 other columns using pandas/python?

I want to populate 10 columns with the numbers 1-16 depending on the values in 2 other columns. I can start by providing the column header or create new columns (does not matter to me).
I tried to create a function that iterates over the numbers 1-10 and then assigns a value to the z variable depending on the values of b and y.
Then I want to apply this function to each row in my dataframe.
import pandas as pd
import numpy as np
data = pd.read_csv('Nuc.csv')
def write_Pcolumns(df):
"""populates a column in the given dataframe, df, based on the values in two other columns in the same dataframe"""
#create string of numbers for each nucleotide position
positions = ('1','2','3','4','5','6','7','8','9','10')
a = "Po "
x = "O.Po "
#for each position create a variable for the nucleotide in the sequence (Po) and opposite to the sequence(o. Po)
for each in positions:
b = a + each
y = x + each
z = 'P' + each
#assign a value to z based on the nucleotide identities in the sequence and opposite position
if df[b] == 'A' and df[y]=='A':
df[z]==1
elif df[b] == 'A' and df[y]=='C':
df[z]==2
elif df[b] == 'A' and df[y]=='G':
df[z]==3
elif df[b] == 'A' and df[y]=='T':
df[z]==4
...
elif df[b] == 'T' and df[y]=='G':
df[z]==15
else:
df[z]==16
return(df)
data.apply(write_Pcolumns(data), axis=1)
I get the following error message:
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This happens because df[index]=='value' returns a series of booleans, not a single boolean for each value.
Check out Pandas error when using if-else to create new column: The truth value of a Series is ambiguous

Find rows in dataframe column containing questions

I have a TSV file that I loaded into a pandas dataframe to do some preprocessing and I want to find out which rows have a question in it, and output 1 or 0 in a new column. Since it is a TSV, this is how I'm loading it:
import pandas as pd
df = pd.read_csv('queries-10k-txt-backup', sep='\t')
Here's a sample of what it looks like:
QUERY FREQ
0 hindi movies for adults 595
1 are panda dogs real 383
2 asuedraw winning numbers 478
3 sentry replacement keys 608
4 rebuilding nicad battery packs 541
After dropping empty rows, duplicates, and the FREQ column(not needed for this), I wrote a simple function to check the QUERY column to see if it contains any words that make the string a question:
df_test = df.drop_duplicates()
df_test = df_test.dropna()
df_test = df_test.drop(['FREQ'], axis = 1)
def questions(row):
questions_list =
["what","when","where","which","who","whom","whose","why","why don't",
"how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'] in questions_list:
return 1
else:
return 0
df_test['QUESTIONS'] = df_test.apply(questions, axis=1)
But once I check the new dataframe, even though it creates the new column, all the values are 0. I'm not sure if my logic is wrong in the function, I've used something similar with dataframe columns which just have one word and if it matches, it'll output a 1 or 0. However, that same logic doesn't seem to be working when the column contains a phrase/sentence like this use case. Any input is really appreciated!
If you wish to check exact matches of any substring from question_list and of a string from dataframe, you should use str.contains method:
questions_list = ["what","when","where","which","who","whom","whose","why",
"why don't", "how","how far","how long","how many",
"how much","how old","how come","?"]
pattern = "|".join(questions_list) # generate regex from your list
df_test['QUESTIONS'] = df_test['QUERY'].str.contains(pattern)
Simplified example:
df = pd.DataFrame({
'QUERY': ['how do you like it', 'what\'s going on?', 'quick brown fox'],
'ID': [0, 1, 2]})
Create a pattern:
pattern = '|'.join(['what', 'how'])
pattern
Out: 'what|how'
Use it:
df['QUERY'].str.contains(pattern)
Out[12]:
0 True
1 True
2 False
Name: QUERY, dtype: bool
If you're not familiar with regexes, there's a quick python re reference. Fot symbol '|', explanation is
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way
IIUC, you need to find if the first word in the string in the question list, if yes return 1, else 0. In your function, rather than checking if the entire string is in question list, split the string and check if the first element is in question list.
def questions(row):
questions_list = ["are","what","when","where","which","who","whom","whose","why","why don't","how","how far","how long","how many","how much","how old","how come","?"]
if row['QUERY'].split()[0] in questions_list:
return 1
else:
return 0
df['QUESTIONS'] = df.apply(questions, axis=1)
You get
QUERY FREQ QUESTIONS
0 hindi movies for adults 595 0
1 are panda dogs real 383 1
2 asuedraw winning numbers 478 0
3 sentry replacement keys 608 0
4 rebuilding nicad battery packs 541 0