Length of value issue with unique ids - pandas

I am trying to write a simple code and haven't found a simple answer for this. I am trying to assign a unique ID to each person based on when the file was amended and their employee ID. Then add the column of Unique IDs to the file.
excel1 = "Book1.xlsx"
df1 = pd.read_excel(excel1, header = 0)
time = time.strftime('%m%d%Y%H%m', time.gmtime(os.path.getmtime ("Book1.xlsx")))
unique_id=[df1["ID"] + time]
df1["CID"]=unique_id
When I try to run it I keep getting an error of
ValueError: Length of values does not match length of index
Could anyone have an answer on this?

Related

pandas drop row if column contains string

I have a csv file as follow:
message,name,userID,period,#timestamp,timediff
messagebody,Request URL,system,period_8,2021-05-10 09:21:31,1
messagebody,Request URL,system,period_9,2021-05-10 09:58:19,1
"Failed Logon for user ""user""",Logon Attempt,user,period_1,2021-05-10 08:00:22,1
"Failed Logon for user ""user""",Logon Attempt,user,period_1,2021-05-09 05:59:34,1
I am trying to check check the userID and remove all the rows that contains system
I tried with:
f['userID'] = f[~f["userID"].str.contains("system", na=False)]
But it doesn't seem to drop the rows.
Just a little explanation about the columns userID
This column is the result of a merge of other 2 columns.
f['userID'] = f[['destinationUserName','sourceUserName']].astype(str).agg(''.join,1).replace('nan','',regex=True)
f['userID'] = f[~f["userID"].str.contains("system", na=False)]
if I run my script I get this error:
ValueError: Length mismatch: Expected axis has 239 elements, new values have 252 elements
Can anyone help me to understand how to overcome this issue?
How can I target that column and remove specific rows that contains a specific string.
thank you so much for any help

Pandas dataframe: grouping by unique identifier, checking conditions, and applying 1/0 to new column if condition is met/not met

I have a large dataset pertaining customer churn, where every customer has an unique identifier (encoded key). The dataset is a timeseries, where every customer has one row for every month they have been a customer, so both the date and customer-identifier column naturally contains duplicates. What I am trying to do is to add a new column (called 'churn') and set the column to 0 or 1 based on if it is that specific customer's last month as a customer or not.
I have tried numerous methods to do this, but each and every one fails, either do to tracebacks or they just don't work as intended. It should be noted that I am very new to both python and pandas, so please explain things like I'm five (lol).
I have tried using pandas groupby to group rows by the unique customer keys, and then checking conditions:
df2 = df2.groupby('customerid').assign(churn = [1 if date==max(date) else 0 for date in df2['date']])
which gives tracebacks because dataframegroupby object has no attribute assign.
I have also tried the following:
df2.sort_values(['date']).groupby('customerid').loc[df['date'] == max('date'), 'churn'] = 1
df2.sort_values(['date']).groupby('customerid').loc[df['date'] != max('date'), 'churn'] = 0
which gives a similar traceback, but due to the attribute loc
I have also tried using numpy methods, like the following:
df2['churn'] = df2.groupby(['customerid']).np.where(df2['date'] == max('date'), 1, 0)
which again gives tracebacks due to the dataframegroupby
and:
df2['churn'] = np.where((df2['date']==df2['date'].max()), 1, df2['churn'])
which does not give tracebacks, but does not work as intended, i.e. it applies 1 to the churn column for the max date for all rows, instead of the max date for the specific customerid - which in retrospect is completely understandable since customerid is not specified anywhere.
Any help/tips would be appreciated!
IIUC use GroupBy.transform with max for return maximal values per groups and compare with date column, last set 1,0 values by mask:
mask = df2['date'].eq(df2.groupby('customerid')['date'].transform('max'))
df2['churn'] = np.where(mask, 1, 0)
df2['churn'] = mask.astype(int)

Using to_datetime several columns names

I am working with several CSV's that first N columns are information and then the next Ms (M is big) columns are information regarding a date.
This is the dataframe picture
I need to set just the columns between N+1 to N+M - 1 columns name to date format.
I tried this, in this case N+1 = 5, no matter M, I suppose that I can use -1 to not affect the last column name.
ContDiarios.columns[5:-1] = pd.to_datetime(ContDiarios.columns[5:-1])
but I get the following error:
TypeError: Index does not support mutable operations
The way you are doing is not feasable. Please try this way
def convert(x):
try:
return pd.to_datetime(x)
except:
return x
x.columns = map(convert,x.columns)
Or you can also use df.rename property to convert it.

pandas - numpy using np.where to calculate and construct new columns

I am trying to create a new column based on selection criteria in another column. This is at an end of a while loop so the data frame does not have the column until this part of the first iteration. All subsequent iterations will be based on this columns previous iteration's total and the current totals:
if 'cBeds' in sPhase.columns:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase['cBeds'] + (sPhase[infCount] * .08)), sPhase['cBeds'])
else:
sPhase['cBeds'] = np.where(sPhase['COUNTYFP'] == '1', (sPhase[infCount] * .08), sPhase['cBeds'])
However, when I run the code I get 'KeyError: 'cBeds'
How can handle updating a column in a conditional when the column doesn't exist on the first iteration?
In the else clause, you reference sPhase['cbeds'] as the third parameter to np.where even though you've already established that the column does not exist.
If you want to avoid this problem, just add the column at the beginning of the loop and give it a default value that you can conditionally change later.

Building a new dataset

I want to take data from one set and enter it into another empty set.
So, for example, I want to do something like:
if ([i,x] > 9){
new_data$House[y,x] <- data[i,2]
}
but I want to do it over and over, creating new rows in new_data.
How do I keep adding data to new_data and overriding/saving the new row?
Essentially, I just want to know how to "grow" an empty data set.
Please ignore any errors in the code, it is just an example and I am still working on other details.
Thanks
If you are using r language, I presume you are looking for rbind:
new_data = NULL # define your new dataset
for(i in 1:nrow(data)) # loop over row of data
{
if(data[i,x] > 9) # if statement for implementing a condition
{
new_data = rbind(new_data,data[i,2:6]) # adding values of the row i and column 2 to 6
}
}
At the end, new_data will contain as many rows that satisfy the if statement and each row will contain values extracted from column 2 to 6.
If it is what you are looking for, there is various ways to do that without the need of a for loop, as an example:
new_data = data[data[i,x]>9,2:6]
If this answer is not satisfying for you, please provide more details in your question, include a reproducible example of your data and the expected output