Concatenating string and integers in pandas dataframe(based on conditions) - pandas

In my dataframe I have 2 columns:
Country index(For example SK)
id_number(usually 8 digit,for example:98341852)
I want to concatenate them and it's easy:
sk_df['id'] = sk_df['country index'].str.cat(sk_df['id_number'].values.astype(str))
But some of rows in column id_number has number of digits less than 8. In this case I want to add zeros as separator between Country index and id_number(for example, if length of id_number is 6 I want to add 8-6 = 2 zeros between variables: SK00813841. If id_number length is 7,than add 1 zero etc)
I tried this:
def indexing(row):
if row['id_number'].astype(str).str.len() == 8:
return row['country index'].str.cat(row['id_number'].values.astype(str))
else:
sep_mult = 8 - row['id_number'].astype(str).str.len()
return row['country index'].str.cat(row['id_number'].values.astype(str),sep = '0'*sep_mult)
sk_df['id'] = sk_df.apply(lambda row: indexing(row),axis = 1)
But it doesn't work.
How can I do it?

Use .zfill():
sk_df['id'] = sk_df['country index'] + sk_df['id_number'].astype(str).str.zfill(8)

Related

Filtering dataframe based on two column string match count

I want to filter data rows based on string from data['Age'] and count atleast two occurence of that pattern "o;", "i;", "twenty;", "a;" in data['Name'].
data = pd.DataFrame({'Name':['To;om', 'ni;cki;', 'krish', 'jack'],'Age':['o', 'i', 'twenty', 'a']})
data
Name Age
0 To;om o
1 ni;cki; i
2 krish twenty
3 jack a
Output should look like this:
Name Age count
0 ni;cki; i 2
Use df.apply:
In [427]: data[data.apply(lambda x: x['Name'].count(f"{x['Age']};") >= 2, 1)]
Out[427]:
Name Age
1 ni;cki; i
List comprehension to find the counts and then filtering with it:
df["count"] = [name.count(age) for name, age in zip(df.Name, df.Age)]
out = df[df["count"] >= 2].copy()
to get
>>> out
Name Age count
1 ni;cki; i 2
The copy at the end is to avoid the SettingWithCopyWarning in possible future manipulations.

Dataframe change column value on if statement and keeps the new value to next row

I wish you good health to you and your family.
In my dataframe I have a column 'condition' which is filled with .astype(float).
Based on information that i put in this dataframe for every row it makes math and if is over specific amount it increase the value of 'condition' by 1 . Everything works fine with it and as it should be.
I made another column named ['order']. Which change its value if ['condition'] has value of 3. That's the code with witch you can see what I mean:
import pandas as pd
import numpy as np
def graph():
df = (pd.DataFrame(np.random.randint(-3,4,size=(100, 1)), columns=[('condition')]))
df['order'] = 0
df.loc[(df['condition'] == 3) & (df['order'] == 0) , 'order'] = df['order'] + 1
df.loc[(df['condition'] == -3) & (df['order'] == 1) , 'order'] = df['order'] + -1
df.to_csv('copy_bars.csv')
graph()
As you can see it changes the value in 'order' row to 1 when it fill first condition. But it never change back from 1 to 0 because of second if statement. It changes to 0 just because at the begging I give the row amount of 0.
How could I modify the code so when it is changed to 1 to keep this new value until second if statement fill ?
Row, Condition, Order
0 -1 0
1 3 1
2 -1 0
3 2 0
4 -2 0
5 -3 0
6 0 0
instead of this I would like to get in Order column for line from 1 to 4 to be represented with value of 1 so can my second condition trigger.
If I understood what you want this should be something like what you want. Because it is row by row and is based on two values it is not easy to vectorize but probably someone else can do it. Hope it works for you.
order = []
have_found_plus_3 = False
for i, row in df.iterrows():
if row['condition'] == 3:
have_found_plus_3 = True
elif row['condition'] == -3:
have_found_plus_3 = False
if have_found_plus_3:
order.append(1)
else:
order.append(0)
df['order'] = order

Select column with the most unique values from csv, python

I'm trying to come up with a way to select from a csv file the one numeric column that shows the most unique values. If there are multiple with the same amount of unique values it should be the left-most one. The output should be either the name of the column or the index.
Position,Experience in Years,Salary,Starting Date,Floor,Room
Middle Management,5,5584.10,2019-02-03,12,100
Lower Management,2,3925.52,2016-04-18,12,100
Upper Management,1,7174.46,2019-01-02,10,200
Middle Management,5,5461.25,2018-02-02,14,300
Middle Management,7,7471.43,2017-09-09,17,400
Upper Management,10,12021.31,2020-01-01,11,500
Lower Management,2,2921.92,2019-08-17,11,500
Middle Management,5,5932.94,2017-11-21,15,600
Upper Management,7,10192.14,2018-08-18,18,700
So here I would want 'Floor' or 4 as my output given that Floor and Room have the same amount of unique values but Floor is the left-most one (I need it in pure python, i can't use pandas)
I have this nested in a whole bunch of other code for what I need to do as a whole, i will spare you the details but these are the used elements in the code:
new_types_list = [str, int, str, datetime.datetime, int, int] #all the datatypes of the columns
l1_listed = ['Position', 'Experience in Years', 'Salary', 'Starting Date', 'Floor', 'Room'] #the header for each column
difference = [3, 5, 9, 9, 6, 7] #is basically the amount of unique values each column has
And here I try to do exactly what I mentioned before:
another_list = [] #now i create another list
for i in new_types_list: # this is where the error occurs, it only fills the list with the index of the first integer 3 times instead of with the individual indices
if i== int:
another_list.append(new_types_list.index(i))
integer_listi = [difference[i] for i in another_list] #and this list is the corresponding unique values from the integers
for i in difference: #now we want to find out the one that is the highest
if i== max(integer_listi):
chosen_one_i = difference.index(i) #the index of the column with the most unique values is the chosen one -
MUV_LMNC = l1_listed[chosen_one_i]
```
You can use .nunique() to get number of unique in each column:
df = pd.read_csv("your_file.csv")
print(df.nunique())
Prints:
Position 3
Experience in Years 5
Salary 9
Starting Date 9
Floor 7
Room 7
dtype: int64
Then to find max, use .idxmax():
print(df.nunique().idxmax())
Prints:
Salary
EDIT: To select only integer columns:
print(df.loc[:, df.dtypes == np.integer].nunique().idxmax())
Prints:
Floor

Merge certain rows in a DataFrame based on startswith

I have a DataFrame, in which I want to merge certain rows to a single one. It has the following structure (values repeat)
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1
5 xxx2
6 xxx3
7 billed:xxxx
...
Now the problem is, that the columns 5 & 6 still belong to the description and were separated just wrong (whole string separated by ","). I want to merge the "description" row (4) with the values afterwards (5,6). In my DF, there can be 1-5 additional entries which have to be merged with the description row, but the structure allows me to work with startswith, because no matter how many rows have to be merged, the end point is always the row which starts with "billed". Due to me being very new to python, I haven´t got any code written for this problem yet.
My thought is the following (if it is even possible):
Look for a row which starts with "description" → Merge all the rows afterwards till reaching the row which starts with "billed", then stop (obviosly we keep the "billed" row) → Do the same to each row starting with "description"
New DF should look like:
Index Value
1 date:xxxx
2 user:xxxx
3 time:xxxx
4 description:xxx1, xxx2, xxx3
5 billed:xxxx
...
df = pd.DataFrame.from_dict({'Value': ('date:xxxx', 'user:xxxx', 'time:xxxx', 'description:xxx', 'xxx2', 'xxx3', 'billed:xxxx')})
records = []
description = description_val = None
for rec in df.to_dict('records'): # type: dict
# if previous description and record startswith previous description value
if description and rec['Value'].startswith(description_val):
description['Value'] += ', ' + rec['Value'] # add record Value into previous description
continue
# record with new description...
if rec['Value'].startswith('description:'):
description = rec
_, description_val = rec['Value'].split(':')
elif rec['Value'].startswith('billed:'):
# billed record - remove description value
description = description_val = None
records.append(rec)
print(pd.DataFrame(records))
# Value
# 0 date:xxxx
# 1 user:xxxx
# 2 time:xxxx
# 3 description:xxx, xxx2, xxx3
# 4 billed:xxxx

TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']