How to combine certain column values together in Python and make values in the other column be the means of the values combined? - pandas

I have a Panda dataframe where one of the columns is a sequence of numbers('sequence')many of them repeating and the other column values('binary variable') are either 1 or 0.
I have grouped by the values in the sequences column which are the same and made the column values in the binary variable be the % of entries which are non-zero in that group.
I now want to combine entries in the 'sequence' column with the same values together and make the column values in 'binary variable' the mean of the column values of those columns that that were combined.
So my data frame looks like this:
df = pd.DataFrame([{'sequence' : [1, 1, 4,4,4 ,6], 'binary variable' : [1,0,0,1,0,1]}).
I have then used this code to group together the same values in sequence. Using this code:
df.groupby(["sequence"]).apply(lambda 'binary variable': (binary variable!= 0).sum() / binary variable.count()*100 )
I am left with the sequence columns with non-repeating values and the binary variable column now being the percentage of non zeros
.
But now I want to group some of the column values together(so for this toy example the 1 and 4 values), and have the binary variable column have values which are the mean of the percentages of say the values for 1 and 4.
This isn't terribly well worded as finding it awkward to describe it but any help would be much appreciated, I've tried to look online and had many failed attempts with code of my own but it just is not working.
Any help would be greatly appreciated

It seems like you want to group the table twice and take the mean each time. For the second grouping, you need to create a new column to indicate the group.
Try this code:
import pandas as pd
# sequence groups for final average
grps = {(1,4):[1,4],
(5,6):[5,6]}
# initial data
df = pd.DataFrame({'sequence' : [1,1,4,4,4,5,5,6], 'binvar' : [1,0,0,1,0,1,0,1]})
gb = df.groupby(["sequence"])['binvar'].mean().reset_index() #.apply(lambda 'binary variable': (binary variable!= 0).sum() / binary variable.count()*100 )
def getgrp(x): # search groups
for k in grps:
if x in grps[k]:
return k
print(df.to_string(index=False))
gb['group'] = gb.apply(lambda r: getgrp(r[0]), axis = 1)
gb = gb.reset_index()
print(gb.to_string(index=False))
gb = gb[['group','binvar']].groupby("group")['binvar'].mean().reset_index()
print(gb.to_string(index=False))
Output
sequence binvar
1 1
1 0
4 0
4 1
4 0
5 1
5 0
6 1
index sequence binvar group
0 1 0.500000 (1, 4)
1 4 0.333333 (1, 4)
2 5 0.500000 (5, 6)
3 6 1.000000 (5, 6)
group binvar
(1, 4) 0.416667
(5, 6) 0.750000

Related

Select column with the most unique values from csv, python

I'm trying to come up with a way to select from a csv file the one numeric column that shows the most unique values. If there are multiple with the same amount of unique values it should be the left-most one. The output should be either the name of the column or the index.
Position,Experience in Years,Salary,Starting Date,Floor,Room
Middle Management,5,5584.10,2019-02-03,12,100
Lower Management,2,3925.52,2016-04-18,12,100
Upper Management,1,7174.46,2019-01-02,10,200
Middle Management,5,5461.25,2018-02-02,14,300
Middle Management,7,7471.43,2017-09-09,17,400
Upper Management,10,12021.31,2020-01-01,11,500
Lower Management,2,2921.92,2019-08-17,11,500
Middle Management,5,5932.94,2017-11-21,15,600
Upper Management,7,10192.14,2018-08-18,18,700
So here I would want 'Floor' or 4 as my output given that Floor and Room have the same amount of unique values but Floor is the left-most one (I need it in pure python, i can't use pandas)
I have this nested in a whole bunch of other code for what I need to do as a whole, i will spare you the details but these are the used elements in the code:
new_types_list = [str, int, str, datetime.datetime, int, int] #all the datatypes of the columns
l1_listed = ['Position', 'Experience in Years', 'Salary', 'Starting Date', 'Floor', 'Room'] #the header for each column
difference = [3, 5, 9, 9, 6, 7] #is basically the amount of unique values each column has
And here I try to do exactly what I mentioned before:
another_list = [] #now i create another list
for i in new_types_list: # this is where the error occurs, it only fills the list with the index of the first integer 3 times instead of with the individual indices
if i== int:
another_list.append(new_types_list.index(i))
integer_listi = [difference[i] for i in another_list] #and this list is the corresponding unique values from the integers
for i in difference: #now we want to find out the one that is the highest
if i== max(integer_listi):
chosen_one_i = difference.index(i) #the index of the column with the most unique values is the chosen one -
MUV_LMNC = l1_listed[chosen_one_i]
```
You can use .nunique() to get number of unique in each column:
df = pd.read_csv("your_file.csv")
print(df.nunique())
Prints:
Position 3
Experience in Years 5
Salary 9
Starting Date 9
Floor 7
Room 7
dtype: int64
Then to find max, use .idxmax():
print(df.nunique().idxmax())
Prints:
Salary
EDIT: To select only integer columns:
print(df.loc[:, df.dtypes == np.integer].nunique().idxmax())
Prints:
Floor

How to create new columns using groupby based on logical expressions

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

Access query to split comment field into multiple records

In Access 2013 I'm trying to split fields in a column into new rows but having no success with several routes I've tried. For example, Row 1 has ID # 1 with 3 codes pertaining to it separated by commas (12, 13, 14). I have been trying to run a query to separate row 1 into three rows, each having one of the codes but the same ID #.
My goal is to have the comment column split into separate records/rows with the correct corresponding ID #, etc.
I've tried several variations of the following SQL query in access with no success:
Select [1 Tracking].ID #, [1 Tracking]Codes
FROM [1 Tracking]
CROSS APPLY STRING_SPLIT([1 Tracking].Codes, ",");
Original Data on table
ID # / Code:
1 (12, 13, 14)
2 (13, 15)
3 (17)
(Only 3 rows)
Results after running query;
ID # / Code:
1 (12)
1 (13)
1 (14)
2 (13)
2 (15)
3 (17)
(Now 6 rows instead of 3)
Any help would be greatly appreciated. I've only found VBA answers and have no idea how to use it. Thank you.

How to find the row and column number of a specific cell in sql?

I have a table in SQL database and I want to find the location of a cell like a coordinate and vice versa. Here is an example:
0 1 2 3
1 a b c
2 g h i
3 n o j
When I ask for i, I want to get row=2 and column=3. When I ask for a cell of row=2 and column=3, I want to get i.
You need to store your matrix in table specifying the columns and rows like this
create table matrix (
row int,
column int,
value varchar2(20)
);
Then you insert your data like this
insert into matrix values (1, 1, 'a');
insert into matrix values (1, 2, 'b');
//and so on.
And then you can simply find what you need using two queries
select column, row from matrix where value = 'i';
select value from matrix where column = 2 and row = 3;
In Oracle, you would do:
select "3"
from t
where "0" = 2;
Naming columns as numbers is not recommended. Your whole data model is strange for SQL. A better representation would be:
row col val
1 1 a
1 2 b
1 3 c
2 1 g
. . .
Then you could do:
select val
from grid
where row = 2 and col = 3;
Create a primary key column such as 'id' and for example, the related row is 'col'
select col from db where id = 2;
this returns you a specific cell (x,2)

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation