Create a function to calculate an equation from a dataframe in pandas - pandas

I have a dataframe as shown below
Inspector_ID Sector Waste Fire Traffic
1 A 7 2 1
1 B 0 0 0
1 C 18 2 0
2 A 1 6 3
2 B 1 4 0
2 C 4 14 2
3 A 0 0 0
3 B 2 6 12
3 C 0 1 4
From the above dataframe I would like to calculate the Inspector's expertise score in raising issues in a domain (waste, Fire and Traffic).
For example the the score of inspector-1 for waste is (((7/8)*2) + ((18/22)*3)/2)/2
I1W = Inspector-1 similarity in waste.
Ai = No. of waste issues raised by inspector-1 in sector i
Ti = Total no. of waste issues in sector i
Ni = No of inspectors raised issues in sector i(if all zero then only it is considered as not raised)
TS1 = Total no of sectors the inspector-1 visited.
I1W = Sum((Ai/Ti)*Ni)/TS1
The expected output is below dataframe
Inspector_ID Waste Fire Traffic
1 I1W I1F I1T
2 I2W I2F I2T
3 I3W I3F I3T
TBF = To be filled

You could look into something along the lines of:
newData = []
inspector_ids = df['Inspector_ID'].unique().tolist()
for id in inspector id:
current_data = df.loc[df['Inspector_id'] == id]
#With the data of the current inspector you get the desired values
waste_val = 'I1W'
fire_val = 'I1F'
traffic_val = 'I1T'
newData.append([id,waste_val, fire_val, traffic_val])
new_df = pd.DataFrame(newData, columns = ['Inspector_ID','Waste','Fire','Traffic'])
Some ideas for getting the values you need
#IS1 = Sectors visited by inspector 1.
#After the first loc that filters the inspector
sectors_visited = len(df['Sector'].unique().tolist())
#Ai = No. of waste issues raised by inspector-1 in sector i
waste_Issues_A = current_data.loc[current_data['Sector' == A].value_counts()
#Ti = Total no. of waste issues in sector i
#You can get total number of issues by sector with
df['Sector'].value_counts()
#Ni = No of inspectors raised issues in sector i(if all zero then only it is considered as not raised)
#I dont know if i understand this one correctly, I guess its the number
#of inspectors that raised issues on a sector
inspectors_sector_A = len(df.loc[df['Sector'] == A]['Inspector_ID'].unique().tolist())
The previous was done by memory so take the code with a grain of salt (Specially the Ni one).

Related

Changing column name and it's values at the same time

Pandas help!
I have a specific column like this,
Mpg
0 18
1 17
2 19
3 21
4 16
5 15
Mpg is mile per gallon,
Now I need to replace that 'MPG' column to 'litre per 100 km' and change those values to litre per 100 km' at the same time. Any help? Thanks beforehand.
-Tom
I changed the name of the column but doing both simultaneously,i could not.
Use pop to return and delete the column at the same time and rdiv to perform the conversion (1 mpg = 1/235.15 liter/100km):
df['litre per 100 km'] = df.pop('Mpg').rdiv(235.15)
If you want to insert the column in the same position:
df.insert(df.columns.get_loc('Mpg'), 'litre per 100 km',
df.pop('Mpg').rdiv(235.15))
Output:
litre per 100 km
0 13.063889
1 13.832353
2 12.376316
3 11.197619
4 14.696875
5 15.676667
An alternative to pop would be to store the result in another dataframe. This way you can perform the two steps at the same time. In my code below, I first reproduce your dataframe, then store the constant for conversion and perform it on all entries using the apply method.
df = pd.DataFrame({'Mpg':[18,17,19,21,16,15]})
cc = 235.214583 # constant for conversion from mpg to L/100km
df2 = pd.DataFrame()
df2['litre per 100 km'] = df['Mpg'].apply(lambda x: cc/x)
print(df2)
The output of this code is:
litre per 100 km
0 13.067477
1 13.836152
2 12.379715
3 11.200694
4 14.700911
5 15.680972
as expected.

counting unique values in column using sub-id

I have a df containing sub-trajectories (segments) of users, with mode of travel indicated by 0,1,2... which looks like this:
df = pd.read_csv('sample.csv')
df
id lat lon mode
0 5138001 41.144540 -8.562926 0
1 5138001 41.144538 -8.562917 0
2 5138001 41.143689 -8.563012 0
3 5138003 43.131562 -8.601273 1
4 5138003 43.132107 -8.598124 1
5 5145001 37.092095 -8.205070 0
6 5145001 37.092180 -8.204872 0
7 5145015 39.289341 -8.023454 2
8 5145015 39.197432 -8.532761 2
9 5145015 39.198361 -8.375641 2
In the above sample, id is for the segments but a full trajectory maybe covered by different modes (i.e. contains multiple segments).
So the first 4-digits of id is the unique trajectories, and the last 3-digits, unique segment with that trajectory.
I know that I can count the number of unique segments in the dfusing:
df.groupby('id').['mode'].nunique()
How do I then count the number of unique trajectories 5138, 5145, ...?
Use indexing for get first 4 values with str, if necessary first convert values to strings by Series.astype:
df = df.groupby(df['id'].astype(str).str[:4])['mode'].nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
If need processing values after first 4 ids:
s = df['id'].astype(str)
df = s.str[4:].groupby(s.str[:4]).nunique().reset_index(name='count')
print (df)
id count
0 5138 2
1 5145 2
Another idea is use lambda function:
df.groupby(df['id'].apply(lambda x: str(x)[:4]))['mode'].nunique()

SQL dealing every bit without run query repeatedly

I have a column using bits to record status of every mission. The index of bits represents the number of mission while 1/0 indicates if this mission is successful and all bits are logically isolated although they are put together.
For instance: 1010 is stored in decimal means a user finished the 2nd and 4th mission successfully and the table looks like:
uid status
a 1100
b 1111
c 1001
d 0100
e 0011
Now I need to calculate: for every mission, how many users passed this mission. E.g.: for mission1: it's 0+1+1+0+1 = 5 while for mission2, it's 0+1+0+0+1 = 2.
I can use a formula FLOOR(status%POWER(10,n)/POWER(10,n-1)) to get the bit of every mission of every user, but actually this means I need to run my query by n times and now the status is 64-bit long...
Is there any elegant way to do this in one query? Any help is appreciated....
The obvious approach is to normalise your data:
uid mission status
a 1 0
a 2 0
a 3 1
a 4 1
b 1 1
b 2 1
b 3 1
b 4 1
c 1 1
c 2 0
c 3 0
c 4 1
d 1 0
d 2 0
d 3 1
d 4 0
e 1 1
e 2 1
e 3 0
e 4 0
Alternatively, you can store a bitwise integer (or just do what you're currently doing) and process the data in your application code (e.g. a bit of PHP)...
uid status
a 12
b 15
c 9
d 4
e 3
<?php
$input = 15; // value comes from a query
$missions = array(1,2,3,4); // not really necessary in this particular instance
for( $i=0; $i<4; $i++ ) {
$intbit = pow(2,$i);
if( $input & $intbit ) {
echo $missions[$i] . ' ';
}
}
?>
Outputs '1 2 3 4'
Just convert the value to a string, remove the '0's, and calculate the length. Assuming that the value really is a decimal:
select length(replace(cast(status as char), '0', '')) as num_missions as num_missions
from t;
Here is a db<>fiddle using MySQL. Note that the conversion to a string might look a little different in Hive, but the idea is the same.
If it is stored as an integer, you can use the the bin() function to convert an integer to a string. This is supported in both Hive and MySQL (the original tags on the question).
Bit fiddling in databases is usually a bad idea and suggests a poor data model. Your data should have one row per user and mission. Attempts at optimizing by stuffing things into bits may work sometimes in some programming languages, but rarely in SQL.

Can I use pandas to create a biased sample?

My code uses a column called booking status that is 1 for yes and 0 for no (there are multiple other columns that information will be pulled from dependant on the booking status) - there are lots more no than yes so I would like to take a sample with all the yes and the same amount of no.
When I use
samp = rslt_df.sample(n=298, random_state=1, weights='bookingstatus')
I get the error:
ValueError: Fewer non-zero entries in p than size
Is there a way to do this sample this way?
If our entire dataset looks like this:
print(df)
c1 c2
0 1 1
1 0 2
2 0 3
3 0 4
4 0 5
5 0 6
6 0 7
7 1 8
8 0 9
9 0 10
We may decide to sample from it using the DataFrame.sample function. By default, this function will sample without replacement. Meaning, you'll receive an error by specifying a number of observations larger than the number of observations in your initial dataset:
df.sample(20)
ValueError: Cannot take a larger sample than population when 'replace=False'
In your situation, the ValueError comes from the weights parameter:
df.sample(3,weights='c1')
ValueError: Fewer non-zero entries in p than size
To paraphrase the DataFrame.sample docs, using the c1 column as our weights parameter implies that rows with a larger value in the c1 column are more likely to be sampled. Specifically, the sample function will not pick values from this column that are zero. We can fix this error using either one of the following methods.
Method 1: Set the replace parameter to be true:
m1 = df.sample(3,weights='c1', replace=True)
print(m1)
c1 c2
0 1 1
7 1 8
0 1 1
Method 2: Make sure the n parameter is equal to or less than the number of 1s in the c1 column:
m2 = df.sample(2,weights='c1')
print(m2)
c1 c2
7 1 8
0 1 1
If you decide to use this method, you won't really be sampling. You're really just filtering out any rows where the value of c1 is 0.
I was able to this in the end, here is how I did it:
bookingstatus_count = df.bookingstatus.value_counts()
print('Class 0:', bookingstatus_count[0])
print('Class 1:', bookingstatus_count[1])
print('Proportion:', round(bookingstatus_count[0] / bookingstatus_count[1], 2), ': 1')
# Class count
count_class_0, count_class_1 = df.bookingstatus.value_counts()
# Divide by class
df_class_0 = df[df['bookingstatus'] == 0]
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([f_class_0_under, df_class_1], axis=0)
df_class_1 = df[df['bookingstatus'] == 1]
based on this https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Thanks everyone

How to get the KY-022 Infrared Receiver module to work on NodeMCU in Lua?

I have a KY-022 IR module that I can't get to work on my NodeMCU. I've been searching for some code samples in Lua on the internet with no luck. Can anyone point me in the right direction? Any code samples would be greatly appreciate it.
At the moment I have the following code:
local pin = 4
gpio.mode(pin, gpio.OPENDRAIN, gpio.PULLUP)
gpio.trig(pin, "down", function (level, micro)
print(gpio.read(pin), level, micro)
end)
When I press a button on the remote, I get something like this:
0 0 571940709
0 0 571954086
0 0 571955257
1 0 571958694
1 0 571963275
1 0 571969917
0 0 571974347
0 0 571980989
1 0 571983203
1 0 571987709
0 0 571993359
1 0 572000078
0 0 572004508
0 0 572047513
0 0 572058674
So, how do I get from that to figuring out which key was pressed on the remote?
After a month or so i've reopened this project and played around with it some more. As piglet suggested, I started listening for both high and low signals. The data is still very inconsistent and can't get a stable reading.
(And by the way, thanks for the vote-down piglet, that was greatly appreciated. I wish you could have seen my search history before you decided that i'm ignorant)
I'm going to post my curent code maybe somebody can point out what I'm doing wrong here.
local pin = 4
local prevstate = false
local prevmicro = 0
local prevtime = 0
local count = 0
gpio.mode(pin, gpio.INT)
gpio.trig(pin, "both", function (level, micro)
--local state = gpio.read(pin)
local state = level
if (micro - prevmicro) > 90000 then
prevmicro = 0
prevstate = false
count = 0
print("\n#", "st", "lv", "microtime", "timing")
end
if prevstate ~= state then
time = math.floor((micro - prevmicro)/100)
prevstate = state
prevmicro = micro
if time > 3 and time < 1000 then
if prevtime > 80 and prevtime < 100 then
if time > 17 and time < 25 then
print('Repeat')
elseif time > 40 and time < 50 then
print('Start')
end
else
print(count, gpio.read(pin), level, micro, time)
count = count + 1
end
prevtime = time
end
end
end)
and here are some sample readouts from pushing the same button:
# st lv microtime timing
1 1 1 1504559531 16
2 1 0 1504566995 74
3 0 1 1504567523 5
4 1 0 1504573619 60
5 0 1 1504587422 138
6 1 0 1504588011 5
7 1 1 1504604250 162
8 1 0 1504605908 16
9 1 1 1504659929 540
10 1 0 1504662154 22
# st lv microtime timing
1 1 1 1505483535 16
2 1 0 1505491003 74
3 0 1 1505491558 5
4 1 0 1505497627 60
5 0 1 1505511409 137
6 1 0 1505512023 6
7 1 1 1505518186 61
8 1 0 1505527733 95
9 1 0 1505586167 22
10 1 1 1505586720 5
# st lv microtime timing
1 1 1 1507990937 16
2 1 0 1507998405 74
3 0 1 1507998934 5
4 1 0 1508005029 60
5 0 1 1508018811 137
6 1 0 1508019424 6
7 1 1 1508035641 162
8 1 0 1508037322 16
9 1 1 1508091345 540
10 1 0 1508093570 22
As it turns out, the Lua code required for this is actually quite simple.
Where the code above is falling over is actually the print statements. These are extremely expensive and basically, kill your sampling resolution until it's useless.
You are in essence, writing an interrupt service routine, you have a limited time budget before you have to read the next edge change and if it happens before you are done processing, tough luck! So you need to make the ISR as efficient as you can.
In the example below, we listen to the "both" edge event, when one occurs, we simply record an indication of which edge and what duration.
Periodically (using a timer) we print out the contents of the waveform.
This perfectly matches the waveform on my logic analyzer, you still have the challenge of decoding the signal. Though, there are lots of great protocol docs that explain how to take accurate waveform data and use it to determine the signal being sent. I found that a lot of cheap "brand x" remotes appear to be using the NEC protocol, so this might be a good place to start depending on your project.
IR transmission because of its nature is not completely error-free so you may get a spurious edge signal from time to time but the code below is pretty stable and runs quite well in isolation, I have yet to test it when the Microcontroller is under more load than just listening for IR.
It may turn out that using Lua for this purpose is not the best due to the fact that it is an interpreted language (each command issued is parsed and then executed at runtime, this is not at all efficient.) But I will see how far I can get before I decide to write a c module.
local irpin = 2
local lastTimestamp = 0
local waveform = {}
local i = 1
gpio.mode(irpin,gpio.INT)
gpio.trig(irpin, "both", function(level, ts)
onEdge(level, ts)
end)
function onEdge(level, ts)
waveform[i] = level
waveform[i+1] = ts - lastTimestamp
lastTimestamp = ts
i = i+2
end
-- Print out the waveform
function showWaveform ()
if table.getn(waveform) > 65 then
for k,v in pairs(waveform) do
print(k,v)
end
i = 1;
waveform = {}
end
end
tmr.alarm(0, 1000, 1, showWaveform)
print("Ready")
The following code works for my 17 key remote which came with my cheap KY-022 module. I just finished it and haven't had time to clean it up nor to optimize it, so bear with me.
local IR = 2
local lts, i, wave = 0, 0, {}
local keys = {}
keys['10100010000000100000100010101000'] = '1'
keys['10001010000000100010000010101000'] = '2'
keys['10101010000000100000000010101000'] = '3'
keys['10000010000000100010100010101000'] = '4'
keys['10000000000000100010101010101000'] = '5'
keys['10101000000000100000001010101000'] = '6'
keys['10101010000000000000000010101010'] = '7'
keys['10100010001000000000100010001010'] = '8'
keys['10100000100000000000101000101010'] = '9'
keys['10100000101000000000101000001010'] = '0'
keys['10001010001000000010000010001010'] = '*'
keys['10100010100000000000100000101010'] = '#'
keys['10000000101000000010101000001010'] = 'U'
keys['10000000100000000010101000101010'] = 'L'
keys['10001000101000100010001000001000'] = 'R'
keys['10001000001000100010001010001000'] = 'D'
keys['10000010101000000010100000001010'] = 'OK'
local function getKey()
local data = ''
local len = table.getn(wave)
if len >= 70 then
local pkey = 0
local started = false
for k, v in pairs(wave) do
v = math.floor(v/100)
if (pkey == 87 or pkey == 88 or pkey == 89) and (v > 40 and v < 50) then
started = true
end
pkey = v
if started then
if v > 300 then
started = false
end
--this is just to fix some random skipped edges
if (v > 20 and v < 25) or v == 11 then
if v > 20 and v < 25 then
d = 17
else
d = 6
end
v1 = v - d
data = data .. '' .. math.floor(v1/10)
v2 = v - (v - d)
data = data .. '' .. math.floor(v2/10)
else
if v < 40 then
data = data .. '' .. math.floor(v/10)
end
end
end
end
control = data:sub(0, 32)
if control == '00000000000000000101010101010101' then
data = data:sub(32, 63)
print(len, data, keys[data] or '?')
end
end
lts, i, wave = 0, 0, {}
end
local function onEdge(level, ts)
local time = ts - lts
wave[i] = time
i = i + 1
if time > 75000 then
tmr.alarm(0, 350, 0, getKey)
end
lts = ts
end
gpio.mode(IR,gpio.INT)
gpio.trig(IR, "both", onEdge)
I'm putting this asside and start working on some other parts of my project for the moment, but if anyone has any suggestions on how I could improve it, make it faster, smaller even, leave a comment.
PS: for those that are going to complain about not working for them, you need to adjust the if statement values for the started variable based on your remote timings. In my case it's always 88 or 89 followed by 44.
You have to get the sequence sent by the remote for each button.
Record the IR emitter's on-off sequence by logging the time stamps for high-low and low-high transitions.
Note the various patterns for each button you want to use or emulate.
Here's a in-depth tutorial http://www.instructables.com/id/How-To-Useemulate-remotes-with-Arduino-and-Raspber/
You can find this and similar resources using www.google.com