Efficient way to calculate averages, standard deviations from a txt file

Efficient way to calculate averages, standard deviations from a txt file - numpy

Here is a copy of what one of many txt files looks like.
Class 1:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
0 2 180 63 38
-1 -2 0 79 84
-1 -2 180 85 95
. . . . .
Subject B:
posX posY posZ x(%) y(%)
0 2 0 71 73
-1 -2 0 69 88
. . . . .
Subject C:
posX posY posZ x(%) y(%)
0 2 0 86 71
-1 -2 0 81 55
. . . . .
Class 2:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
-1 -2 0 79 84
. . . . .
The number of classes, subjects, row entries all vary.
Class1-Subject A always has posZ entries that have 0 alternating with 180
Calculate average of x(%), y(%) by class and by subject
Calculate standard deviation of x(%), y(%) by class and by subject
Also ignore the posZ of 180 row when calculating averages and std_deviations
I have developed an unwieldly solution in excel (using macro's and VBA) but I would rather go for a more optimal solution in python.
numpy is very helpful but the .mean(), .std() functions only work with arrays- I am still researching some more into it as well as the panda's groupby function.
I would like the final output to look as follows (1. By Class, 2. By Subject)
1. By Class
X Y
Average
std_dev
2. By Subject
X Y
Average
std_dev

I think working with dictionaries (and a list of dictionaries) is a good way to get familiar with working with data in python. To format your data like this, you'll want to read in your text files and define variables line by line.
To start:
for line in infile:
if line.startswith("Class"):
temp,class_var = line.split(' ')
class_var = class_var.replace(':','')
elif line.startswith("Subject"):
temp,subject = line.split(' ')
subject = subject.replace(':','')
This will create variables that correspond to the current class and current subject. Then, you want to read in your numeric variables. A good way to just read in those values is through a try statement, which will try to make them into integers.
else:
line = line.split(" ")
try:
keys = ['posX','posY','posZ','x_perc','y_perc']
values = [int(item) for item in line]
entry = dict(zip(keys,values))
entry['class'] = class_var
entry['subject'] = subject
outputList.append(entry)
except ValueError:
pass
This will put them into dictionary form, including the earlier defined class and subject variables, and append them to an outputList. You'll end up with this:
[{'posX': 0, 'x_perc': 81, 'posZ': 0, 'y_perc': 72, 'posY': 2, 'class': '1', 'subject': 'A'},
{'posX': 0, 'x_perc': 63, 'posZ': 180, 'y_perc': 38, 'posY': 2, 'class': '1', 'subject': 'A'}, ...]
etc.
You can then average/take SD by subsetting the list of dictionaries (applying rules like excluding posZ=180 etc.). Here's for averaging by Class:
classes = ['1','2']
print "By Class:"
print "Class","Avg X","Avg Y","X SD","Y SD"
for class_var in classes:
x_m = np.mean([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_m = np.mean([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
x_sd = np.std([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_sd = np.std([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
print class_var,x_m,y_m,x_sd,y_sd
You'll have to play around printed output to get exactly what you want, but this should get you started.

Related

How do I measure the length of the lists per userId using pandas?

I am trying to measure the length of the list under Original Query and subsequently find the mean and std dev but I cannot seem to measure the length. How do I do it?
This is what I tried:
filepath = "yandex_users_paired_queries.csv" #path to the csv with the query datasetqueries = pd.read_csv(filepath)
totalNum = queries.groupby('Original Query').size().reset_index(name='counts')
sessions = queries.groupby(['UserID','Original Query'])
print(sessions.size())
print("----------------------------------------------------------------")
print("~~~Mean & Average~~~")
sessionsDF = sessions.size().to_frame('counts')
sessionsDFbyBool = sessionsDF.groupby(['Original Query'])
print(sessionsDFbyBool["counts"].agg([np.mean,np.std]))
And this is my output:
UserID Original Query
154 [1228124, 388107, 1244921, 3507784] 1
[1237207, 1974238, 1493311, 1222688, 733390, 868851, 428547, 110871, 868851, 235307] 1
[1237207, 1974238, 1493311, 1222688, 733390, 868851, 428547] 1
[1237207, 1974238, 1493311, 1222688, 733390] 1
[1237207] 1
..
343 [919873, 551537, 1841361, 1377305, 610887, 1196372, 3724298] 1
[919873, 551537, 1841361, 1377305, 610887, 1196372] 1
345 [3078369, 3613096, 4249887, 2383044, 2366003, 4043437] 1
[3531370, 3078369, 284354, 4300636] 1
347 [1617419] 1
Length: 612, dtype: int64

You want to apply the len function on the 'Original Query' column.
queries['oq_len'] = queries['Original Query'].apply(len)
sessionsDF = queries.groupby('UserID').oq_len.agg([np.mean,np.std])

Splitting a coordinate string into X and Y columns with a pandas data frame

So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()

try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))

using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)

How to find word frequency per country list in pandas?

Let's say I have a .CSV which has three columns: tidytext, location, vader_senti
I was already able to get the amount of *positive, neutral and negative text instead of word* pero country using the following code:
data_vis = pd.read_csv(r"csviamcrpreprocessed.csv", usecols=fields)
def print_sentiment_scores(text):
vadersenti = analyser.polarity_scores(str(text))
return pd.Series([vadersenti['pos'], vadersenti['neg'], vadersenti['neu'], vadersenti['compound']])
data_vis[['vadersenti_pos', 'vadersenti_neg', 'vadersenti_neu', 'vadersenti_compound']] = data_vis['tidytext'].apply(print_sentiment_scores)
data_vis['vader_senti'] = 'neutral'
data_vis.loc[data_vis['vadersenti_compound'] > 0.3 , 'vader_senti'] = 'positive'
data_vis.loc[data_vis['vadersenti_compound'] < 0.23 , 'vader_senti'] = 'negative'
data_vis['vader_possentiment'] = 0
data_vis.loc[data_vis['vadersenti_compound'] > 0.3 , 'vader_possentiment'] = 1
data_vis['vader_negsentiment'] = 0
data_vis.loc[data_vis['vadersenti_compound'] <0.23 , 'vader_negsentiment'] = 1
data_vis['vader_neusentiment'] = 0
data_vis.loc[(data_vis['vadersenti_compound'] <=0.3) & (data_vis['vadersenti_compound'] >=0.23) , 'vader_neusentiment'] = 1
sentimentbylocation = data_vis.groupby(["Location"])['vader_senti'].value_counts()
sentimentbylocation
sentimentbylocation gives me the following results:
Location vader_senti
Afghanistan negative 151
positive 25
neutral 2
Albania negative 6
positive 1
Algeria negative 116
positive 13
neutral 4
TO GET THE MOST COMMON POSITIVE WORDS, I USED THIS CODE:
def process_text(text):
tokens = []
for line in text:
toks = tokenizer.tokenize(line)
toks = [t.lower() for t in toks if t.lower() not in stopwords_list]
tokens.extend(toks)
return tokens
tokenizer=TweetTokenizer()
punct = list(string.punctuation)
stopwords_list = stopwords.words('english') + punct + ['rt','via','...','…','’','—','—:',"‚","â"]
pos_lines = list(data_vis[data_vis.vader_senti == 'positive'].tidytext)
pos_tokens = process_text(pos_lines)
pos_freq = nltk.FreqDist(pos_tokens)
pos_freq.most_common()
Running this will give me the most common words and the number of times they appeared, such as
[(good, 1212),
(amazing, 123)
However, what I want to see is how many of these positive words appeared in a country.
For example:
I have a sample CSV here: https://drive.google.com/file/d/112k-6VLB3UyljFFUbeo7KhulcrMedR-l/view?usp=sharing

Create a column for each most_common word, then do a groupby location and use agg to apply a sum for each count:
words = [i[0] for i in pos_freq.most_common()]
# lowering all cases in tidytext
data_vis.tidytext = data_vis.tidytext.str.lower()
for i in words:
data_vis[i] = data_vis.tidytext.str.count(i)
funs = {i: 'sum' for i in words}
grouped = data_vis.groupby('Location').agg(funs)
Based on the example from the CSV and using most_common as ['good', 'amazing'] the result would be:
grouped
# good amazing
# Location
# Australia 0 1
# Belgium 6 4
# Japan 2 1
# Thailand 2 0
# United States 1 0

Checking if the tictactoe game has ended and returning the winner

I'm trying to code the function which checks if the tictactoe game has ended and returns the winner if the game has ended. Can you please help me code this part ?
input: s, a 3*3 integer matrix depicting the current state of the game.
if the element of the matrix is equal to 0 means the place is empty, and if the element is equal to 1 means the place is taken by 'X' player and if the element is equal to -1 means that the place is taken by 'O' player.
output:
e: the result, an integer scalar with value 0, 1 or -1.
if e = None, the game has not ended yet.
if e = 0, the game ended with a draw.
if e = 1, X player won the game.
if e = -1, O player won the game.
I tried some user defined functions to work this out but couldn't do it properly. can you please help me code this part ?
[code]
def check_game(s):
e = None
def sum_all_straights(S):
r3 = np.arange(3)
return np.concatenate([s.sum(0),s.sum(1),s[[r3,2-r3],r3].sum(1)])
def winner(S):
ss = sum_all_straights(S)
ssa = np.absolute(ss)
win, = np.where(ssa==3)
if win.size:
e = ss[win[0]]//3
sas = sum_all_straights(np.absolute(S))
if (sas>ssa).all():
e = 0
return e

Here is one approach: We can check for a winner by summing the fields of each straight line. If at least one of the sums is 3 or -3 then player X or O has won.
If there is no winner we check for a draw. The game is drawn if no player can win anymore, i.e. if each straight line has at least one X and one O. We can detect that by first taking the absolute value of the board and then summing, call that "SAS"---this counts the chips on each straight line regardless of who they belong to---and first summing and then take the absolute value, call that "SSA". If both players have chips on a line there will be some cancellation in the sum and SSA will be smaller than SAS. If this is true for each straight line the game is drawn.
Implementation:
import numpy as np
def sum_all_straights(S):
r3 = np.arange(3)
return np.concatenate([S.sum(0),S.sum(1),S[[r3,2-r3],r3].sum(1)])
# ^ ^ ^
# | | |
# | ------------+-----
# | | |
# | ------------ |
# This returns the three vertical sums, | the three horizontal sums and
# the two diagonal sums, using advanced indexing
def winner(S):
ss = sum_all_straights(S)
ssa = np.absolute(ss)
win, = np.where(ssa==3)
if win.size:
return ss[win[0]]//3
sas = sum_all_straights(np.absolute(S))
if (sas>ssa).all():
return 0
examples = b"""
XO. OOO ... XXO
OXO XX. .X. OXX
XXO OXX .O. XOO
"""
examples = np.array(examples.strip().split()).reshape(3,4,1).view("S1").transpose(1,0,2)
examples = 0 + (examples==b'X') - (examples==b'O')
for example in examples:
print(example,winner(example),"\n")
Demo:
[[ 1 -1 0]
[-1 1 -1]
[ 1 1 -1]] None
[[-1 -1 -1]
[ 1 1 0]
[-1 1 1]] -1
[[ 0 0 0]
[ 0 1 0]
[ 0 -1 0]] None
[[ 1 1 -1]
[-1 1 1]
[ 1 -1 -1]] 0

Hive - selecting the id for which the other field's value is ascending in consecutive timestamps

I need to select the equipment_id for which the "Reading" is ascending in in consecutive Timestamps from the below Hive table 'Whether_report'.
station_id equipment_id timpe_stamp Reading
1 100 00:00:01 60
2 100 00:00:02 61
3 100 00:00:03 62
4 100 00:00:04 60
5 100 00:00:05 61
. . . .
. . . .
16 114 00:00:11 66
17 114 00:00:12 65
. . . .
. . . .
. . . .
. . . .
29 112 00:00:23 71
30 113 00:00:24 69
for example:- i need to select the euipment_id whose Reading is in ascending for five consecutive timestamps (eg:- 60->61->62->63->64->65) and should not select the equipment_id for which the readings for consequent timestamps (eg:- 60->61->62->60->61). I am struggling to get the correct query.Any suggestion is much appreciated.

I tried a loop to for your requirement:
List<Integer> lis = new ArrayList<Integer>();
int j=0, flag=1, width=0;
lis.add(0, 60);
lis.add(1, 61);
lis.add(2, 61);
lis.add(3, 60);
lis.add(4, 61);
lis.add(5, 62);
lis.add(6, 64);
lis.add(7, 66);
lis.add(8, 68);
Iterable<Integer> itr = lis;
for(int i : itr)
{
if( j != 0) {
if( width == 4)
break;
if( i>j ) {
flag = 1;
width++;
}
else if( i<j && width != 4) {
flag = 0;
width = 0;
}
}
System.out.println(i);
j=i;
}
System.out.println("flag = "+flag+"width = "+ (width));
}
Output:
60
61
61
60
61
62
64
66
flag = 1 width = 4
I think if this can be plugged in the reducer class where the key is IntWritable equipment_id and value is Iterable IntWritable values and feed the values to this loop, assuming all time stamp values are unique.
Don't know if this is an optimal solution, considering the volume of data. Hope it helps !!!!!

You probably have to go to pig or MR. You are trying to find a sorted sub_sequence of length 5 in an bunch of readings, which probably cannot be achieved in a single query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Efficient way to calculate averages, standard deviations from a txt file - numpy

Related

How do I measure the length of the lists per userId using pandas?

Splitting a coordinate string into X and Y columns with a pandas data frame

How to find word frequency per country list in pandas?

Checking if the tictactoe game has ended and returning the winner

Hive - selecting the id for which the other field's value is ascending in consecutive timestamps

Categories

Resources