Splitting a coordinate string into X and Y columns with a pandas data frame - pandas

So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()

try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))

using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)

Related

average on dataframe segments

In the following picture, I have DataFrame that renders zero after each cycle of operation (the cycle has random length). I want to calculate the average (or perform other operations) for each patch. For example, the average of [0.762, 0.766] alone, and [0.66, 1.37, 2.11, 2.29] alone and so forth till the end of the DataFrame.
So I worked with this data :
random_value
0 0
1 0
2 1
3 2
4 3
5 0
6 4
7 4
8 0
9 1
There is probably a way better solution, but here is what I came with :
def avg_function(df):
avg_list = []
value_list = list(df["random_value"])
temp_list = []
for i in range(len(value_list)):
if value_list[i] == 0:
if temp_list:
avg_list.append(sum(temp_list) / len(temp_list))
temp_list = []
else:
temp_list.append(value_list[i])
if temp_list: # for the last values
avg_list.append(sum(temp_list) / len(temp_list))
return avg_list
test_list = avg_function(df=df)
test_list
[Out] : [2.0, 4.0, 1.0]
Edit: since requested in the comments, here is a way to add the means to the dataframe. I dont know if there is a way to do that with pandas (and there might be!), but I came up with this :
def add_mean(df, mean_list):
temp_mean_list = []
list_index = 0 # will be the index for the value of mean_list
df["random_value_shifted"] = df["random_value"].shift(1).fillna(0)
random_value = list(df["random_value"])
random_value_shifted = list(df["random_value_shifted"])
for i in range(df.shape[0]):
if random_value[i] == 0 and random_value_shifted[i] == 0:
temp_mean_list.append(0)
elif random_value[i] == 0 and random_value_shifted[i] != 0:
temp_mean_list.append(0)
list_index += 1
else:
temp_mean_list.append(mean_list[list_index])
df = df.drop(["random_value_shifted"], axis=1)
df["mean"] = temp_mean_list
return df
df = add_mean(df=df, mean_list=mean_list
Which gave me :
df
[Out] :
random_value mean
0 0 0
1 0 0
2 1 2
3 2 2
4 3 2
5 0 0
6 4 4
7 4 4
8 0 0
9 1 1

Extracting table data using BeautifulSoup

Having a little trouble using BeautifulSoup to extract data (zip code and population). Any help appreciated.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
Now I'm stuck. Don't really know how to pull the data in the columns I want and append it to the columns I made in the empty dataframe.
Any help appreciated.
You just need to loop over your cols, and dump that into your austin_pop dataframe.
So I did that by making a list of the data from the cols using list comprehension:
row_list = [ data.text for data in cols ]
List comprehension equivalent to a for loop. You can use either.:
row_list = []
for data in cols:
rows_list.append(data.text)
Created a single row, kept the 2 columns you wanted, and then dumped that in to austin_pop:
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
Full Code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.zip-codes.com/city/tx-austin.asp"
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
row_list = [ data.text for data in cols ]
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
austin_pop = austin_pop.iloc[1:, :]
austin_pop['Zip Code'] = austin_pop['Zip Code'].apply(lambda x: x.split()[-1])
Output:
print (austin_pop)
Zip Code Population
1 73301 0
2 73344 0
3 78681 50,606
4 78701 6,841
5 78702 21,334
6 78703 19,690
7 78704 42,117
8 78705 31,340
9 78708 0
10 78709 0
11 78710 0
12 78711 0
13 78712 860
14 78713 0
15 78714 0
16 78715 0
17 78716 0
18 78717 22,538
19 78718 0
20 78719 1,764
21 78720 0
22 78721 11,425
23 78722 5,901
24 78723 28,330
25 78724 21,696
26 78725 6,083
27 78726 13,122
28 78727 26,689
29 78728 20,299
30 78729 27,108
.. ... ...
45 78746 26,928
46 78747 14,808
47 78748 40,651
48 78749 34,449
49 78750 26,814
50 78751 14,385
51 78752 18,064
52 78753 49,301
53 78754 15,036
54 78755 0
55 78756 7,194
56 78757 21,310
57 78758 44,072
58 78759 38,891
59 78760 0
60 78761 0
61 78762 0
62 78763 0
63 78764 0
64 78765 0
65 78766 0
66 78767 0
67 78768 0
68 78772 0
69 78773 0
70 78774 0
71 78778 0
72 78779 0
73 78783 0
74 78799 0
[74 rows x 2 columns]

Output the result of each loop in different columns

price.txt file has two columns: (name and value)
Mary 134
Lucy 56
Jack 88
range.txt file has three columns: (fruit and min_value and max_value)
apple 57 136
banana 62 258
orange 88 99
blueberry 98 121
My aim is to test whether the value in price.txt file is between the min_value and max_value in range.txt. If yes, putout 1, If not, output "x".
I tried:
awk 'FNR == NR { name=$1; price[name]=$2; next} {
for (name in price) {
if ($2<=price[name] && $3>=price[name]) {print 1} else {print "x"}
}
}' price.txt range.txt
But my results are all in one column, just like follows:
1
1
x
x
x
x
x
x
1
1
1
x
Actually, I want my result to be like: (Each name has one column)
1 x 1
1 x 1
x x 1
x x x
Because I need to use paste to add the output file and range.txt file together. The final result should be like:
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
So, how can I get the result of each loop in different columns? And is there anyway to output the final result without paste based on my current code? Thank you.
This builds on what you provided,
# load prices by index to maintain read order
FNR == NR {
price[names++]=$2
next
}
# save max index to avoid using non-standard length(array)
END {
names=NR
}
{
l = $1 " " $2 " " $3
for (i=0; i < names; i++) {
if ($2 <= price[i] && $3 >= price[i]) {
l = l " 1"
} else {
l = l " x"
}
}
print l
}
and generates output,
apple 57 136 1 x 1
banana 62 258 1 x 1
orange 88 99 x x 1
blueberry 98 121 x x x
However, you don't have the person name for the score (anonymous results) - maybe that's intentional?
The change here is to explicitly index array populated in first block to maintain order.

Pandas custom file format

I have a huge Pandas DataFrame that I need to write away to a format that RankLib can understand. Example with a target, a query ID and 3 features is this:
5 qid:4 1:12 2:0.6 3:13
1 qid:4 1:8 2:0.4 3:11
I have written my own function that iterates over the rows and writes them away like this:
data_file = open(filename, 'w')
for index, row in data.iterrows():
line = str(row['score'])
line += ' qid:'+str(row['srch_id'])
counter = 0
for feature in feature_columns:
counter += 1
line += ' '+str(counter)+':'+str(row[feature])
data_file.write(line+'\n')
data_file.close()
Since I have about 200 features and 5m rows this is obviously very slow. Is there a better approach using the I/O of Pandas itself?
you can do it this way:
Data:
In [155]: df
Out[155]:
f1 f2 f3 score srch_id
0 12 0.6 13 5 4
1 8 0.4 11 1 4
2 11 0.7 14 2 10
In [156]: df.dtypes
Out[156]:
f1 int64
f2 float64
f3 int64
score object
srch_id int64
dtype: object
Solution:
feature_columns = ['f1','f2','f3']
cols2id = {col:str(i+1) for i,col in enumerate(feature_columns)}
def f(x):
if x.name in feature_columns:
return cols2id[x.name] + ':' + x.astype(str)
elif x.name == 'srch_id':
return 'quid:' + x.astype(str)
else:
return x
(df.apply(lambda x: f(x))[['score','srch_id'] + feature_columns]
.to_csv('d:/temp/out.csv', sep=' ', index=False, header=None)
)
out.csv:
5 quid:4 1:12 2:0.6 3:13
1 quid:4 1:8 2:0.4 3:11
2 quid:10 1:11 2:0.7 3:14
cols2id helper dict:
In [158]: cols2id
Out[158]: {'f1': '1', 'f2': '2', 'f3': '3'}

Efficient way to calculate averages, standard deviations from a txt file

Here is a copy of what one of many txt files looks like.
Class 1:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
0 2 180 63 38
-1 -2 0 79 84
-1 -2 180 85 95
. . . . .
Subject B:
posX posY posZ x(%) y(%)
0 2 0 71 73
-1 -2 0 69 88
. . . . .
Subject C:
posX posY posZ x(%) y(%)
0 2 0 86 71
-1 -2 0 81 55
. . . . .
Class 2:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
-1 -2 0 79 84
. . . . .
The number of classes, subjects, row entries all vary.
Class1-Subject A always has posZ entries that have 0 alternating with 180
Calculate average of x(%), y(%) by class and by subject
Calculate standard deviation of x(%), y(%) by class and by subject
Also ignore the posZ of 180 row when calculating averages and std_deviations
I have developed an unwieldly solution in excel (using macro's and VBA) but I would rather go for a more optimal solution in python.
numpy is very helpful but the .mean(), .std() functions only work with arrays- I am still researching some more into it as well as the panda's groupby function.
I would like the final output to look as follows (1. By Class, 2. By Subject)
1. By Class
X Y
Average
std_dev
2. By Subject
X Y
Average
std_dev
I think working with dictionaries (and a list of dictionaries) is a good way to get familiar with working with data in python. To format your data like this, you'll want to read in your text files and define variables line by line.
To start:
for line in infile:
if line.startswith("Class"):
temp,class_var = line.split(' ')
class_var = class_var.replace(':','')
elif line.startswith("Subject"):
temp,subject = line.split(' ')
subject = subject.replace(':','')
This will create variables that correspond to the current class and current subject. Then, you want to read in your numeric variables. A good way to just read in those values is through a try statement, which will try to make them into integers.
else:
line = line.split(" ")
try:
keys = ['posX','posY','posZ','x_perc','y_perc']
values = [int(item) for item in line]
entry = dict(zip(keys,values))
entry['class'] = class_var
entry['subject'] = subject
outputList.append(entry)
except ValueError:
pass
This will put them into dictionary form, including the earlier defined class and subject variables, and append them to an outputList. You'll end up with this:
[{'posX': 0, 'x_perc': 81, 'posZ': 0, 'y_perc': 72, 'posY': 2, 'class': '1', 'subject': 'A'},
{'posX': 0, 'x_perc': 63, 'posZ': 180, 'y_perc': 38, 'posY': 2, 'class': '1', 'subject': 'A'}, ...]
etc.
You can then average/take SD by subsetting the list of dictionaries (applying rules like excluding posZ=180 etc.). Here's for averaging by Class:
classes = ['1','2']
print "By Class:"
print "Class","Avg X","Avg Y","X SD","Y SD"
for class_var in classes:
x_m = np.mean([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_m = np.mean([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
x_sd = np.std([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_sd = np.std([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
print class_var,x_m,y_m,x_sd,y_sd
You'll have to play around printed output to get exactly what you want, but this should get you started.