Is the file style wrong? and how to adding the sum - sum

I am new as Python learning.
created a txt file contains series of numbers:
TXT Style A: 1, 3, 4, 2, 22, 11, ...,32
TXT Style B: 1 3 4 2 22...32
both txt files can be opened. I failed to add the sum of the values.
Question 1: Does the text style with ',' affect the adding function?
Question 2: How can I get each value and calculate the sum?
with open('numbers.txt', 'r') as f:
numbers_line = f.read()
print(numbers_line)
x = numbers_line.split()
#print(len(x))
def number():
i = len(x)
# Learn different methods to retrieve element
#s = int(x.__getitem__(25))
s = int(x[25])
w = int(x[24])
# Able to retrieve element, but this is not the way to program
# needed advice and corrections
print(i, s, w, s + w)
number()
f.close()

Does the text style with ',' affect the adding function?
Before we can get to the adding, it does affect the splitting, since str.split() if no delimiter is specified splits only at whitespace. To split at comma and whitespace as well, you can
import re
…
x = re.split("[,\s]+", numbers_line)
How can I get each value and calculate the sum?
You already used the int function to convert a string to an integer (although with int(x[25]) you missed that the indexes range from 0 to 24); you could apply int to each string of x with the map function, and you can calculate the sum with the sum function:
print(sum(map(int, x)))

Related

Using openpyxl and scipy to solve non-linear system related to resistor network

I'm trying to do the right thing for my career by learning how to really use the data science tools and automate excel.
There is a pretty basic resistor circuit called a Pi attenuator.
On a particular component, two of the terminals are shorted together by design.
Because these two of the elements are common (Terminal 3 shorts R1 and R2 - see spreadsheet screenshot) it isn't possible to just measure them with a meter and get the real value of the element. What is measured
I have measurement data both before and after exposing them to high temp oven
The left side has the Before oven measurements (measured and converted)
I was able to use scipy fsolve to get the correct element values (as if R1 and R2 weren't common). This code absolutely works and gives the correct info.
The F[] equations came from solving for the series/parallel value when an ohmmeter is put across pins (1,3), (2,3), and (1,2). I know those are correct. As I understand it, the system is non-linear because of the variables existing in the denominators.
I used this function to manually enter the starting measurements and converted them to their real values.
The problem is that I need to do this hundreds of times for many test groups, and will need to test other attenuator values. I don't want to have to re-run the script after tying in successive measurements if I can help it.
I'm able to get the data out of excel and into numpy arrays, but when I pass the data to the fsolve function I get
printed values of that have add a bunch of decimal places
RuntimeWarning: overflow encountered in double_scalars-divide by zero encountered in double_scalars
To me these are scary insurmountable issues. The lines the warnings occurred on might have changed, but they happen during the F[] = equation assignments.
I wanted to be able to provide the spreadsheet but I'm told this link won't work.
Actual excel datafile
The code I'm working with that uses openpyxl has been limited to 1 row of data for my sanity. If I can get the correct values I would tackle the next part next.
I tried to do what I could to lay out the basis of what I'm doing with screenshots and by providing as much detail as possible. I'm hoping to receive feedback about how to do what I've explained by using numpy arrays, fsolve, and openpyxl.
In general (in my personal programming "experience") I have issues with making sure I've casted to the correct type. I wouldn't be surprised if that was part of my problem. I have issues with "scope". I also overthink everything, and the majority of my EE schooling used C or assembly.
It's embarrassing to say how long it too me to even get this far. I'm so ignorant that I don't even know what I don't know. I've put it down and picked it back up enough that I'm starting to get emotional and need another set of eyes. I'm trying to fail forward here, but I need another set of eyes.
I tried:
-enforcing float64 (or float32) dtype to the np arrays
-creating other np arrays in the convert_measured function
-casting the parameters passed to convert_measured to seperate variables within the function
-rounding the values from the cells because they seem to expand. I don't need any more than 2 decimals of precision for this
import numpy as np
from scipy.optimize import fsolve
import openpyxl
wb = openpyxl.load_workbook("datafile.xlsx")
ws = wb['Group A']
""" Excel Data Cell Locations
Measured R1,R2,R3 Effective Starting Values
Group A,B,C,D worksheet -Cell Range I15 to K34
I = column 9
K = colum 11
"""
#Data limited to just one row here to debug
MeasColBegin = int(9)
MeasColEnd = int(11)
MeasRowBegin = int(15) # Measured values start in row 15
MeasRowEnd = int(15) # Row 34 is the real end. If this is set to 15 then it's only looking at one row
rows_input = ws.iter_rows(min_row=MeasRowBegin,max_row=MeasRowEnd, min_col=MeasColBegin,max_col=MeasColEnd)
"""
Calculated R1,R2,R3 Actual Finished Values
Group A,B,C,D worksheet -Cell Range I39 to K58
I = column 9
K = colum 11
"""
#These aren't being used yet
CalcColBegin = int(9)
CalcColEnd = int(11)
CalcRowBegin = int(39)
CalcRowEnd = int(58)
#row iterator for actual/calculated values
#This isn't being used yet later
rows_output = ws.iter_rows(min_row=CalcRowBegin,max_row=CalcRowEnd, min_col=CalcColBegin,max_col=CalcColEnd)
#This is called by fsolve
#This works when I don't use data passed from the excel spreadsheet
def convert_measured(z): #Maybe I need to cast z to a different type/size...?
F = np.empty((3))
F.astype('float64')
#print("z datatypes ", z[0].dtype, z[1].dtype, z[2].dtype)
# IDK why this prints so many decimal places, (or why it prints them so many times)
# I assume it has to do how the optimizer works
#print("z[]= ", z[0], z[1], z[2])
# Same assignments as simplier version of code that provides correct answers
#x = z[0]
#y = z[1]
#w = z[2]
x = z[0]
y = z[1]
w = z[2]
#print("x=",x," y=",y," w=",w)
#This is certainly wrong
F[0] = 1/(1/z[0]+1/(z[1]+z[2]))-z[0]
F[1] = 1/(1/z[1]+1/(z[0]+z[2]))-z[1]
F[2] = 1/(1/z[2]+1/(z[0]+z[1]))-z[2]
#I tried thinking that rounding might help. Nope.
#F[0] = 1/(1/x+1/(y+w))-np.around(z[0],4)
#F[1] = 1/(1/y+1/(x+w))-np.around(z[1],4)
#F[2] = 1/(1/w+1/(x+y))-np.around(z[2],4)
#Original code from example that works when I enter the numbers
#F[0] = 1/(1/x+1/(y+w))-148.884
#F[1] = 1/(1/y+1/(x+w))-148.853
#F[2] = 1/(1/w+1/(x+y))-16.506
#Bottom line is that one of my problems is that I don't know how to represet F[0,1,2] in terms of z
return F
def main():
# numpy array where measured values to be converted will go
zGuess = np.array([1,1,1])
zGuess = zGuess.astype('float64')
# numpy array where converted solution/values will go
zSol = np.array([1,1,1])
zSol = zSol.astype('float64')
# For loop used to iterate through rows and extract measurements
# These are passed to the convert_measured function
for row in rows_input:
print("row[]= ", row[0].value, row[1].value, row[2].value) #print out values to check it's the right data
#store values into np array that will be sent to convert_measured
zGuess[0]=row[0].value
zGuess[1]=row[1].value
zGuess[2]=row[2].value
print("zGuess[]=", zGuess[0], zGuess[1], zGuess[2]) #print again to make sure because I had problems with dtype
# Solve for measurements / Convert to actual values
zSol = fsolve(convert_measured, zGuess)
#This should print the true values of the elements as if no shunt between R1 and R2 exists
print("zSol= ",zSol[0], zSol[1], zSol[2])
#Todo: store correct solutions into array and write to spreadsheet
if __name__ == "__main__":
main()
I've made some changes to your code and ran this on your manually calculated data. The result looks to be the same apart from a few rounding differences.
Therefore 'rows_input' currently points to the range C15:E34 in the code sample (other rows_input line is commented out).
The main change was to the line that calls the convert function
zSol = fsolve(convert_measured, zGuess)
to call the function with the params for calculating and round the array values to 2 decimal places
zSol = np.round(fsolve(convert_measured, zSol, zGuess), 2)
The convert_measured function was also changed to accept the inputs for the conversions.
Changed code sample below; (I commented out the print statements except for the zsol values)
import numpy as np
from scipy.optimize import fsolve
import openpyxl
wb = openpyxl.load_workbook("datafile.xlsx")
ws = wb['Group A']
""" Excel Data Cell Locations
Measured R1,R2,R3 Effective Starting Values
Group A,B,C,D worksheet -Cell Range I15 to K34
I = column 9
K = colum 11
"""
# Data limited to just one row here to debug
MeasColBegin = int(9)
MeasColEnd = int(11)
MeasRowBegin = int(15) # Measured values start in row 15
MeasRowEnd = int(15) # Row 34 is the real end. If this is set to 15 then it's only looking at one row
# rows_input = ws.iter_rows(min_row=MeasRowBegin, max_row=MeasRowEnd, min_col=MeasColBegin, max_col=MeasColEnd)
rows_input = ws.iter_rows(min_row=15, max_row=34, min_col=3, max_col=5)
"""
Calculated R1,R2,R3 Actual Finished Values
Group A,B,C,D worksheet -Cell Range I39 to K58
I = column 9
K = colum 11
"""
# These aren't being used yet
CalcColBegin = int(9)
CalcColEnd = int(11)
CalcRowBegin = int(39)
CalcRowEnd = int(58)
# row iterator for actual/calculated values
# This isn't being used yet later
rows_output = ws.iter_rows(min_row=CalcRowBegin, max_row=CalcRowEnd, min_col=CalcColBegin, max_col=CalcColEnd)
def convert_measured(z, xl):
F = np.empty((3))
# F.astype('float64')
x = z[0]
y = z[1]
w = z[2]
# Original code from example that works when I enter the numbers
F[0] = 1/(1/x+1/(y+w))-xl[0]
F[1] = 1/(1/y+1/(x+w))-xl[1]
F[2] = 1/(1/w+1/(x+y))-xl[2]
return F
def main():
# numpy array where measured values to be converted will go
zGuess = np.array([1, 1, 1])
zGuess = zGuess.astype('float64')
# numpy array where converted solution/values will go
zSol = np.array([1, 1, 1])
zSol = zSol.astype('float64')
# For loop used to iterate through rows and extract measurements
# These are passed to the convert_measured function
for row in rows_input:
#print("row[]= ", row[0].value, row[1].value, row[2].value) # print out values to check it's the right data
# store values into np array that will be sent to convert_measured
zGuess[0] = row[0].value
zGuess[1] = row[1].value
zGuess[2] = row[2].value
#print("zGuess[]=", zGuess[0], zGuess[1], zGuess[2]) # print again to make sure because I had problems with dtype
# Solve for measurements / Convert to actual values
# zSol = fsolve(convert_measured(zSol), zGuess)
zSol = np.round(fsolve(convert_measured, zSol, zGuess), 2)
# This should print the true values of the elements as if no shunt between R1 and R2 exists
print("zSol= ", zSol[0], zSol[1], zSol[2])
# Todo: store correct solutions into array and write to spreadsheet
if __name__ == "__main__":
main()
Output looks like this
zSol= 290.03 288.94 16.99
zSol= 283.68 294.84 16.97
zSol= 280.83 297.43 17.07
zSol= 292.67 286.63 16.99
zSol= 277.51 301.04 16.93
zSol= 294.98 284.66 16.95
...

Pandas - Setting column value, based on a function that runs on another column

I have been all over the place to try and get this to work (new to datascience). It's obviously because I don't get how the datastructure of Panda fully works.
I have this code:
def getSearchedValue(identifier):
full_str = anedf["Diskret data"].astype(str)
value=""
if full_str.str.find(identifier) <= -1:
start_index = full_str.str.find(identifier)+len(identifier)+1
end_index = full_str[start_index:].find("|")+start_index
value = full_str[start_index:end_index].astype(str)
return value
for col in anedf.columns:
if col.count("#") > 0:
anedf[col] = getSearchedValue(col)
What i'm trying to do is iterate over my columns. I have around 260 in my dataframe. If they contain the character #, it should try to fill values based on whats in my "Diskret data" column.
Data in the "Diskret data" column is completely messed up but in the form CCC#111~VALUE|DDD#222~VALUE| <- Until there is no more identifiers + values. All identifiers are not present in each row, and they come in no specific order.
The function works if I run it with hard coded strings in regular Python document. But with the dataframe I get various error like:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Input In [119], in <cell line: 12>()
12 for col in anedf.columns:
13 if col.count("#") > 0:
---> 14 anedf[col] = getSearchedValue(col)
Input In [119], in getSearchedValue(identifier)
4 full_str = anedf["Diskret data"].astype(str)
5 value=""
----> 6 if full_str.str.find(identifier) <= -1:
7 start_index = full_str.str.find(identifier)+len(identifier)+1
8 end_index = full_str[start_index:].find("|")+start_index
I guess this is because it evaluate against all rows (Series) which obviously provides some false and true errors. But how can I make the evaluation and assignment so it it's evaluating+assigning like this:
Diskret data
CCC#111
JJSDJ#1234
CCC#111~1IBBB#2323~2234
1 (copied from "Diskret data")
0
JJSDJ#1234~Heart attack
0 (or skipped since the row does not contain a value for the identifier)
Heart attack
The plan is to drop the "Diskret data" when the assignment is done, so I have the data in a more structured way.
--- Update---
By request:
I have included a picture of how I visualize the problem, And what I seemingly can't make it do.
Problem visualisation
With regex you could do something like:
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
series = pd.Series(
['CCC#111~1|BBB#2323~2234', 'JJSDJ#1234~Heart attack']
)
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series.apply(map_)
Breaking this down:
Create a new series by running a map on each row that turns your long string into a list of tuples
Create a new series by running a map on each row that turns your long string into a list of tuples.
reg_series = series.str.findall(r'([^~|]+)~([^~|]+)')
reg_series
# output:
# 0 [(CCC#111, 1), (BBB#2323, 2234)]
# 1 [(JJSDJ#1234, Heart attack)]
Then we create a map_ function. This function takes each row of reg_series and maps it to two rows: the first with only the "keys" and the other with only the "values". We then create series of this with the index as the keys and the values as the values.
Edit: We added in a if/else statement that check whether the list exists. If it does not, we return an empty series of type object.
def map_(list_) -> pd.Series:
if list_:
idx, values = zip(*list_)
return pd.Series(values, idx)
else:
return pd.Series(dtype=object)
...
print(idx, values) # first row
# output:
# ('CCC#111', 'BBB#2323') (1, 2234)
Finally we run apply on the series to create a dataframe that takes the outputs from map_ for each row and zips them together in columnar format.
reg_series.apply(map_)
# output:
# CCC#111 BBB#2323 JJSDJ#1234
# 0 1 2234 NaN
# 1 NaN NaN Heart attack

Getting same value from list in dataframe column using Python

I have dataframe in which there 3 columns, Now, I added one more column and in which I am adding unique values using random function.
I created list variable and using for loop I am adding random string in that list variable
after that, I created another loop in which I am extracting value of list and adding it in column's value.
But, Same value is adding in each row everytime.
df = pd.read_csv("test.csv")
lst = []
for i in range(20):
randColumn = ''.join(random.choice(string.ascii_uppercase + string.digits)
for i in range(20))
lst.append(randColumn)
for j in lst:
df['randColumn'] = j
print(df)
#Output.......
A B C randColumn
0 1 2 3 WHI11NJBNI8BOTMA9RKA
1 4 5 6 WHI11NJBNI8BOTMA9RKA
Could you please help me to fix this that Why each row has same value from list.
Updated to work correctly with any type of column in df.
If I got your question clearly, you can use method zip of rdd to achieve your goals.
from pyspark.sql import SparkSession, Row
import pyspark.sql.types as t
lst = []
for i in range(2):
rand_column = ''.join(random.choice(string.ascii_uppercase + string.digits) for i in range(20))
# Adding random strings as Row to list
lst.append(Row(random=rand_column))
# Making rdd from random strings array
random_rdd = sparkSession.sparkContext.parallelize(lst)
res = df.rdd.zip(random_rdd).map(lambda rows: Row(**(rows[0].asDict()), **(rows[1].asDict()))).toDF()

Got TypeError: string indices must be integers with .apply [duplicate]

I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]

how to find missing number between minimum and maximum

I want to make a NumPy array which has below;
Random number: 0~9 (0<=value<=9) Random 1D size: 5~9 (5<= size <=9)
And I hope to find missing numbers between min and max so I made a code like this
import numpy as np
min_val = 0
max_val = 10
min_val_len = 5
max_val_len = 10
arr1 = [4,3,2,7,8,2,3]
a = list(arr1)
print(a)
diff = np.setdiff1d(range(min_val, max_val), arr1)
arr = np.arange(min_val_len, max_val_len)
if diff in arr:
print(diff)
else:
print("no missing")
In my purpose, the output will be [5,6].
And if an input is [1, 2, 3, 4, 5], the result will be 'no_missing'.
But the code isn't work on my expectation.
I think you expect in to work in a way it does not: You want to check every single element, try:
b = [d in arr for d in diff]
Now b contains a boolean value for each value d of diff. If you want to find the actual number that are missing you can do it using a condition
b = np.intersect1d(np.setdiff1d(range(min_val, max_val), arr1), arr)
Also note that python has built in set types, so you do not actually need to use numpy.
Now b contains all numbers of d that are in arr. But you can do it in even a simpler way as you're already using the notion of sets:
print(np.setdiff1d(rang