Python selenium get table values into List of Lists - selenium

I'm just trying to get the data from this table:
https://www.listcorp.com/asx/sectors/materials
and put all the values (the TEXT) into a list of lists.
I've tried so many different methods using--> xpath, getByClassName, By.tag
------------
rws = driver.find_elements_by_xpath("//table/tbody/tr/td")
---------------
table = driver.find_element_by_class_name("v-datatable v-table theme--light")
--------------
findElements(By.tagName("table"))
--------------
# to identify the table rows
l = driver.find_elements_by_xpath ("//*[#class= 'v-datatable.v-
table.theme--light']/tbody/tr")
# to get the row count len method
print (len(l))
# THIS RETURNS '1' which cant be right because theres hundreds of rows
And nothing seems to work to get the values in an easy way to understand the manner.
(EDIT SOLVED)
before doing the SOLVED solution below.
First do: time.sleep(10) this will allow the page to load so that the table can actually be retrieved. then just append all the cells to a new list. YOU WILL NEED MULTIPLE LISTS to fit all the rows.

So basically you can use find_elements_by_tag_name
and use this code
row = driver.find_elements_by_tag_name("tr")
data = driver.find_elements_by_tag_name("td")
print('Rows --> {}'.format(len(row)))
print('Data --> {}'.format(len(data)))
for value in row:
print(value.text)
Have proper wait to populate the data.

Related

Building a new dataset

I want to take data from one set and enter it into another empty set.
So, for example, I want to do something like:
if ([i,x] > 9){
new_data$House[y,x] <- data[i,2]
}
but I want to do it over and over, creating new rows in new_data.
How do I keep adding data to new_data and overriding/saving the new row?
Essentially, I just want to know how to "grow" an empty data set.
Please ignore any errors in the code, it is just an example and I am still working on other details.
Thanks
If you are using r language, I presume you are looking for rbind:
new_data = NULL # define your new dataset
for(i in 1:nrow(data)) # loop over row of data
{
if(data[i,x] > 9) # if statement for implementing a condition
{
new_data = rbind(new_data,data[i,2:6]) # adding values of the row i and column 2 to 6
}
}
At the end, new_data will contain as many rows that satisfy the if statement and each row will contain values extracted from column 2 to 6.
If it is what you are looking for, there is various ways to do that without the need of a for loop, as an example:
new_data = data[data[i,x]>9,2:6]
If this answer is not satisfying for you, please provide more details in your question, include a reproducible example of your data and the expected output

Performing calculations on multiple columns in dataframe and create new columns

I'm trying to perform calculations based on the entries in a pandas dataframe. The dataframe looks something like this:
and it contains 1466 rows. I'll have to run similar calculations on other dfs with more rows later.
What I'm trying to do, is calculate something like mag='(U-V)/('R-I)' (but ignoring any values that are -999), put that in a new column, and then z_pred=10**((mag-c)m) in a new column (mag, c and m are just hard-coded variables). I have other columns I need to add too, but I figure that'll just be an extension of the same method.
I started out by trying
for i in range(1):
current = qso[:]
mag = (U-V)/(R-I)
name = current['NED']
z_pred = 10**((mag - c)/m)
z_meas = current['z']
but I got either a Series for z, which I couldn't operate on, or various type errors when I tried to print the values or write them to a file.
I found this question which gave me a start, but I can't see how to apply it to multiple calculations, as in my situation.
How can I achieve this?
Conditionally adding calculated columns row wise are usually performed with numpy's np.where;
df['mag'] = np.where(~df[['U', 'V', 'R', 'I']].eq(-999).any(1), (df.U - df.V) / (df.R - df.I), -999)
Note; assuming here that when any of the columns contain '-999' it will not be calculated and a '-999' is returned.

matching several combinations of columns in a table

I am reading a table where all its values has to be validated before we process it further. The valid values are stored in another table that we match our main table with. The validation criteria is to match several columns as follows:
Table 1 (the main data we read in)
Name --- Unit --- Age --- Address --- Nationality
The above shows the column names that we are reading from the table and the other table contains the valid values of the above columns . When we look only for valid values in our main table, we have to consider combination of columns in the main data table, for example Name --- Unit --- Age. If all the value in a particular row for the column combination matches against the other table then we keep the row, otherwise we delete it.
How do I address the issue with Numpy ?
Thanks
you can just loop through rows. An easy/simple way would be:
dummy_df = table_df ## make a copy of your table, since we are deleting rows we want to have the original df saved.
relevant_columns = ['age','name','sex',...] ## define relevant columns, in case either dataframe has columns you dont want to compare on
for indx in dummy_df.index :
## checks if any row is identical, if so, drops it.
if ((np.array(dummy_df.loc[indx][relevant_columns]) == main_df[relevant_columns].values).sum(1) == len(relevant_columns)).sum() > 0:
dummy_df = dummy_df .drop(indx)
ps: i am assuming the data is in pandas dataframe format.
hope it helps :)
ps2: if the headers/columns have different names it wont work

Dynamically creating variables, while doing map/apply on a dataframe in pandas to get key names for the values in Series object returned

I am writing code for a Naive Bayes model(I know there's a standard implementation in Sklearn, but I want to code it anyway) - For this I have say upwards of 30 features, against all of which I have the corresponding click & impression counts (Treat them as True/False flags)
What I need then, is to calculate
P(Click/F1, F2.. F30) = (P(Click)*P(F1/Click)*P(F2|click) ..*P(F30|Click))/(P(F1, F2...F30), and
P(NoClick/F1, F2.. F30) = (P(NoClick)*P(F1/NoClick)*P(F2|Noclick) ..*P(F30|NOClick))/(P(F1, F2...F30)
Where I will disregard the denominator as it will affect both Click & Non click behaviour similarly.
Example, for two features, day_custom & is_tablet_phone, I have
is_tablet_phone click impression
FALSE 375417 28291280
TRUE 17743 4220980
day_custom click impression
Fri 77592 7029703
Mon 43576 3773571
Sat 65950 5447976
Sun 66460 5031271
Thu 74329 6971541
Tue 55282 4575114
Wed 51555 4737712
My approach to the Problem : Assuming I read the individual files in data frame, one after another, I want the abilty to calculate & store the corresponding Probablities back in a file, that I will then use for real time prediction of Probabilty to click vs no click.
One possible structure of "processed file" thus would be -:
Here's my entire code -:
In the full blown example, I am traversing the entire directory structure(of 30 txt files, one at a time, from the base path) - which is why I need the ability to create "names" at runtime.
for base_path in base_paths:
for root, dirs, files in os.walk(base_path):
for file in files:
file_paths.append(os.path.join(root, file))
For reasons of tractability, follow from here, by taking the 2 txt files as sample input
file_paths=['/home/ekta/Desktop/NB/day_custom.txt','/home/ekta/Desktop/NB/is_tablet_phone.txt']
flag=0
for filehandle in file_paths:
feature_name=filehandle.split("/")[-1].split(".")[0]
df= pd.read_csv(filehandle,skiprows=0, encoding='utf-8',sep='\t',index_col=False,dtype={feature_name: object,'click': int,'impression': int})
df2=df[(df.impression-df.click>0) & (df.click >0)]
if flag ==0:
MySumC,MySumNC,Mydict=0,0,collections.defaultdict(dict)
MySumC=sum(df2['click'])
MySumNC=sum(df2['impression'])
P_C=float(MySumC)/float(MySumC+MySumNC)
P_NC=1-P_C
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
flag=1 %Set the flag as "1" because we don't need to compute the MySumC,MySumNC, P_C & P_NC again
Question :
It looks like THIS loop is the killer here.Also, intutively, looping on a dataframe is a BAD practice. How can I rewrite this, perhaps using Map/Apply ?
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
What I need in Mydict , which is a hash to store each feature name and each feature value in it
{'day_custom_Mon':{'P_day_custom_Mon_C':.787,'P_day_custom_Mon_NC': 0.556},
'day_custom_Tue':{'P_day_custom_Tue_C':0.887,'P_day_custom_Tue_NC': 0.156},
'day_custom_Wed':{'P_day_custom_Tue_C':0.087,'P_day_custom_Tue_NC': 0.167}
'day_custom_Thu':{'P_day_custom_Tue_C':0.947,'P_day_custom_Tue_NC': 0.196},
'is_tablet_phone_True':{'P_is_tablet_phone_True_C':.787,'P_is_tablet_phone_True_NC': 0.066},
'is_tablet_phone_False':{'P_is_tablet_phone_False_C':.787,'P_is_tablet_phone_False_NC': 0.077},
.. and so on..
%PPS: I just made up those float numbers, but you get the point
Also because I will later serialize this file & pass to Redis directly, for other systems to feed on it, in an cron-job manner, so I need to preserve some sort of Dynamic naming .
What I tried -:
Since I am reading feature_name as
feature_name=filehandle.split("/")[-1].split(".")[0]` # thereby abstracting & creating variables dynamically
def funct1(row):
return row[feature_name]
def funct2(row):
return row['click']
def funct3(row):
return row['impression']
then..
df2.apply(funct2,axis=1)df2.apply(funct,axis=1)*float(P_C))/MySumC, df2.apply(funct3,axis=1)*float(P_NC))/MySumNC Gives me both the values I need for a feature_value(say Mon, Tue, Wed, and so on..) for a feature_name (say,day_custom)
I also know that df2.apply(funct1, axis=1) contains part of mycustom "names"(ie feature values), how would I then build these names using map/apply ?
Ie. I will have the values, but how would I create the "key" 'P_'+feature_name+'_'+feature_value+'_C' , since feature value post apply is returned as a series object.
check out the following recipe which does exactly what you want, only using data frame manipulations. I also simplified the actual frequency calculation a bit ;)
#set the feature name values as the index of
df2.set_index(feature_name, inplace=True)
#This is what df2.set_index() looks like:
# click impression
#day_custom
#Fri 9917 3163
#Mon 2566 3818
#Sat 8725 7753
#Sun 6938 8642
#Thu 6136 2556
#Tue 5234 2356
#Wed 9463 9433
#rename the index of your data frame
df2.rename(index=lambda x:"%s_%s"%('day_custom', x), inplace=True)
#compute the total sum of your data frame entries
totsum = float(df2.values.sum())
#use apply to multiply every data frame element by the total sum
df2 = df2.applymap(lambda x:x/totsum)
#transpose the data frame to have the following shape
#day_custom day_custom_Fri day_custom_Mon ...
#click 0.102019 0.037468 ...
#impression 0.087661 0.045886 ...
#
#
dftranspose = df2.T
# template kw for formatting
templatekw = {'click':"P_%s_C", 'impression':"P_%s_NC"}
# build a list of small data frames with correct index names P_%s_NC etc
dflist = [dftranspose[[col]].rename(lambda x:templatekw[x]%col) for col in dftranspose]
#use the concatenate function to produce a sparse dictionary
MyDict= pd.concat(dflist).to_dict()
Instead of assigning to MyDict at the end, you can use the update-method during the loop.
For understanding the comments below, see here my
Original answer:
Try to use a pivot_table:
def clickfunc(x):
return np.sum(x) * P_C / MySumC
def impressionfunc(x):
return np.sum(x) * P_NC / MySumNC
newtable = df2.pivot_table(['click', 'impression'], 'feature_name', \
aggfunc=[clickfunc, impressionfunc])
#transpose the table for the dictionary to have the right form
newtable = newtable.T
#to_dict functionality already gives the correct result
MyDict = newtable.to_dict()
#rename by copying
for feature_value, subdict in MyDict.items():
word = feature_name +"_"+ feature_value
copydict[word] = {'P_' + word + '_C':subdict['click'],\
'P_' + word + '_NC':subdict['impression'] }
This gives you the result you want in copydict
itertuples() is what worked for me(worked at lightspeed) - though It is still not using the map/apply approach that I so much wanted to see. Itertuples on a pandas dataframe returns the whole row, so I no longer have to do df2[df2[feature_name]==feature_value]['click'] - be aware that this matching by value is not only expensive, but also undesired, since it may return a series, if there were duplicate rows. itertuples solves that problem were elegantly, though I need to then access the individual objects/columns by integer indexes , which means less re-usable code. I could abstract this, but It wont be like accessing by column names, the status-quo.
for row in df2.itertuples():
Mydict[feature_name+'_'+str(row[1])]={'P_'+feature_name+'_'+str(row[1])+'_C':(row[2]*float(P_C))/MySumC, \
'P_'+feature_name+'_'+str(row[1])+'_NC':(row[3]*float(P_NC))/MySumNC}
Note that I am accesing each column in the row by row[1] , row[2] and like. For example, row has (0, u'Fri', 77592, 7029703)
Post this I get
dict(Mydict)
{'day_custom_Thu': {'P_day_custom_Thu_NC': 0.18345372640838162, 'P_day_custom_Thu_C': 0.0019559423132143377}, 'day_custom_Mon': {'P_day_custom_Mon_C': 0.0011466875948906617, 'P_day_custom_Mon_NC': 0.099300235316209587}, 'day_custom_Sat': {'P_day_custom_Sat_NC': 0.14336163246883712, 'P_day_custom_Sat_C': 0.0017354517827023852}, 'day_custom_Tue': {'P_day_custom_Tue_C': 0.001454726996987919, 'P_day_custom_Tue_NC': 0.1203925662982053}, 'day_custom_Sun': {'P_day_custom_Sun_NC': 0.13239618235343156, 'P_day_custom_Sun_C': 0.0017488722589598259}, 'is_tablet_phone_TRUE': {'P_is_tablet_phone_TRUE_NC': 0.11107365073163174, 'P_is_tablet_phone_TRUE_C': 0.00046690100046229593}, 'day_custom_Wed': {'P_day_custom_Wed_NC': 0.12467127727567069, 'P_day_custom_Wed_C': 0.0013566522616712882}, 'day_custom_Fri': {'P_day_custom_Fri_NC': 0.1849842396242351, 'P_day_custom_Fri_C': 0.0020418070466026303}, 'is_tablet_phone_FALSE': {'P_is_tablet_phone_FALSE_NC': 0.74447539516197614, 'P_is_tablet_phone_FALSE_C': 0.0098789704610580936}}

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem