How can I retrieve data from a cell? - openpyxl

I'm using openpyxl and I want to retrieve data. The following line
cell_name_Target = ws.cell(row = 0, column = ColT).value
gives me an error message:
TypeError: argument of type 'NoneType' is not iterable"
How can I get the data from specific row and column into a variable?

openpyxl uses 1-based indexing for rows and columns.

Related

Synapse Analytics Pyspark: TypeError: list object is not callable

I need to create a new dataframe in Synapse Analytics using column names from another dataframe. The new dataframe will have just one column (column header:col_name and the columns names from the other dataframe are the cell values. Here's my code:
df1= df.columns
colName =[]
for e in df1:
list1 = [e]
colName.append(list1)
col=['col_name']
df2=spark.createDataFrame(colName,col)
display(df2)
The output table created look like below:
With the output dataframe, i can do the following count, display or withColumn command.
df2.count()
df2=df2.withColumn('index',lit(1))
But when i start doing the below filter command, i ended up with 'list' object not callable error message.
display(df2.filter(col('col_name')=='dob'))
I am just wondering if anyone know what I am missing and how I can solve this.At the end i'd like to add a conditional column based on the value in the col_name column.
The problem is that you have two objects called col.
You did this :
col=['col_name']
therefore, when you do this :
display(df2.filter(col('col_name')=='dob'))
you do not call pyspark.sql.functions.col anymore but ['col_name'], hence, TypeError: list object is not callable.
Simply replace here :
# display(df2.filter(col('col_name')=='dob'))
from pyspark.sql import functions as F
display(df2.filter(F.col('col_name')=='dob'))

Using .loc to populate an empty dataframe... error = 'Passing list-likes to .loc or [] with any missing labels is no longer supported'

empty DF: raw_count_df
htp_raw: htp_raw - these are the values i want to enter into the corresponding columns in raw_count_df
How could I rewrite this code...
raw_count_df is the empty DF with the column headers htf_one, htf_two, htf_three and htf_average (the columns I am populating)
htf_raw is a dataframe containing the values I want to enter into the empty dataframe.
Using loc this code would identify the column htf_one and then use the index of the empty dataframe to place the value into the correct place. I only want values from htf_raw which match the index of the empty dataframe.
This code worked recently...
raw_count_df['htp_one'] = htp_raw.loc[raw_count_df.index, 'htf_one']
raw_count_df['htp_two'] = htp_raw.loc[raw_count_df.index, 'htf_two']
raw_count_df['htp_three'] = htp_raw.loc[raw_count_df.index, 'htf_three']
raw_count_df['htp_average'] = htp_raw.loc[raw_count_df.index, 'average']
Now I am getting this error..
Passing list-likes to .loc or [] with any missing labels is no longer supported
I am not sure how I would re-write this code using .reindex etc to populate the dataframe in the same way.

pandas to_datetime leaves unconverted data

I'm trying to convert a column with strings that looks like "201905011" (year/month/day) to datetime, ideally showing as 05-01-2019 (month/day/year). I'm currently trying to following but it's not working for me.
pd.to_datetime(data.datetime, format = '%Y%m%d%H')
This leaves me with the error: "ValueError: unconverted data remains: 4"
I would like to know instead how to correctly do this.
I created an example based on ALollz comment. I have created a dataframe in which first row is correct and second row has and extra 0 in the end. If you use the method, it will return the rows in which the data doesn't match the specified format.
import pandas as pd
df = pd.DataFrame({"datefield":["201901010","20190101010"]})
df.loc[pd.to_datetime(df.datefield, format='%Y%m%d%H', errors='coerce').isnull(), 'datefield']
1 20190101010
Name: datefield, dtype: object

pandas schema validation with specific columns

I have a pandas dataframe with almost 56 columns and 120000 row.
I would like to implement validation only on some columns and not for all of them.
I followed article at https://tmiguelt.github.io/PandasSchema/
When i did like something below function, it throws an error as
"Invalid number of columns. The schema specifies 2, but the data frame has 56"
def DoValidation(self, df):
null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]
schema = pandas_schema.Schema([Column('ItemId', null_validation)],
[Column('ItemName', null_validation)])
errors = schema.validate(df)
if (len(errors) > 0):
for error in errors:
print(error)
return False
return True
Am i doing something wrong ?
What is the correct way to validate specific column in a dataframe ?
Note: I have to implement different type of validations like decimal, length, null check validations etc on different columns and not just null check validation as show in function above.
As Yuki Ho mentioned in his answer, by default you have to specify as many columns in the schema as your dataframe.
But you can also use the columns parameter in schema.validate() to specify which columns to check. Combining that with schema.get_column_names() you can do the following to easily avoid your issue.
schema.validate(df, columns=schema.get_column_names())
Error goes as "Invalid number of columns. The schema specifies 2, but the data frame has 56" because you have 56 columns.
You might have to validate all of those 56 or create a new df containing the columns you want to specify.

How to access nested tables in hdf5 with pandas

I want to retrieve a table from an HDF5 file using pandas.
Following several references I found, I have tried to open the file using:
df = pd.read_hdf('data/test.h5', g_name),
where g_name is the path to the object I want to retrieve, i.e. the table TAB1, for instance, MAIN/Basic/Tables/TAB1.
g_name is retrieved as follows:
def get_all(name):
if 'TAB1' in name:
return name
with h5py.File('data/test.h5') as f:
g_name = f.visit(get_all)
print(g_name)
group = f[g_name]
print(type(group))
I have also tried retrieving the object itself, as seen in the above code snippet, but the object type is
How would I convert this to something I can read as a data frame in pandas?
For the first case, I get the following error:
"cannot create a storer if the object is not existing "
I do not understand why it cannot find the object, if the path is the same as retrieved during the search.
I found the following solution:
hf = h5py.File('data/test.h5')
data = hf.get('MAIN/Basic/Tables/TAB1')
result = data[()]
# This last step just converts the table into a pandas df
df = pd.DataFrame(result)