Python append entry KeyError problem because of missing data from the API - pandas

So, i'm trying to collect data from an API to make a dataframe. The problems is that that when i get the response in JSON some of the values are missing for some rows. That means that one row has all 10 out of 10 values and some only have 8 out of 10.
For e.g. I have such code to fill in the data from the API to then form a DataFrame:
response = r.json()
cols = ['A', 'B', 'C', 'D']
l = []
for entry in response:
l.append([
entry['realizationreport_id'],
entry['suppliercontract_code'],
entry['rid'],
entry['ppvz_inn'],
I get this error because in one of the rows the API didn't give a value in response:
KeyError: 'ppvz_inn'
So i'm tryng to fix it so that the cell of the DataFrame is filled with 0 or Nan if the API doesn't have a value for this specific row
l = []
for entry in response:
l.append([
entry['realizationreport_id'],
entry['suppliercontract_code'],
entry['rid'],
entry['ppvz_inn'],
try:
entry['ppvz_supplier_name'],
except KeyError:
'0',
And now i get this error:
try:
^
SyntaxError: invalid syntax
How to actually make it work and fill those cells with no data?

You cannot have a try-except statement in the middle of your append statement.
You could either work with if statements or first try to fix your JSON data by filling in empty values. You could also maybe use setdefault, see here some info about it.

Use collections.defaultdict. It's a subclass of dict which does not return KeyError, creating a called key instead.
You can cast your existing dict to defaultdict using unpacking.
for entry in response:
entry_defaultdict = defaultdict(list, **entry)
In this case, every call to non-existing object will create an empty list as a value of the key.

Related

How to update a whole column based on a HTTP call done on each cell?

I am fumbling around with Pandas (I want to avoid using Excel, I have very basic knowledge of Pandas and a reasonable of Python), trying to add a column based on another column.
Specifically, I have a column with IDs, and I want to enrich my data by making a HTTP query to an API and using a field in the JSON response:
d['m0'] = pd.read_json(f"http://localhost:3000/{d['id']}")['H']['M0']
What I wanted to say in the above was
take the data from a cell in the column id, run the API query, and put the ['H']['M0'] of the JSON response (a string) into the column m0
What I get is
InvalidURL Traceback (most recent call last)
Cell In[5], line 1
----> 1 d['m0'] = pd.read_json(f"http://localhost:3000/{d['id']}")['H']['M0']
I feel that the way th eURI was built is not correct, i.e. the content of the cell for column id was not used, but rather the whole column:
InvalidURL: URL can't contain control characters. '/0 AA13\n1 BB10\n2
AA13, BB10, ...are the ids in the column
If I understand it correctly, you have various ID values and you want to automate it by fetching the JSON corresponding to that ID and use values from JSON to further populate the dataframe.
I have not seen the structure of your JSON or return type of accessed fields; but I feel that you are looking for following:
d['m0'] = d['id'].apply(lambda id: pd.read_json(f"http://localhost:3000/{id}")['H']['M0'])

xml_nodeset to tibble, one row per xml_nodeset (item)

I have a complicated xml file with items as 1st child nodes. The items can have different structure and some of the attributes are missing in some of them. I need to store one item (nodeset) in tibble row, so that I keep track on missing attributes and write a function handling all variants.
I found a solution of the first step by Felix Ebert:
https://stackoverflow.com/questions/49253021/how-to-extract-xml-attr-and-xml-text-on-different-levels-with-xml2-and-purrr
I copy part of the code here:
xml <- xml2::read_xml("input/example.xml")
rows <- xml %>% xml_find_all("//xmlsubsubnode")
rows_df <- data_frame(node = rows)
Function data_frame was depreciated and I got error messages if I replace it with
tibble()
as_tibble()
data.frame()
With "tibble" I get following ERROR:
df_articles <- tibble(item = xml_articles)
Error:
! All columns in a tibble must be vectors.
✖ Column `item` is a `xml_nodeset` object.
Backtrace:
1. tibble::tibble(item = xml_articles)
2. tibble:::tibble_quos(xs, .rows, .name_repair)
3. tibble:::check_valid_col(res, col_names[[j]], j)
4. tibble:::check_valid_cols(set_names(list(x), name))
I would be grateful if anybody can update the original post.

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

How to stop Jupyter outputting truncated results when using pd.Series.value_counts()?

I have a DataFrame and I want to display the frequencies for certain values in a certain Series using pd.Series.value_counts().
The problem is that I only see truncated results in the output. I'm coding in Jupyter Notebook.
I have tried unsuccessfully a couple of methods:
df = pd.DataFrame(...) # assume df is a DataFrame with many columns and rows
# 1st method
df.col1.value_counts()
# 2nd method
print(df.col1.value_counts())
# 3rd method
vals = df.col1.value_counts()
vals # neither print(vals) doesn't work
# All output something like this
value1 100000
value2 10000
...
value1000 1
Currently this is what I'm using, but it's quite cumbersome:
print(df.col1.value_counts()[:50])
print(df.col1.value_counts()[50:100])
print(df.col1.value_counts()[100:150])
# etc.
Also, I have read this related Stack Overflow question, but haven't found it helpful.
So how to stop outputting truncated results?
If you want to print all rows:
pd.options.display.max_rows = 1000
print(vals)
If you want to print all rows only once:
with pd.option_context("display.max_rows", 1000):
print(vals)
Relevant documentation here.
I think you need option_context and set to some large number, e.g. 999. Advatage of solution is:
option_context context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block.
#temporaly display 999 rows
with pd.option_context('display.max_rows', 999):
print (df.col1.value_counts())

Pandas HDF5 Select with Where on non natural-named columns

in my continuing spree of exotic pandas/HDF5 issues, I encountered the following:
I have a series of non-natural named columns (nb: because of a good reason, with negative numbers being "system" ids etc), which normally doesn't give an issue:
fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'])
however, my select statement does fall over it:
>>> fact_hdf.select('store_0_0', columns=['o', 'a-6', 'm-13'], where=[('a-6', '=', [0, 25, 28])])
blablabla
File "/srv/www/li/venv/local/lib/python2.7/site-packages/tables/table.py", line 1251, in _required_expr_vars
raise NameError("name ``%s`` is not defined" % var)
NameError: name ``a`` is not defined
Is there any way to work around it? I could rename my negative value from "a-1" to a "a_1" but that means reloading all of the data in my system. Which is rather much! :)
Suggestions are very welcome!
Here's a test table
In [1]: df = DataFrame({ 'a-6' : [1,2,3,np.nan] })
In [2]: df
Out[2]:
a-6
0 1
1 2
2 3
3 NaN
In [3]: df.to_hdf('test.h5','df',mode='w',table=True)
In [5]: df.to_hdf('test.h5','df',mode='w',table=True,data_columns=True)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_kind'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
/usr/local/lib/python2.7/site-packages/tables/path.py:99: NaturalNameWarning: object name is not a valid Python identifier: 'a-6_dtype'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
NaturalNameWarning)
There is a very way, but would to build this into the code itself. You can do a variable substitution on the column names as follows. Here is the existing routine (in master)
def select(self):
"""
generate the selection
"""
if self.condition is not None:
return self.table.table.readWhere(self.condition.format(), start=self.start, stop=self.stop)
elif self.coordinates is not None:
return self.table.table.readCoordinates(self.coordinates)
return self.table.table.read(start=self.start, stop=self.stop)
If instead you do this
(Pdb) self.table.table.readWhere("(x>2.0)",
condvars={ 'x' : getattr(self.table.table.cols,'a-6')})
array([(2, 3.0)],
dtype=[('index', '<i8'), ('a-6', '<f8')])
e.g. by subsituting x with the column reference, you can get the data.
This could be done on detection of invalid column names, but is pretty tricky.
Unfortunately I would suggest renaming your columns.