Read json string data - hive

I have the below column product_details where I want to extract the code where the colour is green.
Column - product_details
"[
{\"colour\":\"red\",\"code\":1,\"id\":111\"},{\"colour\":\"blue\",\"code\":4,\"id\":222\"},{\"colour\":\"green\",\"code\":6,\"id\":333\"}
]"
o/p: 6
I tried regex_replace and regex_extract to get the value. But not successful.

Related

Remove just strings from the entries in my first column of data frame

I have strings and numbers in my first column of a data frame:
rn
AT457
X5377
X3477
I want to remove just the strings and keep the numbers from each entry in the column called rn.
Any help is appreciated.
Use a regular expression to do this.
For example, with R :
## Sample data :
df=data.frame(rn=c("AT457","X5377","X3477"))
## Replace the letters with *nothing* ('\D' is used to identify non-digit characters)
df$rn_strip=gsub('\\D',"",df$rn)
## Output :
rn rn_strip
1 AT457 457
2 X5377 5377
3 X3477 3477

Data cleaning: regex to replace numbers

I have this dataframe:
p=pd.DataFrame({'text':[2,'string']})
and trying to replace digit 2 by an 'a' using this code:
p['text']=p['text'].str.replace('\d+', 'a')
But instead of letter a and get NaN?
What am I doing wrong here?
In your dataframe, the first value of the text column is actually a number, not a string, thus the NaN error when you try to call .str. Just convert it to a string first:
p['text'] = p['text'].astype(str).str.replace('\d+', 'a')
Output:
>>> p
text
0 a
1 string
(Note that .str.replace is soon going to change the default value of regex from True to False, so you won't be able to use regular expressions without passing regex=True, e.g. .str.replace('\d+', 'a', regex=True))

Is there an equivalent of an f-string in Google Sheets?

I am making a portfolio tracker in Google Sheets and wanted to know if there is a way to link the "TICKER" column with the code in the "PRICE" column that is used to pull JSON data from Coin Gecko. I was wondering if there was an f-string like there is in Python where you can insert a variable into the string itself. Ergo, every time the Ticker column is updated the coin id will be updated within the API request string. Essentially, string interpolation
For example:
TICKER PRICE
BTC =importJSON("https://api.coingecko.com/api/v3/coins/markets?vs_currency=usd&ids={BTC}","0.current_price")
You could use CONCATENATE for this:
https://support.google.com/docs/answer/3094123?hl=en
CONCATENATE function
Appends strings to one another.
Sample Usage
CONCATENATE("Welcome", " ", "to", " ", "Sheets!")
CONCATENATE(A1,A2,A3)
CONCATENATE(A2:B7)
Syntax
CONCATENATE(string1, [string2, ...])
string1 - The initial string.
string2 ... - [ OPTIONAL ] - Additional strings to append in sequence.
Notes
When a range with both width and height greater than 1 is specified, cell values are appended across rows rather than down columns. That is, CONCATENATE(A2:B7) is equivalent to CONCATENATE(A2,B2,A3,B3, ... , A7,B7).
See Also
SPLIT: Divides text around a specified character or string, and puts each fragment into a separate cell in the row.
JOIN: Concatenates the elements of one or more one-dimensional arrays using a specified delimiter.

Python: Remove exponential in Strings

I have been trying to remove the exponential in a string for the longest time to no avail.
The column involves strings with alphabets in it and also long numbers of more than 24 digits. I tried converting the column to string with .astype(str) but it just reads the line as "1.234123E+23". An example of the table is
A
345223423dd234324
1.234123E+23
how do i get the table to show the full string of digits in pandas?
b = "1.234123E+23"
str(int(float(b)))
output is '123412299999999992791040'
no idea how to do it in pandas with mixed data type in column

make new column based on presence of a word in another

I have
pd.DataFrame({'text':['fewfwePDFerglergl','htrZIPg','gemlHTML']})
text
0 wePDFerglergl
1 htrZIPg
2 gemlHTML
a column 10k rows long. Each column contains one of ['PDF','ZIP','HTML']. The length of each entry in text is 14char max.
how do I get:
pd.DataFrame({'text':['wePDFerglergl','htrZIPg','gemlHTML'],'file_type':['pdf','zip','html']})
text file_type
0 wePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
I tried df.text[0].find('ZIP') for a single entry, but do not know how to stitch it all together to test and return the correct value for each row in the column
Any suggestions?
We can use str.extract here with the regex flag for in-case sensitive (?i)
words = ['pdf','zip','html']
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')
Or we use the flags=re.IGNORECASE argument:
import re
df['file_type'] = df['text'].str.extract(f'({"|".join(words)})', flags=re.IGNORECASE)
Output
text file_type
0 fewfwePDFerglergl PDF
1 htrZIPg ZIP
2 gemlHTML HTML
If you want file_type as lower case, chain str.lower():
df['file_type'] = df['text'].str.extract(f'(?i)({"|".join(words)})')[0].str.lower()
text file_type
0 fewfwePDFerglergl pdf
1 htrZIPg zip
2 gemlHTML html
Details:
The pipe (|) is the or operator in regular expressions. So with:
"|".join(words)
'pdf|zip|html'
We get the following in pseudocode:
extract "pdf" or "zip" or "html" from our string
You could use regex for this:
import re
regex = re.compile(r'(PDF|ZIP|HTML)')
This matches any of the desired substrings. To extract these matches in order in proper case, here's a one-liner:
file_type = [re.search(regex, x).group().lower() for x in df['text']]
This returns the following list:
['pdf', 'zip', 'html']
Then to add the column:
df['file_type'] = file_type