MongoDB delete data using regex - pandas

I was able to use the following to delete data using pandas:
import re
repl = {r'<[^>]+>': '',
r'\r\n': ' ',
r'Share to facebook|Share to twitter|Share to linkedin|Share on Facebook|Share on Twitter|Share on Messenger|Share on Whatsapp': ''}
articles['content'] = articles['content'].replace(repl, regex=True)
How can I do the same on the actual database thats in Atlas?
My data structure is:
_id:
title:
url:
description:
author:
publishedAt:
content:
source_id:
urlToImage:
summarization:

MongoDB does not have any built-in Operators to perform Regex Replace on the go (for now).
You can loop through the documents using the regex find in the programming language of your choice and replace that way instead.
from pymongo import MongoClient
import re
m_client = MongoClient("<MONGODB-URI-STRING")
db = m_client["<DB-NAME>"]
collection = db["<COLLECTION-NAME>"]
replace_dictionary = {
r'<[^>]+>': '',
r'\r\n': ' ',
r'Share to facebook|Share to twitter|Share to linkedin|Share on Facebook|Share on Twitter|Share on Messenger|Share on Whatsapp': ''
}
count = 0
for it in collection.find({
# Merge all refex finds to a single list
"$or": [{"content": re.compile(x, re.IGNORECASE)} for x in replace_dictionary.keys()]
}, {
# Project only the field to be replaced for faster execution of script
"content": 1
}):
# Iterate over regex and replacements and apply the same using `re.sub`
for k, v in replace_dictionary.items():
it["content"] = re.sub(
pattern=k,
repl=v,
string=it["content"],
)
# Update the regex replaced string
collection.update_one({
"_id": it["_id"]
}, {
"$set": {
"content": it['content']
}
})
# Count to keep track of completion
count += 1
print("\r", count, end='')
print("DONE!!!")

Related

How to select specific fields on FaunaDB Query Language?

I can't find anything about how to do this type of query in FaunaDB. I need to select only specifics fields from a document, not all fields. I can select one field using Select function, like below:
serverClient.query(
q.Map(
q.Paginate(q.Documents(q.Collection('products')), {
size: 12,
}),
q.Lambda('X', q.Select(['data', 'title'], q.Get(q.Var('X'))))
)
)
Forget the selectAll function, it's deprecated.
You can also return an object literal like this:
serverClient.query(
q.Map(
q.Paginate(q.Documents(q.Collection('products')), {
size: 12,
}),
q.Lambda(
'X',
{
title: q.Select(['data', 'title'], q.Get(q.Var('X')),
otherField: q.Select(['data', 'other'], q.Get(q.Var('X'))
}
)
)
)
Also you are missing the end and beginning quotation marks in your question at ['data, title']
One way to achieve this would be to create an index that returns the values required. For example, if using the shell:
CreateIndex({
name: "<name of index>",
source: Collection("products"),
values: [
{ field: ["data", "title"] },
{ field: ["data", "<another field name>"] }
]
})
Then querying that index would return you the fields defined in the values of the index.
Map(
Paginate(
Match(Index("<name of index>"))
),
Lambda("product", Var("product"))
)
Although these examples are to be used in the shell, they can easily be used in code by adding a q. in front of each built-in function.

Is there a way to use dynamic dataset name in bigquery

Problem Statement :
I am trying to use BigqueryOperator in airflow. The aim is to read the same queries as many times with dynamic changing of dataset names ie dataset names will be passed as a parameter.
example:
project.dataset1_layer1.tablename1, project.dataset2_layer1.tablename1
Expected:
I want to maintain one single copy of SQL wherein I can pass dataset names as parameters which can get replaced for that particular dataset.
Error Messages:
I tried to pass dynamic dataset name as a part of query_params. But it got failed with below error message.
The query got parsed as
INFO - Executing: [u'SELECT col1, col2 FROM project.#partner_layer1.tablename']
ERROR - BigQuery job failed. Final error was: {u'reason': u'invalidQuery', u'message': u'Query parameters cannot be used in place of table names at [1:37]', u'location': u'query'}. u'CREATE_IF_NEEDED', u'query': u'SELECT col1, col2 FROM project.#partner_layer1.tablename'}, u'jobType': u'QUERY'}}
`
Things I have tried so far
Query Temaplate temp.sql is as below:
SELECT col1, col2 FROM `project.#partner_layer1.tablename`;
Airflow BigqueryOperator is used as below:
query_template_dict = {
'partner_list' = ['val1', 'val2', 'val3', 'val4']
'google_project': 'project_name',
'queries': {
'layer3': {
'template': 'temp.sql',
'output_dataset': '_layer3',
'output_tbl': 'table_{}'.format(table_date),
'output_tbl_schema': 'temp.txt'
}
},
'applicable_tasks': {
'val1': {
'table_layer3': []
},
'val2': {
'table_layer3': []
},
'val3': {
'table_layer3': []
},
'val4': {
'table_layer3': []
}
}
}
for partner in query_template_dict['partner_list']:
# Loop over applicable report queries for a partner
applicable_tasks = query_template_dict['applicable_tasks'][partner].keys()
for task in applicable_tasks:
destination_tbl = '{}.{}{}.{}'.format(query_template_dict['google_project'], partner,
query_template_dict['queries'][task]['output_dataset'] ,
query_template_dict['queries'][task]['output_tbl'])
}
#Actual destination table structure
#destination_tbl = 'project.partner_layer3.table_20200223'
run_bq_cmd = BigQueryOperator (
task_id =partner + '-' + task,
sql =[query_template_dict['queries'][task]['template']],
destination_dataset_table =destination_tbl,
use_legacy_sql =False,
write_disposition ='WRITE_APPEND',
create_disposition ='CREATE_IF_NEEDED',
allow_large_results =True,
query_params=[
{
"name": "partner",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": partner}
},
{
"name": "batch_date",
"parameterType": { "type": "STRING" },
"parameterValue": { "value": batch_date}
}
],
dag=dag,
Can anybody help me with this issue?
Is there a limitation in BigQuery to dynamically pass dataset names?
Replace the dataset name in Airflow, not in BigQuery.
So do this before the query is sent to BigQuery - use Python string replacement within Airflow.

Adding new key-value pair into json using karate

My payload looks like this :
{
"override_source": "DS",
"property_code": "0078099",
"stay_date": "2018-11-26T00:00:00.000000",
"sku_prices": [
],
"persistent_override": false
}
There is an array dblist ["2","3"] , it would consists of numbers from 1 to 4. Based on the elements present in the list, I want to add key-values {"sku_price":"1500","sku_code":"2"} to my payload. I am using the following code :
* eval if(contains("3",dblist)) karate.set('pushRatesFromDS.sku_prices[]','{ "sku_price": "1500","sku_code":"3" }')
When I execute my feature file, I do not get any errors but, key-values are not added to my payload. However if I move this code to a new feature file and call it, key-value pairs get added to my payload. The code in my new feature file looks like : * set pushRatesFromDS.sku_prices[] = { "sku_price": "1500","sku_code":"2" }
Try this:
* def foo =
"""
{
"override_source": "DS",
"property_code": "0078099",
"stay_date": "2018-11-26T00:00:00.000000",
"sku_prices": [
],
"persistent_override": false
}
"""
* eval karate.set('foo', '$.sku_prices[]', { foo: 'bar' })

Bluemix SQLDB Query - Can't figure out JSON Parameter Markings

In my Nodered Bluemix application, I'm trying to make a SqlDB query, but I can't find sufficient documentation or examples on how to use the parameter markings in the query. Are there any examples and further insight into what I am doing wrong? Here is the flow I am having trouble with:
[
{
"id":"7924a83a.03355",
"type":"websocket-listener",
"path":"/ws/dbdata",
"wholemsg":"false"
},
{
"id":"b84efad2.9a2a58",
"type":"function",
"name":"Parse JSON",
"func":"msg.payload = JSON.parse(msg.payload);\nvar begin = msg.payload[0].split(\" \");\nbegin[1] = begin[1]+\":00\";\nvar date1 = begin[0].split(\"-\");\nvar processStart = date1[2]+\"-\"+date1[0]+\"-\"+date1[1]+\" \"+begin[1];\n\nvar end = msg.payload[0].split(\" \");\nend[1] = end[1]+\":00\";\nvar date2 = end[0].split(\"-\");\nvar processEnd = date2[2]+\"-\"+date2[0]+\"-\"+date2[1]+\" \"+end[1];\n\nmsg.payload[0] = processStart;\nmsg.payload[1] = processEnd;\nreturn msg;",
"outputs":1,"noerr":0,"x":381.79998779296875,"y":164.8000030517578,"z":"3f9da5d2.b3f0aa",
"wires":[["4f92b16a.cf981"]]
},
{
"id":"3e20f8a4.06451",
"type":"websocket in",
"name":"dbInput",
"server":"7924a83a.03355",
"client":"",
"x":159.8000030517578,"y":164.8000030517578,"z":"3f9da5d2.b3f0aa",
"wires":[["b84efad2.9a2a58"]]
},
{
"id":"68a4a35.5983f5c",
"type":"debug",
"name":"",
"active":true,"console":"false",
"complete":"true",
"x":970.7999877929688,"y":162.8000030517578,"z":"3f9da5d2.b3f0aa",
"wires":[]
},
{
"id":"5a0aed1c.34279c",
"type":"sqldb in",
"service":"LabSensors-sqldb",
"query":"",
"params":"{msg.begin},{msg.end}",
"name":"db Request",
"x":787.7999877929688,"y":163.8000030517578,"z":"3f9da5d2.b3f0aa",
"wires":[["68a4a35.5983f5c"]]
},
{
"id":"e08c4a85.e95e68",
"type":"debug",
"name":"",
"active":true,"console":"false",
"complete":"true",
"x":791.7999877929688,"y":233.8000030517578,"z":"3f9da5d2.b3f0aa",
"wires":[]
},
{
"id":"4f92b16a.cf981",
"type":"function",
"name":"Construct Query",
"func":"msg.begin = msg.payload[0];\nmsg.end = msg.payload[1];\nmsg.payload = \"SELECT * FROM IOT WHERE TIME >= '?' AND TIME < '?'\";\nreturn msg;",
"outputs":1,"noerr":0,"x":583.7999877929688,"y":163.8000030517578,"z":"3f9da5d2.b3f0aa",
"wires":[["5a0aed1c.34279c",
"e08c4a85.e95e68"]]
}
]
In the node-red documentation for the SQLDB query node it says:
"Parameter Markers is a comma delimited set of json paths. These will replace any question marks that you place in your query, in the order that they appear."
Have you tried removing the curly braces, i.e. to set the "params" field in the node to just "msg.begin,msg.end"?
You just need remove single quotes this is a correct sentence:
msg.payload = "SELECT * FROM IOT WHERE TIME >= ? AND TIME < ?";

RStudio shiny datatables save csv unquoted?

How can I save the output of an RStudio Shiny DataTables table using the Save to CSV extension, but have the content saved unquoted instead of the default, which is in double-quotes:
For example, for a single column with two entries, I get a file.csv like this:
"column_name"
"foo"
"bar"
And instead I would like either:
column_name
foo
bar
Or even better, without the header:
foo
bar
My current code looks like this:
output$mytable <- renderDataTable({
entries()
}, options = list(colnames = NULL, bPaginate = FALSE,
"sDom" = 'RMDT<"cvclear"C><"clear">lfrtip',
"oTableTools" = list(
"sSwfPath" = "copy_csv_xls.swf",
"aButtons" = list(
"copy",
"print",
list("sExtends" = "collection",
"sButtonText" = "Save",
"aButtons" = list("csv","xls")
)
)
)
)
)
EDIT:
I tried with one of the suggested answers, and ajax is not allowed, the page complains when I click on SaveTXT. If I do the following, it still puts things within double quotes:
list("sExtends" = "collection",
"sButtonText" = "SaveTXT",
"sFieldBoundary" = '',
"aButtons" = list("csv")
Any ideas?
It should be possible through button options:Button options
And changing sFieldBoundary value.
$(document).ready( function () {
$('#example').dataTable( {
"sDom": 'T<"clear">lfrtip',
"oTableTools": {
"aButtons": [
{
"sExtends": "ajax",
"sFieldBoundary": '"'
}
]
}
} );
} );
But I couldn't get it working in shiny.