using pymongo load data to jupyter notebook and get collection list - pandas

I am trying to connect mongo DB using pymongo.
how can I get collection list?
import pymongo
from pymongo import MongoClient
import pandas as pd
import json
uri = "mongodb://XXX:abcd#dev-mongo.XXX.com/YYY?authSource=admin&authMechanism=SCRAM-SHA-1"
client = MongoClient(uri)
db = client['database_name']
collection=db['collection_name']
coll_list = db.collection_names()

import pymongo
db_connect = pymongo.MongoClient("localhost", 27017)
database_name = 'MY_DATABASE_NAME'
database = db_connect[database_name]
collection = database.list_collection_names(include_system_collections=False)
for collect in collection:
print (collect)

Related

Hybris. Export data. Passing incorrect impexscript to Exporter

I'm unable to export impexscript due to error:
de.hybris.platform.impex.jalo.ImpExException: line 3 at main script: No valid line type found for {0=, 1=user_to_test_export}
This is how I'm doing export of my data using impex
String impexScript = String.format("INSERT Customer;uid[unique=true];\n" +
";%s", customer.getUid());
ImpExMedia impExMedia = ImpExManager.getInstance().createImpExMedia("test_importexport_exportscript");
impExMedia.setData(new ByteArrayInputStream(impexScript.getBytes()), "CustomerData", "text/csv");
ExportConfiguration config = new ExportConfiguration(impExMedia, ImpExManager.getExportOnlyMode());
Exporter exporter = new Exporter(config);
exporter.export();
User/Customer can be easily exported via groovy.
import de.hybris.platform.impex.jalo.*
import de.hybris.platform.impex.jalo.exp.ExportConfiguration
import de.hybris.platform.impex.jalo.exp.Exporter
import de.hybris.platform.impex.jalo.exp.Export
String impexScript = String.format("INSERT Customer;uid[unique=true]");
ImpExMedia impExMedia = ImpExManager.getInstance().createImpExMedia("test_importexport_exportscript");
impExMedia.setData(new ByteArrayInputStream(impexScript.getBytes()), "CustomerData", "text/csv");
ExportConfiguration config = new ExportConfiguration(impExMedia, ImpExManager.getExportOnlyMode());
Exporter exporter = new Exporter(config);
Export export = exporter.export();
println(export.getExportedData())
Note: run groovy in commit mode.
Sample Output:
data_export_1676281713548(data_export_1676281713548(8798244438046))
With this PK 8798244438046 ImpExMedia can be search easily in Backoffice.

How to scrape the tweets' external links content for text analytics?

I am using following code to scrape the news associated with tweets external links.
import snscrape.modules.twitter as sntwitter
import pandas as pd
import urllib
import re
import requests
# Creating list to append tweet data to a dataframe
tweets_list = []
# Using TwitterSearchScraper to scrape data and append tweets to list
for i,tweet in enumerate(sntwitter.TwitterSearchScraper('#SkySportsNews OR #cnnsport since:2022-07-06').get_items()):
url = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", tweet.content)
if len(url) != 0:
page = requests.get(url)
if i>18000:
break
tweets_list.append([tweet.date, tweet.id, tweet.content, tweet.user.username,url,page.text])
# Creating a dataframe from the tweets list above
tweets_df = pd.DataFrame(tweets_list, columns=['Datetime', 'Tweet Id', 'Text', 'Username','url','text'])
But I am getting below error
> `InvalidSchema: No connection adapters were found for "['http link']"`
and its coming from page = requests.get(url)` line

error: value show is not a member of Unit CaseFileDFTemp.show()

I ran below code in databricks scala notebook but I am getting error.
LIBRARY ADDED : azure-cosmosdb-spark_2.4.0_2.11-1.3.4-uber
CODE :
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType,LongType,FloatType,DoubleType, TimestampType}
import org.apache.spark.sql.cassandra._
//datastax Spark connector
import com.datastax.spark.connector._
import com.datastax.spark.connector.cql.CassandraConnector
import com.datastax.driver.core.{ConsistencyLevel, DataType}
import com.datastax.spark.connector.writer.WriteConf
//Azure Cosmos DB library for multiple retry
import com.microsoft.azure.cosmosdb.cassandra
import sqlContext.implicits._
spark.conf.set("x","x")
spark.conf.set("x","x")
spark.conf.set("x","x")
spark.conf.set("x","x")
val CaseFileDFTemp = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "case_files", "keyspace" -> "shared"))
.load().show()
CaseFileDFTemp.show()
ERROR:
error: value show is not a member of Unit CaseFileDFTemp.show()
Can you please try creating the SQL context and try the show function.
import sqlContext.implicits._
val sqlContext= new org.apache.spark.sql.SQLContext(sc)
Please let me know if it helps.
If you write
val CaseFileDFTemp = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "case_files", "keyspace" -> "shared"))
.load().show()
Then CaseFileDFTemp will have type Unit, because the show() will "consume" your dataframe. So remove show(), then it will work

Bigquery Not Accepting Stackdriver's Sink Writer Identity

I've been following the documentation to export logs from Stackdriver to Bigquery. I've tried the following:
from google.cloud.logging.client import Client as lClient
from google.cloud.bigquery.client import Client as bqClient
from google.cloud.bigquery.dataset import AccessGrant
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path_to_json.json'
lc = lClient()
bqc = bqClient()
ds = bqc.dataset('working_dataset')
acc = ds.access_grants
acc.append(AccessGrant('WRITER', 'groupByEmail', 'cloud-logs#system.gserviceaccount.com')) #this is the service account that is shown in our Exports tab in GCP.
ds.access_grants = acc
ds.update()
But we get the error message:
NotFound: 404 Not found: Email cloud-logs#system.gserviceaccount.com (PUT https://www.googleapis.com/bigquery/v2/projects/[project-id]/datasets/[working-dataset])
Why won't our dataset be updated? The key being used is the one which already created the dataset itself.

Google Custom Search via API is too slow

I am using Google Custom Search to index content on my website.
When I use a REST client to make the get request at
https://www.googleapis.com/customsearch/v1?key=xxx&q=query&cx=xx
I get response in sub seconds.
But when I try to make the call using my code, it takes up six seconds. What am I doing wrong ?
__author__ = 'xxxx'
import urllib2
import logging
import gzip
from cfc.apikey.googleapi import get_api_key
from cfc.url.processor import set_query_parameter
from StringIO import StringIO
CX = 'xxx:xxx'
URL = "https://www.googleapis.com/customsearch/v1?key=%s&cx=%s&q=sd&fields=kind,items(title)" % (get_api_key(), CX)
def get_results(query):
url = set_query_parameter(URL, 'q', query)
request = urllib2.Request(url)
request.add_header('Accept-encoding', 'gzip')
request.add_header('User-Agent','cfc xxxx (gzip)')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.read()
return data
I have implemented performance tips mentioned in Performance Tips. I would appreciate any help. Thanks.