Pyspark and Elasticsearch connectivity and creating dataframe from index - dataframe

I'm having some challenges to connect with elasticsearch and pyspark. Want to creat dataframes and query it. I'm using setup of all the required version. But as documentation I haven't found anything specific. Can anyone here guide me with end to end process.
It will be very helpful for me to have your support.
As per my research I have tried configuration setup during creating spark session but I'm not sure whether it's right or wrong.

Related

Looking for examples on Airflow GCSToS3Operator. Thanks

I am trying to send file from GCS bucket to S3 bucket using Airflow. I came across this article https://medium.com/apache-airflow/generic-airflow-transfers-made-easy-5fe8e5e7d2c2 but looking for specific code implementations and examples which also explains the requirements for this. I am a newbie to Airflow and GCP.
Astronomer is a good place to start with . see the doc for GCSToS3Operator.
You can get dependencies, explanation on each variable and links to examples

Quickest Way to synchronously refresh TR-Formulas in VBA

Thanks to the help in this forum, I got my SQL-conncection and inserts working now.
The following TR-formula is used to retrieve the data from Excel Eikon:
#TR($C3,"TR.CLOSEPRICE (adjusted=0);
TR.CompanySharesOutstanding;
TR.Volume;
TR.TURNOVER"," NULL=Null CODE=MULTI Frq=D SDate="&$A3&" EDate="&$A3)
For 100k RICs the formulas usually need between 30s and 120s to refresh. That would still be acceptable.
The problem is to get the same refresh-speed in a VBA-loop. Application.Run "EikonRefreshWorksheet" is currently used for a synchronous refresh as recommended in this post.
https://community.developers.refinitiv.com/questions/20247/can-you-please-send-me-the-excel-vba-code-which-ex.html
The syntax of the code is correct and working for 100 RICS. But already for 1k the fetching gets very slow and will freeze completely for like 50k. Even with a timeout interval of 5min.
I isolated the refresh-part. There is nothing else slowing it down. So is this maybe just not the right method for fetching larger data sets? Does anyone know a better alternative?
I finally got some good advice from the Refinitiv Developer Forum which I wanted to share here:
I think you should be using the APIs directly as opposed to opening a spreadsheet and taking the data from that - but all our APIs have limits in place. There are limits for the worksheet functions as well (which use the same backend services as our APIs) - as I think you have been finding out.
You are able to use our older Eikon COM APIs directly in VBA. In this instance you would want to use the DEX2 API to open a session and download the data. You can find more details and a DEX2 tutorial sample here:
https://developers.refinitiv.com/en/api-catalog/eikon/com-apis-for-use-in-microsoft-office/tutorials#tutorial-6-data-engine-dex-2
However, I would recommend using our Eikon Data API in the Python environment as it is much more modern and will give you a much better experience than the COM APIs. If you have a list of 50K instruments say - you could make 10 API calls of say 5K instruments using some chunking and it would all be much easier for you to manage - without even resorting to Excel - and then you can use any Python SQL tool to ingest into any database you wish - all from one python script.
import refinitiv.dataplatform.eikon as ek
ek.set_app_key('YOUR APPKEY HERE')
riclist = ['VOD.L','IBM.N','TSLA.O']
df,err = ek.get_data(riclist,["TR.CLOSEPRICE(adjusted=0).date","TR.CLOSEPRICE(adjusted=0)",'TR.CompanySharesOutstanding','TR.Volume','TR.TURNOVER'])
df
#df.to_sql - see note below
#df.to_csv("test1.csv")
1641297076596.png
This will return you a pandas dataframe that you can easily directly write into any SQLAlchemy Database for example (see example here) or CSV / JSON for example.
Unfortunately, our company policy does not allow for Python at the moment. But the VBA-solution also worked, even though it took some time to understand the tutorial and it has more limitations.

Importing RedisTimeSeries data into Grafana

I've got a process storing RedisTimeSeries data in a Redis instance on Docker. I can access the data just fine with the RedisInsight CLI:
I can also add Redis as a data source to Grafana:
I've imported the dashboards:
But when I actually try to import the data into a Grafana dashboard, the query just sits there:
TS.RANGE with a value of - +, or two timestamps, also produces nothing: (I do get results when entering it into the CLI, but not as a CLI query in Grafana.
What could I be missing?
The command you should be using in the Grafana dashboard for retrieving and visualising the data in time series stored in Redis with RedisTimeSeries is TS.RANGE for a specific key, or TS.MRANGE in combination with a filter that selects a set of time series matching this filter. List of commands with RedisTimeSeries: https://oss.redislabs.com/redistimeseries/commands/ (you're using TS.INFO which does only retrieve metadata of time series key, not the actual samples within)
So I looked into this a bit more. Moderators deleted my last answer because it didn't 'answer' the question.
There is a github issue for this. One of the developers also responded. It is broken and has been for awhile. Grafana doesn't seem to want to maintain this datasource at the moment. IMHO they should remove the redis timeseries support from their plugin library if it isn't fully baked.
[redis datasource issue for TS.RANGE]
[1]: https://github.com/RedisGrafana/grafana-redis-datasource/issues/254
Are you trying to display a graph (eg, number of people vs time)? If so, perhaps that TS.INFO is not the right command and you should use something like TS.MRANGE.
Take a look at
https://redislabs.com/blog/how-to-use-the-new-redis-data-source-for-grafana-plug-in/
for some more examples.

How to check if the cloudera services like hive, Impala are running or not through java code?

I want to run some hive queries, and then need to collect different metrics like hdfs bytes read/write. For this I have written java code. But before running the code I just want to check if the cloudera services like hive, impala, yarn are running or not. If running then the code need to execute otherwise just exit. Is there any way to check the status of services by java code?
Sampson S gave you a correct answer, but it's not trivial to implement. The information is available via the REST API of the Cloudera Manager (CM) tools offered by Cloudera. You would have your Java program make a web GET request to CM, parse the JSON result and use that to make a decision. Alternatively, you could look at the code behind their APIs to make a more direct query.
But I think you should ask "Why?" What are you trying to accomplish? Are you replicating the functionality already provided by CM? When asking questions here on SO it's always helpful to provide some context. It seems like you may be new to the environment. Perhaps it already does what you want.

Liferay 6.2 Lucene replication in cluster

I'd welcome any help regarding simple issue: I have clustered environment and I enabled Lucene replication in properties (lucene.replicate.write=true). Now, all the tutorials are instructing me to reindex Lucene.
Should I run it on one node? On both? Simultaneously or sequentially?
This question has been asked in Liferay Forum as well: https://www.liferay.com/community/forums/-/message_boards/view_message/69175435.
Thank you!
Basically what I did at first was following:
cluster.link.enabled=true
lucene.replicate.write=true
and the result was NOT WORKING replication.
What I tried next was to overcome this issue and continue with clustering the rest of the portal which at the end helped lucene as well. My progress was to:
deploy cluster activation keys
deploy ehcache-cluster-web.war
portal-ext.properties:
cluster.link.enabled=true
cluster.link.autodetect.address=<COMMONLY_ACCESSIBLE_IP_AND_PORT>
lucene.commit.batch.size=1
lucene.commit.time.interval=5000
lucene.replicate.write=true
ehcache.cluster.link.replication.enabled=true
cluster.link.channel.properties.control=<PATH_TO_XML>
cluster.link.channel.properties.transport.0=<PATH_TO_XML>
portal.instance.protocol=http
portal.instance.http.port=8080
setenv.sh
-Djava.net.preferIPv4Stack=true
-Djgroups.bind_addr=<IP_OF_THE_NODE>
edit clusterlink_control and clusterlink_transport files by Liferay tutorials
when servers shutted down delete contents of data/lucene and in Control Panel run reindaxation on one node
At the end, Lucene replication IS WORKING. What I think could be significant is following. At first, portal.properties explanation on keys lucene.commit.* is kind of hard to comprehend. By trial and error I found out that these two keys are in AND relation. Also, I found out about portal.instance.* keys which are used for multiple purposes in clustering and can matter if you have loadbalancers and/or Apaches between the nodes and autodetect fails.
There are multiple ways to configure search clustering in Liferay. If you use the lucene.replicate.write=true way, you're looking at several reindexing runs: On every restart of a server you must reindex that server's documents, as it might have missed indexing requests when it was down.
So, short answer: Don't worry, reindex both. Sooner or later you'll do it anyways, no matter if you need only one now.