How to get the historical data from bitfinex.com with out a limit? - api

I am drawing a chart using the data pulled from bitfinex.com via a simple API query. As the result, i will need to render a chart which is going to show the historical data of BTCUSD for the past two years.
Docs are available right here: https://bitfinex.readme.io/v2/reference#rest-public-candles
Everything works fine except the limit of the retrieved data.
This is my request:
https://api.bitfinex.com/v2/candles/trade:1h:tBTCUSD/hist?start=1514764800000&sort=1
The result can be seen over here or you can copy the request to the browser: https://docs.google.com/document/d/1sG11Ro0X21_UFgUtdqrlitcCchoSh30NzGCgAe6M0u0/edit?usp=sharing
The problem is that I receive candles for only 5 days no matter what dates or parameters I use. I can get more candles if i add the limit parameter to the string. But still, I can not get more than 1100-1000 candles. I even get the 500 error from the server:
Server error: GET https://api.bitfinex.com/v2/candles/trade:1h:tBTCUSD/hist?limit=1100&start=1512086400000&end=1516233600000&sort=1 resulted in a 500 Internal Server Error response:\n ["error",10020,"limit: invalid"]. What should be the valid limit? There is no such information in the docs.
The author of this topic has the same question but no solutions are given. The last answer does not make big changes: Bitfinex data api
How can I get the desired amount of data for the two years period of time? I do not want to break my query down into smaller pieces and go step by step. It will look ugly.

From the looks of it the limit is set to 1000. If you need more then 1000 historical entries you could parse the last timestamp of the response and create another request till you reach the desired end time.
Keep in mind that you can only do 10-90 requests peer minute. So it's smart to make some kind of sleeping mechanism on every request for 6 seconds or something like that.
import json
import time
import requests
start = 1512086400000
end = 1516233600000
timestamp = start
last_timestamp = None
url = 'https://api.bitfinex.com/v2/trades/tBTCUSD/hist/'
historical_data = []
while timestamp <= end and timestamp != last_timestamp:
print("Requesting "+str(timestamp))
params = {'start': timestamp, 'limit': 1000, 'sort': 1}
response = requests.get(url, params=params)
trades = json.loads(response.content)
historical_data.extend(trades)
last_timestamp = timestamp
id, timestamp, amount, price = trades[-1]

Related

Testing assertion with value and response time - Karate

I try to compare the response time with certain amount of time, but I don't know how to do it. I dont even know if the number I give is taken as seconds or milliseconds
This is my code:
Scenario: Case
Given url 'https://reqres.in/api/users?page=2'
When method GET
Then print responseTime
* def time = response.data.responseTime
And assert response.data.responseTime < 10
The response:
I've also tried putting the numbers like milliseconds, but get the same result :(
This worked for me, try it. It is milliseconds:
* url 'https://httpbin.org/get'
* method get
* assert responseTime < 2000
Refer docs: https://github.com/karatelabs/karate#responsetime
That said, I personally don't recommend this kind of assertions in your tests. That's what performance testing is for: https://github.com/karatelabs/karate/tree/master/karate-gatling

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

Bigquery Stream: Missing data after new table created

We recently noticed that within a short period of time after a new table being created, data which were streamed in, without any exceptions or errors, just got missing. Is there any known grace time the streaming should wait?
I finally figured out what happened by printing out trace info step by step. The multi-thread contributed to cover up the issue for a long time.
This the original 'missing data' code to create a table:
insert = sBIGQUERY.tables().insert(mProjectId, mDataset, table);
logger.info("Table " + tid.toString()+" is created at " + new Date(insert
.execute().getCreationTime()));
where insert.execute().getCreationTime() never returned.... (I don't know why) and thus the rest of my process(put data back to the sending queue to wait for next stream) didn't execute.
After I change it to:
sBIGQUERY.tables().insert(mProjectId, mDataset, table).execute();
logger.info("Table " + tid.toString()+" is created");
It runs properly and we get all the data up to BQ.
#Jordan Tigani, do you know the reason for getCreationTime() never get back? (or during quite a long period than I can wait for)
There is a 'warm up' time of a few seconds after streaming first occurs on a table before it is available for querying. There is a similar warm up time if you stop streaming to the table for more than 24 hours and then start again.
See the docs here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataavailability

Dynamically change the periodic interval of celery task at runtime

I have a periodic celery task running once per minute, like so:
#tasks.py
#periodic_task(run_every=(crontab(hour="*", minute="*", day_of_week="*")))
def scraping_task():
result = pollAPI()
Where the function pollAPI(), as you might have guessed from the name, polls an API. The catch is that the API has a rate limit that is undisclosed, and sometimes gives an error response, if that limit is hit. I'd like to be able to take that response, and if the limit is hit, decrease the periodic task interval dynamically (or even put the task on pause for a while). Is this possible?
I read in the docs about overwriting the is_due method of schedules, but I am lost on exactly what to do to give the behaviour I'm looking for here. Could anyone help?
You could try using celery.conf.update to update your CELERYBEAT_SCHEDULE.
You can add a model in the database that will store the information if the rate limit is reached. Before doing an API poll, you can check the information in the database. If there is no limit, then just send an API request.
The other approach is to use PeriodicTask from django-celery-beat. You can update the interval dynamically. I created an example project and wrote an article showing how to use dynamic periodic tasks in Celery and Django.
The example code that updates the task when the limit reached:
def scraping_task(special_object_id, larger_interval=1000):
try:
result = pollAPI()
except Exception as e:
# limit reached
special_object = ModelWithTask.objects.get(pk=special_object_id)
task = PeriodicTask.objects.get(pk=special_object.task.id)
new_schedule, created = IntervalSchedule.objects.get_or_create(
every=larger_inerval,
period=IntervalSchedule.SECONDS,
)
task.interval = new_schedule
task.save()
You can pass the parameters to the scraping_task when creating a PeriodicTask object. You will need to have an additional model in the database to have access to the task:
from django.db import models
from django_celery_beat.models import PeriodicTask
class ModelWithTask(models.Model):
task = models.OneToOneField(
PeriodicTask, null=True, blank=True, on_delete=models.SET_NULL
)
# create periodic task
special_object = ModelWithTask.objects.create_or_get()
schedule, created = IntervalSchedule.objects.get_or_create(
every=10,
period=IntervalSchedule.SECONDS,
)
task = PeriodicTask.objects.create(
interval=schedule,
name="Task 1",
task="scraping_task",
kwargs=json.dumps(
{
"special_obejct_id": special_object.id,
}
),
)
special_object.task = task
special_object.save()

How to avoid Hitting the 10 sec limit per user

We run multiple short queries in parallel, and hit the 10 sec limit.
According to the docs, throttling might occur if we hit a limit of 10 API requests per user per project.
We send a "start query job", and then we call the "getGueryResutls()" with timeoutMs of 60,000, however, we get a response after ~ 1 sec, we look for JOB Complete in the JSON response, and since it is not there, we need to send the GetQueryResults() again many times and hit the threshold, that is causing an error, not a slowdown. the sample code is below.
our questions are as such:
1. What is a "user" is it an appengine user, is it a user-id that we can put in the connection string or in the query itslef?
2. Is it really per API project of BigQuery?
3. What is the behavior?we got an error: "Exceeded rate limits: too many user/method api request limit for this user_method", and not a throttling behavior as the doc say and all of our process fails.
4. As seen below in the code, why we get the response after 1 sec & not according to our timeout? are we doing something wrong?
Thanks a lot
Here is the a sample code:
while (res is None or 'jobComplete' not in res or not res['jobComplete']) :
try:
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
except HTTPException:
if independent:
raise
Are you saying that even though you specify timeoutMs=60000, it is returning within 1 second but the job is not yet complete? If so, this is a bug.
The quota limits for getQueryResults are actually currently much higher than 10 requests per second. The reason the docs say only 10 is because we want to have the ability to throttle it down to that amount if someone is hitting us too hard. If you're currently seeing an error on this API, it is likely that you're calling it at a very high rate.
I'll try to reproduce the problem where we don't wait for the timeout ... if that is really what is happening it may be the root of your problems.
def query_results_long(self, jobId, maxResults, res=None):
start_time = query_time = None
while res is None or 'jobComplete' not in res or not res['jobComplete']:
if start_time:
logging.info('requested for query results ended after %s', query_time)
time.sleep(2)
start_time = datetime.now()
res = self.service.jobs().getQueryResults(projectId=self.project_id,
jobId=jobId, timeoutMs=60000, maxResults=maxResults).execute()
query_time = datetime.now() - start_time
return res
then in appengine log I had this:
requested for query results ended after 0:00:04.959110