from xbbg import blp works for equity but does not work for bonds - bloomberg

from xbbg import blp works for equity but does not work for bonds.
I use this pip library: https://pypi.org/project/xbbg/
I do the following imports.
import blpapi
from xbbg import blp
I then run the following test for an equity:
# this works
eqData = blp.bdh(
tickers='SPX Index', flds=['high', 'low', 'last_price'],
start_date='2018-10-10', end_date='2018-10-20',
)
print(eqData)
This works and produces the expected dataframe.
I do exactly the same for a corporate bond:
# this returns empty
bondData = blp.bdh(
tickers='XS1152338072 Corp', flds=['px_bid', 'px_ask'],
start_date='2019-10-10', end_date='2018-10-20',
)
print(bondData)
This fails (produces an empty dataFrame), even though the data exists.
Here is the result (an empty DataFrame):
getting bond data...
Empty DataFrame
Columns: []
Index: []
Also note that i can get the BDP function to work for bonds.
why can i not get the BDH function to work ?

it appears as though the start date (year) was after the end date.
ie. change 2019 to 2018.
start_date='2019-10-10', end_date='2018-10-20'
corrects to
start_date='2012-10-10', end_date='2018-10-20',
This produces the expected dataFrame.

Related

Query returns value that don't exist in PySpark Dataframe

Is there a way to create a subset dataframe from a dataframe and be sure that its values will be used afterward?
I have a huge PySpark Dataframe like this (simplified example):
id
timestamp
value
1
1658919602
5
1
1658919604
9
2
1658919632
2
Now I want to take a sample from it to test something, before running on the entire Dataframe. I get a sample by:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
df_sample.show() shows some values.
Then I run this command, and sometimes it returns values that are present in df_sample and sometimes it returns values that are not present in df_sample but in df.
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol'))
As if it's not using df_sample but picking in a non deterministic way 10 rows from df.
Interestingly, if I run df_sample.show() afterwards, it shows the same values as when it was first called.
Why is this happening?
Here's full code:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
# shows some values
df_sample.show()
# run query
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol')
# df_temp sometimes shows values that are present in df_sample, but sometimes shows values that aren't present in df_sample but in df
df_temp.show()
# Shows the exact same values as when it was first called
df_sample.show()
Edit1: I understand that Spark is lazy, but is there any way to force it to not be lazy in this scenario?
We can use sample function provided by spark to achieve this.Every time you run a sample() function it returns a different set of sampling records, To regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run.
df=spark.range(100)
# Execute first time
print(df.sample(0.1,123).collect())
# Execute Second time with same seed-123
print(df.sample(0.1,123).collect())
# Execute with different seed-456
print(df.sample(0.1,456).collect())
Refer spark docs
Stratum sampling in spark
What worked was using df_sample = df.limit(10).cache() or df_sample = df.limit(10).persist(). Samkart's comment pointed me in this direction.

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Python KeyError when using pandas

I'm following a tutorial on NLP but have encountered a key error error when trying to group my raw data into good and bad reviews. Here is the tutorial link: https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e
#reviews.csv
I am so angry about the service
Nothing was wrong, all good
The bedroom was dirty
The food was great
#nlp.py
import pandas as pd
#read data
reviews_df = pd.read_csv("reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] +
reviews_df["Positive_Review"]
reviews_df.columns
I'm seeing the following error:
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Negative_Review'
Why is this happening?
You're getting this error because you did not understand how to structure your data.
When you do df['reviews']=df['Positive_reviews']+df['Negative_reviews'] you're actually summing the values of Positive reviews to Negative reviews(which does not exist currently) into the 'reviews' column (chich also does not exist).
Your csv is nothing more than a plaintext file with one text in each row. Also, since you're working with text, remember to enclose every string in quotation marks("), otherwise your commas will create fakecolumns.
With your approach, it seems that you'll still tag all your reviews manually (usually, if you're working with machine learning, you'll do this outside code and load it to your machine learning file).
In order for your code to work, you want to do the following:
import pandas as pd
df = pd.read_csv('TestFileFolder/57886076.csv', names=['text'])
## Fill with placeholder values
df['Positive_review']=0
df['Negative_review']=1
df.head()
Result:
text Positive_review Negative_review
0 I am so angry about the service 0 1
1 Nothing was wrong, all good 0 1
2 The bedroom was dirty 0 1
3 The food was great 0 1
However, I would recommend you to have a single column (is_review_positive) and have it to true or false. You can easily encode it later on.

How to properly iterate over a for loop using Dask?

When I run a loop like this (see below) using dask and pandas, only the last field in the list gets evaluated. Presumably this is because of "lazy-evaluation"
import pandas as pd
import dask.dataframe as ddf
df_dask = ddf.from_pandas(df, npartitions=16)
for field in fields:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list)
If I add .compute() to the last line:
df_dask["column__{field}".format(field=field)] = df_dask["column"].apply(lambda _: [__ for __ in _ if (__ == field)], meta=list).compute()
it then works correctly, but is this the most efficient way of doing this operation? Is there a way for Dask to add all the items from the fields list at once, and then run them in one-shot via compute()?
edit ---------------
Please see screenshot below for a worked example
You will want to call .compute() at the end of your computation to trigger work. Warning: .compute assumes that your result will fit in memory
Also, watch out, lambdas late-bind in Python, so the field value may end up being the same for all of your columns.
Here's one way to do it, where string check is just a sample function that returns True/False. The issue was the late binding of lambda functions.
from functools import partial
def string_check(string, search):
return search in string
search_terms = ['foo', 'bar']
for s in search_terms:
string_check_partial = partial(string_check, search=s)
df[s] = df['YOUR_STRING_COL'].apply(string_check_partial)

Pandas, importing JSON-like file using read_csv

I would like to import data from .txt to dataframe. I can not import it using classical pd.read_csv, while using different types of sep it throws me errors. Data I want to import Cell_Phones_&_Accessories.txt.gz is in a format.
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
....
You can use jen for separator, then split by first : and pivot:
df = pd.read_csv('Cell_Phones_&_Accessories.txt', sep='¥', names=['data'], engine='python')
df1 = df.pop('data').str.split(':', n=1, expand=True)
df1.columns = ['a','b']
df1 = df1.assign(c=(df1['a'] == 'product/productId').cumsum())
df1 = df1.pivot('c','a','b')
Python solution with defaultdict and DataFrame constructor for improve performance:
from collections import defaultdict
data = defaultdict(list)
with open("Cell_Phones_&_Accessories.txt") as f:
for line in f.readlines():
if len(line) > 1:
key, value = line.strip().split(':', 1)
data[key].append(value)
df = pd.DataFrame(data)