I'm using Apache Beam. When writing to tfRecord I need to include the ID of the item along with its text and embedding.
The tutorial works with just one list of text but I also have a list of the IDs to match the list of text so I was wondering how I could pass the ID to the following function:
def to_tf_example(entries):
examples = []
text_list, embedding_list = entries
for i in range(len(text_list)):
text = text_list[i]
embedding = embedding_list[i]
features = {
# need to pass in ID here like so:
'id': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[ids.encode('utf-8')])),
'text': tf.train.Feature(
bytes_list=tf.train.BytesList(value=[text.encode('utf-8')])),
'embedding': tf.train.Feature(
float_list=tf.train.FloatList(value=embedding.tolist()))
}
example = tf.train.Example(
features=tf.train.Features(
feature=features)).SerializeToString(deterministic=True)
examples.append(example)
return examples
My first thought was just to include the ids in the text column of my database and then extract them via slicing or regex or something but was wondering if there was a better way, I assume converting to a PCollection but don't know where to start. Here is the pipeline:
with beam.Pipeline(args.runner, options=options) as pipeline:
query_data = pipeline | 'Read data from BigQuery' >>
beam.io.Read(beam.io.BigQuerySource(project='my-project', query=get_data(args.limit), use_standard_sql=True))
# list of texts
text = query_data | 'get list of text' >> beam.Map(lambda x: x['text'])
# list of ids
ids = query_data | 'get list of ids' >> beam.Map(lambda x: x['id'])
( text
| 'Batch elements' >> util.BatchElements(
min_batch_size=args.batch_size, max_batch_size=args.batch_size)
| 'Generate embeddings' >> beam.Map(
generate_embeddings, args.module_url, args.random_projection_matrix)
| 'Encode to tf example' >> beam.FlatMap(to_tf_example)
| 'Write to TFRecords files' >> beam.io.WriteToTFRecord(
file_path_prefix='{0}'.format(args.output_dir),
file_name_suffix='.tfrecords')
)
query_data | 'Convert to entity and write to datastore' >> beam.Map(
lambda input_features: create_entity(
input_features, args.kind))
I altered generate_embeddings to return List[int], List[string], List[List[float]] and then used the following function to pass the list of ids and text in:
def generate_embeddings_for_batch(batch, module_url, random_projection_matrix):
embeddings = generate_embeddings([x['id'] for x in batch], [x['text'] for x in batch], module_url, random_projection_matrix)
return embeddings
Here I'll assume generate_embeddings has the signature List[str], ... -> (List[str], List[List[float]])
What you want to do is avoid splitting your texts and ids into separate PCollections. So you might want to write something like
def generate_embeddings_for_batch(
batch,
module_url,
random_projection_matrix) -> Tuple[int, str, List[float]]:
embeddings = generate_embeddings(
[x['text'] for x in batch], module_url, random_projection_matrix)
text_to_embedding = dict(embeddings)
for id, text in batch:
yield x['id'], x['text'], text_to_embedding[x['text']]
From there you should be able to write to_tf_example.
It would probably make sense to look at using TFX.
Related
For a 6 class sentence classification task, I have a list of sentences where I retrieve the absolute values before the softmax is applied. Example list of sentences:
s = ['I like the weather today', 'The movie was very scary', 'Love is in the air']
I get the values the following way:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
print(output.logits.detach().numpy())
# returns [[-0.8390876 2.9480567 -0.5134539 0.70386493 -0.5019671 -2.619496 ]]
#[[-0.8847909 -0.9642067 -2.2108874 -0.43932158 4.3386173 -0.37383893]]
#[[-0.48750368 3.2949197 2.1660519 -0.6453249 -1.7101991 -2.817954 ]]
How do I create a data frame with columns sentence, class_1, class_2, class_3, class_4, class_5, class_6 where I add values iteratively or maybe in a more optimal way where I append each new sentence and its absolute values? What would be the best way?
Expected output:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...
If I only had one sentence, I could transform it to a data frame like this, but I would still need to append the sentence somehow
sentence = tokenizer("Love is in the air", return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
Maybe creating two separate data frames and then appending them would be one plausible way of doing this?
Save the model outputs in a list and then create the dataframe from an object:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
import pandas as pd
model_name = "Emanuel/bertweet-emotion-base"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
outputs = []
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
outputs.append(output.logits.detach().numpy()[0])
# convert to one numpy array
outputs = np.array(outputs)
# create dataframe
obj = {"sentence": s}
for class_id in range(outputs.shape[1]):
# get the data column for that class
obj[f"class_{class_id}"] = outputs[:,class_id].tolist()
df = pd.DataFrame(obj)
I managed to come up with a solution and I am posting it as someone might find it useful.
The idea is to initialize a data frame and to append the absolute values for every sentence while iterating
absolute_vals = pd.DataFrame()
for i in s:
sentence = tokenizer(i, return_tensors="pt")
output = model(sentence["input_ids"])
px = pd.DataFrame(output.logits.detach().numpy())
absolute_vals = absolute_vals.append(px, ignore_index = True)
absolute_vals
Returns:
sentence class_1 class_2 class_3 ....
0 I like the weather today -0.8390876 2.9480567 -0.5134539 ....
1 The movie was very scary -0.8847909 -0.9642067 -2.2108874 ....
2 Love is in the air -0.48750368 3.2949197 2.1660519 ....
...
I am working on a project where i am using mental health related subreddit posts containing two feature columns (text, title) and a label column (Subreddit).
I want to use LSTM for classification where i need to create embedding matrix for both the columns in short need both columns for text classification but i cannot find the way to embed both columns.
Code i am using for text sequences is
text_sequences_train = token.texts_to_sequences(preprocessed_text_train)
title_sequences_train = token.texts_to_sequences(preprocessed_title_train)
#print(sequences_train)
train=np.hstack(text_sequences_train+title_sequences_train)
train.reshape(1,train.shape[0])
train_seq_x=pad_sequences(train, maxlen=300)
text_sequences_test = token.texts_to_sequences(preprocessed_text_test)
title_sequences_test = token.texts_to_sequences(preprocessed_title_test)
#print(sequences_train)
test=np.hstack(text_sequences_test+title_sequences_test)
test.reshape(1,test.shape[0])
test_seq_x=pad_sequences(test, maxlen=300)
text_sequences_val = token.texts_to_sequences(preprocessed_text_val)
title_sequences_val = token.texts_to_sequences(preprocessed_title_val)
#print(sequences_train)
val=np.hstack(text_sequences_val+title_sequences_val)
val.reshape(1,val.shape[0])
val_seq_x=pad_sequences(val, maxlen=300)
the above code gives me an error
ValueError: `sequences` must be a list of iterables. Found non-iterable: 428.0
code i am using for embedding matrix is
glove_file = "glove.42B.300d.txt"
import tqdm
EMBEDDING_VECTOR_LENGTH = 300 # <=200
def construct_embedding_matrix(glove_file, word_index):
embedding_dict = {}
with open(glove_file,'r', encoding='utf-8') as f:
for line in f:
values=line.split()
# get the word
word=values[0]
if word in word_index.keys():
# get the vector
vector = np.asarray(values[1:], 'float32')
embedding_dict[word] = vector
#print(embedding_dict[word].shape)
### oov words (out of vacabulary words) will be mapped to 0 vectors
num_words=len(word_index)+1
#initialize it to 0
embedding_matrix=np.zeros((num_words, EMBEDDING_VECTOR_LENGTH))
for word,i in tqdm.tqdm(word_index.items()):
if i < num_words:
vect=embedding_dict.get(word, [])
if len(vect)>0:
embedding_matrix[i] = vect[:EMBEDDING_VECTOR_LENGTH]
#print(embedding_matrix[i].shape)
print(embedding_matrix)
return embedding_matrix
embedding_matrix=construct_embedding_matrix(glove_file, word_index)
If I convert text sequences and then train test split it gives an error where X and Y no of samples do not match
Suppose I have a tsv dataset file of conversations with arbitrary lengths, with each message tab seperated, each line representing a full conversation:
Hi\tHow are you?\tIm doing well
This is a conversation?\tYes.\tHuh\tIts also a test
I would like to create a tensorflow dataset from this containing all combinations of the conversation in order, like this (I'll seperate the inputs and targets with \t and the individual messages with \b):
Hi\tHow are you?
Hi/bHow are you?\tIm doing well
This is a conversation?\tYes.
This is a conversation?/bYes.\tHuh
This is a conversation?/bYes./bHuh\tIts also a test
I'm essentially looking to implement this but in tensorflow datasets:
def convertline(text, max_length=20):
text=text.split("\t") #split the conversation by tabs
inputs, targets=[],[] #create empty arrays for inputs and targets
for y in range(1,len(text)): #iterate through the split conversation
x=y-max_length if y-max_length >= 0 else 0 #get the starting value; if it's negative, use 0 instead
inputs.append("/b".join(text[x:y])) #append to the inputs the current window, joined by /b
targets.append(text[y]) #append the target
return [{"inputs":inputs[i], "targets":targets[i]} for i in range(len(inputs))] #zip them together in a dict of inputs and targets
with open("testfile.txt", "r") as f: #open a file
line = f.readline() #read file line by line
while line:
print(convertline(line.strip())) #run the function and print its results
line=f.readline()
which returns:
[{'inputs': 'Hi', 'targets': 'How are you?'}, {'inputs': 'Hi/bHow are you?', 'targets': 'Im doing well'}]
[{'inputs': 'This is a conversation?', 'targets': 'Yes.'}, {'inputs': 'This is a conversation?/bYes.', 'targets': 'Huh'}, {'inputs': 'This is a conversation?/bYes./bHuh', 'targets': 'Its also a test'}]
This is what I have so far:
def dataset(split, shuffle_files=False):
# Load lines from the text file as examples.
ds = tf.data.TextLineDataset(nq_tsv_path[split])
# Split each "<question>\t<answer>" example into (question, answer) tuple.
# This definitely won't work, and is most likely where the code to generate sliding windows should be
ds = ds.map(functools.partial(tf.io.decode_csv, record_defaults=["", ""],
field_delim="\t", use_quote_delim=False),
num_parallel_calls=tf.data.experimental.AUTOTUNE)
# Map the dataset into dicts of questions and answers
ds = ds.map(lambda *ex: dict(zip(["question", "answer"], ex)))
return ds
I'm trying to loop through all files in a directory and add "indicator" data to them. I had the code working where I could select 1 file and do this, but now am trying to make it work on all files. The problem is when I make the loop it says
ValueError: Invalid file path or buffer object type: <class 'list'>
The goal would be for each loop to read another file from list, make changes, and save file back to folder with changes.
Here is complete code w/o imports. I copied 1 of the "file_path"s from the list and put in comment at bottom.
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file in listdrs_path:
file_path = listdrs_path
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
###################################################
#here is an example of what path from list looks like
#'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/A.csv'
The problem is in line number 13 and 14. Your filename is in variable file but you are using file_path which you've assigned the file list. Because of this you are getting ValueError. Try this:
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
Let me know if it helps.
I'm currently trying to run a Dataflow (Apache Beam, Python SDK) task to import a >100GB Tweet file into BigQuery, but running into Error: Message: Too many sources provided: 15285. Limit is 10000.
The task takes the tweets (JSON), extracts 5 relevant fields, transforms/sanitizes them a bit with some transforms and then write those values into BigQuery, which will be used for further processing.
There's Cloud Dataflow to BigQuery - too many sources but it seems to be caused by having a lot of different input files, whereas I have a single input file, so it doesn't seem relevant. Also the solutions mentioned there are rather cryptic and I'm not sure if/how I could apply them to my problem.
My guess is that BigQuery writes temporary files for each row or something before persisting them, and that's what's meant by "too many sources" ?
How can I fix this?
[Edit]
Code:
import argparse
import json
import logging
import apache_beam as beam
class JsonCoder(object):
"""A JSON coder interpreting each line as a JSON string."""
def encode(self, x):
return json.dumps(x)
def decode(self, x):
return json.loads(x)
def filter_by_nonempty_county(record):
if 'county_fips' in record and record['county_fips'] is not None:
yield record
def run(argv=None):
parser = argparse.ArgumentParser()
parser.add_argument('--input',
default='...',
help=('Input twitter json file specified as: '
'gs://path/to/tweets.json'))
parser.add_argument(
'--output',
required=True,
help=
('Output BigQuery table for results specified as: PROJECT:DATASET.TABLE '
'or DATASET.TABLE.'))
known_args, pipeline_args = parser.parse_known_args(argv)
p = beam.Pipeline(argv=pipeline_args)
# read text file
#Read all tweets from given source file
read_tweets = "Read Tweet File" >> beam.io.ReadFromText(known_args.input, coder=JsonCoder())
#Extract the relevant fields of the source file
extract_fields = "Project relevant fields" >> beam.Map(lambda row: {'text': row['text'],
'user_id': row['user']['id'],
'location': row['user']['location'] if 'location' in row['user'] else None,
'geo':row['geo'] if 'geo' in row else None,
'tweet_id': row['id'],
'time': row['created_at']})
#check what type of geo-location the user has
has_geo_location_or_not = "partition by has geo or not" >> beam.Partition(lambda element, partitions: 0 if element['geo'] is None else 1, 2)
check_county_not_empty = lambda element, partitions: 1 if 'county_fips' in element and element['county_fips'] is not None else 0
#tweet has coordinates partition or not
coordinate_partition = (p
| read_tweets
| extract_fields
| beam.ParDo(TimeConversion())
| has_geo_location_or_not)
#lookup by coordinates
geo_lookup = (coordinate_partition[1] | "geo coordinates mapping" >> beam.ParDo(BeamGeoLocator())
| "filter successful geo coords" >> beam.Partition(check_county_not_empty, 2))
#lookup by profile
profile_lookup = ((coordinate_partition[0], geo_lookup[0])
| "join streams" >> beam.Flatten()
| "Lookup from profile location" >> beam.ParDo(ComputeLocationFromProfile())
)
bigquery_output = "write output to BigQuery" >> beam.io.Write(
beam.io.BigQuerySink(known_args.output,
schema='text:STRING, user_id:INTEGER, county_fips:STRING, tweet_id:INTEGER, time:TIMESTAMP, county_source:STRING',
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE))
#file_output = "write output" >> beam.io.WriteToText(known_args.output, coder=JsonCoder())
output = ((profile_lookup, geo_lookup[1]) | "merge streams" >> beam.Flatten()
| "Filter entries without location" >> beam.FlatMap(filter_by_nonempty_county)
| "project relevant fields" >> beam.Map(lambda row: {'text': row['text'],
'user_id': row['user_id'],
'county_fips': row['county_fips'],
'tweet_id': row['tweet_id'],
'time': row['time'],
'county_source': row['county_source']})
| bigquery_output)
result = p.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.DEBUG)
run()
It's a little bit complicated, so it would probably take too much time to do it in bigquery directly. The code reads the tweets json, splits the PCollection by whether it's geotagged or not, if not it tries to look it up via profile location, maps to location to what's relevant for our GIS analysis and then writes it to BigQuery.
The number of files correspond to the number of shards the elements were processed in.
One trick to reducing this is to generate some random keys, and group the elements based on that before writing them out.
For example, you could use the following DoFn and PTransform in your pipeline:
class _RoundRobinKeyFn(beam.DoFn):
def __init__(self, count):
self.count = count
def start_bundle(self):
self.counter = random.randint(0, self.count - 1)
def process(self, element):
self.counter += 1
if self.counter >= self.count:
self.counter -= self.count
yield self.counter, element
class LimitBundles(beam.PTransform):
def __init__(self, count):
self.count = count
def expand(self, input):
return input
| beam.ParDo(_RoundRobinKeyFn(self.count))
| beam.GroupByKey()
| beam.FlatMap(lambda kv: kv[1])
You would just use this before the bigquery_output:
output = (# ...
| LimitBundles(10000)
| bigquery_output)
(Note that I just typed this in without testing it, so there are likely some Python typos.)