How to convert SCollection[String] to Seq[String] or List[String]? - spotify-scio

I want to convert SCollection[String] to Seq[String] or List[String].
For example, I have a variable called ids.
val ids: SCollection[String] = ~
ids.saveAsTextFile(pathToGCS)
When I save it to Cloud Storage, the contents of the text file are a table of IDs.
id1
id2
id2
I want to keep the contents of a file as Seq or List.
val seqOdIds: Seq[String] = ~

Not within the same job since Dataflow doesn't have the notion of a driver node like Spark to collect data from worker nodes. See https://spotify.github.io/scio/Scio%2C-Scalding-and-Spark.html#scio-and-spark
You can use the tap API to read file content after job completion. See
https://spotify.github.io/scio/examples/TapOutputExample.scala.html

Related

How i can read files from s3 using pyspark which is created after a particular time

I need to read json files from s3 using pyspark. The S3 location may contain hundreds of thousands of files. and every file have same metdata. But each time i need to read only the files that is created after a particular time. How i can do this?
If you have access to the system that creates these files, the simplest way to approach this would be to add a date partition when you write them:
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=20210901/myfile1.json
s3://mybucket/myfolder/date=2021831/myfileA.json
And then you can read them with a filter; Pyspark will then only load the files that it needs into memory.
start_dt = '20210831'
end_dt = '20210901'
df = (
spark
.read
.json(path)
.filter(F.col("date").between(start_dt, end_dt))
)
Note that I have not explicitly tested this with JSON files, just with Parquet, so this method may need to be adapted.
If you don't have access to change how the files are written, I don't think Pyspark has direct access to the metadata of the files. Instead, you will want to query S3 directly using boto3 to generate a list of files, filter them using boto3 meta data, and then pass the list of files into the read method:
# generate this by querying via boto3
recent_files = ['s3://mybucket/file1.json', 's3://mybucket/file2.json']
df = spark.read.json(*recent_files)
Info about listing files from boto3.
You can provide modifiedAfter and modifiedBefore parameters to DataFrameReader.json function.
modifiedBefore an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Example
from datetime import datetime
# Fill this variable with your last date
lowerbound = datetime(2021, 9, 1, 13, 0, 0)
# Current execution
upperbound = datetime.now()
df = spark.read.json(source_path,
modifiedAfter=lowerbound.strftime('%Y-%m-%dT%H:%M:%S'),
modifiedBefore=upperbound.strftime('%Y-%m-%dT%H:%M:%S'))
As noted in the discussion on Kafels' answer, modifiedBefore and modifiedAfter don't work with S3 as a data source. This is a real shame!
The next best alternative is to use boto3 to list all objects in the partition, and then filter the results on the lastModified element in the results. The results don't contain a creation timestamp so lastModified is the best you can do. You also need to be careful to handle pagination given the large number of objects.
Something like this should work to retrieve the matching keys:
import boto3
def get_matching_s3_keys(bucket, prefix="", after_date=None):
"""
List keys in an S3 bucket that match specified criteria.
:param bucket: Name of the S3 bucket.
:param prefix: Only get objects whose key starts with
this prefix
:param after_date: Only get objects that were last modified
after this date. Note: this needs to be a timezone-aware date
"""
paginator = s3.get_paginator("list_objects_v2")
kwargs = {'Bucket': bucket, 'Prefix': prefix}
for page in paginator.paginate(**kwargs):
try:
contents = page["Contents"]
except KeyError:
break
for obj in contents:
last_modified = obj["LastModified"]
if after_date is None or last_modified > after_date:
yield obj["Key"]

PySpark map function - send n rows instead of one to build a list

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr.
I have deployed pysolr module for this purpose
import pysolr
def index_module(row ):
...
solr_client = pysolr.Solr(SOLR_URI)
solr_client.add(row)
...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")
df.toJSON().map(index_module).count()
index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.
How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?
Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?
Finally, I have to create Solr Object each time, is there any alternative for this ?

Can't figure out how to insert keys and values of nested JSON data into SQL rows with NiFi

I'm working on a personal project and very new (learning as I go) to JSON, NiFi, SQL, etc., so forgive any confusing language used here or a potentially really obvious solution. I can clarify as needed.
I need to take the JSON output from a website's API call and insert it into a table in my MariaDB local server that I've set up. The issue is that the JSON data is nested, and two of the key pieces of data that I need to insert are used as variable key objects rather than values, so I don't know how to extract it and put it in the database table. Essentially, I think I need to identify different pieces of the JSON expression and insert them as values, but I'm clueless how to do so.
I've played around with the EvaluateJSON, SplitJSON, and FlattenJSON processors in particular, but I can't make it work. All I can ever do is get the result of the whole expression, rather than each piece of it.
{"5381":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":74.0,"tm_def_snp":63.0,"temperature":58.0,"st_snp":8.0,"punts":4.0,"punt_yds":178.0,"punt_lng":55.0,"punt_in_20":1.0,"punt_avg":44.5,"humidity":47.0,"gp":1.0,"gms_active":1.0},
"1023":{"wind_speed":4.0,"tm_st_snp":26.0,"tm_off_snp":82.0,"tm_def_snp":56.0,"temperature":74.0,"off_snp":82.0,"humidity":66.0,"gs":1.0,"gp":1.0,"gms_active":1.0},
"5300":{"wind_speed":17.0,"tm_st_snp":27.0,"tm_off_snp":80.0,"tm_def_snp":64.0,"temperature":64.0,"st_snp":21.0,"pts_std":9.0,"pts_ppr":9.0,"pts_half_ppr":9.0,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl":4.0,"idp_sack":1.0,"idp_qb_hit":2.0,"humidity":100.0,"gp":1.0,"gms_active":1.0,"def_snp":23.0},
"608":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":53.0,"tm_def_snp":79.0,"temperature":88.0,"st_snp":4.0,"pts_std":5.5,"pts_ppr":5.5,"pts_half_ppr":5.5,"idp_tkl_solo":4.0,"idp_tkl_loss":1.0,"idp_tkl_ast":1.0,"idp_tkl":5.0,"humidity":78.0,"gs":1.0,"gp":1.0,"gms_active":1.0,"def_snp":56.0},
"3396":{"wind_speed":6.0,"tm_st_snp":20.0,"tm_off_snp":60.0,"tm_def_snp":70.0,"temperature":63.0,"st_snp":19.0,"off_snp":13.0,"humidity":100.0,"gp":1.0,"gms_active":1.0}}
This is a snapshot of an output with a couple thousand lines. Each of the numeric keys that you see above (5381, 1023, 5300, etc) are player IDs for the following stats. I have a table set up with three columns: Player ID, Stat ID, and Stat Value. For example, I need that first snippet to be inserted into my table as such:
Player ID Stat ID Stat Value
5381 wind_speed 4.0
5381 tm_st_snp 26.0
5381 tm_off_snp 74.0
And so on, for each piece of data. But I don't know how to have NiFi select the right pieces of data to insert in the right columns.
I believe that it's possible to use jolt to transform your json into a format:
[
{"playerId":"5381", "statId":"wind_speed", "statValue": 0.123},
{"playerId":"5381", "statId":"tm_st_snp", "statValue": 0.456},
...
]
then use PutDatabaseRecord with json reader.
Another approach is to use ExecuteGroovyScript processor.
Add new parameter to it with name SQL.mydb and link it to your DBCP controller service
And use the following script as Script Body parameter:
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
def ff=session.get()
if(!ff)return
//read flow file content and parse it
def body = ff.read().withReader("UTF-8"){reader->
new JsonSlurper().parse(reader)
}
def results = []
//use defined sql connection to create a batch
SQL.mydb.withTransaction{
def cmd = 'insert into mytable(playerId, statId, statValue) values(?,?,?)'
results = SQL.mydb.withBatch(100, cmd){statement->
//run through all keys/subkeys in flow file body
body.each{pid,keys->
keys.each{k,v->
statement.addBatch(pid,k,v)
}
}
}
}
//write results as a new flow file content
ff.write("UTF-8"){writer->
new JsonBuilder(results).writeTo(writer)
}
//transfer to success
REL_SUCCESS << ff

How to evenly distribute data in apache pig output files?

I've got a pig-latin script that takes in some xml, uses the XPath UDF to pull out some fields and then stores the resulting fields:
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
Note that we're using pig-0.12.0 on our cluster, so I ripped the XPath/XMLLoader classes out of pig-0.14.0 and put them in my own jar so that I could use them in 0.12.
This above script works fine and produces the data that I'm looking for. However, it generates over 1,900 partfiles with only a few mbs in each file. I learned about the default_parallel option, so I set that to 128 to try and get 128 partfiles. I ended up having to add a piece to force a reduce phase to achieve this. My script now looks like:
set default_parallel 128;
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
forced_reduce = FOREACH (GROUP results BY RANDOM()) GENERATE FLATTEN(results);
store forced_reduce into '$output';
Again, this produces the expected data. Also, I now get 128 part-files. My problem now is that the data is not evenly distributed among the part-files. Some have 8 gigs, others have 100 mb. I should have expected this when grouping them by RANDOM() :).
My question is what would be the preferred way to limit the number of part-files yet still have them evenly-sized? I'm new to pig/pig latin and assume I'm going about this in the completely wrong way.
p.s. the reason I care about the number of part-files is because I'd like to process the output with spark and our spark cluster seems to do a lot better with a smaller number of files.
I'm still looking for a way to do this directly from the pig script but for now my "solution" is to repartition the data within the spark process that works on the output of the pig script. I use the RDD.coalesce function to rebalance the data.
From the first code snippet, I am assuming it is map only job since you are not using any aggregates.
Instead of using reducers, set the property pig.maxCombinedSplitSize
REGISTER udf-lib-1.0-SNAPSHOT.jar;
DEFINE XPath com.blah.udfs.XPath();
docs = LOAD '$input' USING com.blah.storage.XMLLoader('root') as (content:chararray);
results = FOREACH docs GENERATE XPath(content, 'root/id'), XPath(content, 'root/otherField'), content;
store results into '$output';
exec;
set pig.maxCombinedSplitSize 1000000000; -- 1 GB(given size in bytes)
x = load '$output' using PigStorage();
store x into '$output2' using PigStorage();
pig.maxCombinedSplitSize - setting this property will make sure each mapper reads around 1 GB data and above code works as identity mapper job, which helps you write data in 1GB part file chunks.

Apache Pig: Merging list of attributes into a single tuple

I receive data in the form
id1|attribute1a,attribute1b|attribute2a|attribute3a,attribute3b,attribute3c....
id2||attribute2b,attribute2c|..
I'm trying to merge it all into a form where I just have a bag of tuples of an id field followed by a tuple containing a list of all my other fields merged together.
(id1,(attribute1a,attribute1b,attribute2a,attribute3a,attribute3b,attribute3c...))
(id2,(attribute2b,attribute2c...))
Currently I fetch it like
my_data = load '$input' USING PigStorage(|) as
(id:chararray, attribute1:chararray, attribute2:chararray)...
then I've tried all combinations of FLATTEN, TOKENIZE, GENERATE, TOTUPLE, BagConcat, etc. to massage it into the form I want, but I'm new to pig and just can't figure it out. Can anyone help? Any open source UDF libraries are fair game.
Load each line as an entire string, and then use the features of the built-in STRPLIT UDF to achieve the desired result. This relies on there being no tabs in your list of attributes, and assumes that | and , are not to be treated any differently in separating out the different attributes. Also, I modified your input a little bit to show more edge cases.
input.txt:
id1|attribute1a,attribute1b|attribute2a|,|attribute3a,attribute3b,attribute3c
id2||attribute2b,attribute2c,|attribute4a|,attribute5a
test.pig:
my_data = LOAD '$input' AS (str:chararray);
split1 = FOREACH my_data GENERATE FLATTEN(STRSPLIT(str, '\\|', 2)) AS (id:chararray, attr:chararray);
split2 = FOREACH split1 GENERATE id, STRSPLIT(attr, '[,|]') AS attributes;
DUMP split2;
Output of pig -x local -p input=input.txt test.pig:
(id1,(attribute1a,attribute1b,attribute2a,,,attribute3a,attribute3b,attribute3c))
(id2,(,attribute2b,attribute2c,,attribute4a,,attribute5a))