Save a dataframe view after groupBy using pyspark - pandas

My homework is giving me a hard time with pyspark. I have this view of my "df2" after a groupBy:
df2.groupBy('years').count().show()
+-----+-----+
|years|count|
+-----+-----+
| 2003|11904|
| 2006| 3476|
| 1997| 3979|
| 2004|13362|
| 1996| 3180|
| 1998| 4969|
| 1995| 1995|
| 2001|11532|
| 2005|11389|
| 2000| 7462|
| 1999| 6593|
| 2002|11799|
+-----+-----+
Every attempt to save this (and then load with pandas) to a file gives back the original source data text file form I read with pypspark with its original columns and attributes, only now its .csv but that's not the point.
What can I do to overcome this ?
For your concern I do not use SparkContext function in the begining of the code, just plain "read" and "groupBy".

df2.groupBy('years').count().write.csv("sample.csv")
or
df3=df2.groupBy('years').count()
df3.write.csv("sample.csv")
both of them will create sample.csv in your working directory

You can assign the results into a new dataframe results, and then write the results to a csv file. Note that there are two ways to output the csv. If you use spark you need to use .coalesce(1) to make sure only one file is outputted. The other way is to convert .toPandas() and use to_csv() function of pandas DataFrame.
results = df2.groupBy('years').count()
# writes a csv file "part-xxx.csv" inside a folder "results"
results.coalesce(1).write.csv("results", header=True)
# or if you want a csv file, not a csv file inside a folder (default behaviour of spark)
results.toPandas().to_csv("results.csv")

Related

Postman Request body data from excel / csv file - forward slash

I get Request body data from excel file.
I have already covert excel to csv format.
I have kind of able to find a solution but it is not working 100% as jsonbody format in not fetching data correctly is shows forward slash in csv import data from runner collections.
Request Body
{{jsonBody}}
Set Global variables jsonBody
Run collection select data file as csv file as per screenshot request body shows with forward slash.
After running the collection I'm getting body incorrect version with forward slash.
This below screenshot show correct version on csv data I require to remove forward slash from csv data
I had similar issue with postman and realized my problem was more of a syntax issue.
Lets say your cvs file has the following columns:
userId | mid | platform | type | ...etc
row1 94J4J | 209444894 | NORTH | PT | ...
row2 324JE | 934421903 | SOUTH | MB | ...
row3 966RT | 158739394 | EAST | PT | ...
This is how you want your json request body to look like:
{
"userId" : "{{userId}}",
"mids":[{
"mid":"{{mid}}",
"platform":"{{platform}}"
}],
"type":["{{type}}"],
.. etc
}
Make sure your colums names match the varibales {{variableName}}
The data coming from CSV is already in a stringified format so you don't need to do anything in pre-request.
example:
let csv be
| jsonBody |
| {"name":"user"}|
Now in postman request just use:
{{jsonBody}}
as {{column_name}} will be considered as data varaible so , in your case {{jsonBody}}
csv:
make sure you save this as csv file :
Now in request use :
output:
if you want to add the json body as value of another then just use :
Output:

Add file name and timestamp into each record in BigQuery using Dataflow

I have a few .txt files with data in JSON to be loaded to google BigQuery table. Along with the columns in the text files I will need to insert filename and current timestamp for each rows. It is in GCP Dataflow with Python 3.7
I accessed the Filemetadata containing the filepath and size using GCSFileSystem.match and metadata_list.
I believe I need to get the pipeline code to run in a loop, pass the filepath to ReadFromText, and call a FileNameReadFunction ParDo.
(p
| "read from file" >> ReadFromText(known_args.input)
| "parse" >> beam.Map(json.loads)
| "Add FileName" >> beam.ParDo(AddFilenamesFn(), GCSFilePath)
| "WriteToBigQuery" >> beam.io.WriteToBigQuery(known_args.output,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND)
)
I followed the steps in Dataflow/apache beam - how to access current filename when passing in pattern? but I can't make it quite work.
Any help is appreciated.
You can use textio.ReadFromTextWithFilename instead of ReadFromText. That will produce a PCollection of (filename,line) tuples.
To include the file and timestamp in your output json record, you could change your "parse" line to
| "parse" >> beam.map(lambda (file, line): {
**json.loads(line),
"filename": file,
"timestamp": datetime.now()})

Writing integer/string to a text file in pyspark from a cluster

I am using EMR step functions to analyze data.
I wanted to store the count of the analyzed dataframe to decide whether I can save it as a csv or parquet. I would prefer CSV but if the size is too big, I wont be able to download it and use it on my laptop.
I used the count() method to store it to a int variable limit
When i try using the following code:
coalesce(1).write.format("text").option("header", "false").mode("overwrite").save("output.txt")
It says that:
int doesnt have any attribute called write
Is there a way to write integers or string to a file so that I can open it in my s3 bucket and inspect after the EMR step has run?
Update:
I tried the dataframe method as suggested by #Shu, but am getting the following error.
Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task 0 in stage 13.0 failed 4 times, most recent failure:
Lost task 0.3 in stage 13.0 (TID 19396, ip-10-210-13-34.ec2.internal,
executor 11): org.apache.spark.SparkException: Task failed while
writing rows. at
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
at
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
What could be the root cause of this?
You can parallelize the int variable to create an rdd then write to HDFS using .saveAsTextFile
df.show()
#+---+
#| _1|
#+---+
#| a|
#| b|
#+---+
limit=df.count()
spark.sparkContext.parallelize([limit]).coalesce(1).saveAsTextFile("<path>")
#content of file
#cat <path>/part-00000
#2
Other way would be creating dataframe from count variable then write in csv format as header false.
from pyspark.sql.types import *
spark.createDataFrame(spark.sparkContext.parallelize([limit]),IntegerType()).coalesce(1).write.format("csv").option("header", "false").mode("overwrite").save("<path>")
#or in text format
spark.createDataFrame(spark.sparkContext.parallelize([limit]),StringType()).coalesce(1).write.format("text").mode("overwrite").save("<path>")
#cat part-*
#2

Add Custom Header in Pentaho

I have data that looks like the following:
assetnum | assetdesc
123 | sampledesc
432 | sample desc2
I want to insert another row with four fields so it looks like the following:
SYSNAME | OBJSTRUC | AddChange | En
assetnum | assetdesc
123 | sampledesc
432 | sample desc2
However I am unsure how to do this. Does anyone know how?
I have tried generating rows but I am unsure how to merge so that it looks like this. I have also thought of adding headers but I am unsure how to specify the header (without it being created automatically) I am quite new to Pentaho.
Thanks.
Here is a hack. Assume StepA writes the actual data into a file fileA. Before writing anything into your fileA have a Text file output step and in the content tab, Add Ending line of file field, enter the custom row you need to insert. Since the file is empty at the beginning, your last line will become the first line. Once it is done, you can write the other data as per your original source using Append flag. To set the dependency, use the Block until steps finish to block the actual write in StepA.

Read a file from a position in Robot Framework

How can I read a file from a specific byte position in Robot Framework?
Let's say I have a process running for a long time writing a long log file. I want to get the current file size, then I execute something that affects the behaviour of the process and I wait until some message appears in the log file. I want to read only the portion of the file starting from the previous file size.
I am new to Robot Framework. I think this is a very common scenario, but I haven't found how to do it.
There are no built-in keywords to do this, but writing one in python is pretty simple.
For example, create a file named "readmore.py" with the following:
from robot.libraries.BuiltIn import BuiltIn
class readmore(object):
ROBOT_LIBRARY_SCOPE = "TEST SUITE"
def __init__(self):
self.fp = {}
def read_more(self, path):
# if we don't already know about this file,
# set the file pointer to zero
if path not in self.fp:
BuiltIn().log("setting fp to zero", "DEBUG")
self.fp[path] = 0
# open the file, move the pointer to the stored
# position, read the file, and reset the pointer
with open(path) as f:
BuiltIn().log("seeking to %s" % self.fp[path], "DEBUG")
f.seek(self.fp[path])
data = f.read()
self.fp[path] = f.tell()
BuiltIn().log("resetting fp to %s" % self.fp[path], "DEBUG")
return data
You can then use it like this:
*** Settings ***
| Library | readmore.py
| Library | OperatingSystem
*** test cases ***
| Example of "tail-like" reading of a file
| | # read the current contents of the file
| | ${original}= | read more | /tmp/junk.txt
| | # do something to add more data to the file
| | Append to file | /tmp/junk.txt | this is new content\n
| | # read the new data
| | ${new}= | Read more | /tmp/junk.txt
| | Should be equal | ${new.strip()} | this is new content