Read streaming data from s3 using pyspark - numpy

I would like to leverage python for its extremely simple text parsing and functional programming capabilities and also to tap into the rich offering of scientific computing libraries like numpy and scipy and hence would like to use pyspark for a task.
The task I am looking to perform at the outset is to read from a bucket where there are text files being written to as part of a stream. Could someone paste a code snippet of how to read streaming data from an s3 path using pyspark? I thought this could be done only using scala and java till recently but I just found out today that spark 1.2 onwards, streaming is supported in pyspark as well but am unsure whether S3 streaming is supported?
The way I used to do it in scala is to read it in as a HadoopTextFile I think and also use configuration parameters to set aws key and secret. How would I do something similar in pyspark?
Any help would be much appreciated.
Thanks in advance.

Check the "Basic Sources" section in the documentation: https://spark.apache.org/docs/latest/streaming-programming-guide.html
I believe you want something like
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext('local[2]', 'my_app')
ssc = StreamingContext(sc, 1)
stream = ssc.textFileStream('s3n://...')

Related

Generate sas7bdat files from a pandas dataframe

I would like to know if there's any python library that supports this conversion, currently the options i've found are SASpy, csv or SQL database but was unsuccessful.
This is not really a programming question but hope it won't be an issue.
I've found this post:
Export pandas dataframe to SAS sas7bdat format
But was hoping to find any updates on new libraries that support sas7bdat files creation and how licensing works for SASpy.
The sas7bdat is very hard to write. The read is fairly doable (but pretty hard) but the write is brutal. SAS costs a LOT of money and cannot be purchased (it is leased). My suggestions:
Use one of the products by companies that have done it. Some examples: CoyRoc (SSIS adaptor) $, StatTransfer $, SPSS $$$, SAS (lots of dollar signs). WPS might be able to do it but they save to their format to avoid the mess. They probably also support sas7bdat export.
Do not use sas7bdat format. Consider something else like SAS Transport format. Look at my github repository (savian-net) for C# code that can do it. Translate to Python or find a python library that can handle SAS Transport.
The sas7bdat is a binary, proprietary protocol that is 100% not published anywhere. Any docs are guesses based upon binary sleuthing. It is based on an old mainframe format and 'likely remnants' appear to be included. My suggestion is to avoid it like the plague and find an alternative.
An alternative to using xport as Stu suggested - as of Viya 2021.2.6, SAS supports reading externally generated parquet files via the new parquet import engine. As such, you could export the file to parquet via Python then directly import that into SAS and save it as a .sas7bdat file.
https://communities.sas.com/t5/SAS-Communities-Library/Parquet-Support-in-SAS-Compute-Server/ta-p/811733

determine written object paths with Pyspark 3.2.1 + hadoop 3.3.2

When writing dataframes to S3 using the s3a connector, there seems to be no official way of determining the object paths on s3 that were written in the process. What I am trying to achieve is simply determining what objects have been written when writing to s3 (using pyspark 3.2.1 with hadoop 3.3.2 and the directory committer).
The reason this might be useful:
partitionBy might add a dynamic amount of new paths
spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing
With pyspark 3.1.2 and Hadoop 3.2.0 it used to be possible to use the not officially supported "_SUCCESS" file which was written at the path before the first partitioning on S3, which contained all the paths of all written files. Now however, the number of paths seems to be limited to 100 and this is not a option anymore.
Is there really no official, reasonable way of achieving this task?
Now however, the number of paths seems to be limited to 100 and this is not a option anymore.
we had to cut that in HADOOP-16570...one of the scale problems which surfaced during terasorting at 10-100 TB. the time to write the _SUCCESS file started to slow down job commit times. it was only ever intended for testing. sorry.
it is just a constant in the source tree. if you were to provided a patch to make it configurable, I'll be happy to review and merge, provided you follow the "say which aws endpoint you ran all the tests or we ignore your patch" policy.
I don't know where else this stuff is collected. the spark driver is told of the number of files and their total size from each task commit, but isn't given the list by tasks, not AFAIK.
spark creates it's own "part..." parquet files with cryptic names and number depending on the partitions when writing
the part-0001- bit of the filename comes from the task id; the bit afterwards is a uuid created to ensure every filename is unique -see SPARK-8406
Adding UUID to output file name to avoid accidental overwriting. you can probably turn that off

How to load Stata file (.dta) in SQL?

I am a Stata user and a new to SQL, so being able to read Stata files directly would be a big step. Please let me know if this is doable.
You can either use odbc as Wouter suggested or go through Python. Then you can use the pandas library.

GUI tools for viewing/editing Apache Parquet

I have some Apache Parquet file. I know I can execute parquet file.parquet in my shell and view it in terminal. But I would like some GUI tool to view Parquet files in more user-friendly format. Does such kind of program exist?
Check out this utility. Works for all windows versions: https://github.com/mukunku/ParquetViewer
There is Tad utility, which is cross-platform. Allows you to open Parquet files and also pivot them and export to CSV. Uses DuckDB as it's backend. More info on the DuckDB page:
GH here:
https://github.com/antonycourtney/tad
Actually I found some Windows 10 specific solution. However, I'm working on Linux Mint 18 so I would like to some Linux (or ideally cross-platform) GUI tool. Is there some other GUI tool?
https://www.channels.elastacloud.com/channels/parquet-net/how-about-viewing-parquet-files
There is a GUI tool to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.
It supports complex data type like array, map, struct etc. And you can save the read file in CSV format.
GUI option for Windows, Linux, MAC
You can now use DBeaver to
view parquet data
view metadata and statistics
run sql query on one or multiple files. (supports glob expressions)
generate new parquet files.
DBeaver leverages DuckDB driver to perform operations on parquet file. Features like Projection and predicate pushdown are also supported by DuckDB.
Simply create an in-memory instance of DuckDB using Dbeaver and run the queries like mentioned in this document. Right now Parquet and CSV is supported.
Here is a Youtube video that explains the same - https://youtu.be/j9_YmAKSHoA
JetBrains (IntelliJ, PyCharm etc) has a plugin for this, if you have a professional version: https://plugins.jetbrains.com/plugin/12494-big-data-tools

Python interaction in Pandas documentation

In this document http://pandas.pydata.org/pandas-docs/stable/pandas.pdf the python interaction is done very nicely.
Where are the latex sources so I can see how this is done?
The docs are generated using sphinx.
You can see how by reading at the make.py file from the pandas github repository.