Spark Java - how to convert a non delimited file into dataset in spark java - apache-spark-sql

I need to read a non delimited file and convert it into dataset in spark java. Need to map to column names by reading from csv and splitting each line based on size of each attribute. Please suggest me how to do in spark java.

I cannot do that in Java as I use Scala, but one can FoldLeft applying successive substring or slice operations, or do that also without FoldLeft.
An example in Scala which you can convert - this is the less advanced option:
import org.apache.spark.sql.functions._
import spark.implicits._
// Cols for renaming.
val list = List("C1", "C2", "C3")
// Gen some data.
val df = Seq(
("C1111sometext999"),
("C2222sometext888"),
).toDF("data")
// "heavy" lifting.
val df2 = df.selectExpr("substring(data, 0, 5)", "substring(data, 6,8)", "substring(data, 14,3)")
// Rename from list. Can also do "as Cn" in selectExpr.
val df3 = df2.toDF(list:_*)
df3.show
returns:
+-----+--------+---+
| C1| C2| C3|
+-----+--------+---+
|C1111|sometext|999|
|C2222|sometext|888|
+-----+--------+---+
You will then have to convert to types.

Related

How to remove stop words from dataframe, pyspark or sql?

So for instance I have this dataframe.
data = Seq(("Novelist Sparks turns screenwriter with this film, which combines his usual themes (beaches, grieving teens, cancer) as a vehicle for Cyrus to put her childhood career behind her. It's exactly what we expect, but it's also fairly watchable."))
df = data.toDF("sentence")
What I want to do is remove all stop words from this row or column value and count the words after removing stop words.
pyspark or sql code example, both are good.
We made the Fugue project to port native Python or Pandas code to Spark or Dask. This lets you can keep the logic very readable by expressing it in native Python. Fugue can then port it to Spark for you with one function call.
First we start with a test Pandas DataFrame:
import pandas as pd
df = pd.DataFrame({"id": [1,2,3], "sentences": ["this is sentence one.", "this is sentence two.", "this is sentence three"]})
Then we create a pandas-based function to handle stopwords
def process(df:pd.DataFrame) -> pd.DataFrame:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop = set(stopwords.words('english'))
df['processed_sentences'] = df['sentences'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])
return df
If you want to count the words, just add it as another column. Now we can bring this function to Fugue and test it.
from fugue import transform
transform(df, process, schema="*, processed_sentences:str")
Now that we see it works, we can use it on Spark by specifying the engine
import fugue_spark
transform(df, process, schema="*, processed_sentences:str", engine="spark").show()
Note .show() is needed because of Spark's lazy evaluation.
The output is:
+---+--------------------+-------------------+
| id| sentences|processed_sentences|
+---+--------------------+-------------------+
| 1|this is sentence ...| sentence one.|
| 2|this is sentence ...| sentence two.|
| 3|this is sentence ...| sentence three|
+---+--------------------+-------------------+
The Fugue transform function can take in a Pandas DataFrame or Spark DataFrame, and it will output a Spark DataFrame if you are using the Spark engine.
I think you have to import the nltk.corpus inside the function so that it is executed on the workers rather than the driver. You need nltk installed on the workers because they need access to stopwords.words.

How to take input from pandas.dataFrame in Apache Beam Pipeline

I am trying to take input from pandas dataframe to apache beam pipeline and write it to GCS. Without using dataflow/apache beam, I am able to write the dataframe data in GCS. But now dataflow is in picture.
def database_to_gcs(self, type='full'):
if type == 'full':
with open(self.tablemetadata, 'r') as fr:
next(fr)
self.clear_directory()
argv = [
'--project={0}'.format(self.project_name),
'--job_name=One',
'--save_main_session',
'--staging_location=gs://{0}/staging/'.format(self.bucket_name),
'--temp_location=gs://{0}/staging/'.format(self.bucket_name),
'--runner=DataflowRunner'
]
p = beam.Pipeline(argv=sys.argv)
for line in fr:
table_name, primary_key = line.split(',')
self.cur.execute("SELECT * FROM " + table_name)
df = pd.DataFrame(list(self.cur))
dictionary = df.to_dict('split')
print(dictionary)
input_dataframe = df
output_path = 'gs://{0}/output/{1}/{2}/{3}'.format(self.bucket_name,
table_name,
str(datetime.now().date()),
str(datetime.now()) + "_" + table_name + '.csv')
(p
| 'ReadDataframe' >> beam.io.ReadFromText(input_dataframe)
| 'WriteToFile' >> beam.io.Write(output_path)
)
p.run()
Beam provides ParDo transform where you can write arbitrary Python code that operates on input elements. So probably consider writing a DoFn that takes lines of text read from input file and generates dataframes. You can either process these dataframes in the same ParDo or feed them to a secondary ParDo where you do the processing. I don't think Beam currently have any utility transforms for handling pandas dataframes currently even though this was discussed several times.
For anyone reading this old question, Beam no longer supports python 2.x but there is now DataFrame support in apache_beam.dataframe.io.

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

How to concat multiple pandas dataframes into one dask dataframe larger than memory?

I am parsing tab-delimited data to create tabular data, which I would like to store in an HDF5.
My problem is I have to aggregate the data into one format, and then dump into HDF5. This is ~1 TB-sized data, so I naturally cannot fit this into RAM. Dask might be the best way to accomplish this task.
If I use parsing my data to fit into one pandas dataframe, I would do this:
import pandas as pd
import csv
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"]
readcsvfile = csv.reader(csvfile)
total_df = pd.DataFrame() # create empty pandas DataFrame
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = pd.concat([total_df, df]) # creates one big dataframe
Using dask to do the same task, it appears users should try something like this:
import pandas as pd
import csv
import dask.dataframe as dd
import dask.array as da
csv_columns = ["COL1", "COL2", "COL3", "COL4",..., "COL55"] # define columns
readcsvfile = csv.reader(csvfile) # read in file, if csv
# somehow define empty dask dataframe total_df = dd.Dataframe()?
for i, line in readcsvfile:
# parse create dictionary of key:value pairs by table field:value, "dictionary_line"
# save dictionary as pandas dataframe
df = pd.DataFrame(dictionary_line, index=[i]) # one line tabular data
total_df = da.concatenate([total_df, df]) # creates one big dataframe
After creating a ~TB dataframe, I will save into hdf5.
My problem is that total_df does not fit into RAM, and must be saved to disk. Can dask dataframe accomplish this task?
Should I be trying something else? Would it be easier to create an HDF5 from multiple dask arrays, i.e. each column/field a dask array? Maybe partition the dataframes among several nodes and reduce at the end?
EDIT: For clarity, I am actually not reading directly from a csv file. I am aggregating, parsing, and formatting tabular data. So, readcsvfile = csv.reader(csvfile) is used above for clarity/brevity, but it's far more complicated than reading in a csv file.
Dask.dataframe handles larger-than-memory datasets through laziness. Appending concrete data to a dask.dataframe will not be productive.
If your data can be handled by pd.read_csv
The pandas.read_csv function is very flexible. You say above that your parsing process is very complex, but it might still be worth looking into the options for pd.read_csv to see if it will still work. The dask.dataframe.read_csv function supports these same arguments.
In particular if the concern is that your data is separated by tabs rather than commas this isn't an issue at all. Pandas supports a sep='\t' keyword, along with a few dozen other options.
Consider dask.bag
If you want to operate on textfiles line-by-line then consider using dask.bag to parse your data, starting as a bunch of text.
import dask.bag as db
b = db.read_text('myfile.tsv', blocksize=10000000) # break into 10MB chunks
records = b.str.split('\t').map(parse)
df = records.to_dataframe(columns=...)
Write to HDF5 file
Once you have dask.dataframe try the .to_hdf method:
df.to_hdf('myfile.hdf5', '/df')

Pandas Dataframe to RDD

Can I convert a Pandas DataFrame to RDD?
if isinstance(data2, pd.DataFrame):
print 'is Dataframe'
else:
print 'is NOT Dataframe'
is DataFrame
Here is the output when trying to use .rdd
dataRDD = data2.rdd
print dataRDD
AttributeError Traceback (most recent call last)
<ipython-input-56-7a9188b07317> in <module>()
----> 1 dataRDD = data2.rdd
2 print dataRDD
/usr/lib64/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
2148 return self[name]
2149 raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150 (type(self).__name__, name))
2151
2152 def __setattr__(self, name, value):
AttributeError: 'DataFrame' object has no attribute 'rdd'
I would like to use Pandas Dataframe and not sqlContext to build as I'm not sure if all the functions in Pandas DF are available in Spark. If this is not possible, is there anyone that can provide an example of using Spark DF
Can I convert a Pandas Dataframe to RDD?
Well, yes you can do it. Pandas Data Frames
pdDF = pd.DataFrame([("foo", 1), ("bar", 2)], columns=("k", "v"))
print pdDF
## k v
## 0 foo 1
## 1 bar 2
can be converted to Spark Data Frames
spDF = sqlContext.createDataFrame(pdDF)
spDF.show()
## +---+-+
## | k|v|
## +---+-+
## |foo|1|
## |bar|2|
## +---+-+
and after that you can easily access underlying RDD
spDF.rdd.first()
## Row(k=u'foo', v=1)
Still, I think you have a wrong idea here. Pandas Data Frame is a local data structure. It is stored and processed locally on the driver. There is no data distribution or parallel processing and it doesn't use RDDs (hence no rdd attribute). Unlike Spark DataFrame it provides random access capabilities.
Spark DataFrame is distributed data structures using RDDs behind the scenes. It can be accessed using either raw SQL (sqlContext.sql) or SQL like API (df.where(col("foo") == "bar").groupBy(col("bar")).agg(sum(col("foobar")))). There is no random access and it is immutable (no equivalent of Pandas inplace). Every transformation returns new DataFrame.
If this is not possible, is there anyone that can provide an example of using Spark DF
Not really. It is far to broad topic for SO. Spark has a really good documentation and Databricks provides some additional resources. For starters you check these:
Introducing DataFrames in Spark for Large Scale Data Science
Spark SQL and DataFrame Guide