Spark streaming from multiple folder

Spark streaming from multiple folder - amazon-s3

I have the exact same question asked here. I was not able to comment there as I didn't have enough reputation in stackoverflow. So I'm posting a duplicate. Not sure if there is a way around it.
The answer given there doesn't work. textFileStream() doesn't take comma separated folder list.
16/02/24 11:01:40 WARN FileInputDStream: Error finding new files
java.io.FileNotFoundException: File file:/shared/data/2016-01-22-05/,file:/shared/data/2016-01-22-06 does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:376)
This is what I have so far
val folderList = makeAListOfFoldersToWatch()
val dstreamsList = folderList.map(ssc.textFileStream(_))
val lines = ssc.union(dstreamsList)
lines.foreachRDD( rdd => {
This solution works on a fixed folder list. The use case here is to have S3 folders per hour in YYYY-MM-DD-HH format. New folder is created every hour. Is there a way to keep folder list updated in a long running streaming job? Any other way to solve this?

Related

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.

Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

Create a notebook from several md files [duplicate]

This question already has answers here:
Markdown and including multiple files
(20 answers)
Closed 2 years ago.
I'm new to Stack Overflow.
I write a lot. So I created different .md files in different directories.
Now I wanna create a notebook(it doesn't matter in .pdf format or another .md) from all the md files but I have some problems:
It will be messy I guess
I don't know how to do so
I wanted to know if there is a way to do it in a tidy way :)

I see your post is tagged for r-markdown, so I am going to show you how to do it the r-markdown way.
You can create an index.Rmd file (this doesn't have to be named index.rmd) that links to other r markdowns.
In your index file, add a code chunk with the following bit of code:
```{r child = 'children/summary.Rmd'}
```
This will knit what you have in the summary.Rmd into index.Rmd. For this example, I put summary.Rmd into a sub directory called "children".
Let me know if you have any questions!

I believe that the fastest and the most hassle-free way for you to start with the notebook is to use one of the myriads of static generators available (for example, MkDocs Material) or proceed with an application for taking notes (for example, Notable, Boost Note, and Joplin)

osquery - How can I retrieve a file origin using osquery?

I'm using osquery on Windows and I need help: I want to retrieve the file origin of a specific file. For example I download a file from http://example.com and I'm looking for a query on osquery that show me the info that I download that specific file from http://example.com (or something like this). I thought that to derive this information I can compare the timestamps between the table file and the table routes but there isn't the column timestamp in routes. How can I do that?

I don't see a table for this on windows, although the information is available on the system through ADS(see this answer). I would open an issue for this on the osquery repo, it would be a valuable table to have.
You can use the extended_attributes table. For example:
osquery> select path, key, value, base64 from extended_attributes where path ='/Users/victor/Downloads/osqueryi.zip';
path = /Users/victor/Downloads/osqueryi.zip
key = com.apple.lastuseddate#PS
value = eynzWgAAAAAbZEQgAAAAAA==
base64 = 1
path = /Users/victor/Downloads/osqueryi.zip
key = where_from
value = https://files.slack.com/files-pri/T04QVKUQG-FALAL3WP2/download/osqueryi.zip
base64 = 0
osquery>

+1 on what #groob mentioned, this'd be a nice table to have and I think we've wanted it for some time. I thought we already had an issue cut for this, but I went ahead and made a new one as simple searches wasn't turning anything up. Thanks for the question :)
https://github.com/facebook/osquery/issues/5250

how to handle multiple uploads?

I'm writing a program that can upload files to multiple FTP servers.
There is a table, at the top row there are the sites, and at the far left column there are the files. through this table I define what should be uploaded to where.
the program is already working, but what i want to do now is to upload the files parallely on each site. so when i hit start each column will go through the rows on its own and upload the files to that site if the content of the specific cell says so. sites can be any number between 1 and 50. and all uploads should be in parallel. (one file at the time for each site)
what i am asking is what is the best way to handle such thing? i know i have to set up multiple uploaders, but what is confusing me is how to keep track what each site is doing. the only thing i can come up with is an array of arrays. where each position is for a site, and the array at that position defines what file is beeing uploaded and all the informations it needs for that. would that be a good solution?
thanks!

You can put your data into array then use for loop
use this code
$web = ['www.firstSite.com','www.secondSite.com']
$user = ['firstUser','secondUser']
$pass = ['firstPass','secondPass']
for($i=0;$i<sizeof($web);$i++)
{
$conn_id = ftp_connect($web[$i]);
$login_result = ftp_login($conn_id,$user[$i],$pass[$i]);
if (ftp_put($conn_id, $server_file, $local_file, FTP_BINARY))
{echo "Success";}
else {echo "Failed";}
}

You can make a custom class and use a List(Of SiteFiles) for a collection of them. You iterate the sites data and create a new SiteFiles object for each site and add the files names to the Files property that need to be uploaded to that site. Then when your done making this List(Of SiteFiles) then you can iterate each file in SiteFiles.Files for each SiteFiles object and use threading/async methods if needed and upload the files. This gives you a neat and tidy way to organize what your doing.
Public Class SiteFiles
Public Property Site As String
Public Property Files As New List(Of String)
End Class

Delete Folders and Containing Files

I have a really quick question. My program actually downloads a zip file then extracts it onto their desktop. But I need an uninstall feature for it, which is basically deleting multiple folders and containing files. How can I do this in vb.net?

If all of your folders are contained in a single folder, it should be pretty straight forward.
Dim path As String = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) & "\YOURPATH"
System.IO.Directory.Delete(path, True)
That will delete your root directory, and all the directories and files below it. You could just call this several times over if your files and directories are not all in a single root directory like "YOURPATH" in the example. This will spare you from having to remove each file individually.

The .NET IO unit has a two commands that should let you do the trick:
System.IO.Directory.GetDirectories("C:\\Program Files\\Your Directory\\*.*");
System.IO.Directory.GetFiles("C:\\Program Files\\Your Directory\\*.*");
I would write a method that takes the name of a directory and uses the "GetFiles" routine to get all of the files and to delete them using System.IO.File.Delete(path) in a foreach loop. Then, run a foreach loop on the result of the GetDirectories() command calling the function recursively.
Update: Steve Danner points out that the System.IO.Directory namespace has a Delete method so you don't need to go through the loops I talk about here. His answer is the right one and should be voted up. Mine, at this point, is more of a curiosity (although thank you to the person who gave me an upvote ;0).

Your are looking for DirectoryInfo, use it like this:
Dim di As New IO.DirectoryInfo(path)
di.Delete(True)

Dim path As String = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) & "\YOURPATH"
System.IO.Directory.Delete(path, True)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark streaming from multiple folder - amazon-s3

Related

Is there a way to list the directories in a using PySpark in a notebook?

Create a notebook from several md files [duplicate]

osquery - How can I retrieve a file origin using osquery?

how to handle multiple uploads?

Delete Folders and Containing Files

Categories

Resources