Google Dataflow Sharing Resources Between Windows - google-bigquery

I am currently building a google data-flow pipeline that writes to multiple big query tables at run-time. The problem I am currently facing is, I need to re-use the resources like big query service instance, table info etc. (I do not want to re-create those resources every time) but I am not able to cache them in an efficient way.
Currently I am using a simple factory to cache them (using static concurrent hash map). The pipeline does not seem to pick those from the cache (actually it does it for couple of times but most of them are re-created).
I saw some work around with fixed size session windows but I need more simpler solution if there exists any.
So, is there any best practices or solution to the current problem I am facing.
Is there any way to share resources between windows ?

Actually I misplaced the logging information which let to invert the result (my bad). But the solution with Static Factory separate from the pipeline job seem to resolve the resource sharing issue. Hope this helps to anyone having similar issue further :)

Related

I need some advice about druid and metamarkets

I need a solution for storing logs (which more or less follow one of, say, 10, standard formats), preferably in real time, in a database which is fast to query and can easily give me the result to various wired queries. E.g. queries looking for keywords in text bodies, queries involving multiple tables.
A solution that was recommended to me was MetaMarket, which seems to do real-time logging with a very good query system in style. However I'm unsure about the cost and wether or not such a complex solution is needed.
From what I understand the "selling point" of metamarket is the druid db and said db is open source and can be deployed outside of their stack. So what I come here to ask is:
Have any of you guys had experience deploying a real-time logging system with Druid ? How hard was it ? How long did it take ? What are the challenges ? What other technologies besides druid did you use ? Do you have any recommended reading ?
Have any of you had experience with metamarket. If so, again, how hard was it ? how long did it take ? what are the challenges ? how were the cost once it hit production ? Do you have any recommended reading on the subject ?
Also, bonus question: Are there actually any benchmarks done by "unbiased professionals" about druid ? The fact that a real-time in real-time out databse is written in Java seems a bit... ahm, hard to believe.
This is quick answer.
It is true druid is open source but the missing link here is a good UI that plays nice with druid. There is one UI used to be called caravel and now is superset i guess it can do an ok job.
Concerning running a druid cluster it should not be that hard if you have enough resources (eg engineers) to put in place all the pipeline from packaging to deploying druid on the machines/cloud.
Finally the last piece is monitoring/updating the cluster it needs good amount of work as well.
And yes it is written using JAVA but that's the case for many other realtime software take example of KAFKA, in fact druid does a lot of thing off-heap and uses memory mapped files for serving data. Reading the white paper will provide a good/basic understanding of the system, hence you find the answer if druid is a good fit or not.

What are some methods of testing data analytics systems and ETL processes?

I work primarily with so-called "Big Data"; the ETL and analytics parts. One of the challenges I constantly face is finding a good way to "test my data" so to speak. For my mapreduce and ETL scripts I write solid unit test coverage but if there are unexpected underlying changes in the data itself (coming from multiple application systems) the code won't necessarily throw a noticeable error which leaves me with bad / altered data that I don't know about.
Are there any best practices out there that help people keep an eye on what / how the underlying data may be changing?
Our technology stack is AWS EMR, Hive, Postgres, and Python. We're not really interested in bringing in a big ETL framework like Informatica.
You could create some kind of mapping files(maybe xml or something) as per the standards specific to your systems and validate your incoming data before putting it into your cluster, or maybe during the process itself. I was facing a similar issue sometime ago and ended up doing this.
I don't know how feasible it is for your data and your use case but it did the trick for us. I had to create the xml files once(I know it's boring and tedious, but worth giving a try) and now whenever I get new files I use these xml files to validate the data before putting it into my cluster to check whether the data is correct or not(as per the standards defined). This saves a lot of time and effort which would be involved if I have to check everything manually everytime I get some new data.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

iPhone SDK: Data Synchronization

I am looking for an overview of data synchronization techniques available on the iPhone platform. We need the ability to be able to sync a subset of content from a server to a local database residing on the iPhone.
On other projects I have worked on, the data synchronization was handled by the database. Is that available in SQLite? If not, any suggestions on techniques? Rolling our own would not be my first choice.
Thanks in advance.
Unfortunately I don't think there's currently a tool/feature that does this. A similar question was posted, which links to an article how someone rolled their own.
In essence, you could create a "pending" queue table that kept track of the record id's that needed to be updated. When the app needed to synch, as long as you had a way to identify the local record with the server record it could synch it that way. And of course the opposite way from the server down.
iPhone SQLite DB and Web-based DB synchronization and interaction recommendations
There is one framework that could help you if you are using Core Data.
You shall have a look at ZSync. An overview is available here:
http://ideveloper.tv/freevideo/details?index=17089146
If you are using SQLite I strongly suggest that you consider switching to Core Data. You will certainly gain some performances and the integration of undo/redo.
You'll have to remove a whole bunch of custom code anyway... :)

Is there a way of sharing a Core Data store between processes?

What am I trying to do?
A UI process that reads data from a Core Data store on disk. It wouldn't need to edit the data, just read and display the data.
A command line process that writes to the same data store as accessed by the UI.
Why?
So that the command line process can be running all the time but the user can quit the UI process and forget about the app until they need to look at the data it's captured.
What would be the simplest and most reliable way of achieving this?
What Have I Tried?
I've read up on sharing a data store between threads and implemented this once before, but I can't find anything in the docs or on the web indicating how to share a store between processes.
Is it as simple as pointing both processes at the same data store file? I've experimented with this briefly. It appeared to work OK, but I'm worried I might run into problems with locking etc when it's really put under stress.
Finally
I'd really appreciate someone giving me pointers on what direction to go with this. Thanks.
This might be one of those situations in which you'll simply have to Try It And Seeā„¢.
Insofar as I can remember, SQLite (which is the data store you'll most likely want to be using) has built in mechanisms for file locking and so on; so the integrity of the file is likely to be assured. If, on the other hand, you use the CoreData/XML approach, you might run into problems.
In other words; use the SQLite backing for your file, and you should likely be fine.
You can do exactly what you want, you probably want to use the SQLite store otherwise saving and committing every time you want to synch out data will be horrifically slow. You just need to use some sort of IPC doorbell between the apps so that you can inform one app it needs to recheck the persistent store on disk and merge in its data.
Apple documents using multiple persistent store corindators as a valid option in Multi-Threading with Core Data (in "General Guidelines", open 2). That happens to be discussing completely parallel CD stacks in the same process, but it is valid if they are in completely separate address spaces as well.
Nearly two years on, and I've just found a much better way of doing this.
The answer seems to lie with Sync Services. I didn't even realise it existed! There's an excellent post about this at:
http://www.timisted.net/blog/archive/core-data-and-sync-services/
I've not tried this with my app yet, but it seems like an excellent way of sharing a core data store between two processes or applications.
If I experience any performance issues, I'll update this answer accordingly, but this seems like the Apple recommended way of doing it.
You need to re-think your architecture. If you want a daemon to own the data store, then have your GUI app connect to the daemon. Trying to share the data store is a can of worms you don't want to open.