Dask dataframe read parquet format fails from http - pandas

I have been dealing with this problem for a week.
I use the command
from dask import dataframe as ddf
ddf.read_parquet("http://IP:port/webhdfs/v1/user/...")
I got invalid parquet magic.
However ddf.read_parquet is Ok with "webhdfs://"
I would like the ddf.read_parquet works for http because I want to use it in dask-ssh cluster for workers without hdfs access.

Although the comments already partly answer this question, I thought I would add some information as an answer
HTTP(S) is supported by dask (actually fsspec) as a backend filesystem; but to get partitioning within a file, you need to get the size of that file, and to resolve globs, you need to be able to get a list of links, neither of which are necessarily provided by any given server
webHDFS (or indeed httpFS) don't work like HTTP downloads, you need to use a specific API to open a file and fetch a final URL on a cluster member to that file; so the two methods are not interchangeable
webHDFS is normally intended for use outside of the hadoop cluster; within the cluster, you would probably use plain HDFS ("hdfs://"). However, kerberos-secured webHDFS can be tricky to work with, depending on how the security was set up.

Related

Importing RedisTimeSeries data into Grafana

I've got a process storing RedisTimeSeries data in a Redis instance on Docker. I can access the data just fine with the RedisInsight CLI:
I can also add Redis as a data source to Grafana:
I've imported the dashboards:
But when I actually try to import the data into a Grafana dashboard, the query just sits there:
TS.RANGE with a value of - +, or two timestamps, also produces nothing: (I do get results when entering it into the CLI, but not as a CLI query in Grafana.
What could I be missing?
The command you should be using in the Grafana dashboard for retrieving and visualising the data in time series stored in Redis with RedisTimeSeries is TS.RANGE for a specific key, or TS.MRANGE in combination with a filter that selects a set of time series matching this filter. List of commands with RedisTimeSeries: https://oss.redislabs.com/redistimeseries/commands/ (you're using TS.INFO which does only retrieve metadata of time series key, not the actual samples within)
So I looked into this a bit more. Moderators deleted my last answer because it didn't 'answer' the question.
There is a github issue for this. One of the developers also responded. It is broken and has been for awhile. Grafana doesn't seem to want to maintain this datasource at the moment. IMHO they should remove the redis timeseries support from their plugin library if it isn't fully baked.
[redis datasource issue for TS.RANGE]
[1]: https://github.com/RedisGrafana/grafana-redis-datasource/issues/254
Are you trying to display a graph (eg, number of people vs time)? If so, perhaps that TS.INFO is not the right command and you should use something like TS.MRANGE.
Take a look at
https://redislabs.com/blog/how-to-use-the-new-redis-data-source-for-grafana-plug-in/
for some more examples.

Adding fields to Cloudwatch without using JSON

So I have typical run of the mill logs from Nginx and tomcat servers which are just single line text files with typical log format. I have changed the tomcat access logs to output pipe delimited fields so I can easily process them using some unix scripts. I'd like to get rid of my unix scripts and move to using cloudwatch to process my logs in a similar manner, however I found out that cloudwatch really doesn't understand anything beyond timestamp, message, and logstream by default.
It will add fields using JSON, but JSON is verbose when it comes to log files. I'd like to just let it process a CSV file which seems like an obvious alternative to JSON. I'm willing to change my log format to meet a requirement like that, but I can't find any information about how I could do that.
Is my only option to translate my logs into JSON in order to add fields to cloudwatch? I am aware of the parse command, but I find that cumbersome to reconstitute my fields every time I want to build a query. Especially since these will mostly be access logs which will have numerous fields. I have aws cloudwatch log agent setup on my systems and I'm currently sending these logs to cloudwatch.
The closest thing there is to handling space delimited log files is to use Metric Filters. Or at least that's how the authors of CloudWatch designed it.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
The best examples of this is here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CountOccurrencesExample.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/ExtractBytesExample.html
Not sure if this is going to work for what I'm trying to do with logs, but it's a start. And it's the closest thing to a proper answer. If you want it done right, you gotta do it yo'self.

Why is matillion not loading data from S3?

I have a simple S3 load with all the correct information. There are no validation errors but and the package executes without a problem. It's just that there is no data in the table. Any tips from someone that is knowledgeable about Matillion?
There are a number of reasons why Matillion might not appear to load any data in an S3 Load.
Firstly, I'd check that the pattern matches the file names in the S3 location, which is a regular expression match.
I believe that also includes the path which you may have included in the location parameter, so it may be worth modifying your pattern to look something like .*\/FilePrefix.* or even just .* and then selecting the actual file in the location parameter
Secondly, if the files were last modified more than 64 days ago, or they have already been loaded in to the table previously, Snowflake won't load them by default, which you can get around by turning the Force Load parameter On.

Merge two Endeca Servers (Endeca 3.1) into one. Including their current data

Let me explain in more detail:
1st: I'm running endeca 3.1, so Endeca Server here refers to 3.0's Data Domain.
I'm required to use an Endeca Server currently present on Endeca (Downloaded a Demo VM). All the info on it, including, groups, attributes and data, must be merged into out Endeca Server. (It can also be the other way around, i could merge my Endeca Server into this one.)
So far, i've tried to do the following:
1) Clone the Endeca Server
2) use the putCollection sconfig operation to create a collection on it with the same name i have on mine.
3) Load configurations using the LoadCollection & LoadAttributes graphs from OEID POC Template 3.1. I point to the new collection on the Configuration.xls file.
This is where i encounter an issue. The LoadAttributes graph gets a T/O message from the server's WS. Then the config WSDL becomes inaccesible for a while. I can't go beyond this point.
I've been able to load data into the collection, but i need to load the attributes first.
THanks in advance for your replies.
Regards
There are a few techniques.
Have you tried exporting the data domain and then importing it?
You can use the endeca-cmd tools to export to a file, and then import from that file. This would enable you to add 2 datastores into one server.
If you want to combine 2 datastores then that is a different question.
The simplest approach in 3.1 if the data collections are small. Extract then as CSV (via a data-table), convert to XLS and add them via self provisioning into separate collections within a single data store. If you are running in the VM this is potentially the easiest approach.
This can also be done using Integrator.
You don't need to load the attributes unless you are using multi-value types. You can call against the conversation web-service to extract data and then load it using 'bulk-load' I would not worry too much about creating the attributes unless this becomes essential due to their type or complexity. If you cannot call against the conversation web-service, then again extract as csv and load using Integrator.

Transfer large file from Google BigQuery to Google Cloud Storage

I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.
I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.
This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.
Is there a better approach for this?
Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.
Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.
Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.
From the console, use the bq utility to strip the headers:
bq --skip_leading_rows 1