Is there an alternative to GeoIPCountryWhois.csv available? Since, Max Mind has upgraded to geoip2 databases this file is not shipped anymore.
I am trying to use zmap/zgrab on this list of start/end IPs with the country code and do same data analysis on output from zgrab.
Related
I have a csv files, that it has the following structure.
ERP,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
MARKETING,J,JACKSON,8388 SOUTH CALIFORNIA ST.,TUCSON,AZ,85708,267-3352,,ALLENTON,MI,48002,810,710-0470,369-98-6555,462-11-4610,1953-05-00,F,
As you can see there is not header, but for your information the first part (first column) represents the sector where are getting the data.
What I have to do is depending on the first column value, for example (MARKETING or ERP) I have to send all that rows to a different output directory.
For example, all rows with ERP to /output/ERP/
all rows with MARKETING to /output/marketing/
I have an idea about how to do it, but my problem is about the RouteOnAttribute processor I am using, I don't know how to refer to the first column and to indicate what is the value (ERP or MARKETING) to later on send it to the correct output directory.
Here is my schema.
Thanks.
Use PartitionRecord processor for this case.
Configure the processor with record reader/writer controller services. Even though if you are not having header you can use col1,col2...etc in avro schema.
add new property that defines processor to use that field for partition the flowfile.
Now partition record processor adds the partition field attribute with value, by making use of this attribute value we can dynamically store files into respected directories dynamically.
Flow:
1.GetFile
2.PartitionRecord
3.PutFile //configure directory as /output/${<keep_partition_field_name_here>}
Please refer this link for configuring usage of partition record processor.
(or)
Old Approach:
Using RouteText processor instead of SplitText + RouteOnAttribute Processors
Configure RouteText processor as
Use the ERP/MARKETING connections connect to PutFile processor and use RouteText.Route attribute value to dynamically save the files into Directories.
Flow:
1.GetFile
2.RouteText
3.PutFile //configure directory as /output/${RouteText.Route}/
You can also use Group Regular expression property value to create partitions.
Note
Using PartitionRecord processor will be more efficient than RouteText processor.
I'm developing a new webpage in (.NET framework, if that helps) for the below scenario. Every single day, we get a cab drivers report.
Date | Blob
-------------------------------------------------------------
15/07 | {"DriverName1":"100kms", "DriverName2":"10kms", "Hash":"Value"...}
16/07 | {"DriverName1":"50kms", "DriverName3":"100kms", "Hash":"Value"}
Notice that the 'Blob' is the actual data received in json format - contains information about the distance covered by a driver at that particular day.
I have written a service which reads the above table & further breaks down this and puts it into a new table like below:
Date | DriverName | KmsDriven
15/07 DriverName1 100
15/07 DriverName2 10
16/07 DriverName3 100
16/07 DriverName1 50
By populating this, I can easily do the following queries:
How many drivers drove on that particular day.
How is 'DriverName1' did for that particular week, etc.,
My questions here are:
Are there anything in .NET / SQL world to specifically address this or let me know if I am reinventing the wheel here.
Is this the right way to use the Blob data ?
Are there any design patterns to adhere here to ?
Are there anything in .NET / SQL world to specifically address this or
let me know if I am reinventing the wheel here.
Well, there are JSON parsers available, for example Newtonsoft's Json.NET. Or you can use SQL Server's own functions. Once you have extracted individual values from JSON, you can write them into corresponding columns (in your new table).
Is this the right way to use the Blob data?
No. It violates the principle of atomicity, and therefore the first normal form.
Are there any design patterns to adhere here to?
I'm not sure about "patterns", but I don't see why would you need a BLOB in this case.
Assuming the data is uniform (i.e. it always has the same fields), you can just declare the columns you need and write directly to them (as you already proposed).
Otherwise, you may consider using SQL Server's XML data type, which will enable you to extract some of the sections within an XML document, or insert a new section without replacing your whole document.
I've got a problem i have been trying to work out I have googled a few things that are similar to what i want to do but can't work out exactly how to do it,
I have around 250M ip address's and i want to look that up against the maxmind geolite2 data so that i can tell what country each IP-address originates from,
I have imported all the data into my Redshift cluster with talend,
table a has ID and 'ipaddress' ie 10.0.0.5
table b (maxmind) has country name and IP range as expressed as 10.0.0.0/24
how could i use Redshift SQL to match these two considering the size of my source data source?
edit: heres the link to the geolite2 data https://dev.maxmind.com/geoip/geoip2/geolite2/
You could try using Amazon Redshift's ability to Import Custom Python Library Modules to load the netaddr library. Then, you could use the library within a User Defined Function written in Python.
See also: IP Range to CIDR conversion in Python?
I'm using the Java API to query for all job ids using the code below
Bigquery.Jobs.List list = bigquery.jobs().list(projectId);
list.setAllUsers(true);
but it doesn't list me job ids that were run by Client ID for web applications (ie. metric insights) I'm using private key authentication.
Using the command line tool 'bq ls -j' in turn giving me only the metric insight job ids but not the ones ran with the private key auth. Is there a get all method?
The reason I'm doing this is trying to get better visibility into what queries are eating up our data usage. We have multiple sources of queries: metric insights, in house automation, some done manually, etc.
As of version 2.0.10, the bq client has support for API authorization using service account credentials. You can specify using a specific service account with the following flags:
bq --service_account your_service_account_here#developer.gserviceaccount.com \
--service_account_credential_store my_credential_file \
--service_account_private_key_file mykey.p12 <your_commands, etc>
Type bq --help for more information.
My hunch is that listing jobs for all users is broken, and nobody has mentioned it since there is usually a workaround. I'm currently investigating.
Jordan -- It sounds like you're honing in on what we want to do. For all access that we've allowed into our project/dataset we want to produce an aggregate/report of the "totalBytesProcessed" for all queries executed.
The problem we're struggling with is that we have a handful of distinct java programs accessing our data, a 3rd party service (metric insights) and 7-8 individual users who have query access via the web interface. Fortunately the incoming data only has one source so explaining the cost for that is simple. For queries though I am kinda blind at the moment (and it appears queries will be the bulk of the monthly bill).
It would be ideal if I can get the underyling data for this report with just one listing made with some single top level auth. With that I think from the timestamps and the actual SQL text I can attribute each query to a source.
One thing that might make this problem far easier is if there were more information in the job record (or some text adornment in the job_id for queries). I don't see that I can assign my own jobIDs on queries (perhaps I missed it?) and perhaps recording some source information in the job record would be possible? Just thinking out loud now...
There are three tables you can query for this.
region-**.INFORMATION_SCHEMA.JOBS_BY_{USER, PROJECT, ORGANIZATION}
Where ** should be replaced by your region.
Example query for JOBS_BY_USER in the eu region:
select
count(*) as num_queries,
date(creation_time) as date,
sum(total_bytes_processed) as total_bytes_processed,
sum(total_slot_ms) as total_slot_ms_cost
from
`region-eu.INFORMATION_SCHEMA.JOBS_BY_USER` as jobs_by_user,
jobs_by_user.referenced_tables
group by
2
order by 2 desc, total_bytes_processed desc;
Documentation is available at:
https://cloud.google.com/bigquery/docs/information-schema-jobs
I'm using SQLite to store some data. The primary database is on a NAS (Debian Lenny, 2.6.15, armv4l) since the NAS runs a script which updates the data every day. A typical "select * from tableX" looks like this:
2010-12-28|20|62.09|25170.0
2010-12-28|21|49.28|23305.7
2010-12-28|22|48.51|22051.1
2010-12-28|23|47.17|21809.9
When I copy the DB to my main computer (Mac OS X) and run the same SQL query, the output is:
2010-12-28|20|1.08115035175016e-160|25170.0
2010-12-28|21|2.39343503830763e-259|-9.25596535779558e+61
2010-12-28|22|-1.02951149572792e-86|1.90359837597183e+185
2010-12-28|23|-1.10707273937033e-234|-2.35343828462275e-185
The 3rd and 4th column have the type REAL. Interesting fact: When the numbers are integer (i.e. they end with ".0"), there is no difference between the two databases. In all other cases, the differences are ... hm ... surprising? I can't seem to find a pattern.
If someone's got a clue - please share!
PS: sqlite3 -version output
Debian: 3.6.21 (lenny-backports)
Mac OS X: 3.6.12 (10.6)
In release 3.4.0 of SQLite there was a compile time flag added.
Added the SQLITE_MIXED_ENDIAN_64BIT_FLOAT compile-time option to support ARM7 processors with goofy endianness.
I was having this same problem with an Arm920Tid device and my x86 based VM. The arm device was writing the data, and I was trying to read it on the x86 VM (or on my Mac).
After adding this compile time flag to my makefile for my arm build I was able to get sane values when I queried the DB on either platform.
For reference I am using sqlite 3.7.14
It should be, the file format says that REAL is stored in big-endian format, which would be architecture-invariant if serialized correctly by both builds.
A value of 7 stored within the database record header indicates that the corresponding database value is an SQL real (floating point number). In this case the blob of data contains an 8-byte IEEE floating point number, stored in big-endian byte order.