I'm getting wrong location when I query the GeoLite2-City.mmdb database with ip = '104.6.30.56' (from Python). Their demo site returns good data for this IP (https://www.maxmind.com/en/geoip-demo).
In [33]: import geoip2.database
In [34]: reader = geoip2.database.Reader('.../GeoLite2-City.mmdb')
In [35]: reader.city('104.6.30.56').city # should be Santa Rosa, Ca
Out[35]: geoip2.records.City(geoname_id=None, confidence=None, _locales=['en'], names={})
In [36]: reader.city('104.6.30.56').location # should be ~(38, -122)
Out[36]: geoip2.records.Location(postal_confidence=None, average_income=None, accuracy_radius=None, time_zone=None, longitude=-97.0, metro_code=None, population_density=None, postal_code=None, latitude=38.0)
In [37]: reader.city('173.194.116.131').city # works fine for Google
Out[37]: geoip2.records.City(geoname_id=5375480, confidence=None, _locales=['en'], names={u'ru': u'\u041c\u0430\u0443\u043d\u0442\u0438\u043d-\u0412\u044c\u044e', u'fr': u'Mountain View', u'en': u'Mountain View', u'de': u'Mountain View', u'zh-CN': u'\u8292\u5ef7\u7ef4\u5c24', u'ja': u'\u30de\u30a6\u30f3\u30c6\u30f3\u30d3\u30e5\u30fc'})
Versions:
In [39]: reader.metadata()
Out[39]: maxminddb.reader.Metadata(binary_format_major_version=2, description={u'en': u'GeoLite2 City database'}, record_size=28, database_type=u'GeoLite2-City', languages=[u'de', u'en', u'es', u'fr', u'ja', u'pt-BR', u'ru', u'zh-CN'], build_epoch=1438796457, ip_version=6, node_count=3199926, binary_format_minor_version=0)
In [40]: geoip2.__version__
Out[40]: '2.2.0'
Is this because I'm using Lite version?
Geoip location is only somewhat accurate.
Providers like MaxMind do their best to understand what IP address is associated with what geo location. However, that is a daunting task. IP addresses can be reassigned by the company that controls them, some companies do not publish the geography associated with an address, the IP you observe might belong to a proxy server far from the actual user, and there can be errors compiling the data.
Since their online system returns the correct geo location, this is probably an example of that final category.
In working extensively with geo location and correlating it to known facts about users, I observe that geo location databases are accurate around 85% - 90% of the time. Some providers do more than others to correctly handle the harder-to-handle IP addresses, but none of them are perfect.
If GeoIP returns the correct result and GeoLite does not, then yes, you're likely seeing the impact of the degraded accuracy of GeoLite. It's really a question of "do you want to pay, and if so, how much?"
Bear in mind that they recently introduced a third-level "Precision" service offering, of which the City database is itself now a degraded version.
Related
Recently I've been occasionally receiving odd results from the Google Reverse Geocode API.
For example when looking up the address for coordinates 43.2379396 -72.44746565 the result I get is:
6HQ3+52 Springfield, VT, USA
In another case looking up 43.703563 -72.209753 results with:
PQ3R+C3 Hanover, NH, USA
Does anyone know what the initial 7 bytes of the returned address symbolize? When I receive this type of result it's always 4 bytes of alphanumeric data followed by a plus sign then 2 more alphanumeric bytes.
After some additional research I found that these are Plus Code addresses, a relatively new feature in Google Maps. These are used for places that don't have a street address. These seem to have some similarities to "what 3 words" addresses.
I am using premium account (not sandbox) for data collection.
I want to collect:
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to US and not geolocated at tweet level, excluding all retweets
All tweets in English that contain ‘china’ or ‘chinese’ that are user geolocated to ‘Minnesota’ and not geolocated at tweet level, excluding all retweets
The code is as follows:
premium_search_args = load_credentials('twitter_API.yaml',
yaml_key ='search_tweets_premium_api', env_overwrite=False)
# keywords for the search
# key word 1
keywords = '(China OR Chinese) lang:en profile_country:US -place_country:US -is:retweet'
# key word 2
keywords = '(China OR Chinese) lang:en -place_country:US profile_region:"Minnesota" -is:retweet'
# define search rule
rule = gen_rule_payload(keywords,from_date='2019-12-01',
to_date='2019-12-10',results_per_call=500)
# create result stream and print before start
rs = ResultStream(rule_payload=rule, max_results=1250000,
**premium_search_args)
My problems are that:
For the first one, a large portion of the results I get didn’t satisfy the query. First, some don’t have Profile Geo enrichment, i.e. user.derived.locations attribute is not in the user object. Second, if it is, a lot don’t have country code US, i.e. they are identified to other countries.
For the second one, the result I get from this method is a smaller subset of the results I can get from 1). That is, when I filter all tweets user geolocated to Minnesota (by user.derived.locations.region) from profile_country:US, it gives a larger sample than using profile_region:“Minnesota”. A considerable amount of data is missing using this method.
I have tried several times but it seems that user geolocation operators don’t work exactly what I want. Does anyone has any idea why this is the case? I would very much appreciate any answers/suggestions/comments.
Thank you!
For the millionth time I had a dataset today that listed full state names. But, I needed it to list state postal code abbreviations. Here is a code snip I wrote that mapped the changes for me using data from a generic website.
1) Anyone know of or think of a better solution?
2a) Anyone know of a better web reference? Using USPS sites (such as the ones below) will not seem to work with pd.read_html()
2b) I also had a hard time isolating the correct table from pd.read_html() and the wiki page at: https://en.wikipedia.org/wiki/List_of_U.S._state_abbreviations
import pandas as pd
# Make Generic Data For Demonstration Purpose
data = {'StName':['Wisconsin','Minnesota','Minnesota',
'Wisconsin','Florida','New York']}
df = pd.DataFrame(data)
# Get State Crosswalk From Generic Website
crosswalk = 'http://app02.clerk.org/menu/ccis/Help/CCIS%20Codes/state_codes.html'
states = pd.read_html(crosswalk)[0]
# Demo Crosswalking State Name to State Abbreviation
df['StAbbr'] = df['StName'].map(dict(zip(states['Description'],
states['Code'])))
# Demo Reverse Crosswalking Back to State Name
df['StNameAgain'] = df['StName'].map(dict(zip(states['Code'],
states['Description'])))
I'd like to analyze the gambling activities in Bitcoin.
Does anyone has a list of addresses for gambling services such as SatoshiDICE and LuckyBit?
For example, I found addresses of SatoshiDICE here.
https://www.satoshidice.com/Bets.php
My suggestion would be to go and look for a list of popular addresses, i.e., addresses that received and/or sent a lot of transactions. Most gambling sites will use vanity addresses that include part of the site's name in the address, so you might also just search in the addresses for similar patterns.
It's rather easy to build such a list using Rusty Russell's bitcoin-iterate if you have a synced full node:
bitcoin-iterate --output "%os" -q > outputscripts.csv
This will get you a list of all output scripts in confirmed transactions in the blockchain. The output scripts include the pubkey hash that is also encoded in the address.
Let's keep only the P2PKH scripts of the form 76a914<pubkey-hash>88ac
grep -E '^76a914.*88ac$' outputscripts.csv > p2pkhoutputs.csv
Just for reference, the 90.03% (484715631/538368714) of outputs are to P2PKH scripts, so we should be getting pretty accurate results.
So let's get a count for each outputscript and count its occurence:
sort p2pkhoutputs.csv | uniq -c | sort -g > uniqoutputscripts.csv
And finally let's convert the scripts to the addresses. We'll need to do the base58 encoding, and I chose the python base58 library:
from base58 import b58encode_check
def script2address(s):
h = s.decode('hex')[3:23]
h = chr(0) + h
return b58encode_check(h)
For details on how addresses are generated please refer to the Bitcoin wiki. And here we have the top 10 addresses sorted by incoming transactions:
1880739, 1NxaBCFQwejSZbQfWcYNwgqML5wWoE3rK4
1601154, 1dice8EMZmqKvrGE4Qc9bUFf9PX3xaYDp
1194169, 1LuckyR1fFHEsXYyx5QK4UFzv3PEAepPMK
1105378, 1dice97ECuByXAvqXpaYzSaQuPVvrtmz6
595846, 1dice9wcMu5hLF4g81u8nioL5mmSHTApw
437631, 1dice7fUkz5h4z2wPc1wLMPWgB5mDwKDx
405960, 1MPxhNkSzeTNTHSZAibMaS8HS1esmUL1ne
395661, 1dice7W2AicHosf5EL3GFDUVga7TgtPFn
383849, 1LuckyY9fRzcJre7aou7ZhWVXktxjjBb9S
As you can see SatishiDice and LuckyBit are very much present in the set. Grepping for the vanity addresses unearths a lot of addresses too.
I would suggest using the usual chain analysis approach: send money to these services and note the addresses. Then perform transitive, symmetric etc closures on the same in the blockchain transaction graph to get all addresses in their wallet.
No technique can determine addresses in a wallet of the user is intelligent enough to mix properly.
I am working on a new project and whole day thinking & looking for the best way, how to save users' location to database.
I need to save their city, country. Here is a bit problem, because for example for Europeans this means city=Berlin, country=Germany.
But for US users it's like city=Los Angeles, country=California (state=USA).
So this is the problem with which I ma facing whole today.
My goal in this app is to find all users in the city according to their location. And also find the people, which are in their area of let's say 15 km/miles.
I plan to implement the app in RoR, PostgreSQL, the app will run probably on Heroku.
What is the best way to solve this "problem"? Could you give me please some advices, tips, whatever?
Thank you
You can use the geokit and geokit-rails gems to achieve that. See here for documentation: https://github.com/imajes/geokit-rails
Basically, it works like this: you save address data of your users and that address is looked up and mapped to a point in space (lat/lng) using a geocoding service (e.g. Google Maps or Yahoo Placefinder). These points can then be used to calculate distances etc.
An example:
class User < ActiveRecord::Base
# has the following db fields:
# - city
# - state
# - country
# - latitude
# - longitude
acts_as_mappable :default_units => :miles,
:default_formula => :sphere,
:lat_column_name => :latitude,
:lng_column_name => :longitude,
:auto_geocode => {:field => :full_address}
def full_address
[city, state, country].reject(&:blank).join(', ')
end
end
Then you can do the following:
# find all users in a city (this has nothing to do with geokit but is just a normal db lookup)
User.where(:city => 'New York')
# find all users that are within X miles of a place
User.find_within(300, 'Denver')
and much more, just see the documentation...
This example shows you how to use the geokit gem. This gem does no longer seem to be under active development. So maybe it would be worthwile to check out geocoder: https://github.com/alexreisner/geocoder