i've been asked to generate some demographic reports (crime rates, birth/deaths, etc) based on state and cities for the USA. I have all the demographic data (provided by our client) but can't seem to find any places which have the boundaries (read: LAT/LONG's) of the USA States and their cities.
Our data are Lat/Long points of data (eg. a crime, a birth, etc) and we want to get some mapped reports and also datamine using Sql server (we're using MS Sql 2008, but that shouldn't impact this question).
So .. can anyone direct me to where there are some state and city boundary sources? I know our government has all this information available for free at the US Census Bureau, but i can't seem to understand where it's found and how to digest this info.
I'm assuming that this info will be in the form of lat/long polygons (eg. a shapefile, etc) which i can then import into the DB and mine away.
Can anyone help, please?
A bit late, but in case anyone stumbles across this like I did:
To get the city boundary layers you navigate to "Download Shapefiles" for the year you're interested in and drop down to the city boundary layer, which is called "Places". (which is not an intuitive enough name for me...)
http://www.census.gov/cgi-bin/geo/shapefiles2011/main
That's the link for 2011's shapefiles.
I can suggest the post http://adventuresdotnet.blogspot.com/2009/06/sql-server-2008-importing-tigerline.html
Here's a link to the 2008 TIGER/Line shapefiles, and it sounds like looking at the TIGER/Line Shapefiles FAQ would probably be helpful -- other info at the main page for the 2008 data set.
I know the census is difficult to navigate at times. Try going to the National Atlas at nationalatlas.gov also the National map at http://nationalmap.gov/viewer.html
I think you are looking for GIS data. For example, I found US Census Bureau TIGER/Line 2007FE Shapefiles Mississippi, United States although I am not sure what you get with the download nor how to exploit the information...
HTH.
Was not able to use the tiger.. something web site. pain in the rear. Just download the file at the location below and may be trim according to your necessity.
http://www.nws.noaa.gov/geodata/catalog/national/html/cities.htm
Related
So I'm currently working on a research paper on media bias (or lack thereof) towards 2020 presidential candidates.
For this, I'm looking for a way to make a huge database of sentences that mention these politicians by name or (if possible) with a pronoun. Right now I'd like to only focus on 5-7 of the biggest American news outlets (WaPo, NYT, FOX, etc.).
I want to collect all of these sentences into an Excel sheet, including a timestamp of when the article was released and a link to the article itself. I actually don't know if that's feasible or whether such program/script exists or not.
Do you think there's a way to solve this, does it already exist, and if not, can a rookie programmer write a script for this?
Thank you for all your help in advance!
You'd probably just need to create your own web scraper. You could have a Set of names that you're looking for, and if the name exists on the page then you can have some heuristics to get the sentence it's in. You'll probably have to have some specific stuff for getting the timestamp from the article. I'd say it wouldn't be too bad since you're targeting only a few news outlets, but probably a bit challenging for a rookie programmer.
Also, I recommend checking out something like https://www.webscraper.io/
i want to copy a data from a website which sells courses like ITIL, Prince2 and PMP and many other IT sector courses now there are 20,000 different courses's description is there.
However, i want to use selenium to scrape all the data but description is still subject to copyright.
Kindly let me know how i can manipulate all of that description to data to same meaning but different words.
Is there any API which can give me an access to build an code which will be helping these description data by using it's synonymous or which can change it's grammer to completely new sentennces but same meaning.
Kindly let me know where to start this.
Thanks,
The task you are referring to is called paraphrasing.
There is a lot of research on the field. In arXiv you fill find research papers on the topic. However, since you are asking for an API, I am assuming you don't want to implement these models by your self. Luckily, some authors have published their models online on GitHub. (Note: some are a re-implementation by someone else.)
When you use some of these implementations, note that most offer a pre-trained model. Do read which data set was used for training and try to pick the one that is the most similar to the data that you are facing. By doing so, more words in the domain of your descriptions will be available and more synonyms can be used.
I am currently designing a database for a library a project for my database class. Below is the ER diagram so you can see the basic formatting of the database:
I have completed all of the requirements of the assignment, so please don't think that I am asking for someone to do my homework. What I am asking for assistance with is an idea. We can add extra features into this here such as a report generator, entry form, etc. I have added in a report generator to show the most active member/most popular book and an entry form. But I cannot think of anything else I can add into here to increase the usefulness, any suggestion would be appreciated.
There are a lot of possibilities:
Popularity can be split by gender, which could be interesting.
You could also work on popularity by address to make a geographical analysis, but if this is a local library you probably won't get much out of that.
Another stat that could be worthwile is the average length of borrow.
But the more interesting reports could be those that combine popularity, popularity by gender, average length of borrow with the catalog attributes... so you could have a view by author, by publisher or by publication year.
That is with the current structure. If you were to add genre or media type (book, magazine, CD, DVD) attributes, you would open up a whole lot of new dimensions.
This is a "big" question, that I don't know how to start, so I hope some of you can give me a direction. And if this is not a "good" question, I will close the thread with an apology.
I wish to go through the database of Wikipedia (let's say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let's say in the last 2 years).
I don't know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:
What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
How do you start planning for such a project?
You'll want to start here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Which will take you to here:
http://download.wikimedia.org/enwiki/20100312/
And the file you probably want is:
# 2010-03-17 04:33:50 done Log events to all pages.
* This contains the log of actions performed on pages.
* pages-logging.xml.gz 1.0 GB
http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz
You'll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won't require R. You'll be able to do that with a single MySQL query. Something like:
select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);
etc.
(I'm not sure what their actual schema is, but it'll be something like that.)
You'll run into issues, no doubt, but you'll learn a lot too. Good luck!
You could
work with the wikipedia database dumps, as already mentioned
work with the live mediawiki API, see this minimal example at Rosettacode or my unfinished approach with a S3 class or this package by Peter Konings
work with dbpedia, an effort to extract knowledge from wikipedia into a knowledge base. They offer an online sparql access I don't know much about, and also datasets as n-triples for download. See this python script which might be a starting point for an R script. This approach might be useful to access the content stored in the wikipedia (such as the infoboxes) but I am not sure if information on contributors to the wikipedia is available.
Try WikiXRay (Python/R) and zotero.
Lately we've started getting issues with outdated countries / regions list being presented to users of our web-application.
We currently have a few DB tables to store localized country names along with their regions (states). However as the planet goes, that list is in constant evolution and it's proving to be a pain to maintain as some regions are deleted, some merged - existing data needs to be updated all the time.
What are, if any exist, the best practices when it come to dealing with multi-locale countries/regions list?
Is there a place or a standard in place? I know of ISO 3166, but their list isn't exactly DB friendly ... plus it's not fully localized.
An ideal solution would simply allow us to "sync" to it? Preferably in multiple language. The solution would preferably be free or subscription based with an historic of what changed so we could update our data (aka tblAddress)
Thanks!
geonames is pretty accurate in this respect, and they update regularly.
http://www.geonames.org/export/
There is no such thing. This is a political issue, which you can only solve in the context of your own application. Deciding to use ISO 3166 could be the easiest to defend. I know of issues with at least:
China/Taiwan
Israel/Palestine
China/Tibet
Greece/Macedonia
The ISO lists here are DB friendly, though they only include short names and codes.
This one looks very good: Multiple languages, update option, database independent file format for import, countries/regions/cities information, and some other features you might use or not.
And it's quite affordable if you need it for only one server.
You can try CLDR
http://cldr.unicode.org/
This set of data is maintained by the Unicode organization. It is updated regularly and the data is versioned so it is easy for you to manage the state of your list.
Hy! you can find a free dump of all countries with their respective continents https://gist.github.com/kamermans/1441495, its much easy to use.just download the dump & upload in your data base.
Well, wait, do you just want an up-to-date list of countries? Or do you need to know that Country X has split into Country Y and Country Z? Because I'm not aware of any automated way to get the latter. Even updates to the ISO databases are distributed as PDFs (you're on your own for implementing the change)
The EU maintains data about Local Administrative Units (LAUs) which can be downloaded as hierarchical XLS files in several languages.
United Nations Statistics Division, Standard country or area codes for statistical use (M49).
Look for "Search and Download: Full View" on page left. That leads here.
Groups countries by continent, sub-continental region, Least Developed Countries, and so on.
If you cannot import the excel version, note that the csv has unquoted" fields and a comma in one country name that will bust your import ("Bonaire, Sint Eustatius and Saba"). Perhaps open it first in LibreOffice or whatever, fix the broken country name and shunt its other right-most columns back into place. Then set all cells to type Text, saveAs csv with [Edit Filter Settings] checked [x] in the saveAs dialog, and make sure string delimiter is set to ", as it should be by default.