Storing ip adresses efficiently, for faster lookups and insertions (proxy checking) - sql

I'm writing a small Python 3 program that is supposed to be testing the validity of a large number of proxies and I want to reorganize data so I can quickly lookup IP, test it via curl, write into database whether it works and timestamp.
With about 50 000 rows, 'simple way' takes too long, thus I need some clever way of searching through IPs.
I'm new to SQL, but if I was do it in some programming language, I would make something like this:
| IP_BYTE1 | IP_BYTE2 | IP_BYTE3 | IP_BYTE4 | TIMESTAMP | WORKS |
and then search 'left to right'.
Can anyone help me with creation of such a table and algorithm for fast lookup/insertion?

The simple way is to store them in a table using your favorite data type (varchar or int) and then build an index on them.
If you are looking for different types of IP addresses, then you might want to break them into separate pieces. Are you generally looking at type D addresses? Or do you need to also look at types A, B, and C?

Related

Splunk query to take a search from one index and add a field's value from another index?

How can I write a Splunk query to take a search from one index and add a field's value from another index? I've been reading explanations that involve joins, subsearches, and coalesce, and none seem to do what I want -- even though the example is extremely simple. I am not sure what I am not understanding yet.
main-index has src field which is an IP address and a field I will restrict my results on. I will look over a short amount of time, e.g.
index="main-index" sourcetype="main-index-source" main-index-field="wildcard-restriction*" earliest=-1h | stats count by src
other-index has src_ip field which is an IP address, and has the hostname. It's DHCP leases, so I need to check a longer time frame, and return only the most recent result for a given IP address. I want to get back the hostname from src_nt_host, e.g.
index="other-index" sourcetype="other-index-sourcetype" earliest=-14d
I would like to end up with the following values:
IP address, other-index.src_nt_host, main-index.count
main-index has the smallest amount of records, if that helps for performance reasons.
If I understand you correctly, you need to look at two different time ranges in two different indices,
In that case, it is most likely to be true that a join will be needed
Here's one way it can be done:
index=ndx1 sourcetype=srctp1 field1="someval" src="*" earliest=-1h
| stats count by src
| join src
[| search index=ndx2 sourcetype=srctp2 field2="otherval" src_ip=* src_nt_host=* earliest=-14d
| stats count by src_ip src_nt_host
| fields - count
| rename src_i as src ]
You may need to flip the order of the searches, depending on how many results they each return, and how long they take to run.
You may also be able to achieve what you're looking for in another manner without the use of a join, but we'd need to have some sample data to possibly give a better result

What is the best method to extract a recurring blob data and put in another table ? - SQL

I'm developing a new webpage in (.NET framework, if that helps) for the below scenario. Every single day, we get a cab drivers report.
Date | Blob
-------------------------------------------------------------
15/07 | {"DriverName1":"100kms", "DriverName2":"10kms", "Hash":"Value"...}
16/07 | {"DriverName1":"50kms", "DriverName3":"100kms", "Hash":"Value"}
Notice that the 'Blob' is the actual data received in json format - contains information about the distance covered by a driver at that particular day.
I have written a service which reads the above table & further breaks down this and puts it into a new table like below:
Date | DriverName | KmsDriven
15/07 DriverName1 100
15/07 DriverName2 10
16/07 DriverName3 100
16/07 DriverName1 50
By populating this, I can easily do the following queries:
How many drivers drove on that particular day.
How is 'DriverName1' did for that particular week, etc.,
My questions here are:
Are there anything in .NET / SQL world to specifically address this or let me know if I am reinventing the wheel here.
Is this the right way to use the Blob data ?
Are there any design patterns to adhere here to ?
Are there anything in .NET / SQL world to specifically address this or
let me know if I am reinventing the wheel here.
Well, there are JSON parsers available, for example Newtonsoft's Json.NET. Or you can use SQL Server's own functions. Once you have extracted individual values from JSON, you can write them into corresponding columns (in your new table).
Is this the right way to use the Blob data?
No. It violates the principle of atomicity, and therefore the first normal form.
Are there any design patterns to adhere here to?
I'm not sure about "patterns", but I don't see why would you need a BLOB in this case.
Assuming the data is uniform (i.e. it always has the same fields), you can just declare the columns you need and write directly to them (as you already proposed).
Otherwise, you may consider using SQL Server's XML data type, which will enable you to extract some of the sections within an XML document, or insert a new section without replacing your whole document.

Doing multiple queries in Postgresql - conditional loop

Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1
Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .

SQL Most effective way to store every word in a document separately

Here's my situation (or see TLDR at bottom): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.
The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.
E.g: Document foo's text is
abc abc def
and document bar's text is
abc def ghi
Documents table will have
id | name
1 'foo'
2 'bar'
Words table:
id | word
1 'abc'
2 'def'
3 'ghi'
Word Document table:
word id | doc id | occurrences
1 1 2
1 2 1
2 1 1
2 2 1
3 2 1
As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.
TL;DR My question is this:
How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?
Sorry for the long read and thanks for any help!
Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net
Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)
Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.
Ofcourse, piggy backing on an existing solution has some draw backs:
You need to have a license for the solution.
When your needs can no longer be served, you will have to program it all yourself.
But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.
Your question strikes me as being naive. In the first place... you are begging the question. You are giving a flawed solution to your own problem... and then explaining why it can't work. Your question would be much better if you simply described what your objective is... and then got out of the way so that people smarter than you could tell you HOW to accomplish that objective.
Just off hand... the database sounds like a really dumb idea to me. People have been grepping text with command line tools in UNIX-like environments for a long time. Either something already exists that will solve your problem or else a decent perl script will "fake" it for you-- depending on your real world constraints, of course.
Depending on what your problem actually is, I suspect that this could get into some really interesting computer science questions-- indexing, Bayesian filtering, and who knows what else. I suspect, however, that you're making a very basic task more complicated than it needs to be.
TL;DR My answer is this:
** Why wouldn't you just write a script to go through a directory... and then use regexes to count the occurences of the word in each file that is found there?

Should I initialize my AUTO_INCREMENT id column to 2^32+1 instead of 0?

I'm designing a new system to store short text messages [sic].
I'm going to identify each message by a unique identifier in the database, and use an AUTO_INCREMENT column to generate these identifiers.
Conventional wisdom says that it's okay to start with 0 and number my messages from there, but I'm concerned about the longevity of my service. If I make an external API, and make it to 2^31 messages, some people who use the API may have improperly stored my identifier in a signed 32-bit integer. At this point, they would overflow or crash or something horrible would happen. I'd like to avoid this kind of foo-pocalypse if possible.
Should I "UPDATE message SET id=2^32+1;" before I launch my service, forcing everyone to store my identifiers as signed 64-bit numbers from the start?
If you wanted to achieve your goal and avoid the problems that cletus mentioned, the solution is to set your starting value to 2^32+1. There's still plenty of IDs to go and it won't fit in a 32 bit value, signed or otherwise.
Of course, documenting the value's range and providing guidance to your API or data customers is the only right solution. Someone's always going to try and stick a long into a char and wonder why it doesn't work (always)
What if you provided a set of test suites or a test service that used messages in the "high but still valid" range and persuade your service users to use it to validate their code is proper? Starting at an arbitrary value for defensive reasons is a little weird to me; providing sanity tests rubs me right.
Actually 0 can be problematic with many persistence libraries. That's because they use it as some sort of sentinel value (a substitute for NULL). Rightly or wrongly, I would avoid using 0 as a primary key value. Convention is to start at 1 and go up. With negative numbers you're likely just to confuse people for no good reason.
If everyone alive on the planet sent one message per second every second non-stop, your counter wouldn't wrap until the year 2050 using 64 bit integers.
Probably just starting at 1 would be sufficient.
(But if you did start at the lower bound, it would extend into the start of 2092.)
Why use incrementing IDs? These require locking and will kill any plans for distributing your service over multiple machines. I would use UUIDs. API users will likely store these as opaque character strings, which means you can probably change the scheme later if you like.
If you want to ensure that messages have an order, implement the ordering like a linked list:
---
id: 61746144-3A3A-5555-4944-3D5343414C41
msg: "Hello, world"
next: 006F6F66-0000-0000-655F-444E53000000
prev: null
posted_by: jrockway
---
id: 006F6F66-0000-0000-655F-444E5300000
msg: "This is my second message EVER!"
next: 00726162-0000-0000-655F-444E53000000
prev: 61746144-3A3A-5555-4944-3D5343414C41
posted_by: jrockway
---
id: 00726162-0000-0000-655F-444E53000000
msg: "OH HAI"
next: null
prev: 006F6F66-0000-0000-655F-444E5300000
posted_by: jrockway
(As an aside, if you are actually returning the results as YAML, you can use & and * references instead of just using the IDs as data. Then the client will get the linked-list structure "for free".)
One thing I don't understand is why developers don't grasp that they don't need to expose their AUTO_INCREMENT field. For example, richardtallent mentioned using Guids as the primary key. I say do one better. Use a 64bit Int for your table ID/Primary Key, but also use a GUID, or something similar, as your publicly exposed ID.
An example Message table:
Name | Data Type
-------------------------------------
Id | BigInt - Primary Key
Code | Guid
Message | Text
DateCreated | DateTime
Then your data looks like:
Id | Code Message DateCreated
-------------------------------------------------------------------------------
1 | 81e3ab7e-dde8-4c43-b9eb-4915966cf2c4 | ....... | 2008-09-25T19:07:32-07:00
2 | c69a5ca7-f984-43dd-8884-c24c7e01720d | ....... | 2007-07-22T18:00:02-07:00
3 | dc17db92-a62a-4571-b5bf-d1619210245a | ....... | 2001-01-09T06:04:22-08:00
4 | 700910f9-a191-4f63-9e80-bdc691b0c67f | ....... | 2004-08-06T15:44:04-07:00
5 | 3b094cf9-f6ab-458e-965d-8bda6afeb54d | ....... | 2005-07-16T18:10:51-07:00
Where Code is what you would expose to the public whether it be a URL, Service, CSV, Xml, etc.
Don't want to be the next Twitter, eh? lol
If you're worried about scalability, consider using a GUID (uniqueidentifier) instead.
They are only 16 bytes (twice that of a bigint), but they can be assigned independently on multiple database or BL servers without worrying about collisions.
Since they are random, use NEWSEQUENTIALID() (in SQL Server) or a COMB technique (in your business logic or pre-MSSQL 2005 database) to ensure that each GUID is "higher" than the last one (speeds inserts into your table).
If you start with a number that high, some "genius" programmer will either subtract 2^32 to squeeze it in an int, or will just ignore the first digit (which is "always the same" until you pass your first billion or so messages).