Discover what companies are behind every visit using Elasticsearch

In this article we are explaining how to deanonymize every visit that lands your website, discovering what companies are behind.

If you have a website and you are selling products or/and services to another company, this simple and straightforward tutorial could give you some great insights to design a business strategy.

How? What do we need?

Fine. Let’s begin!

The only thing that we need, as you can guess, is the IP address.

Yes, that’s the only thing. Before digging into the solution, we need to understand some concepts:

ASN (Autonomous System Number): is a unique number assigned to an autonomous system (AS) by the Internet Assigned Numbers Authority (IANA).
An AS consists of blocks of IP addresses which have a distinctly defined policy for accessing external networks and are administered by a single organization but may be made up of several operators.
Ingest pipeline: is a feature provided by Elasticsearch. It’s basically a definition of a series of processor that are to be executed and applied over the data before being indexed, in this case over the ip address.

Taking these 3 concepts into account, what we are using from Elasticsearch is the ingest pipeline, so we need to define it.

Ingest Pipeline

Let’s explain the pipeline the we are going to create.

description: helpful text to describe what the pipeline does.
processors: list of processor to be executed in order.
pipeline_ deanonymize_ip: an unique ID to identify the pipeline.
geoip: the processor that we need.

PUT _ingest/pipeline/pipeline_deanonymize_ip
{
  "description": "Our pipeline to discover companies behind IP address",
  "processors": [
    {
      "geoip": {
        "field": "ip",
        "database_file": "GeoLite2-ASN.mmdb"
      }
    }
  ]
}

The processor geoip is going to do all the work for us. We need to define just 2 things:

field: the name field of the document that the geoip is reading from. Basically the field that is being analyzed before indexing the document.
database_file : as we mentioned previously, we need to link that IP address to ASN, so that’s why to indicate what database is going to use for that.

Simulate endpoint

Great! Now we already have our pipeline defined, let’s use it!

Elasticsearch provides us with an endpoint to simulate the behavior of the pipeline.

POST /_ingest/pipeline/pipeline_deanonymize_ip/_simulate
{
  "docs": [
    {
      "_source": {
        "ip": "8.8.8.8"
      }
    }
  ]
}

Result:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "geoip" : {
            "organization_name" : "Google LLC",
            "asn" : 15169,
            "ip" : "8.8.8.8",
            "network" : "8.8.8.0/24"
          },
          "ip" : "8.8.8.8"
        }
      }
    }
  ]
}

If you want to use it to process your documents, you would have to indicate the pipeline you want to apply at index time. So the pipeline would be applied over the documents.

Conclusion

As we saw, it is so easy to use the ingest pipeline and Elastic provides us with a lot of powerful features and tools to make the most of data. Just with a simple steps, we discover what companies are interested in what we are offering.