Craig Eddy's personal website

Learned ElasticSearch

Still struggling with tokens and filters and such. The end goal is to aggregate over a list of social media posts that we've gathered and produce a word (aka, tag) cloud.

The struggle is ignoring words like "then" and "when" (yes, using a stop filter will do that) and combining plurals and other stemmed words, such as "pumped" and "pumping".

I think I've got the right setup for the index now:

GET /_analyze
{
  "tokenizer": "classic",
  "filter": [ 
    {
      "type": "keep_types",
      "types": [ "<EMOJI>", "<NUM>" ],
      "mode": "exclude"
    },
    {
      "type": "length",
      "min": 2
    },
    "kstem", "stop",
    "classic", 
    "asciifolding"
  ],
  "text": "açaí à la carte can't the foxes trumpet trumping trump trumps trump's milk milks milky jumping jumper quicker quickly 1234 attentive attention 🙏"
}

This produces the following tokens, which is just about right...

{
  "tokens" : [
    {
      "token" : "acai",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "la",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "carte",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "can't",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "<APOSTROPHE>",
      "position" : 4
    },
    {
      "token" : "fox",
      "start_offset" : 26,
      "end_offset" : 31,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "trumpet",
      "start_offset" : 32,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "trump",
      "start_offset" : 40,
      "end_offset" : 48,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "trump",
      "start_offset" : 49,
      "end_offset" : 54,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "trumps",
      "start_offset" : 55,
      "end_offset" : 61,
      "type" : "<ALPHANUM>",
      "position" : 10
    },
    {
      "token" : "trump",
      "start_offset" : 62,
      "end_offset" : 69,
      "type" : "<APOSTROPHE>",
      "position" : 11
    },
    {
      "token" : "milk",
      "start_offset" : 70,
      "end_offset" : 74,
      "type" : "<ALPHANUM>",
      "position" : 12
    },
    {
      "token" : "milk",
      "start_offset" : 75,
      "end_offset" : 80,
      "type" : "<ALPHANUM>",
      "position" : 13
    },
    {
      "token" : "milky",
      "start_offset" : 81,
      "end_offset" : 86,
      "type" : "<ALPHANUM>",
      "position" : 14
    },
    {
      "token" : "jump",
      "start_offset" : 87,
      "end_offset" : 94,
      "type" : "<ALPHANUM>",
      "position" : 15
    },
    {
      "token" : "jumper",
      "start_offset" : 95,
      "end_offset" : 101,
      "type" : "<ALPHANUM>",
      "position" : 16
    },
    {
      "token" : "quick",
      "start_offset" : 102,
      "end_offset" : 109,
      "type" : "<ALPHANUM>",
      "position" : 17
    },
    {
      "token" : "quick",
      "start_offset" : 110,
      "end_offset" : 117,
      "type" : "<ALPHANUM>",
      "position" : 18
    },
    {
      "token" : "1234",
      "start_offset" : 118,
      "end_offset" : 122,
      "type" : "<ALPHANUM>",
      "position" : 19
    },
    {
      "token" : "attentive",
      "start_offset" : 123,
      "end_offset" : 132,
      "type" : "<ALPHANUM>",
      "position" : 20
    },
    {
      "token" : "attention",
      "start_offset" : 133,
      "end_offset" : 142,
      "type" : "<ALPHANUM>",
      "position" : 21
    }
  ]
}