Still struggling with tokens and filters and such. The end goal is to aggregate over a list of social media posts that we've gathered and produce a word (aka, tag) cloud.
The struggle is ignoring words like "then" and "when" (yes, using a stop filter will do that) and combining plurals and other stemmed words, such as "pumped" and "pumping".
I think I've got the right setup for the index now:
The struggle is ignoring words like "then" and "when" (yes, using a stop filter will do that) and combining plurals and other stemmed words, such as "pumped" and "pumping".
I think I've got the right setup for the index now:
GET /_analyze { "tokenizer": "classic", "filter": [ { "type": "keep_types", "types": [ "<EMOJI>", "<NUM>" ], "mode": "exclude" }, { "type": "length", "min": 2 }, "kstem", "stop", "classic", "asciifolding" ], "text": "açaí à la carte can't the foxes trumpet trumping trump trumps trump's milk milks milky jumping jumper quicker quickly 1234 attentive attention 🙏" }
This produces the following tokens, which is just about right...
{ "tokens" : [ { "token" : "acai", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "la", "start_offset" : 7, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "carte", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "can't", "start_offset" : 16, "end_offset" : 21, "type" : "<APOSTROPHE>", "position" : 4 }, { "token" : "fox", "start_offset" : 26, "end_offset" : 31, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "trumpet", "start_offset" : 32, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "trump", "start_offset" : 40, "end_offset" : 48, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "trump", "start_offset" : 49, "end_offset" : 54, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "trumps", "start_offset" : 55, "end_offset" : 61, "type" : "<ALPHANUM>", "position" : 10 }, { "token" : "trump", "start_offset" : 62, "end_offset" : 69, "type" : "<APOSTROPHE>", "position" : 11 }, { "token" : "milk", "start_offset" : 70, "end_offset" : 74, "type" : "<ALPHANUM>", "position" : 12 }, { "token" : "milk", "start_offset" : 75, "end_offset" : 80, "type" : "<ALPHANUM>", "position" : 13 }, { "token" : "milky", "start_offset" : 81, "end_offset" : 86, "type" : "<ALPHANUM>", "position" : 14 }, { "token" : "jump", "start_offset" : 87, "end_offset" : 94, "type" : "<ALPHANUM>", "position" : 15 }, { "token" : "jumper", "start_offset" : 95, "end_offset" : 101, "type" : "<ALPHANUM>", "position" : 16 }, { "token" : "quick", "start_offset" : 102, "end_offset" : 109, "type" : "<ALPHANUM>", "position" : 17 }, { "token" : "quick", "start_offset" : 110, "end_offset" : 117, "type" : "<ALPHANUM>", "position" : 18 }, { "token" : "1234", "start_offset" : 118, "end_offset" : 122, "type" : "<ALPHANUM>", "position" : 19 }, { "token" : "attentive", "start_offset" : 123, "end_offset" : 132, "type" : "<ALPHANUM>", "position" : 20 }, { "token" : "attention", "start_offset" : 133, "end_offset" : 142, "type" : "<ALPHANUM>", "position" : 21 } ] }