yvi: Kaylee half-smiling, looking very pretty (Default)
yvi ([personal profile] yvi) wrote in [site community profile] dw_suggestions2010-02-02 04:10 pm

make search respect umlauts

Title:
make search respect umlauts

Area:
journal search

Summary:
Treat ä/ö/ü and other vowel 'variants' as different characters from a/o/u in the journal search.

Description:
Contrary to popular opinion, the special vowel (and sometimes also consonant) thingies in other languages (thingies is the technical term, really!) are not just fancy versions with a few dots or lines added, but different characters. I only know some things about German, but ä/ö/ü can be written as ae/oe/ue, but not as a/o/u.

I am sure people with knowledge about linguistics are getting a high blood pressure by now, but I really don't know a lot of these things. But what I know is that when I search for "Münster" in the journal search having 75% of the results be about The Munsters the Irish province is not exactly helpful and confusing at first because it's not the word I searched for. If there is any way this can be separated, it would be much appreciated.

Downside being that some people just write ä/ö/ü as a/o/u because they don't know how to do it on their keyboard. I used to type on an American keyboard and just learned the [Alt]+[numbers] combinations, but maybe that's just not widespread.

Poll #2205 make search respect umlauts
Open to: Registered Users, detailed results viewable to: All, participants: 43


This suggestion:

View Answers

Should be implemented as-is.
20 (46.5%)

Should be implemented with changes. (please comment)
7 (16.3%)

Shouldn't be implemented.
8 (18.6%)

(I have no opinion)
8 (18.6%)

(Other: please comment)
0 (0.0%)

archane: Archane is cute and sassy (Default)

[personal profile] archane 2010-02-02 07:26 pm (UTC)(link)
Based on my understanding of how searches work, there are two potential problem with implementing this. First, as you mention, use of accented characters is limited, and allowing both accented and non-accented results in the report means that I can have something which is meaningful and comprehensive rather than leaving out half of the results.

Second, my (admittedly limited) understanding of search functions is that to implement a search for special characters, the search protocol must know what character set you're working from, and while you're using Microsoft's [Alt] key to create accented characters, other people are using Apples [⌘] key, and some of us use html entity references. Ensuring that all of these options—but not the unaccented character—appear in the results could potentially be a programmer's nightmare.
pseudomonas: "pseudomonas" in London Underground roundel (Default)

[personal profile] pseudomonas 2010-02-02 08:22 pm (UTC)(link)
It's not clear what is being proposed here. A search for one variation should return all the likely variations in the text, otherwise you have to know exactly which variant is used - so a search for "naive" won't return anything if the author wrote "naïve", and searching for "café" won't find "cafe". "Ö" is an "oe" equivalent in German, but is not always such in English and in other languages.
pseudomonas: "pseudomonas" in London Underground roundel (Default)

[personal profile] pseudomonas 2010-02-03 01:29 pm (UTC)(link)
There are various categories;
German words, where ö -> oe, as you mention.
Loan-words and names where the diacritics are optional; the standard English spelling of Händel is Handel rather than Haendel, but a search should return all the variant forms.
Cases where the dots are a diaeresis rather than an umlaut: coöperate, indicating that co-o is broken into two syllables rather than one "coo". These are typically optional (and slightly archaic, but I like them)
Decorative uses like the Heavy metal umlaut - these are typically optional.

In all these cases it'd be better if one didn't miss results because the author had used a different spelling.
alixtii: Player from <i>Where on Earth Is Carmen Sandiego?</i> playing the game. (Default)

[personal profile] alixtii 2010-02-09 12:03 am (UTC)(link)
+1
pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)

[personal profile] pauamma 2010-02-02 09:07 pm (UTC)(link)
Hmm: are ä ö ü different from a o u in all languages and writing systems that have them?

(Also, better than memorizing Unicode character codes, if you have a US keyboard, you could sett t to "US International" or its equivalent for your OS.)
jaaaarne: Photo of a seagull in flight, with slight motion blur. (Default)

[personal profile] jaaaarne 2010-02-03 02:55 am (UTC)(link)
Yes. They mean different sound. Say, in my username the a's are actually ä's and are read more like ya's rather than a's. Hope that makes sense. :)
aedifica: Me with my hair as it is in 2020: long, with blue tips (Default)

[personal profile] aedifica 2010-02-03 02:15 am (UTC)(link)
I voted against because while I do know the difference between o-umlaut (or oe) and o, I have to look up the code every time I want to do it on my English keyboard--it's a non-trivial use of time, by the time it adds up. Also because not everyone uses the same characters to represent special characters, and I hate the thought of trying to do separate searches on o-with-umlaut, plain o, and oe every time I want to look for a word that sometimes contains an umlaut. (As it is I need to search on oe and plain o.)
kyrielle: painterly drawing of a white woman with large dark-blue-framed glasses, hazel eyes, brown hair, and a suspicious lack of blemishes (Default)

[personal profile] kyrielle 2010-02-03 04:45 am (UTC)(link)
I wonder if it could be made to include both exact-match and near-match as it does now, but float the exact-match to the top of the results.
ninetydegrees: Art & Text: heart with aroace colors, "you are loved" (Default)

[personal profile] ninetydegrees 2010-02-03 06:16 am (UTC)(link)
That would be what I want too.
msilverstar: (corset)

[personal profile] msilverstar 2010-02-03 07:34 am (UTC)(link)
+1, definitely!
thedivinegoat: A photo of a yellow handled screwdriver, with text saying "This could be a little more sonic" (Default)

[personal profile] thedivinegoat 2010-02-03 10:26 am (UTC)(link)
+1
pseudomonas: "pseudomonas" in London Underground roundel (Default)

[personal profile] pseudomonas 2010-02-03 01:14 pm (UTC)(link)
I like this option.
cesy: "Cesy" - An old-fashioned quill and ink (Default)

[personal profile] cesy 2010-02-03 05:36 pm (UTC)(link)
+1, with "near-match" defined as first oe and only then o.
triadruid: Apollo and the Raven, c. 480 BC , Pistoxenus Painter  (Default)

[personal profile] triadruid 2010-02-04 04:10 pm (UTC)(link)
This is the only way in which I'll support this.
lanterne_rouee: dreamwidth swirl transparent background (dw transparent swirl)

[personal profile] lanterne_rouee 2010-02-03 06:24 am (UTC)(link)
Isn't it already differentiated? I have both 'carl barat' and 'carl barât' listed in my interests, for example. Clicking on each of them returns different lists of communities and interested users. I've always found this to be true of sites based on LJ's code. I guess journal search works differently?

I'd prefer if it was not differentiated (in any sort of search) though. It'd be simpler if it could catch everything so similar in one search, so you don't miss out on potentially relevant results.
lanterne_rouee: dreamwidth swirl transparent background (dw transparent swirl)

[personal profile] lanterne_rouee 2010-02-03 06:42 am (UTC)(link)
Thanks for confirming that. I'd never noticed. :)
schnurble: (Default)

[personal profile] schnurble 2010-02-03 09:04 am (UTC)(link)
If people don't know how to type ä, ö or ü, they can always type ae, oe and ue, because these are the correct replacements. So the search should match ä, ö, ü with ae, oe, ue and not with a, o, u just because people write it wrong in the first place.

There's also the character ß, which I've already seen replaced with a B, but no sensible search would search for a B just because people are creative.
So wen searching for special characters, it should search for what the character means and not what it looks alike.
cesy: "Cesy" - An old-fashioned quill and ink (Default)

[personal profile] cesy 2010-02-03 05:37 pm (UTC)(link)
Yeeeeah. *blinks*
pseudomonas: "pseudomonas" in London Underground roundel (Default)

[personal profile] pseudomonas 2010-02-04 11:33 am (UTC)(link)
Gawsh. I've seen people using ß for β (in non-greek phrases like ß-haemoglobin). It's all wrong.
triadruid: Apollo and the Raven, c. 480 BC , Pistoxenus Painter  (Default)

[personal profile] triadruid 2010-02-04 04:02 pm (UTC)(link)
Those are the correct replacements in some languages/settings, but not all. And being prescriptive with our search results doesn't address that people are already doing this.
inalasahl: portrait of robert rauschenberg (thefandomschool)

[personal profile] inalasahl 2010-02-05 05:44 pm (UTC)(link)
I like it the way it is. I like being able to search for "zoë" and getting "zoe," also. There are a lot of things the search does other than give you back the same exact characters you put in. I just did a search for [word]ing and got back results with [word]ed and [word], as well.