![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[site community profile]](https://www.dreamwidth.org/img/comm_staff.png)
make search respect umlauts
Title:
make search respect umlauts
Area:
journal search
Summary:
Treat ä/ö/ü and other vowel 'variants' as different characters from a/o/u in the journal search.
Description:
Contrary to popular opinion, the special vowel (and sometimes also consonant) thingies in other languages (thingies is the technical term, really!) are not just fancy versions with a few dots or lines added, but different characters. I only know some things about German, but ä/ö/ü can be written as ae/oe/ue, but not as a/o/u.
I am sure people with knowledge about linguistics are getting a high blood pressure by now, but I really don't know a lot of these things. But what I know is that when I search for "Münster" in the journal search having 75% of the results be about The Munsters the Irish province is not exactly helpful and confusing at first because it's not the word I searched for. If there is any way this can be separated, it would be much appreciated.
Downside being that some people just write ä/ö/ü as a/o/u because they don't know how to do it on their keyboard. I used to type on an American keyboard and just learned the [Alt]+[numbers] combinations, but maybe that's just not widespread.
This suggestion:
Should be implemented as-is.
20 (46.5%)
Should be implemented with changes. (please comment)
7 (16.3%)
Shouldn't be implemented.
8 (18.6%)
(I have no opinion)
8 (18.6%)
(Other: please comment)
0 (0.0%)
no subject
Second, my (admittedly limited) understanding of search functions is that to implement a search for special characters, the search protocol must know what character set you're working from, and while you're using Microsoft's [Alt] key to create accented characters, other people are using Apples [⌘] key, and some of us use html entity references. Ensuring that all of these options—but not the unaccented character—appear in the results could potentially be a programmer's nightmare.
no subject
Apart from DW's search, I know no search that treats ä/ö/ü as a/o/u. Google and ebay treat them completely separately from what I can see, Yahoo and Amazon seem to match them with their ae/oe/ue variants. Other search engines do fuzzy matching for misspellings, so you get both results for Munster and Minster mixed in.
no subject
In my understanding that's not necessary, as all pages on DW get UTF8 encoded no matter what you used to put the characters in.
(And as an completely unrelated aside, I have a German keyboard :) my father used to have an American keyboard, like, 10 years back, because it is a bit easier to use for programming, that's why I learned the Alt stuff)
no subject
no subject
When is Ö being used in English? And if it is not used as 'oe', that would be a reason against just assuming that is as 'o' and instead just treating it as the same character, in my opinion.
no subject
German words, where ö -> oe, as you mention.
Loan-words and names where the diacritics are optional; the standard English spelling of Händel is Handel rather than Haendel, but a search should return all the variant forms.
Cases where the dots are a diaeresis rather than an umlaut: coöperate, indicating that co-o is broken into two syllables rather than one "coo". These are typically optional (and slightly archaic, but I like them)
Decorative uses like the Heavy metal umlaut - these are typically optional.
In all these cases it'd be better if one didn't miss results because the author had used a different spelling.
no subject
no subject
(Also, better than memorizing Unicode character codes, if you have a US keyboard, you could sett t to "US International" or its equivalent for your OS.)
no subject
no subject
I don't think my father would have liked that 10 years back when he had an American keyboard for making programming a bit easier and I was allowed to use his computer for accessing the net every two days for an hour :D
I still know the codes, though. 132 was ä, 148 was ö, 149 was ü and either 225 or 255 was ß :)
no subject
no subject
no subject
no subject
no subject
no subject
no subject
no subject
no subject
I'd prefer if it was not differentiated (in any sort of search) though. It'd be simpler if it could catch everything so similar in one search, so you don't miss out on potentially relevant results.
no subject
no subject
no subject
There's also the character ß, which I've already seen replaced with a B, but no sensible search would search for a B just because people are creative.
So wen searching for special characters, it should search for what the character means and not what it looks alike.
no subject
At least Yahoo and Amazon seem to do that, and that's what 'd definitely prefer, too, yeah.
There's also the character ß, which I've already seen replaced with a B
...
There are no words for how wrong that is. If anything, it's a 'ss'. *boggles*
no subject
no subject
no subject
no subject