azurelunatic: Vivid pink Alaskan wild rose. (Default)
Azure Jane Lunatic (Azz) 🌺 ([personal profile] azurelunatic) wrote in [site community profile] dw_suggestions2011-08-09 02:15 am

Better multilingual entry support

Title:
Better multilingual entry support

Area:
entries, search

Summary:
Allow entries to be tagged with the language(s) that they are composed of. This can be used to power more interesting things around the site.

Description:
Entries composed of written or spoken material (text, images of writing, audio, video) usually have one or more languages in which the material is presented. Allowing entries to be voluntarily tagged by their owners to describe the language(s) they are using might allow some interesting features to be developed based on entry tagging.

If a particular spelling appears in more than one language, specifying the language of the entry in site search could help find the thing someone's looking for.

Statistics on actual use of the site by users who speak different languages might be helpful to staff, especially if the technical barriers to offering the site in translation are overcome.

It could help users better connect with people who speak their same language, especially users whose preferred language is in a minority on the site.


What would the user interface be like? A whole long list of possible languages could a) be unwieldy, b) might also leave out languages used by actual site users (sign languages and constructed languages spring to mind as languages that might be left out of even a fairly exhaustive list of languages, and entries with embedded video might have sign language, and fannish communities are reasonably likely to include Tengwar and Klingon, and goodness knows there are probably more use cases that I know nothing of).

One way to do it might be like the tags interface, where something can be typed in, and attempt to autofill from a preset list, but accept new entries gracefully. If designed properly, unique data entered here on public entries could be logged, collated, and presented to an administrator on a regular basis for review; items that are found to be actual common languages not present on the list could then be entered.

Any site function that involves searching by language should allow for synonyms -- three different people might use "tlhIngan Hol", "pIqaD", and "Klingon" to mean the same language -- to say nothing of the typos. There should be a way to bundle known synonyms and known typos -- and also a way to override this bundling.

Another challenge is that people might not tag all their entries (to say nothing of back entries). How hard/expensive would it be to autodetect languages? Failing autodetection, could a default be set by user, like the last language they used?

Poll #7733 Better multilingual entry support
Open to: Registered Users, detailed results viewable to: All, participants: 66


This suggestion:

View Answers

Should be implemented as-is.
38 (57.6%)

Should be implemented with changes. (please comment)
4 (6.1%)

Shouldn't be implemented.
2 (3.0%)

(I have no opinion)
20 (30.3%)

(Other: please comment)
2 (3.0%)

ninetydegrees: Art & Text: heart with aroace colors, "you are loved" (Default)

[personal profile] ninetydegrees 2011-08-10 12:08 am (UTC)(link)
I think I prefer this suggestion (specifying one's language(s) in one's profile and making it work like interests) but I like the idea of being able to search by tags in general, if that's not implemented already.
msilverstar: (corset)

[personal profile] msilverstar 2011-08-10 02:27 am (UTC)(link)
That could certainly work for cutting down the number of possible languages. But it doesn't specify the post langage: one person on my subscription list writes entries in German and English and both.
jumpuphigh: Pigeon with text "jumpuphigh" (Default)

[personal profile] jumpuphigh 2011-08-10 04:26 am (UTC)(link)
The selection drop-down would have to be able to take into account multiple languages in a single entry.
ninetydegrees: Art & Text: heart with aroace colors, "you are loved" (Default)

[personal profile] ninetydegrees 2011-08-10 08:05 am (UTC)(link)
Yeah that's where tagging and being to search by tags would come in handy.
msilverstar: (corset)

[personal profile] msilverstar 2011-08-10 02:28 am (UTC)(link)
80% accurate language auto-detection is pretty easy, there are public algorithms. The last 20% though...
mark: A photo of Mark kneeling on top of the Taal Volcano in the Philippines. It was a long hike. (Default)

[staff profile] mark 2011-08-10 04:16 pm (UTC)(link)
Yeah, I was thinking that I'd rather do this algorithmically and just call it good. I'm very, very wary of adding yet more options to the post interface. It's way too busy already.
msilverstar: (corset)

[personal profile] msilverstar 2011-08-10 04:39 pm (UTC)(link)
If it's for invisible search purposes, that sounds good, but if it ever shows up to humans, the poster needs to be able to override the auto detected language.
pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)

[personal profile] pauamma 2011-08-11 02:10 pm (UTC)(link)
Any autodiscovery should allow the poster to override it IMO:
- Autodiscovery doesn't deal well with multiple-language entries, and poster may want to correct it
- Even with single language entries, autodiscovery may either be wrong or not give the desired amount of detail (eg, guess it's pt but the poster wants to specify pt_BR)
- Poster may want to leave the language unspecified, eg if the point is for readers to figure it out by themselves
pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)

[personal profile] pauamma 2011-08-11 01:37 pm (UTC)(link)
The last 20% can give some wonderfully bizarre results, as in that OO document that was all alternating short segments of French and English. OO was mostly able to tell which was which, but there was a region with both French and English in it that it claimed was all Portuguese.

[personal profile] voldsom 2011-08-12 06:30 am (UTC)(link)
Regarding the 80/20 split? Is that a general 'this is how good detection algorithms are', or is that based on existing Dreamwidth content. Being contrary, I guess my concern is over whether general Dreamidth posts fall into or outside the standards being measured against. If 95% of posts fall into the 80% category, then okay; but if 50% of the posts fall into the 20% category it becomes less practical. I should also poke around the algorithms to try and learn how they handle multiple languages.

It's possible that you could handle most of the 20% by using the user's list of languages from their profile page (*), either simply (we can't be 100% certain of the auto detect, so let's use the default / not set it) or smarter (auto detect says it's X or Y, the user speaks B, M and X, so X it is).

Personally, while I agree that the post entry page is already busy enough, I would prefer to see it as an enterable field. I think that ultimately, the user should be able to control it. Actually, let me take a step back. I think it depends on how it's to be used.

I think the already linked suggestion for languages on profiles is a good precursor to this, and has all the same questions for how you manage languages and the entry and listing thereof.

If this is something that is supposed to be a user choice and/or will be displayed on individual user entries, then it needs to be something that the user can enter, so it becomes the users choice of whether they specify a single language or multiple, or default to leaving it blank and none at all. Leaving it blank is fine for user display here, but if it's to be used for other tools, you still have the problem of selecting a default language for the entry, which may mean forcing the user to have a primary language, and then other languages, which I'm not keen on.

If it's to be an invisible field, that's more used in the background for allowing more choices on the Latest Things page (I'd love to be able to browse for Japanese / Korean posts), then auto detection is probably good enough, within the provisos above. It takes the onus away from the user of having to key in the same language every single time, but it also takes away their choice.
kyrielle: painterly drawing of a white woman with large dark-blue-framed glasses, hazel eyes, brown hair, and a suspicious lack of blemishes (Default)

[personal profile] kyrielle 2011-08-10 02:48 am (UTC)(link)
I have no strong opinion about doing this, but if it is done, I think defaulting to the last language the user used makes the most sense. If I have to fill it in every time, I probably simply won't recall/take the time most days, but if I fill it in once as English and it then remembers for me every time - then all of mine would end up tagged.

I'm not sure how big of an inconvenience that would be for people who post in multiple languages, or whether it might lead to mis-tagging, but for those of us posting in only one it would be a nice convenience. I'm not sure how to weigh those two needs against each other, though.
axiom_of_stripe: DC Comics: Kory cries "X'Hal!" (Default)

[personal profile] axiom_of_stripe 2011-08-10 01:17 pm (UTC)(link)
+1, especially if the "language tag" section is hidable in that fabled new posting page that lets us arrange things to our liking. :)
ursamajor: people on the beach watching the ocean (Default)

[personal profile] ursamajor 2011-08-10 03:19 am (UTC)(link)
This would enable Dreamwidth's ability to do things like let you choose which languages you want to read on the latest entries page, if I grok correctly? I would like that, definitely - it would make it easier for me to find people who write journal entries in French, Korean, etc.

(Tangent: I would still want /latest/ to display all entries in all languages; if I wanted to see only entries in French, I could bookmark something like /latest/?lang=fr to see all the French entries. Or maybe even ?lang=fr,ko,en for multiple languages? But I like seeing the linguistic variety on /latest/; it's a good demonstration that DW isn't just for one particular niche.)

Back to the original suggestion:

- Language tags should constitute a separate entry area from regular tags.
- An entry can contain multiple languages, so tagging an entry with multiple languages should be part of this feature.
- I'd like to be able to set a default language tag going forward; I will write predominantly in English, and I wouldn't want to have to remember to add the language tag every time. I'm already bad enough at consistently using regular tags >_>
- Going backwards, it would be awesome to retroactively apply a language tag as part of the mass-entry-editor, even if this had to be a paid-only or time-delayed feature.
- Synonym-bundling of language tags, yes, absolutely, especially across languages.
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2011-08-10 03:26 am (UTC)(link)
[staff profile] mark and I were just talking about language-specific Latest Things feeds, yeah :) Which is why I was so happy to see this suggestion!
zeborah: Map of New Zealand with a zebra salient (approval)

[personal profile] zeborah 2011-08-10 06:33 am (UTC)(link)
This would be a thousand times of awesome.
vickyblueeyez: (Default)

[personal profile] vickyblueeyez 2011-08-17 01:13 am (UTC)(link)
Being able to filter the latest page by language so I can read it would be great. Google and Bing Translate log me out and doesn't translate everything.
matgb: Artwork of 19th century upper class anarchist, text: MatGB (Default)

[personal profile] matgb 2011-08-10 04:24 am (UTC)(link)
I agree with most of this.

It's also an elegent solution to an issue now we're getting a fair few non-English posts. I'd also like to see a way of filtering them off my network page if possible, but that'd probably need a separate suggestion after there's a way to tag entries.
yvi: Kaylee half-smiling, looking very pretty (Default)

[personal profile] yvi 2011-08-10 07:16 am (UTC)(link)
Provided the default view isn't English-only...
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2011-08-10 07:20 am (UTC)(link)
Nobody's suggesting that, don't worry.
yvi: Kaylee half-smiling, looking very pretty (Default)

[personal profile] yvi 2011-08-10 07:23 am (UTC)(link)
Well, I wouldn't be so sure it wouldn't be suggested if this got implemented.

I'm a bit disillusioned today, I guess :/
denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)

[staff profile] denise 2011-08-10 07:24 am (UTC)(link)
Have some faith in me, at least :P
piranha: red origami crane (Default)

[personal profile] piranha 2011-08-11 12:22 am (UTC)(link)
i would like this A LOT.
pauamma: Cartooney crab wearing hot pink and acid green facemask holding drink with straw (Default)

[personal profile] pauamma 2011-08-11 02:16 pm (UTC)(link)
I'd want to have "I'm not saying" both as a default setting and a per-entry setting.
yvi: Dreamsheep in Germany's national colors (Dreamsheep - Germany)

[personal profile] yvi 2011-08-10 07:15 am (UTC)(link)
A whole long list of possible languages could a) be unwieldy


What with the thousands and thousands of languages spoken/written today, yeah, I'd guess so.

I am not sure how I feel about this yet, will have to think about it some more.
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2011-08-10 01:36 pm (UTC)(link)
I think the tag interface (with multiple possibilities) is probably best, as well as remembering what you used last and providing that as the default value for your next post.

I think canonicalisation would be a good idea, and I think that BCP 47 tags (aka IETF language tags) are a good thing to canonicalise to.

Perhaps something like "Please enter the language(s) used in this post, separated by commas. Please use language tags such as 'en' for English if you know them.", and then if someone enters something else, say something along the lines of "Your value of 'tlhIngan Hol' was recognised as 'tlh' (Klingon, tlhIngan-Hol). Use this standard value or keep your own?", with the language names taken from the "Description" field(s) in the IANA language subtag registry.

The names aren't always the prettiest (for example, "el" is "Modern Greek (1453-)"), but should be recognisable.
deborah: the Library of Congress cataloging numbers for children's literature, technology, and library science (Default)

[personal profile] deborah 2011-08-10 01:44 pm (UTC)(link)
I like that this suggestion as written allows for tagging conlangs.
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2011-08-10 05:23 pm (UTC)(link)
Doj
owl: Charlie Eppes. Geek. (geeky)

[personal profile] owl 2011-08-10 11:24 pm (UTC)(link)
This is awesome.
kerravonsen: glass button: "Shiny!" (shiny)

[personal profile] kerravonsen 2011-08-11 10:19 pm (UTC)(link)
\o/
green_knight: (Words)

[personal profile] green_knight 2011-08-11 09:17 am (UTC)(link)
I definitely see the need for a master list because any language might be referred to by at least two names (English and it's own), and when you get to Japanese/Nihongo/にほんご/something possibly involving Kanji which I can't even write) life could get complicted quickly.

I'm envisioning something like the icon choice menu, populated by my previous choices, with the option to add a language as I want it - and being asked to identify the new language in the masterlist so the latest things could display it correctly.

I'm not happy about autodetect. I've had to use Google Translate quite a bit recently, and the detection often fails; but I can see it being problematic particularly for pairings like Serbian and Croatian which are probably close enough to be misdetected - and where users tend to care a whole lot about being misidentified.
romikchef: (Default)

[personal profile] romikchef 2011-08-13 03:42 pm (UTC)(link)
I'd vote for the item in the preferences page.
Something like "I write mostly in "
And this setting can be used to set default language tag in the post form, letting user edit it.
Edited 2011-08-13 16:40 (UTC)