Überarbeitung der Wortliste

Lúthien Meliel · Post by **Lúthien Meliel** » Thu Sep 15 2011 11:11

Roman wrote:The dôl/dol matter is intentional - there seems to have been a change of conception regarding the etymology of the word. The former comes from ndolo and has the plural duil, the latter from *ndoll- (hence the alternate doll) and probably has the plural *dyl.

What do you propose? To keep those as two separate entries, or rather as one, but with an alternative translation added?

Post by **Roman** » Thu Sep 15 2011 13:45

What do you propose? To keep those as two separate entries, or rather as one, but with an alternative translation added?

Separate entries. The alternate words are different (mostly earlier) forms within the internal history of Sindarin. Different forms within the external history are always separate entries, as are differently derived forms.

Pluralization o > ui seems to work only with nouns formerly ending in -o without a consonant cluster, as thono > thôn, thuin 'pine' (PE17:81) or *tholo > thôl, thuil 'helm' (Q. kastolo) (PE17:188). The altered dol(l) probably cannot have the plural *duil anymore, but rather *dyl(l), as tol(l), tyll 'island'. So merging the entries wouldn't work; or would be very, very misleading.

Lúthien Meliel · Post by **Lúthien Meliel** » Thu Sep 15 2011 13:48

ok, will do.
Thanks

Lúthien Meliel · Post by **Lúthien Meliel** » Sun Sep 18 2011 13:51

Hm, this is giving me more of a headache than I hoped - especially those collation issues in MySQL are very annoying. Never have that at work, where we just put everything in utf8. Maybe Oracle is a bit smarter with these things, too.

Another thing that came up is this: I wanted to make a list of all the words that do not have an English translation yet, ie. the PE17 words, etc.
Therefore I thought to run a comparison between your (Aran's) dataset and Didier's. That works well enough, but just from curiosity i also did the reverse: find the words that do occur somewhere in Didier's set, but not in your set.

I put the results in an XLS file. It's a bit complicated, because I wanted to make sure to not miss anything. Therefore I matched every Sindarin word (regular, alternative, and plural forms) to either the primary entry column from Didier's set, and the "derived forms" column. On the second sheet, there is a single column compilation of those words - the first sheet has a large number of multiple rows because I included all the data that could be interesting.

Eryniel suggested to me that these words could be the Noldorin entries - that your list contains only Sindarin forms. Is that indeed so?
Or are there also deprecated entries in there?

If you could have a look at it, that'd be great.

I uploaded the excel sheet here: didier_words_not_found_in_aran_list.xls

Post by **Roman** » Sun Sep 18 2011 20:07

Eryniel suggested to me that these words could be the Noldorin entries - that your list contains only Sindarin forms. Is that indeed so?

What? Noo... There are several reasons for mismatches:

- overregularization in the Hiswelóke lists; I have often gone back to Tolkien's own spelling and glosses
- inflected verb forms like imperatives are not listed in the German wordlists
- some verbs are given by Tolkien only via their infinitives rather than roots, so we cannot be sure which class they belong to
- some words seem to be Ilkorin rather than Noldorin/Sindarin
- I've added a new regularization X/NT affecting engui and cannui
- some words are really missing in the German wordlists

I've added some comments to the left column (http://www.sindarin.de/didier_words_not ... n_list.xls). There are too many words to elaborate on each one of them, unless you have specific questions:

adlann aclod
adlanna- atlanna-
adlant atlant
aglonn aglon (aglond)
an- error, see na (Ety/374)
andrann anrand
anno inflected
aphad- aphada-
apharch missing
athan athar
avorn born (SD/129-31)
braig **brêg, but should be breig, braig
caenen interpolated, could be added
caenui reconstructed?
camlann camland
canthui cannui (normalized)
caraes interpolated, could be added
caro inflected
carth interpolated, could be added
círbann cirban
cuinar inflected
dambeth too uncertain
daro inflected
dem Ilkorin
díheno inflected
draf- drava-
drego inflected
edledhia- egledhia-
edledhron egledhron
edlothiad too uncertain, reconstructed *edlothia-
edro inflected
eglerio inflected
egol just not necessary
enchui engui (normalized)
erchammon erchamon
erchammui erchamui
erin inflected
fing missing
gaw- gawa-
gerin inflected
gladh- gladha-
godref way too uncertain
gohena- missing
govad- govan-
ha Noldorin pronoun
haf- hav-
hâl missing
hawn haun
he Noldorin pronoun
hebin inflected
hent missing
ho Noldorin pronoun
hwind *hwinn (chwind)
idhor idher
idhrinn idhrin (dhrind)
iôl missing
lachenn lachend
lacho inflected
lasto inflected
laws laus
lefnar lefnor
linnathon inflected
linnod missing
linnon inflected
luithiad uncertain
medli megli
medlin meglin
menniath missing
minno inflected
mistad mistrad + note
na- way, way too uncertain
naglath missing
nagol missing
nallon inflected
narch missing
nawb naub
neled missing
neleg missing
nerthui missing
nornwaith missing
nothlir missing
nuin inflected
nÿw nyf
othlonn othlon (othlond)
othrad ostrad
othronn othrond
pedo inflected
penneth missing
pihen pichen
revia- renia-
rhosg rosc
sen hen
send too uncertain
suilannad missing
suith sûth
tadeg deleted by Tolkien
talagan talagand
talu dalu (dalw)
tangada- tangad-
ten den
then part of a word
thórod Ilkorin
tiro inflected
tolo inflected
toniel wrongly interpolated
tuilinn tuilin (tuilind)
uin inflected
ulunn ulun (ulund)

Thanks for bringing this up. How should we proceed? Should I check it all (including the second column) and update the lists once more for you to download? I'm sure you already have a script to break them up into normalized lists which you can run again.

Lúthien Meliel · Post by **Lúthien Meliel** » Sun Sep 18 2011 20:08

I changed the datamodel a little bit because I made a mistake: the "plural" and "alt" markers were on the TRANSLATION table, so that they would have been properties of a combination of a Sindarin and a German entry -> a 'translation' row.

This wasn't right, because plural and alternative forms don't belong to a translation but to a Sindarin entry.

Now, the ENTRY is recursive: every entry can have a parent entry, and thus plurals and alternative forms belong to the main Sindarin entry. The order of these forms is maintained by an index marker, and every row can have a 'reconstructed' marker set.

This is the new model:

Lúthien Meliel · Post by **Lúthien Meliel** » Sun Sep 18 2011 20:24

hi,

thanks for the comments! That's a great help.

Thanks for bringing this up. How should we proceed? Should I check it all (including the second column) and update the lists once more for you to download? I'm sure you already have a script to break them up into normalized lists which you can run again.

We can always add or edit entries afterwards. Maybe the best thing to do is that I now finish the whole dataset with the data as I have them at the moment as soon as possible, give that back to you to look at, and meanwhile start on rebuilding the application based on the datamodel as it is now.
The functioning of the application does not depend on whether or not the data are complete, after all! As long as I have a working model I can move on.

Maybe it is easier for you to compare things if I would give you both the current dataset (based on your list) and Didier's set, in, for instance, a SQLite datafile or a MySQL dump? Just tell me what format works best.

If you could then cross-check those two sets so that we'd have that done by the time that there is a working application ready to test by others, that would be great

I don't know how long that will take me. It's definitely not a very complex thing, but there are a few thorny issues indeed concerning issues like collation, etc.
It shouldn't take longer than a few weeks at most if things don't get too hectic at work

EDIT - I'm not sure how familiar you are with SQL? I'm asking because the model with Didier's data is rather convoluted (see the diagram on the previous page, and that's not even all I think). I could also create a large view that essentially merges that whole thing into one large table, not unlike the Hisweloke HTML pages on Didier's website. I made something like that already for that first application, so that's not a big deal to write out.
That'd make it a lot easier to compare the Hisweloke set to your set, maybe.

PS - Eryniel is also looking at this list of non-matching entries.

Post by **Roman** » Mon Sep 19 2011 0:27

I changed the datamodel a little bit because I made a mistake: the "plural" and "alt" markers were on the TRANSLATION table, so that they would have been properties of a combination of a Sindarin and a German entry -> a 'translation' row.
This wasn't right, because plural and alternative forms don't belong to a translation but to a Sindarin entry.

That irritated me, I admit. And isn't just about everything dependent on the Sindarin entry - scmut, pronunciation, url, comment, ref and type?

Now, the ENTRY is recursive: every entry can have a parent entry, and thus plurals and alternative forms belong to the main Sindarin entry. The order of these forms is maintained by an index marker, and every row can have a 'reconstructed' marker set.

That sounds... intriguing. I certainly had my fun fiddling with the php script to sort in the alternative words into the main list correctly in all the combinations of attested and unattested words.. I hope recursive entries are easy to handle...
Something I should mention here: I limited the amount of alternative forms to one word only for simplicity, but there can actually be up to three of them for a Sindarin entry.

We can always add or edit entries afterwards. Maybe the best thing to do is that I now finish the whole dataset with the data as I have them at the moment as soon as possible, give that back to you to look at, and meanwhile start on rebuilding the application based on the datamodel as it is now.
The functioning of the application does not depend on whether or not the data are complete, after all! As long as I have a working model I can move on.

Division of labour - I like that.

Maybe it is easier for you to compare things if I would give you both the current dataset (based on your list) and Didier's set, in, for instance, a SQLite datafile or a MySQL dump? Just tell me what format works best.
[...]
EDIT - I'm not sure how familiar you are with SQL? I'm asking because the model with Didier's data is rather convoluted (see the diagram on the previous page, and that's not even all I think).

I didn't even know what SQL was before the beginning my work on the wordlists around June.
If it's just for comparing and checking the mismatches, then what you gave is enough. But if you have a definitive strucutre of the tables you'll use in the application, then send them to me.

Lúthien Meliel · Post by **Lúthien Meliel** » Mon Sep 19 2011 16:18

to answer your question about self-referencing the ENTRY table, here's an example how it works. If you execute this query:

select es.GLOSS sindarin, 'pl.', ep1.GLOSS, ep2.GLOSS, ep3.GLOSS, ep4.GLOSS,
' alt.:', ea1.GLOSS, ea2.GLOSS
from ENTRY es
left join ENTRY ep1
on es.ID = ep1.PARENT_ID and ep1.PLURAL = 1
left join ENTRY ep2
on es.ID = ep2.PARENT_ID and ep2.PLURAL = 2
left join ENTRY ep3
on es.ID = ep3.PARENT_ID and ep3.PLURAL = 3
left join ENTRY ep4
on es.ID = ep4.PARENT_ID and ep4.PLURAL = 4
left join ENTRY ea1
on es.ID = ea1.PARENT_ID and ea1.ALT = 1
left join ENTRY ea2
on es.ID = ea2.PARENT_ID and ea2.ALT = 2
where es.id < 9999

you'll get this result:

a
ab-
ablad
abonnen pl. ebennin ebœnnin
ach
achad
achar
achar-
achared
acharn
achas
aclod alt. atlaud
ad-
ada
adab pl. edaib edeb
adan pl. edain
adanadar pl. edenedair
<etc>
Here I edited out the NULL values and the empty 'pl' and 'alt' tags. Of course that will be done by the software eventually. It's also a little crude because the columns are here hardcoded; the program will look how many plurals and/or alt entries there are and fetch them if necessary. But it's just to demonstrate the idea

full list is here: http://parendili.org/doc/example.txt

Post by **Roman** » Mon Sep 19 2011 19:39

What about sorting the alternative words into the main list? For example, you should be able to search for atlaud or agr, find them in the list and be referred to aclod and agor.

Btw, I have just found out that Sindarin has an ISO 639 code assigned to it, which is sjn ("sin" and "snd" were already taken), see this list. So if it appears in the dictionary, one should use sjn beside eng, ger/deu, fra. I've also changed it on the site.

Lúthien Meliel · Post by **Lúthien Meliel** » Mon Sep 19 2011 22:04

Roman wrote:What about sorting the alternative words into the main list? For example, you should be able to search for atlaud or agr, find them in the list and be referred to aclod and agor.

Sure, that's no problem at all. Because the structural information is contained within parent_id -> id relationships, you can represent it in any way that you want. It's all in the sort of query that is used, how it is presented, sorted, etc.

I think that Eryniel put a topic for "feature requests" on Parendili. Could you maybe write it down in there?

Roman wrote:Btw, I have just found out that Sindarin has an ISO 639 code assigned to it, which is sjn ("sin" and "snd" were already taken), see this list. So if it appears in the dictionary, one should use sjn beside eng, ger/deu, fra. I've also changed it on the site.

That's interesting! I remember from the previous Omentielva that someone was trying to get the Tengwar included in the Unicode set. I have no idea what became of that.

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 21 2011 7:59

Hi,

I did now run into something that I am uncertain about. It's this: since I have now transfered the attributes to the Sindarin entries, I am now assigning the values from your wortliste table to the Sindarin entries.
I did get an unequal number of rows, which turns out to be cause by the fact that I had added only *one* instance of every Sindarin word that occurs in your list to the Entry table.
The more than one translations are modeled by the relationship via the Translation entity.

But the available German translations do not account for all occurences of Sindarin entries in your table, it seems.
If I take, for instance, Sindarin _ad-_ - this is how it occurs in your 'wortliste' table:

Obviously, this is not a case of one Sindarin entry with more than one translation, but these are two separate entries: one reconstructed, one attested.

In the first migration, I have not taken this into account, so I need to change that.
Can you tell me how I can distinguish true separate Sindarin entries from the ones where an entry occurs more than once because there is more than one translation? In this case, it's the 'srek' attribute that I could use, but is this always the case?

Post by **Roman** » Wed Sep 21 2011 13:38

I think that Eryniel put a topic for "feature requests" on Parendili. Could you maybe write it down in there?

Well, I don't see it as an optional 'feature', it's just a basic thing the dictionary has to provide.

That's interesting! I remember from the previous Omentielva that someone was trying to get the Tengwar included in the Unicode set. I have no idea what became of that.

The state of affairs from this year's Omentielva is apparently that he [Michael Everson] is still trying.

Can you tell me how I can distinguish true separate Sindarin entries from the ones where an entry occurs more than once because there is more than one translation? In this case, it's the 'srek' attribute that I could use, but is this always the case?

If it's the same entry, then the alternate form and the reference should match.

I should explain how the entries are constructed. Tolkien's translations have often variations, so one has to decide whether they're the same word or not. For example, we find:

paran 'naked, bare' PE/17:86
paran 'bald, bare' PE/17:171
paran 'smooth, shaven' RC/433

I think these three are reasonably close, so they get slammed together into one entry:

paran 'naked, bare, bald, smooth, shaven' PE/17:86,171, RC/433.

But then we also encounter something like this:

ogol 'gloom(y)' PE18:88
ogol 'bad, evil, wrong PE17:170, VT/48:32
ogol untranslated PE/17:149

'Gloomy' and 'evil' are not quite the same thing, and the untranslated gloss could be either, so there should be two entries:

ogol 'gloom(y)' PE18:88, PE/17:149
ogol 'bad, evil, wrong PE17:170, VT/48:32, PE/17:149

This is all about external homophones so far. The procedure runs into problems when there are internal homophones appearing on the same page:

pann (*pand) 'courtyard' Ety/380
pann 'wide' Ety/380

These can be distinguished by the alternate form (indicating a different etymology). This may also fail, however:

lorn 'asleep' VT45:29
lorn 'quiet water, anchorage, haven, harbour' VT45:29

Here I had to alter the reference manually by including the etymology:

lorn 'asleep' VT45:29, LOR-
lorn 'quiet water, anchorage, haven, harbour' VT45:29, LUR-
_____________________________________

Btw, I have a question to all the others regarding listing inflected forms in the dictionary: How would you like to have it? I don't think it's sensible to have pedo 'speak!, say!' or cuinar 'they live' as headwords. If anything, one could make them subordinate to the entries ped- and cuina- in a separate column for attested usage.
It should be very helpful, however, to include the past tenses for all verbs - reconstructed, if needed - as they aren't trivial in Sindarin, and also indicative of a verb's etymology - which does belong into a helpful dictionary. Luckily, one can conveniently use the noun plural slot for that.

Eryniel Elmíris · Post by **Eryniel Elmíris** » Wed Sep 21 2011 14:16

Hey, Luthien asked me to look at that list as well. I have run across two things, also with regards to your answer, that I think need clarification.

1) Since according to Lúthien the script can differenciate between all sorts of things, but not capitalization, I think the entries for e.g. ardhon (region) / Ardhon (world) might be a problem (unless there is something else to differentiate in your wordlist that I cannot see...

2) The entries for delu/delw show 3 entries. *delu (delw) adj gefährlich / delu (delw) adj hasserfüllt. / *delu (delw) adj tödlich (Ety/355); adj dick (PE/17:17). Maybe I am nitpicking, but shouldn't "dick" and "tödlich" have different entries?

As for inflected forms and past tenses, I am all for it.

Attested usage sounds good, since I would never think of actually searching for an inflected form. And for past tense I can only say: HURRAY!

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 21 2011 14:17

Roman wrote:
I think that Eryniel put a topic for "feature requests" on Parendili. Could you maybe write it down in there?
Well, I don't see it as an optional 'feature', it's just a basic thing the dictionary has to provide.

ok, I'll rename that topic to Mandatory & Optional Requirements then

- but seriously: it's good to also mention whatever you might think is obvious. It might not be altogether obvious to me, because I don't know that much about the linguistic technicalities.
And also, I know that I tend to overlook things if I don't keep them where I can't miss them.

Of course I can put this requirement in there myself (and I will) - but just if you think of something else that I shouldn't forget, please mention it in that topic.

Re. your explanation: thanks! I'll look into that asap.