Überarbeitung der Wortliste

Eryniel Elmíris · Post by **Eryniel Elmíris** » Fri Sep 09 2011 15:03

As far as I know, umlauts are not "special characters" for SQL (like %, _, '), but they do get sorted differently than we Germans are used to. But I am sure there is a way around it - and Lúthien would know

I checked the dictionary as it exists now and there it looks fine.

Post by **Roman** » Fri Sep 09 2011 18:01

I think I remember we had to replace umlauts and ß, because SQL would somehow treat them as special characters and sort them at the end of the alphabet.

Googling around it seems that this was an issue with earlier SQL versions, but now choosing the right collation (order of characters for sorting) should take care of that.

Also, for anyone with a keyboard which does not provide umlauts or ß, it is thus possible to type just "a", "o", "u" or "ss". Try it and look up "schräg" and "schrag" via the search function.

Mhm, that's an issue I haven't thought of. But the second column is only used for sorting, not for searching. The search runs over the first column, with ä=a etc. already implicit, but for some reason not ß=s or ß=ss - so you cannot substitute a different character for "ß" in the search. I'll probably have to solve that in the php script.

Maybe you just add a comment about that below the search box?!

Good idea.

Eryniel Elmíris · Post by **Eryniel Elmíris** » Sat Sep 10 2011 2:04

Maybe a small tipp for people having international (QWERTY) keyboards who nevertheless regularly need umlauts - like me:

In Windows you can set your keyboard in the control panel (in "Region and Language", then "change keyboards") to an international setting (it will show then you click on "show more"). This is what I actually use. You can then type the single quotation mark ' with any vowel and you get an accented vowel. If you press the double quotation mark " with a vowel, it will become an umlaut (or in case of "e" it becomes "ë" - quite handy for Quenya texts).
The crux is that you need to press an additional <space> if you don't want to use the marks that way.
Unfortunately I haven't found out how to do the "ß" in this way yet - or if indeed there is a way. So I still have to use <Alt>0223 for that.

Thanks to Lúthien for that little insight.

P.S.:

How are you planning the next steps? Do you want to go through the German list and put back Tolkien's English glosses?

Lúthien did mention to write a script to incorporate the existing english glosses, leaving only the newer words to be filled in again.

Lúthien Meliel · Post by **Lúthien Meliel** » Sat Sep 10 2011 11:35

Roman wrote: I'm not claiming to know anything about XML, but that's the same impression I got. Sometimes I do have involved searches - like all words with a certain combination of sounds in a certain source, or something like that - but nothing which cannot be done by an SQL query.

I'll try to explain that issue a bit more in detail here:

XML 'an sich' is not complex: it's just a way to structure data so that it is machine-readable. It is a bit like HTML in the sense that it works with tags like <greeting>mae govannen</greeting>.
XML is, just as HTML, text-based. In that sense it is different from a relational database.

The problem that I had with Didier's data was not so much that it was XML based, but that this TEI schema (Text Encoding Initiative) he was using - was both very complex and not documented (note: 'XML Schema' is the set of rules that defines what sort of tags are allowed in a certain XML document, and how their hierarchical relationship is)

There is a sort of division among programmers in the sense that some prefer to use XML to structure data, while others prefer relational (SQL) databases. They both have their merits, but there are areas where XML is clearly not a handy choice - and vice versa. For my daytime job, we work on the data stream between the tax department and the social security department: these message-like data are usually formatted using XML.

But in the case of a dictionary dataset, using XML is in my opinion not a very good choice. It turns the data into a large monolithic block which is very inflexible and which hampers performance. Didier also admitted that this was the reason that the original Hesperides application was very slow to start up.

Another disadvantage is that a set of data that is stored in XML is very hard to modify. You will have to convert it into a relational model (a SQL database) in order to manipulate the data, and then you could write it back to that XML format again.

An advantage of XML would be that the presentation of the data requires very little work: there are ways to apply a transformation to the XML file, which renders it into a readable format in one go.

But as you also mention: it is comparatively very easy to write an SQL query on a relational dataset.

Anyway, I decided very early on that I had to rework Didier's data in a relational format. That was the largest part of the work, because the format he had used was not documented. Or rather: he did adhere to the TEI standard, but because he used only a subset of that, there was no way to know which "entities" there existed unless I would sift manually through the whole thing. Unfortunately, there doesn't exist a generic program to convert XML to relational data, so I eventually wrote a perl program that did this, and then I had to convert the actual Sindarin data to relational database scrips.
This was the database schema that came out of that reverse engineering:

I think this is probably a very usable and universal structure if you are going to do serious linguistic work, but as a base for a Sindarin dictionary I think it's royally over the top. I didn't really mind too much, because i was glad that I had it working anyhow. But now things seem to be changing anyhow, I think that it's a good opportunity to simplify things a bit.

If you're ok with it, I could have a look at your data and maybe normalize it a bit: it could be interesting to split off some recurring data in separate tables, for instance the "special case mutation", the "word type", "reconstruction marker" and maybe some others. That would render it easier to maintain and also allow for other primary languages to be added.
All keeping it pretty simple, not something elaborate like TEI.

Roman wrote:That's all the relevant information I can think of right now. Tell me what you need and I'll send it to you. How are you planning the next steps? Do you want to go through the German list and put back Tolkien's English glosses?
In any case, it'd be great to have a common English/German database once again.

I would need a dump of those two tables: I'm not sure what database you use, but SQL is pretty much standard. I think that'd not be a problem anyhow. A set of CREATE TABLE statements to create the tables, and INSERT statements for the contents would be fine.

I don't have a fixed plan as yet, but I have been thinking something like to take my old database, and use that to retrieve the English entries via the Sindarin words that are common in both databases. Incidentally, that could give us also the French entries. The remaining entries would have to be entered by hand, but I think that there are maybe about 700 of them, which is doable. Maybe something like retrieving German-English translations via the Google Translate API could work, and then going over it to correct eventual mistakes.

As for the (German) letters like ü / ä/ ö / ß - I think you best decide whether it's better to adhere to that .. I don't know what the current correct form is in Germany? And should maintaining both forms be necessary? As seen from a database design point of view, it's redundant to store both. If we put the original form in the database, we can always choose to display either ü / ä/ ö / ß or ue / ae/ oe / ss by character substitution in the program itself. We could even make it a setting so that people could choose for themselves.

I'm glad that you are also enthusiastic about it, especially given the recent developments

It also solves the awkward situation around the PE17 words, which I never could make any sense of.

I would be happy to get this working too!

Post by **Roman** » Sat Sep 10 2011 22:39

In Windows you can set your keyboard in the control panel (in "Region and Language", then "change keyboards") to an international setting (it will show then you click on "show more"). This is what I actually use. You can then type the single quotation mark ' with any vowel and you get an accented vowel. If you press the double quotation mark " with a vowel, it will become an umlaut (or in case of "e" it becomes "ë" - quite handy for Quenya texts).
The crux is that you need to press an additional <space> if you don't want to use the marks that way.
Unfortunately I haven't found out how to do the "ß" in this way yet - or if indeed there is a way. So I still have to use <Alt>0223 for that.

For Windows, there is a very nice tool called Autohotkey. With it, you can write a personal script which returns characters on the input of specified keys. The script can be copied into the auto-startup so that it'll run right away and you don't have to switch anything.
For example, you can make use of the otherwise useless Windows key, if you have one, and hotkey ß to Win+s. Also, I constantly need special linguistic characters, like vowels with a macron āēīōū (they're only available with a Latvian keyboard), which can also be hotkeyed, of course. In fact, you can hotkey all of Mach's css-Tengwar to any of the keys you want and type them right off the bat.

If you're ok with it, I could have a look at your data and maybe normalize it a bit: it could be interesting to split off some recurring data in separate tables, for instance the "special case mutation", the "word type", "reconstruction marker" and maybe some others. That would render it easier to maintain and also allow for other primary languages to be added.

Sure, go ahead if you think it will improve things. Note, though, that there might be problems with homophonous words, e.g. gwa- is both a verb and a prefix, *dan is reconstructed in the sense 'but' and attested in the sense 'against', barad is a special case mutation in the sense 'damned' and not a s.c. in the sense of 'tower'.
So you cannot always identify words by their shape, and even not always by their shape and reference, e.g. pann 'courtyard' and pann 'wide' are both attested in Ety/380. The only difference between them is that the former word has *pand as an earlier form. On the other hand, lorn 'asleep' and lorn 'haven' are both attested in VT/45:29 and have no different sideforms. Here I actually had to alter the reference by adding the root - LUR- for the former and LOR- for the latter - to differentiate between them. (I guess this is where the XML style would be handier.) All of this gave me enough headache in the php script for proper display and linebreaks.
Of course, one can always put up additional tables via the word id, but it seems to me that this would make it harder to maintain. With only one table, you sacrifice efficiency, but I don't think that it will ever be a problem with the comparatively small dictionary we have.

I would need a dump of those two tables: I'm not sure what database you use, but SQL is pretty much standard. I think that'd not be a problem anyhow. A set of CREATE TABLE statements to create the tables, and INSERT statements for the contents would be fine.

I've uploaded the dumps here (hope they're not too smelly

):
http://www.sindarin.de/wortliste.dat
http://www.sindarin.de/wortliste_komm.dat

I don't have a fixed plan as yet, but I have been thinking something like to take my old database, and use that to retrieve the English entries via the Sindarin words that are common in both databases. Incidentally, that could give us also the French entries.

Sounds like a plan. But again, some warning: Words from PE17-19 may be of identical shape, but with a different meaning, e.g. dîr 'hard, difficult', previously 'man'. On the other hand, some words which were attested before also appear in PE17-19, so that the old entries are expanded by the new references.

The remaining entries would have to be entered by hand, but I think that there are maybe about 700 of them, which is doable. Maybe something like retrieving German-English translations via the Google Translate API could work, and then going over it to correct eventual mistakes.

This doesn't seem like a good idea to me.. For the English wordlist, you'll need Tolkien's exact glosses, so there is no shortcut from going through the publications and typing them in by hand.

As for the (German) letters like ü / ä/ ö / ß - I think you best decide whether it's better to adhere to that .. I don't know what the current correct form is in Germany? And should maintaining both forms be necessary? As seen from a database design point of view, it's redundant to store both. If we put the original form in the database, we can always choose to display either ü / ä/ ö / ß or ue / ae/ oe / ss by character substitution in the program itself. We could even make it a setting so that people could choose for themselves.

Proper German spelling has to have üöäß, of course. But this works fine with the newer SQL versions. I've just changed the php script to sort by the first column and it all appears to be in order - so feel free to ignore the second one.
Output is not an issue, only input for those without ß on the keyboard (substituting ss doesn't work here for some reason). But don't worry about that right now, we should get the basics going first.

Lúthien Meliel · Post by **Lúthien Meliel** » Sun Sep 11 2011 7:18

Thank you!
I've downloaded them and will look at them the coming days.

Lúthien Meliel · Post by **Lúthien Meliel** » Tue Sep 13 2011 23:08

Suilad Roman,

I had a better look at the data the past two days. There's a couple of questions that came from that, as well as a first draft for a database model that would accomodate other languages besides German.

I don't understand the meaning of the columns srek, saltrek and s2rek in table wortliste.
These columns only take the values 0, 1, 2, 3 or 9 - it seems as if the names suggest they refer to the columns sind, sind_alt and Sind2 - although I can't figure out what -rek refers to ?
It doesn't seem to be an index to separate otherwise identical rows for the above columns, but maybe I should check that again (though it'd be great if you could just tell me

)

Also, I don't know what the second table wordliste_komm is for. By its contents I'd think that these are the somewhat uncertain entries that you talk about (like _lorn_) but if so, how does this table fit into the data of the first one?

I puzzled a bit with a possible database design. As to what you mention about (for instance) homophonous words; I want to keep the internal cohesion exactly as it is. The changes that I'm thinking of would therefore leave all that intact; it would just distribute the components in another way across tables so that it would become possible to add other (modern) languages. It would also eliminate the need to enter redundant data, like when there's two rows in wortliste for two translations of one Sindarin word. Here's an ER diagram:

It's of course entirely possible that I missed out something, so if you can look this over and comment, that'd be great!
I also left out the srek, saltrek and s2rek culumns because I don't understand what they mean; and the second table is not yet considered here.

I'll also post this message on the existing thread on Parendili.

Thanks again!

NB - it just occurred to me that this model does not only allow for other modern languages to be added (like English, French ... ) but it can also accomodate another elvish language like Quenya (in that case it'd be better to rename that one Foreign Key column to ELVISH_ENTRY_ID)

Or, even better: if Quenya words would be added, you'd also get a Sindarin-Quenya dictionary for free B-) .. at least for the words that exist in both languages, of course.

Post by **Roman** » Wed Sep 14 2011 9:25

I don't understand the meaning of the columns srek, saltrek and s2rek in table wortliste.
These columns only take the values 0, 1, 2, 3 or 9 - it seems as if the names suggest they refer to the columns sind, sind_alt and Sind2 - although I can't figure out what -rek refers to ?
It doesn't seem to be an index to separate otherwise identical rows for the above columns, but maybe I should check that again (though it'd be great if you could just tell me Smile )

As I described above, they show whether the corresponding entry was reconstructed (value 1) or not (no value, value 0). "Rek" stands for "Rekonstruktion". In the case of plurals (sind2), you often have more than one, and more than one reconstructed, so s2rek indicates the amout of reconstructions in the list sind2 - the reconstructed ones have to go first in the list.
In the php script, the integers are correspondingly replaced with asterisks.
"9" is a typo.

Also, I don't know what the second table wordliste_komm is for. By its contents I'd think that these are the somewhat uncertain entries that you talk about (like _lorn_) but if so, how does this table fit into the data of the first one?

It's for various comments: 'untranslated word', 'isolated from X', 'etymology uncertain', 'part of speech uncertain', 'verbal root uncertain', 'only attested in lenited form', 'dialectal form', 'deleted by Tolkien' and so on.
It is joined with the main table in the php script and the comment is added whenever sind and rf are identical, which indubitably identifies the entry. I guess, if you use an index for the Sindarin entries, you'll take that then.

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 14 2011 10:08

Roman wrote: As I described above, they show whether the corresponding entry was reconstructed (value 1) or not (no value, value 0). "Rek" stands for "Rekonstruktion". In the case of plurals (sind2), you often have more than one, and more than one reconstructed, so s2rek indicates the amout of reconstructions in the list sind2 - the reconstructed ones have to go first in the list.
In the php script, the integers are correspondingly replaced with asterisks.
"9" is a typo.

Ah, I'm sorry: I overlooked your post on the previous page. I get it now.

Roman wrote:It's for various comments: 'untranslated word', 'isolated from X', 'etymology uncertain', 'part of speech uncertain', 'verbal root uncertain', 'only attested in lenited form', 'dialectal form', 'deleted by Tolkien' and so on.
It is joined with the main table in the php script and the comment is added whenever sind and rf are identical, which indubitably identifies the entry. I guess, if you use an index for the Sindarin entries, you'll take that then.

And the same thing here. I'll add all these to the model and post that when done.

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 14 2011 11:10

with the remaining columns added, plus the other comments table:

I'll next create an instance of this schema on MySQL and migrate the data therein, so further test if it stands up to expectations.

NB - it's entirely possible to merge the Function, Ref, Type and Comment tables with the Entry table, because they have the exact same structure. I think it's better not to though, because although it would be compact and more flexible, it would render it practically impossible to read "manually".

Post by **Roman** » Wed Sep 14 2011 13:18

NB - it's entirely possible to merge the Function, Ref, Type and Comment tables with the Entry table, because they have the exact same structure. I think it's better not to though, because although it would be compact and more flexible, it would render it practically impossible to read "manually".

But aren't the IDs different? I thought it would be something like
1 n[oun]
2 v[erb]
3 adj[ective]
...
and:
1 untranslated word
2 etymology uncertain
...
Since the part of speech is a tiny string anyway, I don't think would improve things to assign an ID to it. And the references are unique for almost all Sindarin entries. So I'd suggest having a table

ENTRY
#ID
s GLOSS
s REF
s TYPE

Is #LANGUAGE_ID really needed there?
Also, how do you exactly handle various translations of the same Sindarin word? The rule is that Tolkien's entries are like this: baug 'tyrannous, cruel, oppressive'; and the amount of glosses will likely vary among the languages of translations.

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 14 2011 14:17

Roman wrote:Is #LANGUAGE_ID really needed there?
Also, how do you exactly handle various translations of the same Sindarin word? The rule is that Tolkien's entries are like this: baug 'tyrannous, cruel, oppressive'; and the amount of glosses will likely vary among the languages of translations.

"really needed"; no. You can model this thing in an arbitrary number of ways. What I was thinking of was this (showing rows here):

Code: Select all

LANGUAGE
  1 - Sindarin
100 - Deutsch
101 - English
102 - Français

ENTRY
 1000 -   1  baug
 1001 -   1  alfirin
...
10000 - 101 - tyrannous
10001 - 101 - cruel
10002 - 101 - oppressive
10010 - 100 - grausam
10011 - 100 - hart
10012 - 100 - unbarmherzig
10013 - 100 - bitter
10020 - 101 - immortal
10021 - 101 - type of white flower
10030 - 100 - unsterblich

TRANSLATION (showing only the first two columns for now)
1000 - 10000
1000 - 10001
1000 - 10002
1000 - 10010
1000 - 10011
1000 - 10012
1000 - 10013
1001 - 10020
1001 - 10021
1001 - 10030

If you'd now execute this query:

SELECT l.NAME, e1.GLOSS, e2.GLOSS
FROM ENTRY e1, ENTRY e2, LANGUAGE l
JOIN TRANSLATION t1
ON e1.ID = t1.SIND_ENTRY_ID
JOIN TRANSLATION t2
ON e2.ID = t2.MOD_ENTRY_ID
JOIN LANGUAGE l
on e2.LANGUAGE_ID = l.ID
WHERE e1.LANGUAGE_ID = 1
ORDER BY l.NAME;

you'd get this result set:

Deutsch | baug | grausam
Deutsch | baug | hart
Deutsch | baug | unbarmherzig
Deutsch | baug | bitter
Deutsch | alfirin | unsterblich
English | baug | tyrannous
English | baug | cruel
English | baug | oppressive
English | alfirin | immortal
English | alfirin | type of white flower
-- etc., which you can of course also constrain by selecting only one language.

The Entry table forms, together with the two foreign keys to the Translations table, a many-to-many relationship which can have any number of entries from either end.
Storing the attributes on the Translations table will then give you the freedom to assign those for every result row individually.

I'll address your other points later when I'm home.

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 14 2011 14:35

Maybe it's easier if I first load in the dataset you provided, and then give you the whole thing to test?
I'm currently looking into SQLite, which is maybe somewhat faster than the Derby (Java) database that I used for the previous version. Since SQLite stores everything in just one file, it's easy to distribute

To demonstrate the whole thing, I could also write a query that reproduces the German - Sindarin and Sindarin - German lists.

Anyhow, I'll try go get that done asap; let's just hope that there are no unforeseen complicating factors. I'll keep you posted!

Lúthien Meliel · Post by **Lúthien Meliel** » Wed Sep 14 2011 23:14

Suilad Aran,
I've got some more questions about the data set. As expected, there are some rows that are duplicate with regards to the primary key *if* that's set to a combination of *unique German + unique Sindarin entry*.
It seems that for some of those there is a second entry because of an alternate spelling: dol vs. dôl , vín vs. vîn and si vs. sí

the following query gives an overview of all those cases:

select * from wortliste
where sind in
(select sind from wortliste
group by sind, deu
having count(*) > 1)
and deu in
(select deu from wortliste
group by sind, deu
having count(*) > 1)

For the rows that are marked in the above image I can either find no difference at all or there is one only in the references (note that I truncated those columns in the screenshot only, otherwise it wouldn't fit in here). Is it OK to merge the information in the Reference columns for those rows?

As for the identical rows (with regards to the Sindarin and German entries) that differ not, or only in Reference and URL columns, can we merge those?
As for the others that have alternative spelling, such as for rows #699, 700, 3451 & 3902 - those can be accommodated by inserting a second and/or third TRANSLATION row, setting the INDEX to 2 or 3 (possibly together with the PLURAL marker).

All this makes me think that by adding those two columns (INDEX and PLURAL) to the primary key, we can accomodate all possible entries from your set.

Lastly, for rows #2065 + 3164 (Unser), there is a URL in one of the two rows. Also, the "info" column in the vîn column mentions "poss. len" - is that valid for that spelling only?
The reference column in those two rows is slightly different as well.

I think that if we resolve these, the model is good to go. I can probably have a trial version for you by tomorrow (just a database though!), if all goes well

Post by **Roman** » Thu Sep 15 2011 2:05

The dôl/dol matter is intentional - there seems to have been a change of conception regarding the etymology of the word. The former comes from ndolo and has the plural duil, the latter from *ndoll- (hence the alternate doll) and probably has the plural *dyl.

The other duplicates are accidental - when going through PE17 I of course missed that some words were already attested earlier. Thanks for the corrections! I have updated the previous files with them:
http://www.sindarin.de/wortliste.dat
http://www.sindarin.de/wortliste_komm.dat