Roman wrote:
I'm not claiming to know anything about XML, but that's the same impression I got. Sometimes I do have involved searches - like all words with a certain combination of sounds in a certain source, or something like that - but nothing which cannot be done by an SQL query.
I'll try to explain that issue a bit more in detail here:
XML 'an sich' is not complex: it's just a way to structure data so that it is machine-readable. It is a bit like HTML in the sense that it works with tags like <greeting>mae govannen</greeting>.
XML is, just as HTML, text-based. In that sense it is different from a relational database.
The problem that I had with Didier's data was not so much that it was XML based, but that this TEI schema (Text Encoding Initiative) he was using - was both very complex and not documented
(note: 'XML Schema' is the set of rules that defines what sort of tags are allowed in a certain XML document, and how their hierarchical relationship is)
There is a sort of division among programmers in the sense that some prefer to use XML to structure data, while others prefer relational (SQL) databases. They both have their merits, but there are areas where XML is clearly not a handy choice - and vice versa. For my daytime job, we work on the data stream between the tax department and the social security department: these message-like data are usually formatted using XML.
But in the case of a dictionary dataset, using XML is in my opinion not a very good choice. It turns the data into a large monolithic block which is very inflexible and which hampers performance. Didier also admitted that this was the reason that the original Hesperides application was very slow to start up.
Another disadvantage is that a set of data that is stored in XML is very hard to modify. You will have to convert it into a relational model (a SQL database) in order to manipulate the data, and then you could write it back to that XML format again.
An advantage of XML would be that the presentation of the data requires very little work: there are ways to apply a transformation to the XML file, which renders it into a readable format in one go.
But as you also mention: it is comparatively very easy to write an SQL query on a relational dataset.
Anyway, I decided very early on that I had to rework Didier's data in a relational format. That was the largest part of the work, because the format he had used was not documented. Or rather: he did adhere to the TEI standard, but because he used only a subset of that, there was no way to know which "entities" there existed unless I would sift manually through the whole thing. Unfortunately, there doesn't exist a generic program to convert XML to relational data, so I eventually wrote a perl program that did this, and then I had to convert the actual Sindarin data to relational database scrips.
This was the database schema that came out of that reverse engineering:
I think this is probably a very usable and universal structure if you are going to do serious linguistic work, but as a base for a Sindarin dictionary I think it's royally over the top. I didn't really mind too much, because i was glad that I had it working anyhow. But now things seem to be changing anyhow, I think that it's a good opportunity to simplify things a bit.
If you're ok with it, I could have a look at your data and maybe
normalize it a bit: it could be interesting to split off some recurring data in separate tables, for instance the "special case mutation", the "word type", "reconstruction marker" and maybe some others. That would render it easier to maintain and also allow for other primary languages to be added.
All keeping it pretty simple, not something elaborate like TEI.
Roman wrote:That's all the relevant information I can think of right now. Tell me what you need and I'll send it to you. How are you planning the next steps? Do you want to go through the German list and put back Tolkien's English glosses?
In any case, it'd be great to have a common English/German database once again.
I would need a dump of those two tables: I'm not sure what database you use, but SQL is pretty much standard. I think that'd not be a problem anyhow. A set of CREATE TABLE statements to create the tables, and INSERT statements for the contents would be fine.
I don't have a fixed plan as yet, but I have been thinking something like to take my old database, and use that to retrieve the English entries via the Sindarin words that are common in both databases. Incidentally, that could give us also the French entries. The remaining entries would have to be entered by hand, but I think that there are maybe about 700 of them, which is doable. Maybe something like retrieving German-English translations via the Google Translate API could work, and then going over it to correct eventual mistakes.
As for the (German) letters like ü / ä/ ö / ß - I think you best decide whether it's better to adhere to that .. I don't know what the current correct form is in Germany? And should maintaining both forms be necessary? As seen from a database design point of view, it's redundant to store both. If we put the original form in the database, we can always choose to display either
ü / ä/ ö / ß or
ue / ae/ oe / ss by character substitution in the program itself. We could even make it a setting so that people could choose for themselves.
I'm glad that you are also enthusiastic about it, especially given the recent developments
It also solves the awkward situation around the PE17 words, which I never could make any sense of.
I would be happy to get this working too!