Training a Latin language model (AI)

Textkit is a learning community- introduce yourself here. Use the Open Board to introduce yourself, chat about off-topic issues and get to know each other.
Post Reply
quendidil
Textkit Member
Posts: 197
Joined: Wed Oct 18, 2006 11:39 am

Training a Latin language model (AI)

Post by quendidil »

Hey everyone,

I would like to ask for help from Latin language enthusiasts if you can spare the time even if you don't know programming.

I haven't been active on here for years. I've been studying deep learning recently, focusing in particular on natural language processing (NLP).

I had an idea for a side-project where I would fine-tune a very heavy language model trained by Google called BERT, on Latin. I've done a basic version of that which you can view here: https://colab.research.google.com/drive ... 9siq6ngjNX

I can probably fine-tune the model more, but where I would need help from others is creating tasks to evaluate the model.

In particular, I'd like to fine-tune BERT to be able to answer questions in Latin, in a format similar to what Google BERT has done for English questions.

In order to do that, I need to create a database containing passages, questions, and answers to train the model on.

I am working on this in my spare time, but I will need more passages to train the model than what I can come up with on my own. This is a sample passage in JSON format:

Code: Select all

{
  "data": [
    {
      "title": "De_Bello_Gallico_Liber_I",
      "paragraphs": [
        {
          "context": "Gallia est omnis divisa in partes tres, quarum unam incolunt Belgae, aliam Aquitani, tertiam qui ipsorum lingua Celtae, nostra Galli appellantur. Hi omnes lingua, institutis, legibus inter se differunt. Gallos ab Aquitanis Garumna flumen, a Belgis Matrona et Sequana dividit. Horum omnium fortissimi sunt Belgae, propterea quod a cultu atque humanitate provinciae longissime absunt, minimeque ad eos mercatores saepe commeant atque ea quae ad effeminandos animos pertinent important, proximique sunt Germanis, qui trans Rhenum incolunt, quibuscum continenter bellum gerunt.",
          "qas": [
            {
              "answers": [
                {
                  "answer_start": 34,
                  "text": "tres"
                }
              ],
              "question": "In quot partes divisa est Gallia?",
              "id": "5733be284776f41900661182"
            },
            {
              "answers": [
                {
                  "answer_start": 61,
                  "text": "Belgae"
                }
              ],
              "question": "Qui fortissimi sunt in Gallia?",
              "id": "5733be284776f4190066117f"
            },
            {
              "answers": [
                {
                  "answer_start": 313,
                  "text": "propterea quod a cultu atque humanitate provinciae longissime absunt"
                }
              ],
              "question": "Cur fortissimi sunt Belgae?",
              "id": "5733be284776f4190066117f"
            },
            {
              "answers": [
                {
                  "answer_start": 500,
                  "text": "Germani"
                }
              ],
              "question": "Quibus proximi sunt Belgae?",
              "id": "5733be284776f4190066117f"
            },
            {
              "answers": [
                {
                  "answer_start": 514,
                  "text": "trans Rhenum"
                }
              ],
              "question": "Ubi incolunt Germani?",
              "id": "5733be284776f4190066117f"
            },
            {
              "answers": [
                {
                  "answer_start": 223,
                  "text": "Garumna"
                }
              ],
              "question": "Quid flumen dividit Gallos ab Aquitanis?",
              "id": "5733be284776f4190066117f"
            }
          ]
        }
      ]
    }
  ]
}
I don't expect to get as much data as SQuAD, and this is really just a "toy" project, but if you find this project interesting and want to get some practice reading and forming questions in Latin, please help me out!

If the mods are fine with it, maybe I could cross-post this on the Latin forum too?

Thanks!

Post Reply