top of page

Iahlt

Iahlt

Resources

Start with NLP

Recommended courses:
https://www.coursera.org/specializations/natural-language-processing

Recommended textbook, available online:
https://web.stanford.edu/~jurafsky/slp3/

It also provides great little introductions to many fields of linguistics before you hop into the computational part.

NLP Tutorials Part -I from Basics to Advance

https://www.analyticsvidhya.com/blog/2022/01/nlp-tutorials-part-i-from-basics-to-advance/

Natural Language Processing with Python

https://www.nltk.org/book/

100 ChatGPT terms explained from NLP to Entity Extraction

https://www.geeky-gadgets.com/chatgpt-terms-explained/

Natural Language Processing In Healthcare

https://www.routledge.com/Natural-Language-Processing-In-Healthcare-A-Special-Focus-on-Low-Resource/Dash-Parida-Tello-Acharya-Bojar/p/book/9780367685393

Natural Language Processing Specialization

https://www.coursera.org/specializations/natural-language-processing#courses

Hebrew NLP resources

Hebrew NLP Resources

https://github.com/NNLP-IL/Resources

NNLP-IL Hebrew and Arabic NLP Resources

https://resources.nnlp-il.mafat.ai

Hebrew Handwritten Text Recognizer (OCR)

https://github.com/Lotemn102/HebHTR

מאגרי מידע ושת"פים אפשריים
https://docs.google.com/spreadsheets/d/1fGYKyA5Jf_KPCXPCpRWGfRzjDc6ALp9dgKnbIXqxM_Y/edit#gid=0

Legal

חוות דעת: שימושים בתכנים מוגנים בזכויות יוצרים לצורך למידת מכונה

https://www.gov.il/he/departments/legalInfo/machine-learning

Israel's Policy on Artificial Intelligence Regulation and Ethics (Ministry of Innovation, Science and Technology)

https://www.gov.il/en/departments/policies/ai_2023

Open source

Open Source

Github

NLP
https://github.com/topics/natural-language-processing

Speech

https://github.com/topics/speech

spaCy · Industrial-strength Natural Language Processing in Python
https://spacy.io/

Stanza – A Python NLP Package for Many Human Languages

Created by the Stanford NLP Group

https://stanfordnlp.github.io/stanza/a

Open Source OCR

https://github.com/tesseract-ocr/tesseract

Speech Recognition - Whisper (OpenAI)

https://cdn.openai.com/papers/whisper.pdf

Unsupervised

Large language model (LLM)

Open LLMs List

https://github.com/eugeneyan/open-llms

Introducing Open Platform for Enterprise AI
https://www.youtube.com/watch?v=Q5uTvmv5fZU&t=4s

Open Platform for Enterprise AI

https://opea.dev/

https://github.com/opea-project

Nvidia

https://www.nvidia.com/en-eu/ai-data-science/generative-ai/

AWS Bedrock

https://aws.amazon.com/bedrock/

Google Gemma

https://ai.google.dev/gemma

Google Gemini

https://deepmind.google/technologies/gemini/#introduction

Mistral AI brings a strong open generative models to the developers, along with efficient ways to deploy and customise them for production.

https://mistral.ai

Meta Llama
https://ai.meta.com/llama/

Alibaba Cloud provides Tongyi Qianwen (Qwen)
https://www.alibabacloud.com/en/solutions/generative-ai/qwen?_p_lc=1

What’s before GPT-4? A deep dive into ChatGPT

https://medium.com/digital-sense-ai/whats-before-gpt-4-a-deep-dive-into-chatgpt-dfce9db49956

GPT-4 Training process

Like previous GPT models, the GPT-4 base model was trained to predict the next word in a document, and was trained using publicly available data (such as internet data) as well as data we’ve licensed. The data is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas.

So when prompted with a question, the base model can respond in a wide variety of ways that might be far from a user’s intent. To align it with the user’s intent within guardrails, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF).

Note that the model’s capabilities seem to come primarily from the pre-training process—RLHF does not improve exam performance (without active effort, it actually degrades it). But steering of the model comes from the post-training process—the base model requires prompt engineering to even know that it should answer the questions.

https://openai.com/research/gpt-4

BERT
https://github.com/google-research/bert

AlephBERT

https://github.com/OnlpLab/AlephBERT
https://arxiv.org/pdf/2104.04052.pdf

Multi-language Aspects
How Language-Neutral is Multilingual BERT?
https://arxiv.org/pdf/1911.03310.pdf

AraBERT: Transformer-based Model for Arabic Language Understanding
https://arxiv.org/pdf/2003.00104.pdf

ELMo
https://allennlp.org/elmo

LaBSE - Language-agnostic BERT sentence embedding model supporting 109 languages.

https://tfhub.dev/google/LaBSE/2

LaBSE model to PyTorch. It can be used to map 109 languages to a shared vector space.
https://huggingface.co/sentence-transformers/LaBSE

Claude is a large language model (LLM) built by Anthropic.
It's trained to be a helpful assistant in a conversational tone.

https://docs.anthropic.com/claude/docs/getting-started-with-claude

Jais - open-source Arabic Large Language Model (LLM)

https://huggingface.co/core42/jais-13b/tree/main

https://inceptioniai.org/jais/

Falcon LLM - Falcon 40B was the world’s top-ranked open-source AI model when launched

https://falconllm.tii.ae/falcon.html

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

https://huggingface.co/TRI-ML/mamba-7b-rw
https://huggingface.co/papers/2312.00752

https://www.together.ai/blog/mamba-3b-slimpj

bottom of page