--- base_model: - facebook/w2v-bert-2.0 datasets: - classla/ParlaSpeech-RS - classla/ParlaSpeech-HR - classla/Mici_Princ language: - sl - hr - sr library_name: transformers license: cc-by-sa-4.0 metrics: - accuracy --- # Model Card This model annotates primary stress in words on 20ms frames. ## Model Details ### Model Description - **Developed by:** [Peter Rupnik](https://huggingface.co/5roop), [Nikola Ljubešić](https://huggingface.co/nljubesi), Ivan Porupski, Nejc Robida - **Model type:** Audio frame classifier - **Language(s) (NLP):** Croatian, Slovenian, Serbian, Chakavian variant of croatian - **License:** Creative Commons - Share Alike 4.0 ### Model Sources [optional] - **Paper [optional]:** Coming soon ### Direct Use The model is intended for data-driven analyses in primary stress position. ATM, it has been proven to work on 4 datasets in 3 languages. ## Example use ```python import numpy as np from datasets import Audio, Dataset from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification import torch import numpy as np if torch.cuda.is_available(): device = torch.device("cuda") else: device = torch.device("cpu") model_name = "5roop/Wav2Vec2BertPrimaryStressAudioFrameClassifier" feature_extractor = AutoFeatureExtractor.from_pretrained(model_name) model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device) # Path to the file, containing the word to be annotated: f = "wavs/word.wav" def frames_to_intervals(frames: list[int]) -> list[tuple[float]]: from itertools import pairwise import pandas as pd results = [] ndf = pd.DataFrame( data={ "time_s": [0.020 * i for i in range(len(frames))], "frames": frames, } ) ndf = ndf.dropna() indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values for si, ei in pairwise(indices_of_change): if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0: pass else: results.append( (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3)) ) if results == []: return None # Post-processing: if multiple regions were returned, only the longest should be taken: if len(results) > 1: results = sorted(results, key=lambda t: t[1]-t[0], reverse=True) return results[0:1] def evaluator(chunks): sampling_rate = chunks["audio"][0]["sampling_rate"] with torch.no_grad(): inputs = feature_extractor( [i["array"] for i in chunks["audio"]], return_tensors="pt", sampling_rate=sampling_rate, ).to(device) logits = model(**inputs).logits y_pred_raw = np.array(logits.cpu()) y_pred = y_pred_raw.argmax(axis=-1) primary_stress = [frames_to_intervals(i) for i in y_pred] return { "y_pred": y_pred, "y_pred_logits": y_pred_raw, "primary_stress": primary_stress, } # Create a dataset with a single instance and map our evaluator function on it: ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True)) ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs print(ds["y_pred"][0]) # Outputs: [0, 0, 1, 1, 1, 1, 1, ...] print(ds["y_pred_logits"][0]) # Outputs: # [[ 0.89419061, -0.77746612], # [ 0.44213724, -0.34862748], # [-0.08605709, 0.13012762], # .... print(ds["primary_stress"][0]) # Outputs: [0.34, 0.4] ``` ## Training Details ### Training Data 10443 manually annotated multisyllabic words from [ParlaSpeech-HR](https://huggingface.co/datasets/classla/ParlaSpeech-HR). ### Training Procedure #### Training Hyperparameters - Learning rate: 1e-5 - Batch size: 32 - Number of epochs: 20 - Weight decay: 0.01 - Gradient accumulation steps: 1 ## Evaluation ### Testing Data, Factors & Metrics #### Summary ## Citation Coming soon