| --- |
| license: apache-2.0 |
| language: |
| - en |
| base_model: |
| - google/bert_uncased_L-4_H-256_A-4 |
| - WinKawaks/vit-small-patch16-224 |
| pipeline_tag: image-to-text |
| library_name: transformers |
| tags: |
| - vit |
| - bert |
| - vision |
| - caption |
| - captioning |
| - image |
| --- |
| An image captioning model, based on bert-mini and vit-small, weighing only 130mb! |
|
|
| Works very fast on CPU. |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel |
| import requests, time |
| from PIL import Image |
| |
| model_path = "cnmoro/mini-image-captioning" |
| |
| # load the image captioning model and corresponding tokenizer and image processor |
| model = VisionEncoderDecoderModel.from_pretrained(model_path) |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| image_processor = AutoImageProcessor.from_pretrained(model_path) |
| |
| # preprocess an image |
| url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg" |
| image = Image.open(requests.get(url, stream=True).raw) |
| pixel_values = image_processor(image, return_tensors="pt").pixel_values |
| |
| start = time.time() |
| |
| # generate caption - suggested settings |
| generated_ids = model.generate( |
| pixel_values, |
| temperature=0.7, |
| top_p=0.8, |
| top_k=50, |
| num_beams=3 # you can use 1 for even faster inference with a small drop in quality |
| ) |
| generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| |
| end = time.time() |
| |
| print(generated_text) |
| # a large group of people walking through a busy city. |
| |
| print(f"Time taken: {end - start} seconds") |
| # Time taken: 0.19002342224121094 seconds |
| # on CPU ! |
| ``` |