Audio Diffusion

译者：片刻小哥哥

项目地址：https://huggingface.apachecn.org/docs/diffusers/api/pipelines/audio_diffusion

原始地址：https://huggingface.co/docs/diffusers/api/pipelines/audio_diffusion

Audio Diffusion is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images.

The original codebase, training scripts and example notebooks can be found at teticio/audio-diffusion .

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

AudioDiffusionPipeline

class

diffusers.

AudioDiffusionPipeline

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L30)

(

vqvae

: AutoencoderKL

unet

: UNet2DConditionModel

mel

: Mel

scheduler

: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_ddpm.DDPMScheduler]

)

Parameters

vqae ( AutoencoderKL ) — Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.
unet ( UNet2DConditionModel ) — A UNet2DConditionModel to denoise the encoded image latents.
mel ( Mel ) — Transform audio into a spectrogram.
scheduler ( DDIMScheduler or DDPMScheduler ) — A scheduler to be used in combination with unet to denoise the encoded image latents. Can be one of DDIMScheduler or DDPMScheduler .

Pipeline for audio diffusion.

This model inherits from DiffusionPipeline . Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L70)

(

batch_size

: int = 1

audio_file

: str = None

raw_audio

: ndarray = None

slice

: int = 0

start_step

: int = 0

steps

: int = None

generator

: Generator = None

mask_start_secs

: float = 0

mask_end_secs

: float = 0

step_generator

: Generator = None

eta

: float = 0

noise

: Tensor = None

encoding

: Tensor = None

return_dict

= True

)

→

export const metadata = 'undefined';

List[PIL Image]

Parameters

batch_size ( int ) — Number of samples to generate.
audio_file ( str ) — An audio file that must be on disk due to Librosa limitation.
raw_audio ( np.ndarray ) — The raw audio file as a NumPy array.
slice ( int ) — Slice number of audio to convert.
start_step (int) — Step to start diffusion from.
steps ( int ) — Number of denoising steps (defaults to 50 for DDIM and 1000 for DDPM).
generator ( torch.Generator ) — A torch.Generator to make generation deterministic.
mask_start_secs ( float ) — Number of seconds of audio to mask (not generate) at start.
mask_end_secs ( float ) — Number of seconds of audio to mask (not generate) at end.
step_generator ( torch.Generator ) — A torch.Generator used to denoise. None
eta ( float ) — Corresponds to parameter eta (η) from the DDIM paper. Only applies to the DDIMScheduler , and is ignored in other schedulers.
noise ( torch.Tensor ) — A noise tensor of shape (batch_size, 1, height, width) or None .
encoding ( torch.Tensor ) — A tensor for UNet2DConditionModel of shape (batch_size, seq_length, cross_attention_dim) .
return_dict ( bool ) — Whether or not to return a AudioPipelineOutput , ImagePipelineOutput or a plain tuple.

Returns

export const metadata = 'undefined';

List[PIL Image]

export const metadata = 'undefined';

A list of Mel spectrograms ( float , List[np.ndarray] ) with the sample rate and raw audio.

The call function to the pipeline for generation.

Examples:

For audio diffusion:

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=mel.get_sample_rate()))

For latent audio diffusion:

import torch
from IPython.display import Audio
from diffusers import DiffusionPipeline

device = "cuda" if torch.cuda.is_available() else "cpu"
pipe = DiffusionPipeline.from_pretrained("teticio/latent-audio-diffusion-256").to(device)

output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

For other tasks like variation, inpainting, outpainting, etc:

output = pipe(
    raw_audio=output.audios[0, 0],
    start_step=int(pipe.get_default_steps() / 2),
    mask_start_secs=1,
    mask_end_secs=1,
)
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))

encode

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L270)

(

images

: typing.List[PIL.Image.Image]

steps

: int = 50

)

→

export const metadata = 'undefined';

np.ndarray

Parameters

images ( List[PIL Image] ) — List of images to encode.
steps ( int ) — Number of encoding steps to perform (defaults to 50 ).

Returns

export const metadata = 'undefined';

np.ndarray

export const metadata = 'undefined';

A noise tensor of shape (batch_size, 1, height, width) .

Reverse the denoising step process to recover a noisy image from the generated image.

get_default_steps

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L61)

(

)

→

export const metadata = 'undefined';

int

Returns

export const metadata = 'undefined';

int

export const metadata = 'undefined';

The number of steps.

Returns default number of steps recommended for inference.

slerp

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/pipeline_audio_diffusion.py#L311)

(

x0

: Tensor

x1

: Tensor

alpha

: float

)

→

export const metadata = 'undefined';

torch.Tensor

Parameters

x0 ( torch.Tensor ) — The first tensor to interpolate between.
x1 ( torch.Tensor ) — Second tensor to interpolate between.
alpha ( float ) — Interpolation between 0 and 1

Returns

export const metadata = 'undefined';

torch.Tensor

export const metadata = 'undefined';

The interpolated tensor.

Spherical Linear intERPolation.

AudioPipelineOutput

class

diffusers.

AudioPipelineOutput

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/pipeline_utils.py#L124)

(

audios

: ndarray

)

Parameters

audios ( np.ndarray ) — List of denoised audio samples of a NumPy array of shape (batch_size, num_channels, sample_rate) .

Output class for audio pipelines.

ImagePipelineOutput

class

diffusers.

ImagePipelineOutput

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/pipeline_utils.py#L110)

(

images

: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray]

)

Parameters

images ( List[PIL.Image.Image] or np.ndarray ) — List of denoised PIL images of length batch_size or NumPy array of shape (batch_size, height, width, num_channels) .

Output class for image pipelines.

Mel

class

diffusers.

Mel

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L37)

(

x_res

: int = 256

y_res

: int = 256

sample_rate

: int = 22050

n_fft

: int = 2048

hop_length

: int = 512

top_db

: int = 80

n_iter

: int = 32

)

Parameters

x_res ( int ) — x resolution of spectrogram (time).
y_res ( int ) — y resolution of spectrogram (frequency bins).
sample_rate ( int ) — Sample rate of audio.
n_fft ( int ) — Number of Fast Fourier Transforms.
hop_length ( int ) — Hop length (a higher number is recommended if y_res < 256).
top_db ( int ) — Loudest decibel value.
n_iter ( int ) — Number of iterations for Griffin-Lim Mel inversion.

audio_slice_to_image

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L143)

(

slice

: int

)

→

export const metadata = 'undefined';

PIL Image

Parameters

slice ( int ) — Slice number of audio to convert (out of get_number_of_slices() ).

Returns

export const metadata = 'undefined';

PIL Image

export const metadata = 'undefined';

A grayscale image of x_res x y_res .

Convert slice of audio to spectrogram.

get_audio_slice

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L121)

(

slice

: int = 0

)

→

export const metadata = 'undefined';

np.ndarray

Parameters

slice ( int ) — Slice number of audio (out of get_number_of_slices() ).

Returns

export const metadata = 'undefined';

np.ndarray

export const metadata = 'undefined';

The audio slice as a NumPy array.

Get slice of audio.

get_number_of_slices

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L112)

(

)

→

export const metadata = 'undefined';

int

Returns

export const metadata = 'undefined';

int

export const metadata = 'undefined';

Number of spectograms audio can be sliced into.

Get number of slices in audio.

get_sample_rate

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L134)

(

)

→

export const metadata = 'undefined';

int

Returns

export const metadata = 'undefined';

int

export const metadata = 'undefined';

Sample rate of audio.

Get sample rate.

image_to_audio

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L162)

(

image

: Image

)

→

export const metadata = 'undefined';

audio ( np.ndarray )

Parameters

image ( PIL Image ) — An grayscale image of x_res x y_res .

Returns

export const metadata = 'undefined';

audio ( np.ndarray )

export const metadata = 'undefined';

The audio as a NumPy array.

Converts spectrogram to audio.

load_audio

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L94)

(

audio_file

: str = None

raw_audio

: ndarray = None

)

Parameters

audio_file ( str ) — An audio file that must be on disk due to Librosa limitation.
raw_audio ( np.ndarray ) — The raw audio file as a NumPy array.

Load audio.

set_resolution

[<

source

](https://github.com/huggingface/diffusers/blob/v0.23.0/src/diffusers/pipelines/audio_diffusion/mel.py#L80)

(

x_res

: int

y_res

: int

)

Parameters

x_res ( int ) — x resolution of spectrogram (time).
y_res ( int ) — y resolution of spectrogram (frequency bins).

Set resolution.

我们一直在努力

apachecn/AiLearning