Complete guide to open source AI painting tool Stable Diffusion

Complete guide to open source AI painting tool Stable Diffusion

When I first opened the WebUI interface of Stable Diffusion, I was a little confused to tell the truth. Dozens of parameters, a bunch of incomprehensible drop-down menus, and a full-screen command-line window are running code. I turned it off three times and turned it on three times. But after using it for a month, I found that I had never opened Midjourney again.

It's not because it works better-in fact, Midjidence produces pictures faster and has a lower prompt threshold. But Stable Diffusion gave me something completely different: a sense of control. You can control the generation process accurately to every pixel. From models and sampling methods to facial restoration, no link is a black box. Once you get used to this feeling, it will be difficult to go back to.

Let's talk about the hardware challenge first

Regarding graphics cards, there are various rumors in the Jianghu. Some people say it must be 4090, others say nuclear display can run. The actual situation lies in between.

My own experience is: I initially used an old notebook (GTX1650 Mobile, 4GB of video memory) to run SD 1.5. It took about two to three minutes to generate a 512x512 picture, and occasionally there would be an error when the video memory burst. To be honest, this experience is truly tormenting. But when I switched to a desktop with an RTX4070, the same picture took about eight to ten seconds, and I could run the SDXL model. This gap is real.

If you only have 4GB of video memory: It's completely usable, you can run the SD 1.5 model without problem, just be patient. Close all unnecessary background programs and turn on the medvram option in the WebUI settings to relieve memory pressure.

If you have more than 8GB of video memory: Congratulations, SDXL can also run smoothly, which is the basic requirement of the current batch of open source models with the best picture quality.

If you don't have a graphics card or only have a verification display: Google Colab's free GPUs can run, and there are some domestic cloud deployment platforms. It is quite convenient to use other people's graphics cards and pay by volume.

One thing to make sure is not to dwell on the fact that "you can't get started without a good graphics card." Many people are discouraged by hardware thresholds, but in fact you may have been learning to select models, write prompts, and adjust parameters for the first two months, which are not so demanding on graphics cards. It's not too late to get started first and then upgrade the hardware.

About installation: Choose the integration package or toss it through yourself

I have been using the integration package. I tried to manually install the original WebUI once, spent a night resolving various Python environment conflicts, and finally fell back to the integration package.

The advantage of the integrated package is that others have configured the environment and started it at the download and decompression point. I have used Autumn Leaf's integration package, and it does a good job. It comes with common models and managers and is friendly to Chinese users. Some people are also using the integration packages of other UP owners on Station B, with similar functions.

Who is manual installation suitable for? It is suitable for users who don't panic when seeing error messages, are accustomed to using the command line, and want to use the latest version of the functions as soon as possible. If you fit this description, go to GitHub and find a WebUI that is stable forked (such as Forge). The documentation is clearly written and there is a high probability of success if you follow the steps.

My personal opinion is that most people should choose the integration package. Tools serve you, not your master. Being able to use it for creation is a good installation method.

Model selection: Don't fall into the "model sea" trap

This is the easiest pit I have ever seen for novices-crazy downloading of models. When I saw new models posted on Civitai, I downloaded them, and there were dozens of models piled up in my favorites. As a result, I had to struggle for ten minutes to select a model before every drawing was published.

My experience is that focusing on two or three models in stages is much more important than having twenty models.

Classic models from the Stable Diffusion 1.5 era (such as realisticVision, deliberate) are still good places to start. Their models are small in size (about 2GB), mature and stable in style, and a large number of tutorials and prompt references can be found online.

If I want to draw realisticVision for a long time, I have used realisticVision for a long time. It handles Asian faces much better than earlier models.

If I want to draw a second-element style, I think that the majicMIX series of models are relatively friendly to Chinese users in actual use, and I can produce a good second-element picture without writing long prompts.

As for the SDXL model, the picture quality has indeed improved to a higher level, but the model is large (more than 6GB), requires high hardware, and the generation speed is slower. I suggest you migrate to SDXL after you are familiar with the SD 1.5 model so that you can clearly feel where the improvement is.

One suggestion: Each model has what it is best at, find it and use it thoroughly. The experience of using five hundred drawings for a model is much more valuable than using ten drawings for each model.

ControlNet: What really makes SD usable

If Stable Diffusion is a car, ControlNet is the steering wheel and brakes. Before ControlNet came out, the use of AI to draw the composition basically relied on metaphysics-you wrote "A girl standing on the beach and looking into the distance", and what came out might be the front, side, lying down, or even only half of her face.

The feature I rely on most on ControlNet is OpenPose. For example, if you first find a picture of a reference pose on the Internet and use OpenPose to extract bone key points, SD will strictly follow this pose to generate characters. This is not "probably a bit like", but precise control at the skeleton level.

There are also two preprocessors, Canny and Depth, that are also very useful. Canny is suitable for coloring line sketches-you draw a few lines at random, and SD can turn it into a complete illustration. Depth maintains the spatial hierarchical relationship of the image and is particularly useful when making local modifications.

One of my most common combinations is to use a reference picture to control the overall composition through Depth, and use OpenPose to lock the person's pose. 70% of the first batch of pictures produced in this way were close to the effect I wanted, and the rest was to repeatedly adjust the prompt words and sampling parameters.

Don't try to learn all ControlNet's preprocessors at once, start with OpenPose and Canny, adding a Depth at most. These three cover 80% of my usual needs.

Prompt words: A little more complicated than what the online tutorial says

Almost all Stable Diffusion tutorials will teach you the positive prompt word + negative prompt word pattern: write what you want positively, and write what you don't want negatively. This idea is correct, but it is more complicated than the tutorial says.

Tips for positive prompt words:

Quality-related tags are more effective at the front than at the back. For example,"masterpiece, best quality, highly detailed" is better placed at the beginning of the prompt than at the end-this is not metaphysics, but related to the way the data is processed during model training.

Specific descriptions take precedence over abstract descriptions. "A 25-year-old Asian woman wearing a white linen shirt, with long brown curly hair, sitting by the window of a cafe" is much more controllable than a picture of "a beautiful woman in an elegant environment."

But you can't pile up too many details. If the prompt words are too long, the model may ignore parts of the content or produce strange mixing effects. Controlling a prompt within 75 tokens is a good experience point.

Negative prompt words:

Each set of negative prompt words has its own style, and there is no set of "universal negative words." But there are several tags that are almost mandatory: bad anatomy, bad hands, and malformed limbs. As for other things, I suggest you add the corresponding negative words when you run out of the problem.

In my opinion, prompt words are not so much a "programming" as a "way of communicating with the model." The same description, put it another way, and the results may be completely different. This experience requires a lot of practice to accumulate feelings, and no tutorial can replace it.

Let me say a few honest words about copyright

This is the question I have been asked the most since using Stable Diffusion.

The current reality is that Stable Diffusion's original training dataset contained a large number of unauthorized copyrighted images. This is an open fact and there are many related lawsuits in progress. This is an objective controversy.

But at the same time, there are also more and more models in the community that are trained entirely based on authorization data (such as Adobe's model that explicitly trains data for compliance), and this direction is gradually getting better.

As for the pictures you generate themselves-my understanding is that under the current legal framework of most countries and regions, the copyright ownership of pictures generated entirely by AI is still under discussion. Jurisprudence in some regions tends to believe that pure AI works without human participation in creative decision-making are not protected by copyright.

Practical advice: If you use AI to assist in creation (for example, you draw sketches, select reference drawings, and iteratively modify them multiple times), your creative contribution is actually there, and the copyright claim in this case will be much stronger. If you just type a line of prompt words and use it directly, the legal controversy will be much greater.

As for commercial use, my attitude is: it can be used, but it cannot be used as a core asset. Using AI drawings as material, concept drawings, or with a large number of manual modifications is the safest way to use them at present. Use it in the company's Logo or core product design? Better find a serious designer.

A few truly useful practical lessons

Learn to use XYZ chart scripts: This is a feature that comes with WebUI that allows you to batch test the effects of different parameter combinations. For example, if you want to know which sampler is most suitable for your model, you can use the XYZ script to run a set of comparison charts at a glance. This function is unknown to many people, but it is very practical.

Make good use of seeds: When you happen to come up with a picture whose composition and feeling are close to satisfaction, write down its seed value. Using this as a basis to fine-tune the prompt words and parameters is much more efficient than starting from the beginning and touching them one by one.

Don't ignore img2 img: Many novices only focus on cultural images, but in fact, img2 images are more practical in many scenarios. For example, if you already have a basic composition and want to change the style or adjust the details, you can directly modify it on the graphic drawing ten times faster than drawing one.

Learn about Ti and Hypernetwork: This is a fine-tuning method that appeared earlier than LoRA. Although LoRA is more popular now, Ti (Textualization Inversion) and Hypernetwork still have their own use in controlling styles. Especially Hypernetwork is used by many people in game character design.

Don't be greedy for too many LoRA: If you want to train your LoRA, it is recommended to start practicing with the LoRA style category rather than the character category. Because character LoRA has high requirements on the quantity and diversity of material, it is easy for novices to train to produce unsatisfactory results.

A few last words

Stable Diffusion is not everything. It is not as good as commercial tools in some aspects-for example, without ControlNet, the stability of plotting is not as good as Midjourney; for example, it is several streets behind tools like Nano Banana in terms of ease of use.

But when your needs become complex-you need precise control of composition, need consistency for specific roles, need to process sensitive private images without uploading them to the cloud, need to integrate AI plotting into automated processes-Stable Diffusion becomes almost the only option.

I suggest that people interested in AI painting spend a weekend installing Stable Diffusion, downloading a model at random, and then giving themselves a specific task-such as drawing a secondary version of their own avatar, or giving a product design concept diagram. Learning with specific goals is much more efficient than looking at the course in general.

Once this door is opened, you will find that there is a world much richer than you imagined.