Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

About

Using vision-language models (VLMs) in web development presents a promising strategy to increase efficiency and unblock no-code solutions: by providing a screenshot or a sketch of a UI, a VLM could generate the code to reproduce it, for instance in a language like HTML. Despite the advancements in VLMs for various tasks, the specific challenge of converting a screenshot into a corresponding HTML has been minimally explored. We posit that this is mainly due to the absence of a suitable, high-quality dataset. This work introduces WebSight, a synthetic dataset consisting of 2 million pairs of HTML codes and their corresponding screenshots. We fine-tune a foundational VLM on our dataset and show proficiency in converting webpage screenshots to functional HTML code. To accelerate the research in this area, we open-source WebSight.

Hugo Lauren\c{c}on, L\'eo Tronchon, Victor Sanh• 2024

Related benchmarks

Task	Dataset	Result
Screenshot-to-code	Design2Code	Block-Match55.9	20
Widget Reconstruction	Widget2Code (test)	Margin Score32.99	13
document derendering	DocHTML (test)	Block Accuracy83.8	10
Design-to-code generation	Design2Code	SSIM75.1	7
UI-to-Code	Design2Code (test)	CLIP Similarity0.812	6
UI-to-Code generation	WebCode2M Long	CLIP Similarity0.69	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord