Svarichevsky Mikhail - RSS feed http://3.14.by/ Svarichevsky Mikhail - RSS feed en-us Tue, 10 Jun 2006 04:00:00 GMT Sat, 03 Jun 23 16:07:47 +0000 3@14.by 120 10 <![CDATA[65B LLaMA on CPU]]> http://3.14.by/en/read/LLaMA-Replit-LLM-language-model-ai-cpu-quantization
16 years ago dog ate my AI book. At the time (and way before that) common argument on «Why we still don't have AI working and it is always 10 years away» was that we can't make AI work even at 1% or 0.1% human speed, even on supercomputers of the time – therefore it's not about GFLOPS.

This weekend I ran gpt4-alpaca-lora_mlp-65B language model in home lab on CPU (using llama.cpp, due to model size – there is 0 chance to run it on a consumer GPU). This model is arguably the best open LLM (this week), and 65 billion parameters is no joke: with single precision math it won't fit in 256Gb of RAM. If you let it spill into swap, even on NVMe drive – it will run at ~1 token per minute (limited by swap speed), which is about 0.5% of human speed. Even at this snail pace it can still show superhuman performance in memorization-related tasks. It is clear that it was not possible to get there 20 years ago – training time would have been prohibitive even with unlimited government funding.

And this is where unexpected open approach of Meta proven to be superior to closed, dystopian megacorp approach of OpenAI: In 10 weeks since LLaMA was released into the wild, not only derivative models were trained but 2-4-5 bit quantization enabled larger models on consumer hardware. In my case with 5bit quantization - model fits into 64Gb of RAM and runs at ~2 tokens per second (on 64-cores), which is probably 70-90% of my human speed in best shape.

For comparison, I tried Replit-code-v1.3b 2.7B model optimized for coding. After 65B monster – Replit feels like a breeze and shows very good performance despite its size. This is a good reminder that field-specific, smaller models should always be used where possible.

It feels like "1 Trillion parameters will be enough for everybody", but such models would not be practical probably for another 2 years. Meanwhile key enablers of AI proliferation could be increase of RAM in consumer GPUs beyond 24Gb (which is sadly unlikely to happen due to commercial interests) and smaller field-specific models where I would be looking with much more interest.]]>
Mon, 22 May 23 07:57:14 +0000
<![CDATA[First tiny ASIC sent to manufacturing]]> http://3.14.by/en/read/first-asic-tinytapeout-google-skywater-pdk-openlane 5 years ago making microchip from high-level HDL with your own hands required around 300k$ worth of software licenses, process was slow and learning curve steep.

Yesterday I've submitted my first silicon for manufacturing and it was... different. In the evening wife comes as asks "How much time until deadline?". I reply: "2 hours left, but I still have to learn Verilog." (historically my digital designs were in VHDL or schematic).

All this became possible thanks to Google Skywater PDK and openlane synthesis flow - which allowed anyone to design a microchip with no paperwork to sign and licenses to buy. Then https://tinytapeout.com by Matt Venn lowered the barrier even further (idea to tapeout in ~4 hours, including learning curve).

As expected, this all allows much more people to contribute to open source flow, with my favorite being work of Teodor-Dumitru Ene https://github.com/tdene) on hardware adders which now match and beat commercial tools. I think (and hope) that in 5 years opensource tools will dominate the market on mature nodes (28nm and up), not because they are cheaper, but because they are better and easier to use.

My design fits in 100x100µm and contains 289 standard cells. There are 7 ring oscillators with frequency dividers to compare silicon performance to analog simulation across voltage/temperature. I expect to see chips in ~6-9 months, both working and under microscope :-)]]>
Sun, 04 Sep 22 15:22:58 +0000
<![CDATA[This cake is a lie.]]> http://3.14.by/en/read/stable-diffusion-ai-This-cake-is-a-lie Stable Diffusion model that was publicly released this week is a huge step forward in making AI widely accessible.

Yes, DALL-E 2 and Midjourney are impressive, but they are a blackbox. You can play with it, but can't touch the brain.

Stable Diffusion not only can be run locally on relatively inexpensive hardware (i.e. sized perfectly for wide availability, not just bigger=better), it is also easy to modify (starting from tweaking guidance scale, pipeline and noise schedulers). Access to latent space is what I was dreaming about, and Andrej Karpathy's work on latent space interpolation https://gist.github.com/karpathy/00103b0037c5aaea32fe1da1af553355) is just the glimpse into many abilities some consider to be unnatural.

Model is perfect with food, good with humans/popular animals (which are apparently well represented in the training set), but more rare Llamas/Alpakas often give you anatomically incorrect results which are almost NSFW.

On RTX3080 fp16 model completes 50 inference iterations in 6 seconds, and barely fits into 10Gb of VRAM. Just out of curiosity I run it on CPU (5800X3D) - it took 8 minutes, which is probably too painful for anything practical.

One more reason to buy 4090... for work, I promise!
]]>
Fri, 26 Aug 22 19:49:29 +0000
<![CDATA[Voron V0.1 - Ferrari among 3D printers (V0.1430)]]> http://3.14.by/en/read/Voron-V0.1-coreXY-printer-ferrari
Finished assembly and tuning of my new Voron V0.1. Stationary parts are from aluminum kit, rest I printed in ASA-X. Small size allows to reach very decent speeds and accelerations: fast profile 175/306 mm/s (perimeters / infill) with acceleration of 25'000 mm/s². For high quality - 80/150 mm/s, 15'000 mm/s². Fast acceleration and direct extruder make parameters tuning for high quality comparatively easy as extrusion speed is nearly constant. Also, pressure advance + input shaper allowed to increase acceleration from 5'000 to 25'000 mm/s² with no quality degradation on the corners.

It all works on Fluidd+Klipper, SKR-PRO v1.2 + Raspberry Pi 4. When printing 306mm/s @265°C - 40W heater is no longer enough, so I had to overclock printer a little to 28V (+36% heater power). 28V is a limit for TMC2209.

Initially I was considering to participate in SpeedBenchy contest - but things there went too far in the direction of "too fast / too bad". Printing at these speeds is limited by plastic cooling - this is why achievable speeds for high quality prints for ABS/ASA are several times higher than PLA. I.e. printing above 200mm/s is all about cooling, and is a contest of fans and air-ducts.

Update: Got my serial number V0.1430 :-)]]>
Tue, 08 Feb 22 09:00:51 +0000
<![CDATA[Walking with Alpakas]]> http://3.14.by/en/read/alpaka-walk ]]> Mon, 07 Feb 22 19:09:56 +0000 <![CDATA[Milky Way @ Gurnigel, Switzerland (1593m)]]> http://3.14.by/en/read/Milky-Way-Gurnigel-Switzerland-astrophotography
30 seconds, A7III with Samyang 8mm F2.8 @ F4. Yes, this is an APS-C lens on a full frame camera - to have larger pixels / lower noise, as higher resolution here does not help. Largest challenge was Chroma noise, ether hot pixels or when star is focused into a single pixel and it's impossible to recover real color of the star. To fix that I just reset all unusually high & sharp Chroma values to neutral.

Light pollution is visible on the horizon (left side) - it's from the nearest city, Thun - 13km away. ]]>
Sun, 15 Aug 21 14:07:26 +0000
<![CDATA[C/2020 F3 (NEOWISE)]]> http://3.14.by/en/read/2020-NEOWISE-comet body:after {
content: url(//s.14.by/neowise2.jpg);
background-image: url(//s.14.by/neowise2.jpg);
visibility: hidden;
position: absolute;
left: -999em;
}
Made a photo of С/2020 F3 NEOWISE comet, making all the news now. Sigma 70mm F2.8 (@3.5), 60x2.5s (stacked).
After subtracting background - double tail became visible (dust & gas).

On mouse over - color, on click - annotation. Core is indeed slightly green
]]>
Mon, 20 Jul 20 01:26:20 +0000
<![CDATA[Tesla coil x2]]> http://3.14.by/en/read/double-tesla


]]>
Tue, 25 Feb 20 05:29:16 +0000
<![CDATA[Got radio call sign : R2AYN]]> http://3.14.by/en/read/amateur-radio-call-sign-R2AYN-FT8 1, 2) I am now officially allowed to transmit.
Exam was straightforward, paperwork took more time than theory preparation.
Initially I will focus on digital-mode radio (like FT-8) with DIY transmitters. Hopefully, I will be able to reuse some of my HF hardware that I was preparing for plasma experiments in vacuum.

Links for contacts: eQSL qrzcq
]]>
Tue, 04 Feb 20 21:05:30 +0000
<![CDATA[Edwards EXT255H turbomolecular pump - first run]]> http://3.14.by/en/read/Edwards-EXT255H-EXDC160-turbomolecular-pump
It did not disintegrated itself I was surprised that it runs quieter than rotary vane backing pump.
Of-course it's used, came from disassembly of some science equipment in Israel. The only issue was 1 bend pin on a pump, which was probably not used. Anyways, I was able to straighten it up easily.

Centering ring was grinded from one side to allow use of sheet of glass instead of vacuum chamber to see the rotor during first test run on nominal speed (60'000 RPM):




EXDC160 controller:




Was recording video for the chance of capturing rapid unscheduled disassembly, so not much is happening on the video. At 3:53 - roughing pump is turned off to hear the sound of turbo-molecular pump. With such small "chamber size" venting without exceeding maximum pressure rise rate is quite challenging - hence venting and slowing down the pump took much more time then start-up.


Update: Could not resist taking second one from the same seller as a spare, also works fine:
]]>
Mon, 16 Dec 19 06:58:56 +0000