Wednesday, January 25. 2012
I wasn’t a keynoter. Or even a regular presenter. I was just doing a talk at a miniconf. It was still an un-nerving enough experience that I went to see my doctor on the Thursday before I flew to Tullamarine to make sure the chest pains I was having weren’t the onset of a heart attack. It almost would have been a relief if they had been.
As you can see, I did make it, and I didn’t drop dead on stage.
Normally I’m pretty comfortable about speaking in front of people. To the point where, for example, last year I needed to double the time
I’d been told I would be allocated, and spoke extemporaneously from the bullet-points I’d listed on a bit of paper, only looking at them
once. Or, 4 years ago, spoke at a funeral after leaving my speech at home. Give me a run-up and I can usually stand up and talk about most
anything I care about on short notice, and probably for longer than you envisaged when you asked.
So why so nervous? There’s a simple reason: the audience.
Continue reading "Speaking for the first time at linux.conf.au"
Friday, January 20. 2012
Rusty Russell & Matt Evans
Three Cool Projects
- Spark - command-line tool that generates sparks
- Plover - An open-source stenography alternative.
- Homebrew Cray-1A http://chrisfenton/homebrew-cray-1a/
In The Beginning
- “The year was 1976, the hair was long, the shoes were tall.”
- It’s a multi-user machine, it has two teletype machines in front of it!
- Bring on the PDP-11 simulator!
Comparisons
- cat, grep, and ls are the punching bags.
- “Bigger is better. grep is 20 times bigger.”
- “cd only came in in V7.” “I edited the shell so I could have cd.”
- cat was written in assembler in 1976.
- In V6 arguments were in-line.
- The first thing you notice is that they are memory-concious. Although Rusty points out that they don’t bother with system calls; they also just use assembler because it’s natural.
- Rusty re-implemented cat with bug/behaviour and it was only twice is big in C. Modern cat is big in part becuase of more features, error messages, and so on.
- We pay a 30% memory penalty if we use -O2 instead of -Os.
- But -Os is slower by about 6% for these simple utilities.
- Automated runtime analysis tells us 99% of the instructions are used at some point, with only one instruction ever being used. 1% bloat!
- Even going to V7 in 1979 ls has doubled in size. cat only uses 57% of its instructions.
- ...but if you built static cat instead of shared libraries it pulls in another 700KB of glibc dependcies!
- There’s a dependency graph. It looks like scribble.
- It includes TLS (in case you need to fetch from Reddit).
- When we instrument cat on x86 we find that we use... um... 2% of it. Bugger.
0
- On a whole-system analysis there’s 33 MB of wasted RAM. Not much compared to all the memory.
- But there may be a TLB hit.
- Of course, 16-bit vs 64-bit is unfair. So Rusty guessimated the change in Text and Data segments. There’s some big growth, around 50%.
- By way of comparison 32-bit to 64-bit Ubuntu is only 9%.
- If you pull the old code to an Ubuntu system you actually cut the text segment (if you’ve got a stripped-down glibc). ELF, on the other hand, adds stuff, mostly to force on-page alignment. Even so, cat is only marginally bigger that 64-bit PDP cat. Not too bad.
- grep embiggened significantly.
- ls was already complex, with 10 flags. We also have to grow buffers for moving from 14 byte filenames to 255 bytes. It doesn’t use malloc, doing funky magic to grow the program when it starts running out. You kind of need it to use malloc() nowadays. So you grow 120%, because of a combination the changes.
Backporting
- Turns out GNU ls has 60-odd options. A survey of Rusty’s friends says 11 of them were never used.
- So some of this size are the extra options.
- For cat it’s easy: remove all the options and error reporting.
- Cat does some odd malloc() behaviour to have aliged, page-size buffers.
- Backported cat is still bigger than forward-ported cat.
- ls required Vast Surgery.
- It grabs system to the nanosecond so it can show entries more than 6 months old differently.
- It’s much faster, although it’s probably down to LOCALE complexity slowing up non-backported.
- There’s a 60% penality. But that’s for portability, 64-bit and so on.
- 400-odd% bloat? That’s the extra features.
Conclusions
- Most people aren’t prepared to go to the same lengths to keep things small.
- asmutils - reimplementation of *ix utils, but it’s not actually that efficient: it loses all the gains in BSS bloat that they don’t botherr measuring. Bummber.
- Features are the reason for growth.
Thursday, January 19. 2012
- Security needs to be a first-class design concern.
- You don’t need to fix all the bugs, you just need to be better than the other guy.
- Code is growing in complexity (11 million+ lines in the kernel) and more people using tools (750,000+ Android devices activate per day).
- Security is an arms race, where good guys and bad guys compete.
Why do Hackers Attack?
- Build botnets. Botnets can do things you can’t do otherwise; bitcoin generation, spam, etc.
- Gain control ofuseful private information (creditcards)
- Punish/embarrassing people, e.g. Sony, Church of Scientology, etc
- To educate and advocate, e.g. Firesheep.
- Earn a reputation.
Lulz.
Win the Bear Race!
- The attackers are the bear. Don’t be something the bear wants to eat.
Attacks
- Buffer Overflows: the daddy of attacks. The classic stack-smasher.
- The cure is to refuse to allow execution on the stack; e.g. the NX bit.
- The countermeasure to that is the “heap spray” attack; the buffer overwrite gets initial access, but instead of stack-smashing, you inject huge amounts of data into the heap, which is also valid code. Then you jump to a random address in the heap space, which will give you a decent chance of hitting executable code in the heap, and away you go.
- People have tried having ROM-only execution, refusing to execute out of memory.
- The counter to this is “return oriented programming”. You overrun, inspect the ROM, and then use the functions and function fragments in the ROM, chaining them together to build what you want. e.g. one team were able to implement a Turing-complete VM for themselves.
- Hardware attacks. e.g. Adding 1400 gates as a rogue designer to make a hardware Linux backdoor; bug in Intel processor to make an OS-independent attack.
Defences
- No-execute stacks. Can be thwarted by heap-spray in a naive implementation; that is countered by heap address randomisation. Implemented in recent Linux, Windows, and MacOS. This means you need to jump to the right address for the heap overflow on the first time, greatly reducing the likelihood.
- “Stack Canaries” are embedded in the stack frame, as a random value embedded in each function. If the canary is wrong, the execution simply halts.
- Only fully available in OpenBSD.
- Encrypted pointers with StackGuard. Every pointer is encypted with a different, simple XOR on every execution.
- All of these techniques have a cost in memory or CPU.
- Sandboxing. Classic technique from virtual machines (in the Smalltalk/Java sense). For example, Chrome, Firefox, and IE 9 all implement this. All tabs talk to a policy manager, rather than to the underlying operating system.
- The “Browser in the Middle” attack - an attack where visiting a car forum would trigger JS that would check for an open tab to an HSBC internet banking session, and would do a $1 funds transfer.
- What can you do about buffer overflows? Safe languages. Eliminates Spacial and Temporal buffer overflows.
- But in May 2011 the two top attack vectors were Flash and Oracle Java, both of which are managed code. Ooops.
- Languages can be safe. But implementations can be unforgivably broken.
Developers Need to Adopt Better Development Strategys
- “For every piece of software there is a trail of abused users.”
- So employ tools to find vulnerabilities.
- We fix earlier before we release, and we release cleaner code.
- Constrain inputs. If you can’t put garbage into the program, you’ve reatly reduced the possibility of an attack.
- The tools that can check and produce reommendations on vulnerabilities can be used by the bad guys.
- Dynamic taint checking can be used to uncover paths of external inputs that need to be sanity-checked.
Dataflow Analaysis 101
- We build a graph on how the data flows through the program, propogating the taint information.
- This will only work on buffer overflows.
Side-Channel Attacks and Protections
- Depressing because your system qua system is secure, well-thought out, and well-implemented. but someone creates e.g. an environmental condition that violates your assumptions.
- For example, a power fluctuation can indication whether you are mostly doing 1s or 0s.
- The classic RSA attack was based on how long it reuquire to encrypt information.
- Keyboard noise detection. Tempest devices can read your screen.
- Punishing a system.
- Cash. Blackmail.
- Other human engineering, e.g. the Red-Headed League.
- An old attack on DES relied on the fact that DES would flush the cache when it was processing data. Because the placement of the code and data in the cache is consistent, forcing invalidation of the cache can reveal enough timing infomation to let you work out the 1s and 0s.
- Differential Power Analysis. You know XOR takes less power than ADDs, while MUL requires more. So based on the operations you can derive the key; for example, AES is almost all XOR, so watching the power fluctuations will tell you if it’s XOR 0,0 XOR 1,1 XOR 0,1.
- Timing-based attacks. Daniel Bernstein demonstrated a timing attack against AES based on cache timings.
- Fault-based attacks: this is Valeria’s attack from the morning.
Tools for More Secure Software
- Valgrind is a platform that runs as a VM on Linux, with a suite of plugins that find memory leaks, dangling pointers, races, and so on.
- Programs run slooooooooooooooower while running under Valgrind.
- Apparently it’s pronounced Val-grin-d not Val-grind.
- OpenBSD ProPolice. Also becoming available in GCC.
- Fuzz testing will find all sorts of crap, albeit shallow crap.
- Google’s browser fuzz tester found hundreds of defects when it was first released.
- Klee is a more thorough fuzz tester, checking the code coverage as it fuzzes, rather than operating purely randomly. It tries to execute every path that exists, allowing it to come up with deep bugs.
- Metasploit will package up attacks and go after your machine(s).
- NMAP.
Q&A
- The embedded space is incredibly immature. Headed a panel with teams who have remote-owned cars, UAVs, and Pacemakers. He described it as the most terrifying panel he’s run.
- There are even PHP static alanalysers now.
Valeria Bertacco
- Valeria has a talk, and a demo, but of course the hardware isn’t co-operating.
- Cryptography is pervasive. It’s also big business. The direct value of companies like RSA and Verisign is tens of billions. The value of ecommerce companies is hundreds of billions.
- Asymmetric cryptography, RSA keys, rely on two large primes, with which ou perform clever maths.
- Cryptanalysis: poking the cryptography with a stick.
- 2009 we proved you could brute-force 768-bit keys, but it required computation-years to do reliably.
- Side-channel attacks: you measure the time required to encrypt, and guess the key form that. We no pad encryption to avoid this.
- Fault-based: a faulty CPU may leak information in the form of errors.
- Attacks via Transient Faults: when transistors give the wrong values intermittently.
- These are normal events, but normally last <1 clock cycle, but in bad cases will propogate up the stack to the software.
- The probability is very low. But they can be triggered by solar particles (alpha particles, which is dependent on altitude), and are non-predictable.
- As transistors have shrunk, they have become more susceptible to this sort of fault, because they’ve become more fragile.
- As we get smaller we may even get to the point where a sinle alpha particle can flip many transistors.
Forcing Faults Reliably
- A transient fault that occurs when performing the handshake may leak information through the corrupt response. If you can do enough corrupted handshakes, you may get a lot of informaiton.
- Our testbed is an FPGA board running a SPARC v8@40MHz, running Debian. We munt it with a voltage controller to induce faults.
- When you drop voltage, the multiplier, which is the most sensitive component, will have problems. If we gofrom 1.5V to 1.0V always fails. But if you drop it to, say, 1.3V, it will fail intermittently.
- OpenSSL uses a fast algorithm to encrypt; it then verifies the encrypted data, and falls back to a slower, more reliable algorithm if that doesn’t work.
- The attacker collects the faulty signatures over the time. It relies on the fact the slow, reliable algorithm breaks the message into windows. The attacker can collect leaked information window-by-window.
- This ends as a window-by-window brute force.
- But it means you are brute-forcing only, e.g. 4 bits instead of 1024 bits. 100 seconds per check, 2^6 checks in the worst case.
- This makes it quite a lot easier to break the server’s private key.
- In the example, 8,800 signatures were collected in a few hou rs, and then analyzed. They 1024-bit private key was cracked in 100 HOURS.
- Apparently 60 degrees C is about optimal for creating key-cracking errors with the hardware she’s using. Temperature is harder to control that voltage, though.
- OpenSSL 0.9.8i was the victim in this case; before giving the talk, Valeria supplied a patch to stop using the fall-back algorithm, which helps avoid that specific attack.
General Advice
- Keep crypto libraries up to date.
- Overclocking IS A SECURITY RISK.
- Overheating IS A SECURITY RISK.
- Unreliable power IS A SECURITY RISK.
Brilliant session.
Wednesday, January 18. 2012
Andrew Bartlett and Amitay Isaacs
- Samba has hardcore portability requirements.
- m4, sh, and other bare-bones tools. autoconf gone mad: 4,000 lines of m4 code.
- Scripting language of the month club: Python, then TCL, then Lua were all put in and pulled out. None of them were loved or portable enough.
- Then Perl went in.
- Became used for all manner of build and testing tasks.
- awk was then used to try and develop an IDL to autogenerate code to spec. It didn’t really work as you’d like, needing tweaking by hand.
- This lead to PIDL, the Perl IDL compiler. It worked far better, with IDL code being used as-is, no tweaking required. Use is now pervasive, generating both server and client code.
- This has been hugely productive and important in allow significant, rapid change.
- Then it caught JavaScript before it was cool. Tride gave many convincingsounding reasons as to why it’s a great idea.
- It was very easy to embed, with minimal dependencies to make it work.
- JS could even make RPC calls.
- But something went wrong. The cool kids were using Python. So they went back to Python. There may have been chloroform and lies to subdue tridge.
- These lies may have revolved around embedded python and debugging.
- Tridge is now a python fanboi. There’s a general love of Python permeating the project.
- IDL generated bindings are everywhere, with bindings into every component: ldb, tdb, and so on. If it’s a useful part of Samba, you can probably access it directly from within Python.
- Things that were being done in C are being migrated to Python; e.g. samba-tool has migrated from a pure C tool to a Python tool with C extensions.
- Many small tasks are now fork()ed from the core Samba processes and run as Pyhon tools - which makes it trivial to debug bad cases by running the tool from the command line with the same parameters.
Some Examples
- The Samba3 migration tools were clunky; in 2 weeks they were (re)-written in Python with C bindings. The business logic was re-written in Python.
- Python is now the core of the build system, via WAF.
- Does ABI checking: checks that all the contracts are consistent, and alerts developers when they aren’t. Maps all the dependencies.
- Testing Samba: both unit testing and environment testing. The latter is the more challenging, because it requires a running server. And there are many, many different options for Samba 4 when running as an AD server. So it ends up creating 7, 8, or more environments for the test suites to run.
- These tests are now run as part of the commit process - continuous integration. 9,000 tests in 1,300 test suites.
Kate Stewart
- There are now 7,000+ packages in the Ubuntu main distro and over 19,000 in universe.
- There’s a tremendous rate of change.
- This isn’t a solution - it’s a description of an interesting problem space.
- Many projects feed into Debian, who Ubuntu feed from, and then Ubuntu release multiple images to the world at large.
Stabilising this is hard, especially since Debian is a time-based release.
Understanding whether there should be exceptions to the freezes is really challenging, especially when they’re to packages with many dependencies upon them.
- There’s a planning phase, a feature development phase, and a stabilization & QA phase.
- There’s a continual set of imports from Debian, mostly from Unstable. LTS comes from Debian Testing.
- Ubuntu archive contains ALL THE THINGS.
- 6 architectures, with 73 daily images (core Ubuntu + Flavour images).
Number of packages varies by the edition of Ubuntu, as well as the flavours.
Lots of facts: how do you convert facts into knowledge?
- Image build summary files, the Package control files provide some core information.
- Germinate examines “seeds”, groups of packages, and examines the dependencies between them; e.g. “desktop” is a seed, “live” is a seed, and so on and so forth.
- Shows forward and reverse depenedencies.
- There are other tools that help understand these dependencies: madison, apt-rdepends, apt-cache, diff-manifest.
It’s an interesting problem, but there’s no good solution for the problem, as yet. But there are some steps towards one:
- You can graph out seed-based dependencies; you can use rdepends to extract information for driving regression testing.
Q&A
- Look at the bug trackers alongside the package mapping to see how past bugs/changes have created inter-dependencies. But how do you mine those bug trackers?
- For visualisaion, are you familiar with Gapminder? It shows changes of multidimensional data over time. (http://www.gapminder.org/)[gapminder.org]
- Consider using the Debian popcorn database to understand the numbers of people likely to be affected by changes.
Avi Miller
Some key points (having lost many notes due to Firefox being fucking useless).
- There’s a bunch of stuff still working badly or not, and optimisations.
- e.g. metadata is fixed at 4K blocks for metadata, and that hurts performance. This is being fixed.
- RAID is block redundency across disks. So a RAID-1 mirror with 5 different-sized disks will simply make sure that blocks are duped somewhere in the array.
- Scrubbing is great, and will auto-fix on read. There are some important caveats, though; the biggest is that btr prefers to always read from the same device if it can. This means that if you don’t force scrubs occasionally you can have a drive crap itself, pull the drive, and then discover your alternate block was corrupt. And be unable to find a good copy. Oops.
- Chris M recommends scrubbing periodically with the sum tool from time to time (say a week for busy filesystem).
- You can mount any device in an array and everything mounts.
- No idea what happens if you try mounting multiple devices in the array.
- Disk replacement is working smoothly, and Just Works.
- btr send/recieve is working. It sends a “neutral” stream, so it ought to scrub and dump errors.
- btr is friendlier to small machines that ZFS, but not to small disks - it tends to allocate heaps of metadata.
- RAID 0, 1, and 10 are there, but RAID 5, 6 and triple mirroring are still sitting in the merge queue, thanks to Intel.
- You can mix RAID levels in the same disks, because, hey, it’s just block duplication.
- Unfortunately df and the like just Don’t Work. e.g. until you force sync, the filesystem will report the wrong utilisation, and it will always tell you the FS size is the sum of all the disks in an array.
When Bad Things Happen to Good Data
- There’s a read-only btrfs tool, so you can try and save your data when btr goes bad. It works well.
- Chris Mason will be talking about btrfs on Saturday. You may choose to assume that btrfsck will be announced then. If you want.
- Oracle have publicly stated that they will take it into production with btrfs.
- Even if the filesystem isn’t changing, the metadata rolls its root backup (every 30 seconds). You can switch off.
- Avi has some amusing tools to corrupt files and filesystems.
- And “mount -o recovery” just fixed the checksum corruption he inflicted on his test filessytem. Worst case scenario you’ve lost 30 seconds of data per write.
Beeeellions of files
- ext4, xfs, and btrfs all have problem with lots of files.
- ext4 is journal-bound
- xfs has fixed this in head. It spams files all over the place and gets generally good performance, bt generates many seeks.
- btr load-levels across the disk, not isn’t seek-thrashing the disk.
- btr and xfs are both CPU-limited on SSDs.
- “seekwatcher” is one of Chris M’s tools that shows what’s doing on.
yum upgrade and snapshots
- Requires btrfs root, and allows you to snapshot on upgrade and rollback in one hit.
- It’s easier to use Fedora than OEL to convert the FS from ext4. Since ext4 is stores as a conversion snapshot, you can rollback to ext4 later.
- Avi no longer uses the 3D accelerators for VirtualBox so he never has to use GNOME 3.
- When you convert ext4 -> btrfs remember to edit /etc/fstab at change the FS type!
- You need the yum snapshot plugin to be installed.
- Then yum install just creates a snapshot.
- New Fujitsu logging has improved the speed of apt-get and yum, both of which generate a lot of fsync() calls.
Questions
- Some people do md-raid and btr-RAID.
- Dedupe? Not on the roadmap right now. Disks are so big; the cost of CPU and RAM to dedupe is huge.
Tuesday, January 17. 2012
Sarah Novotny
Origin of the talk was when a customer rang with a complaint that a site was wrong, but Sarah couldn’t find a problem, and this provoked her into thinking about where data can ad should be cached.
Why Cache?
We want to move data as close to the end used, while retaining ACID-style guarantees. The abandonment rate after 7 seconds is huge. We need reliable speed.
Count Them.
A Short Diversion
- DBA/SA background means Sarah cares a lot about ACID demantics around data.
- Will therefore focus on the DB
Which Caches are Redundant
- Some caching is redundant.
- The tackle the same functions, but are either redundant or even harmful. Battery-backed controller caches are good and cache disks. Disk caches cache, but are unlikely to be “safe”.
- You need to ensure durability in those cases.
- For MySQL you also have InnoDB query caches and buffer caches.
Why do we keep doing this? Because we want things to go faster! But there’s a conflict between the DB cache and filesystem cache, too. You’re double-buffering. They aren’t particular dangerous on modern filesystems, but it’s an inefficient use of memory and CPU to manage both sets of caches.
Which Caches are Risky
- Expiries not set well on memcached will result in data being lost; Sarah is of the opinion you should only use this for temp data.
- Hypervisors often cache disk in memory, without advising the guest what happens here.
- Disks lie! They are reporting writes suceeding when they aren’t on the platter for reals.
- RAID controllers lie, but at least they lie with battery backup (if you spent the money), so you’re probably OK.
- The last two are really toxic, because you can end up losing data on power failure. Sarah recommends controlled power failures to test this.
- TURN YOUR DISK CACHE OFF IF YOU VALUE YOUR DATA.
- MySQL generally does better if it bypasses the FS cache for direct-attached storage. However, for SAN-attached disk you should leave FS caching on.
Benchmarking
- You need to be careful when benchmarking, but in general it’s good and you can never do enough.
- It’s not magic. You just need to do it right.
- Don’t do bad benchmarks that just e.g. exercise your cache.
- You need to touch the slowest part of the system. Force pessimistic scenarios, e.g. when your controller cache goes offline.
- You also want to test the normal production case, with real data sets and a workload that looks like production behaviour.
- You can test in prod, but you should use proper staging hardware that’s similar.
- Benchmarking with real data also exercises your backups if you populate from them.
- You can also use a replica/DR server on a short-term basis. Breaking replication and then restoring is good practise for this.
Monitoring
- Only monitor the stuff you want.
- Test multiple layers in your infrastructure and that you test both what the end customer sees, as well as each touch point along the way.
- Monitoring is an evolving case; treat it like you’d treat unit testing in software.
- There’s no boilerplate. Every system is unique.
- So many tools.
Selena Deckelmann
Slides at: Slideshare.
- “Success Engineering” - that clearly will work, then.
- Plan for the worst. Minimise risk. Fail. Recover, gracefully.
- “You can’t eliminate risk.”
- alt.sysadmin.recovery shoutout.
- Failure is an option. Admit it.
- The open source world has failure and recovery as a core competency, but perhaps not systematically enough.
- Dr. Jerker Denrell publishes fantastic papers on the topic from a business perspective. “Predicting the Next Big Thing: Success as a Signal of Poor Judgement.” Looked at people who had predicted Black Swan events, and found there was a negative correlation with general quality of judgement.
- Try “Everything is Obvious Once You Know The Answer”
Whatever, science, blah, onto the entertaining anecdotes!
Rats like fibre optic. And we can use stories about this to help inform our planning.
- Document, Test, Verify is like Stop, Drop and, Roll.
Documentation
- Documentation tools are mostly pretty terrible, and there’s good work that could be here.
- Making time to update documentation when you do stuff.
Testing
- Verify your success criteria. What does success look like, what are you trying to achieve.
- Make sure you actually write tests, however simple, and have a buddy sanity check your work.
- Have a plan: make sure you involve other people with it, too.
- There are no shortage of testing tools, which should be repeatable.
- Do stuff in repeatable shell scripts.
- Have staging environments.
Verify
- What does pg_dump -d actually do. Well, it depends.
- Needed a plan for what to do if things go wrong. Staging environment. And test your rollbacks, not just implementation.
- People are really important. Having a buddy.
Failure to Imagine
- Telling externals they need to tell you when you have a problem is not going to work. Trust no-one.
- Share your stories of failure and talk to a diverse group of people, people who are different to you.
- Sharing lets you head failure off at the pass.
- People who are different to you means outside IT - business, musicians, the construction industry.
- Go and physically look at things you might need to do, don’t just sit in a room.
Reflection
- The post-mortem/debrief.
- Keep a notebook of your work, learn from it.
- Plan to have a post-mortem, even if there’s success.
- Document your plan with a timeline, allocate time, and actually test the plan.
- IRC is great, speaking is better. A headset is great.
- Have a timekeeper and alert people to when you’ve hit your drop-dead point.
- Limit improvements to 1-2 things. And endless list will never be worked upon.
Read the DailyWTF.
Jamie Wilkinson
The Problem
- Cluster of slapd/bind/rsync/etc machines.
- How do we monitor these systems?
- Google has their own proprietary monitoring system. It’s basically pervasive to everything they write and any internal libraries etc you use.
- They wanted to maximise reuse.
- Whitebox: the app produces enough data to let you inspect the internal state of the application.
- Most open source apps are good about doing this.
- LDAP gives you lots of data. Too much data.
- BUT they’re all special snowflakes. Bugger.
emtail
- BUT hey have stuff in the logs. So we use emtail: “exporting, modular tail”. Reads logs, runs modules/plugins for extracting useful data, and produces a standardised set of metrics.
- A “metric” is stuff-over-time.
- The Google version exports to the Google DB, the open source version exports JSON dumps.
- The dad is exported via an HTTP server, using JSO or CSV, discarding the historical data; storage is the problem of your collecting tool; cacti, collectd, etc.
- Current version is written in Python, which is the closd version, and the open source rewrite is written in Go. Not up on code.google yet, but “by the end of the day”. It’s not complete yet.
- An awk-like language to express the matching/aggregation rules.
Editorial: This seems kind of dead in the water to me. Kind of NIH-ish.
|