Summarized data in the CAIDA Dataset on the Witty Worm

********
No portion of the CAIDA Dataset on the Witty Worm may be redistributed.

All users who publish (in any venue, including presentations, web pages, 
and papers) data from this dataset must cite:
        The CAIDA Dataset on the Witty Worm - March 19-24, 2004,
        Colleen Shannon and David Moore,
        http://www.caida.org/passive/witty/.   Support for the Witty
        Worm Dataset and the UCSD Network Telescope are provided
        by Cisco Systems, Limelight Networks, the US Department of
        Homeland Security, the National Science Foundation, and
        CAIDA, DARPA, Digital Envoy, and CAIDA Members.
********

Publicly available:

	These files contain no IP addresses or other sensitive information.

	witty.country.distribution.txt
		The country distribution of the witty-infected
		hosts, as mapped by Digital Envoy's NetAcuity service
		[1].  The data consists of a tab separated file
		containing the three letter ISO country abbreviations,
		the percentage of the Witty-infected population
		estimated to reside in that country, and the total
		count of infected computers in that country.

	witty.connection.speeds.txt
		The estimated connectivity distribution of Witty
		infected hosts -- broadband, dsl, dialup, t1, etc.
		File consists of a tab separated lits of Internet
		connection types, percentage of Witty-infected
		population estimated to use that type of connection,
		and total count of infected computers with those
		types of links.

	witty.start.cdf.txt
		A cumulative distribution of the times at which we
		first saw Witty probes from various IP addresses.
		A space separated file containing a unix timestamp
		for each IP address and the number of IP addresses
		with start times earlier or equal to that timestamp.

	witty.relative.start.cdf.txt
		The amount of time between the first observation
		of worm traffic on the telescope and the first time
		we saw a worm packet from a given IP address.  A
		space separated file with the duration of time
		between first witty packet and the onset of a given
		IP address and the number of hosts with an start
		earlier than or equal to that time.

	witty.end.cdf.txt
		A cumulative distribution of end last times we
		observed Witty probes from various IP addresses.
		A space separated file containing the unix timestamp
		for each IP address and the number of IP addresses
		with end times earlier or equal to that timestamps.

	witty.relative.end.cdf.txt
		The amount of time between the first observation
		of worm traffic on the telescope and the last time
		we saw a worm packet from a given IP address.  A
		space separated file containing the cumulative
		distribution of the time between worm onset and the
		cessation of an IP address in seconds and the count
		of computers with relative durations less than or
		equal to the first column.

	witty.durations.cdf.txt
		The duration of time Witty infected IP addresses
		were observed to be transmitting the worm.  The
		file contains a tab separated cumulative distribution
		with one line per IP address with columns for number
		of seconds active, count of IP addresses at less
		than or equal to this infection duration, and percent
		of IP addresses less than or equal to this infection
		duration.


Restricted Access:

	These files contain the IP addresses of worm victims and
	should not be shared with anyone.

	witty.ips.txt
		A list of the IP addresses observed to be transmitting
		the Witty worm, one address per line.

		Q. Why are there 55k addresses in witty.ips.txt
		when the CAIDA Witty worm paper [1] suggests that
		approximately 12k computers were infected?

		A. 55k IP addresses and ~12k computers are only
		contradictory if you assume that every IP address
		uniquely identifies a computer.  This is not the
		case due to widespread use of dynamic addressing
		(typically DHCP) that can cause IP address migration
		for individual infected computers, and
		less-widespread-but-still-prevalent use of address
		aggregators (NAT and proxy firewall type boxes that
		represent many hosts).  Dynamic addressing serves
		to (often drastically) magnify the number of computers
		that appear to be infected, while NATs and other
		aggregators make many infected computers appear to
		be one infected computer (which usually isn't
		infected at all!).  For Witty in particular, there
		was what appeared to be a firewall device at a
		university in Turkey that randomly utilized an
		address from a /21 for every flow sent from inside
		the university out to the rest of the internet.
		This made a small number (we think <5) of infected
		hosts appear to be more than 2000 computers (inflating
		the count for those machines by three orders of
		magnitude!).  Usually DHCP effects are not quite
		so drastic, but they do cause significant inflation.
		To arrive at our estimate of the number of machines
		infected (which is an estimate; there is no way to
		completely reliably identify DHCP/NAT hosts), we
		looked at the maximum number of machines intially
		active simultaneously, and adjusted that upwards
		looking at the trends as additional machines came
		online on different continents.  We also looked at
		the stability of the apparent infected machines in
		each /24 over time.

		For more info, see:
			- DHCP and Witty: Figure 2 in [1]
			- address inflation in witty traces [2]
			- evidence and methods for finding dynamic
			  addressing problems in worm traffic [3] (section B.6)


	witty.start.ips.cdf.txt
		A cumulative distribution of the times at which we
		first saw Witty probes from various IP addresses.
		A tab separated file containing a unix timestamp
		for each IP address, the IP address, and the number
		of IP addresses with start times earlier or equal
		to that timestamp.

	witty.relative.start.cdf.txt
		The amount of time between the first observation
		of worm traffic on the telescope and the first time
		we saw a worm packet from a given IP address.  A
		tab separated file with the duration of time
		between first witty packet and the onset of a given
		IP address, the IP address, and the number of hosts
		with an start earlier than or equal to that time.

	witty.end.ips.cdf.txt
		A cumulative distribution of end last times we
		observed Witty probes from various IP addresses.
		A tab separated file containing the unix timestamp
		for each IP address, the IP address itself,  and
		the number of IP addresses with end times earlier
		or equal to that timestamps.

	witty.relative.end.ips.cdf.txt
		The amount of time between the first observation
		of worm traffic on the telescope and the last time
		we saw a worm packet from a given IP address.  A
		tab separated file containing the cumulative
		distribution of the time between worm onset and the
		cessation of an IP address in seconds, the IP
		address, and the count of computers with relative
		durations less than or equal to the first column.

	witty.durations.ips.cdf.txt
		The duration of time Witty infected IP addresses
		were observed to be transmitting the worm.  The
		file contains a tab separated cumulative distribution
		with one line per IP address with columns for number
		of seconds active, IP address, count of IP addresses
		at less than or equal to this infection duration,
		and percent of IP addresses less than or equal to
		this infection duration.

	witty.countries.txt
		A tab separated list of IP addresses and their
		estimated country of origin, as mapped by Digital
		Envoy's NetAcuity service [4].

	witty.hostnames.txt
		A space separated list of IP addresses and their
		hostnames as looked up on March 24, 2004 (five days
		after the onset of the worm).  The file contains
		hostnames for 45,971 (82%) of the Witty-infected
		IP addresses; the rest had no resolvable hostnames.

	witty.table.txt
		This file contains six tab-separated columns: source 
		IP address, number of packets recieved from that address,
		number of bytes, number of 1-hour raw pcap traces in 
		which this IP address appeared, the first and last times 
		we observed traffic from this IP address.

	build_witty_summaries.sh
		The script used to build all of the cumulative
		distributions and the witty.ips file.


[1]	The Spread of the Witty Worm:
http://www.caida.org/analysis/security/witty/	
[2]	Witty Raw Traces README
	https://data.caida.org/datasets/security/witty/data/witty_pcap_traces/README
[3]	Code-Red: a case study on the spread and victims of an Internet worm
http://www.caida.org/outreach/papers/2002/codered/codered.pdf
[4]	Digital Envoy's NetAcuity Service:
http://www.digitalenvoy.net/solutions/netacuity.shtml



