Description of Data

 

The data provided represents technology companies organized by year and year founded with location (city, state, zip), sales, employment, primary industry, and product types. The industry and product type classifications come from the North American Industry Classification System (NAICS). It is currently a subset of a database being used for research at UMass Lowell. In addition, a comma delimited data file (zipdata.csv) containing 5-digit zip codes and lat/longs is provided.

 

Detailed NAICS information can be found at http://www.census.gov/epcd/www/naics.html

 

The CompanyDataXX file includes the following information for individual high tech companies for each year (XX) in the United States for the years 1989 to 2003 (15 years).

 

Column

 

 1    Year

 2    Company ID (number)

 3-5  Address (city, state and zip)

 6    Industry type (chemicals, energy, medical, software, etc.) - code description is in the file IndustryCodes

 7    Year formed (founded)

 8    Primary NAICS (government company classification code)

 9    Sales in Millions

10    Employment Count

 

The files ProductDataXX represent the products classification per year for each company. These files may contain more records than Excel can open.

 

Column

 

1     Year

2     Company ID (same as above)

3     NAICS product code

4     Product Verbal Description (providing more details on the product than the NAICS code)

 

There are 87,659 companies in the complete data set. About 60,000 companies are included for 2003 for example.

 

There is one company data entry for each company for each year. 

There are multiple product data entries for each company for each year due to the fact that companies typically produce multiple products. 

The company data and production data can be related through the id and year combination.

 

There is missing data.

 

The web site for the full NAICS codes is http://www.census.gov/epcd/naics02/naicod02.htm

 

Companies in each NAICS code are all searchable from the data by region and year. Remember, government sources do not supply company specific information, only the totals and only for geographical areas where there are enough companies so that specific companies can not be identified.

 

The data contains more. Each NAICS code can help identify companies for which the specific NAICS code is the company’s primary code or primary industry and, incredibly, the same NAICS code identify companies that make a product that fits the NAICS code.

 

The zipcode data is in zipdata.csv which is a comma delimited file with the following structure:

 

Column

 

1.    Zip code

2.    Latitude

3.    Longitude

 

 

Example usage

 

1. The database is searched for companies in which MED is the ‘Primary Industry’. Spreadsheets are created which include all such companies by region, state, zip code, or city.

 

2. The companies on the spreadsheet are sorted by company NAICS code. This gives a distribution of companies by the whole range of NAICS codes, including both manufacturing and various services. 

 

3. Step 1 is repeated but the data is searched for companies which have MED products. And again the companies on the spreadsheet are sorted by NAICS code. This gives a much larger group of companies not only in the services sectors but in manufacturing sectors as well. 

 

4. Regions can be compared in terms of the different arrays of NAICS codes that make up MED, or any other technology group or even any major product code.

 

5. The same region can be compared over time.

 

6. Fast growing regions by technology groups can be found and compositions compared over time.

 

 

 

Contest Questions

 

1. Characterize correlations or other patterns among two or more variables in the data.

For example:

What products lead to growth in other products or industries?

What contributes to companies moving, and what characterizes the moves?

 

2. Characterize clusters of products, industries, sales, regions, and/or companies.

For example:

What geographical areas developed in a similar manner or have similar characteristics?

What product combinations tend to be produced by a company, or in a region?

 

3. Characterize unusual products, sales, regions, or companies.

For example:

Are there regions whose product mix changes in an unusual direction?

Are there products whose sales per employee varies geographically?

 

4.  Characterize any other trend, pattern, or structure that may be of interest.

 

 

Any additional questions send to:

 

Grinstein@cs.uml.edu