LOCAL NEWS CENSUS DATA COLLECTION README
========================================

TABLE OF CONTENTS
1. Introduction and Overview
2. Variable Definitions and Coding Instructions
3. Data Collection Methodology
4. Quality Control Procedures
========================================
1. INTRODUCTION AND OVERVIEW
========================================

This documentation accompanies the Local News Impact Consortium (LNIC) toolkit for creating a census of local news outlets. The goal is to systematically document news-producing organizations within a defined geographic area to assess the health of local news ecosystems. 

NOTE: You may not need, or be able to find, data for all of these variables. This is meant to provide a comprehensive framework. 

PURPOSE:
- Document existing news production capacity
- Identify geographic and demographic gaps in coverage
- Create baseline data for longitudinal analysis
- Inform funding and policy decisions
- Support community journalism initiatives

TARGET USERS:
- Academic and non-academic researchers
- Local Press Forward chapters
- Journalism support organizations
- Funders and philanthropists
- Policy makers and civic organizations

GEOGRAPHIC SCOPE CONSIDERATIONS:
Before beginning data collection, clearly define:
- Geographic boundaries (state, county, city, or region)
- Whether to include outlets serving your area from outside boundaries
- Decision on hyperlocal vs. broader coverage outlets
- Treatment of cross-border media markets

========================================
2. VARIABLE DEFINITIONS AND CODING INSTRUCTIONS
========================================

CORE VARIABLES (Recommended for all projects):

1. outlet_name
   - Record the name exactly as it appears on the outlet's website or masthead
   - For outlets with multiple names/brands, use the primary brand name
   - Note any recent name changes in a separate field if relevant

2. digital_location_url
   - Full website URL (include https://)
   - If no website exists, note "no_website" and method of discovery
   - For social media-only outlets, include primary platform URL
   - Verify URLs are active and current

3. physical_address, city, county, state
   - Complete street address when available
   - If no physical address is listed, note "address_not_public"
   - Use mailing address if physical address unavailable
   - This data enables geographic mapping of outlets

4. founding_year
   - Year the outlet began operation
   - If exact year unknown, estimate or note "unknown"
   - For outlets that changed ownership/name, use original founding year

5. outlet_type
   Standard categories (use underscores):
   - print_only: Produces only print publication
   - digital_only: Online/digital content only
   - print_and_digital: Maintains both print and digital presence
   - broadcast_television: Over-the-air or cable TV station
   - public_radio: Non-commercial radio station
   - commercial_radio: Commercial radio station
   - other: Specify in notes

6. publication_frequency
   Standard categories:
   - daily: Content published daily
   - weekly: Published once per week
   - monthly: Published monthly
   - regularly: More than weekly but not daily
   - irregularly: 1-2 times per month or less frequent

7. communities_served
   - Record the community/geographic area as defined by the outlet
   - Copy language from outlet's "About" page when possible
   - Examples: "Springfield metro area," "Latino communities statewide"

8. community_of_identity
   Standard categories:
   - african_american: Serves African American community
   - hispanic_latino: Serves Hispanic/Latino community
   - asian_american: Serves Asian American community
   - native_american: Serves Native American community
   - lgbtq: Serves LGBTQ+ community
   - religious: Serves specific religious community
   - none: Not applicable

9. language
   - english: English language publication
   - spanish: Spanish language publication
   - other: Specify language in notes

10. owner
    - Name of individual or company that owns the outlet
    - For corporate ownership, use parent company name

11. owner_location
    - in_state: Owner located within study area
    - out_of_state: Owner located outside study area

12. business_model
    - commercial: For-profit, advertising/subscription revenue
    - nonprofit: 501(c)(3) or similar non-profit status
    - public_media: Publicly funded (PBS, NPR affiliates)

ADDITIONAL CONTENT AND OPERATIONAL VARIABLES:

13. news_originator_curator
    - originator: Produces original reporting and content
    - curator: Primarily republishes content from other sources
    - mixed: Combination of original and curated content

14. coverage_scope
    - hyperlocal: Very specific geographic area
    - city_wide: Covers entire city or municipality
    - county_wide: Covers entire county
    - regional: Multi-county or regional coverage
    - statewide: Covers entire state
    - multi_state: Covers multiple states

AUDIENCE AND REACH VARIABLES:

15. circulation_audience_size
    - For print: verified circulation numbers
    - For digital: monthly unique visitors or subscribers
    - For broadcast: estimated audience size from ratings
    - Use "unknown" if data unavailable

16. digital_traffic_monthly
    - Monthly unique visitors to website
    - Use third-party verification when possible (SimilarWeb, etc.)
    - Note source of data in comments

17. social_media_following
    - Total followers across all social media platforms
    - Include Facebook, Twitter/X, Instagram, TikTok, YouTube
    - Format as number or range (e.g., 1000-5000)

STAFFING AND OPERATIONAL CAPACITY:

18. editorial_staff_count
    - Number of full-time equivalent editorial staff
    - Include reporters, editors, photographers, videographers
    - Use decimal for part-time (e.g., 2.5 for two full-time, one half-time)
    - Use "unknown" if information unavailable

19. operating_budget
    - Can be formatted as a range (eg: under_50k: Annual budget under $50,000 or 50k_to_250k: 		Annual budget $50,000-$250,000)

OWNERSHIP AND BUSINESS STRUCTURE:

20. ownership_structure
    - single_holding: Independently owned
    - multiple_2_to_5: Small chain (2-5 outlets)
    - multiple_6_to_10: Medium chain (6-10 outlets)
    - multiple_11_to_20: Large chain (11-20 outlets)
    - multiple_20_plus: Major chain/conglomerate (20+ outlets)

21. investment_firm_owned
    - hedge_fund: Owned by hedge fund
    - private_equity: Owned by private equity firm
    - other_investment: Owned by other investment firm
    - none: Not owned by investment firm
    - unknown: Ownership structure unclear

22. parent_company_holdings
    - Total number of media properties owned by parent company
    - Use actual numbers when known
    - Use ranges for large conglomerates (e.g., 100+)

23. nature_of_ownership
    - privately_held: Privately owned company
    - publicly_traded_private: Publicly traded but privately controlled
    - publicly_traded_shareholder: Publicly traded, shareholder controlled
    - unknown: Ownership structure unclear

REVENUE AND BUSINESS MODEL:

24. revenue_sources
    - List primary revenue streams (advertising, subscriptions, grants, etc.)
    - Separate multiple sources with semicolons
    - Examples: "advertising; subscriptions" or "grants; donations; events"

25. primary_revenue_source
    - advertising: Primarily advertising-supported
    - subscriptions: Primarily subscription/paywall revenue
    - grants: Primarily foundation/government grants
    - donations: Primarily individual donations
    - events: Primarily event-based revenue
    - mixed: No single dominant source
    - unknown: Revenue model unclear

26. secondary_revenue_source
    - Same categories as primary_revenue_source
    - Use "none" if outlet has single revenue stream
    - Use "unknown" if secondary source unclear

COUNTY-LEVEL DEMOGRAPHIC VARIABLES:
Use actual numeric values for demographic data from U.S. Census Bureau:

27. county_population
    - Total population count (e.g., 150000)

28. county_population_density
    - People per square mile (e.g., 75.5)

29. county_median_household_income
    - In dollars (e.g., 65000)

30. county_per_capita_income
    - Individual per capita income in dollars (e.g., 32000)

31. county_poverty_rate
    - Percentage below poverty line (e.g., 15.2)

32. county_high_school_graduate_rate
    - Percentage of adults with high school diploma or equivalent (e.g., 89.5)

33. county_higher_education_rate
    - Percentage of adults with bachelor's degree or higher (e.g., 28.7)

34. county_under_18_percentage
    - Percentage of population under 18 years old (e.g., 22.4)

35. county_over_65_percentage
    - Percentage of population over 65 years old (e.g., 18.9)

36. county_white_percentage
    - Percentage of population identifying as White alone (e.g., 75.2)

37. county_black_percentage
    - Percentage of population identifying as Black or African American (e.g., 12.1)

38. county_hispanic_latino_percentage
    - Percentage of population identifying as Hispanic or Latino (e.g., 16.8)

39. county_asian_percentage
    - Percentage of population identifying as Asian (e.g., 5.4)

40. county_native_american_percentage
    - Percentage of population identifying as Native American (e.g., 1.2)

41. county_urbanization_level
    - rural: Primarily rural county
    - suburban: Suburban/mixed county
    - urban: Urban county

CIVIC HEALTH AND INFRASTRUCTURE VARIABLES:

42. county_broadband_access_rate
    - Percentage of households with broadband internet access (e.g., 82.3)

43. county_literacy_score
    - Average literacy score from standardized assessments
    - Use available state/federal data sources

44. county_unemployment_rate
    - Current unemployment rate percentage (e.g., 4.2)

45. county_voter_turnout_rate
    - Percentage of eligible voters who voted in most recent general election (e.g., 67.8)

46. county_volunteer_rate
    - Percentage of population engaged in volunteer activities (e.g., 23.5)

47. number_of_schools_in_county
    - Total number of public and private K-12 schools

48. number_of_colleges_universities_in_county
    - Total number of higher education institutions

49. philanthropic_media_investment_in_area
    - Annual philanthropic investment in local media (in dollars)
    - Use "unknown" if data unavailable
    - Source from Media Impact Funders database when possible

50. local_journalist_count_county
    - Number of working journalists in county
    - Use MuckRack/Rebuild Local News data when available
    - Include freelancers and part-time as fractional equivalents


CODING STANDARDS:
- Use lowercase letters with underscores for all variable names
- Use consistent category codes across all variables
- For numeric fields, use actual numbers (not text)
- For percentage fields, use decimal format (15.2 not 0.152)
- Leave fields blank if data is not available (don't use "unknown" unless specified)
- Document data sources for all information
- Use YYYY-MM-DD format for all dates

========================================
3. DATA COLLECTION METHODOLOGY
========================================

PHASE 1: MASTER LIST CREATION

Primary Sources:
1. State of Local News Project database (Northwestern University)
2. INN (Institute for Nonprofit News) directory
3. Center for Community Media directories
4. State government newspaper listings
5. FCC station databases (for broadcast)
6. Local press association member lists

Search Strategies:
- Use multiple languages for non-English outlets
- Search "[City name] news," "[County name] newspaper"
- Check local government websites for legal notice publishers
- Contact local journalism schools and organizations

PHASE 2: DATA VERIFICATION
For each outlet identified:
1. Verify website is active and current
2. Confirm physical location if listed
3. Review "About" page for mission, coverage area, ownership
4. Check recent content to confirm active publication

PHASE 3: DATA ENTRY
Best Practices:
- Use standardized category codes (underscores, no spaces)
- Maintain consistent formatting across all entries
- Document sources for all information
- Regular backup and version control

PHASE 4: DEMOGRAPHIC DATA COLLECTION
County-level demographic data sources:
- U.S. Census Bureau American Community Survey
- Bureau of Labor Statistics for unemployment data
- Local Catalyst toolkit for consolidated county data
- Civic Information Index for civic health metrics
- MuckRack/Rebuild Local News for journalist counts
- Media Impact Funders for philanthropic investment data

PHASE 5: AUDIENCE AND BUSINESS DATA
Additional data collection methods:
- Direct outreach to outlets for circulation/audience data
- Third-party verification tools (SimilarWeb, Comscore)
- Media kit analysis for audience claims
- Social media platform analytics
- Industry reports and databases

========================================
4. QUALITY CONTROL PROCEDURES
========================================

BEFORE PUBLICATION:
1. Pre-release review with local journalism experts
2. Cross-check sample of outlets with multiple sources
3. Verify geographic coding accuracy
4. Check for duplicate entries
5. Confirm all required fields are complete
6. Validate demographic data against official sources
7. Cross-reference audience data with multiple sources when possible

DATA VALIDATION CHECKS:
- URL accessibility test for all websites
- Category standardization across all entries
- Logical consistency checks (e.g., staff count vs. budget range)
- Geographic coordinate validation if included
- Date format consistency
- Numeric range validation for percentages and counts

ONGOING QUALITY ASSURANCE:
- Regular updates to demographic data (annually)
- Quarterly checks of website activity
- Annual verification of ownership information
- Continuous monitoring for new outlets and closures

=====================================
QUALITY CONTROL CHECKLIST
=====================================

Before submitting your data, check:

TEXT FIELDS:
□ Outlet names match exactly what appears on their websites
□ URLs are complete and include https://
□ All location fields use standard naming (include "County")

NUMERIC FIELDS:
□ No commas in numbers (use 8500 not 8,500)
□ No dollar signs in financial data
□ Percentages use decimal format (15.2 not 0.152)

STANDARDIZED CODES:
□ All codes use lowercase with underscores
□ Codes match exactly the options provided
□ No creative variations or abbreviations

DATES:
□ All dates use YYYY-MM-DD format
□ Years are four digits
□ No text descriptions of dates

CONSISTENCY:
□ Same outlet types coded the same way across all entries
□ Owner locations consistent with your study area definition
□ Revenue sources match standardized categories

COMPLETENESS:
□ Required fields completed for all outlets
□ Missing data handled according to guidelines
□ Notes field used for unusual circumstances

========================================

For questions about this methodology or to contribute improvements, please contact the Local News Impact Consortium (LNIC) at https://www.localnewsimpact.org/contact/.

Last updated: August 2025
Version: 2.0