Spaces:

MaxNoichl
/

openalex_mapper

Running on Zero

App Files Files Community

MaxNoichl commited on 5 days ago

Commit

b38d551

1 Parent(s): e995663

Migrate OpenAlex integration off PyAlex

Browse files

Files changed (8) hide show

.gitignore +3 -0
OPENALEX_API_MIGRATION_PROPOSAL.md +657 -0
app.py +48 -196
config_loader.py +34 -0
openalex_client.py +369 -0
openalex_config.example.json +3 -0
openalex_utils.py +186 -341
requirements.txt +0 -1

.gitignore CHANGED Viewed

@@ -6,6 +6,8 @@
 !*.py
 !requirements.txt
 !README.md
 # Even if they are in subdirectories
 !*/
@@ -43,3 +45,4 @@ static/
 app_save_copy.py
 app_2.py

 !*.py
 !requirements.txt
 !README.md
+!OPENALEX_API_MIGRATION_PROPOSAL.md
+!openalex_config.example.json
 # Even if they are in subdirectories
 !*/
 app_save_copy.py
 app_2.py
+openalex_config.local.json

OPENALEX_API_MIGRATION_PROPOSAL.md ADDED Viewed

	@@ -0,0 +1,657 @@

+# Proposal: OpenAlex API Migration for OpenAlex Mapper
+Date: 2026-03-12
+## Executive summary
+The repository does not need a UI redesign. The plotting, embedding, CSV upload, and downstream record processing can stay largely as they are. The brittle part is the OpenAlex transport/query layer.
+The current code:
+- accepts OpenAlex website URLs and turns them into PyAlex queries in [`openalex_utils.py`](./openalex_utils.py)
+- authenticates with `pyalex.config.email` in [`app.py`](./app.py)
+- relies on page-based pagination and `query.count()` in [`app.py`](./app.py)
+- still understands deprecated filters like `default.search`, `title_and_abstract.search`, and `host_venue.id`
+That is no longer a safe contract with the current OpenAlex API.
+My recommendation is:
+1. Keep the user-facing interface exactly as it is.
+2. Introduce a repository-owned OpenAlex compatibility layer.
+3. Move the main list-fetching path from "URL -> PyAlex DSL" to "URL -> normalized API params -> direct HTTP client".
+4. Keep PyAlex 0.21 only where it adds value during transition, or remove it from the hot path entirely.
+This is the lowest-risk way to keep old pasted OpenAlex URLs working while aligning with the current API.
+## What the repository is doing now
+### Current integration points
+- [`app.py`](./app.py#L197) sets `pyalex.config.email`, not an API key.
+- [`openalex_utils.py`](./openalex_utils.py#L7) parses an OpenAlex URL into a `Works()` query.
+- [`openalex_utils.py`](./openalex_utils.py#L24) splits `filter=` on commas and maps `default.search` to `.search(...)`.
+- [`app.py`](./app.py#L480) calls `query.count()` before fetching.
+- [`app.py`](./app.py#L531), [`app.py`](./app.py#L558), [`app.py`](./app.py#L577), and [`app.py`](./app.py#L614) paginate with `method='page'`.
+- [`openalex_utils.py`](./openalex_utils.py#L212) resolves DOI CSV uploads with `Works().filter(doi=doi_str).get(...)`.
+- [`openalex_utils.py`](./openalex_utils.py#L218) generates readable query labels and still special-cases `host_venue.id` and `concepts.id`.
+### Important observation
+The downstream record model is mostly still compatible with current OpenAlex work responses:
+- `primary_location`
+- `primary_topic`
+- `abstract_inverted_index`
+- `referenced_works`
+- `title`
+That means the migration can be focused on query normalization, authentication, pagination, and transport. The plotting pipeline does not need to change shape.
+## Current OpenAlex state that matters for this repo
+### 1. Authentication has changed
+OpenAlex now documents API-key-based access and credit-based billing. The older "polite pool via email" approach is no longer the right production model.
+Impact on this repo:
+- [`app.py`](./app.py#L197) is configured for the old model.
+- The Hugging Face app should assume `OPENALEX_API_KEY` is required.
+- `pyalex.config.email` is no longer enough as the primary auth strategy.
+Recommended response:
+- add `OPENALEX_API_KEY`
+- configure the client with the API key
+- keep `email` only as optional metadata, not as the core auth mechanism
+### 2. Page-based pagination is only safe for shallow result sets
+OpenAlex now requires cursor pagination beyond 10,000 results. I verified this live on 2026-03-12:
+- `https://api.openalex.org/works?filter=publication_year:2024&page=101&per_page=100`
+- response: `Pagination error. Maximum results size of 10,000 records is exceeded. Cursor pagination is required for records beyond 10,000.`
+Impact on this repo:
+- the "All" path in [`app.py`](./app.py#L605)
+- the "First n samples" path in [`app.py`](./app.py#L605)
+- the random-sampling fallback in [`app.py`](./app.py#L572)
+All three currently rely on page-based traversal and can fail on large queries.
+Recommended response:
+- use cursor pagination for all full-download paths
+- reserve page pagination only for explicitly shallow requests, or stop using it entirely
+### 3. Deprecated search/filter names are still in the code
+Current OpenAlex docs mark old search/filter surfaces as deprecated, including:
+- `default.search`
+- `title_and_abstract.search`
+- `host_venue`
+- `alternate_host_venues`
+- `x_concepts`
+- Concepts as the old classification system
+Impact on this repo:
+- [`openalex_utils.py`](./openalex_utils.py#L29) still maps `default.search`
+- [`openalex_utils.py`](./openalex_utils.py#L261) still understands `title_and_abstract.search`
+- [`openalex_utils.py`](./openalex_utils.py#L340) still special-cases `host_venue.id`
+- [`openalex_utils.py`](./openalex_utils.py#L350) still treats `concepts.id` as a first-class query label case
+### 4. `host_venue.id` is not just deprecated; it already breaks
+I verified this live on 2026-03-12:
+- `https://api.openalex.org/works?filter=host_venue.id:S125754415&per_page=1`
+- response: `Invalid query parameters error. host_venue and alternate_host_venues are deprecated in favor of locations.`
+The replacement works:
+- `https://api.openalex.org/works?filter=primary_location.source.id:S125754415&per_page=1`
+Impact on this repo:
+- old user-pasted URLs containing `host_venue.id` will fail today
+Recommended response:
+- normalize `host_venue.id -> primary_location.source.id`
+- normalize `alternate_host_venues.id -> locations.source.id`
+### 5. Concepts are legacy; Topics are the current taxonomy
+OpenAlex responses still include `concepts`, and `concepts.id` still works today, but OpenAlex now treats Topics as the current classification system.
+Impact on this repo:
+- the app itself mostly uses `primary_topic`, which is good
+- old user URLs may still use `concepts.id`
+Recommended response:
+- do not break old `concepts.id` URLs immediately
+- treat them as legacy pass-through
+- update labels, examples, and new code to prefer Topics
+- do not try to auto-convert Concept IDs to Topic IDs; that is not a clean one-to-one migration
+### 6. `search=` should be the canonical search input
+Current docs prefer the top-level `search=` parameter and current field-specific search filters over `default.search` and `title_and_abstract.search`.
+Impact on this repo:
+- the parser should accept old URLs
+- internal canonical form should use current search syntax
+Recommended response:
+- normalize `filter=default.search:...` into `search=...`
+- keep `title_and_abstract.search` accepted for legacy compatibility
+- internally prefer `search`, `title.search`, `abstract.search`, and `fulltext.search`
+### 7. XPAC is now an opt-in corpus extension
+OpenAlex supports `include_xpac=true` to include the extended paper corpus.
+Impact on this repo:
+- enabling XPAC changes result sets
+- that would alter the app's semantics without the user asking for it
+Recommended response:
+- explicitly keep `include_xpac=false` or omit it
+- do not change corpus scope if the requirement is "keep the interface working as is"
+### 8. The current API still returns both `title` and `display_name`
+Live work responses currently include both `title` and `display_name`.
+Impact on this repo:
+- existing downstream code using `title` still works
+- we should still normalize defensively in case OpenAlex eventually removes one duplicate
+Recommended response:
+- set `record["title"] = record.get("title") or record.get("display_name") or " "`
+### 9. The docs and runtime are not perfectly aligned
+Two examples from today:
+- docs emphasize API keys, but unauthenticated requests still returned `200` from this environment
+- docs describe `per_page` max 100, but the live API still accepted `per_page=200`
+That should not reassure us. It means undocumented compatibility still exists, not that it is safe to rely on.
+Recommended response:
+- code to the documented contract, not the current accidental tolerance
+## Current PyAlex state
+### Version and maintenance surface
+PyAlex on PyPI is currently at 0.21, uploaded 2026-02-23.
+From its current package metadata/README, PyAlex supports:
+- API key configuration
+- select fields
+- sample
+- pagination
+- OR filters
+- search filters
+- semantic search
+It also still exposes deprecated OpenAlex surfaces such as:
+- Concepts
+- N-grams
+- older search/filter patterns for compatibility
+### What that means for this repo
+PyAlex is not the main problem. The current repo problem is that we are translating pasted OpenAlex URLs into PyAlex calls with our own fragile parser.
+PyAlex is still viable if:
+- we pin it
+- we stop letting UI code depend directly on PyAlex query construction
+- we put a repository-owned adapter in front of it
+### My recommendation on PyAlex
+I would not use PyAlex as the primary transport layer for list fetching anymore.
+Reason:
+- the app's input contract is raw OpenAlex URLs
+- OpenAlex's current API surface is evolving
+- URL normalization is simpler and more faithful if we keep requests as HTTP params instead of forcing them through a Python query DSL
+Recommended compromise:
+- use direct HTTP for list/sampling/pagination
+- keep PyAlex temporarily for singleton lookups if convenient
+- pin `pyalex>=0.21,<0.22` while that transition is happening
+If you want the smallest possible dependency surface, PyAlex can be removed entirely later.
+## Proposed migration design
+## Goal
+Keep all current user-facing behavior:
+- same Gradio controls
+- same textbox contract: paste OpenAlex URLs
+- same semicolon-separated multiple-query input
+- same sample-size controls
+- same uploaded CSV behavior
+- same output plot and downloadable CSV shape
+Only the backend fetch layer changes.
+## Proposed architecture
+### 1. Add a repository-owned compatibility layer
+Create a new module, for example `openalex_client.py`, responsible for:
+- auth
+- retries and backoff
+- current API parameter names
+- cursor pagination
+- deterministic random sampling
+- field selection
+- DOI batch resolution
+- singleton lookups for query labels
+This module should be the only place that knows how to talk to OpenAlex.
+### 2. Add URL normalization before any network call
+Create a normalization step, either in a new file like `openalex_query.py` or by refactoring [`openalex_utils.py`](./openalex_utils.py).
+Responsibilities:
+- accept both `openalex.org/...` and `api.openalex.org/...`
+- preserve semicolon-separated multi-query input
+- parse query params without lossy comma splitting
+- produce a canonical internal representation
+Canonicalization rules should include:
+- `default.search -> search`
+- `host_venue.id -> primary_location.source.id`
+- `alternate_host_venues.id -> locations.source.id`
+- `per-page -> per_page`
+- `api_key` stripped from user input and sourced only from environment
+Legacy rules:
+- `concepts.id` stays accepted as legacy
+- `title_and_abstract.search` stays accepted, but is marked legacy in code and tests
+### 3. Fetch lists through direct HTTP, not PyAlex query objects
+Use `requests`, which the repo already depends on.
+For list fetching:
+- build API URLs/params directly
+- use `cursor=*` for deep pagination
+- use `select=` to minimize payload
+- centralize retry and rate-limit handling
+This avoids the current failure mode where a user URL must survive:
+`OpenAlex URL -> custom parser -> PyAlex DSL -> OpenAlex request`
+Instead it becomes:
+`OpenAlex URL -> normalize -> OpenAlex request`
+That is simpler and less fragile.
+### 4. Keep output records in the existing shape
+The client should normalize each work record into the shape expected by the rest of the app:
+- `id`
+- `title`
+- `doi`
+- `publication_year`
+- `abstract_inverted_index`
+- `primary_location`
+- `primary_topic`
+- `referenced_works`
+Normalization defaults:
+- `title = title or display_name or " "`
+- `abstract_inverted_index = {}` or `None`, handled safely by existing abstract reconstruction
+- `referenced_works = []`
+- `primary_location = None`
+- `primary_topic = None`
+This preserves the interface between the fetch layer and the plotting layer.
+## Detailed implementation plan
+### Phase 1: auth and transport
+Changes:
+- replace [`app.py`](./app.py#L197) `pyalex.config.email = ...`
+- read `OPENALEX_API_KEY` from the environment
+- initialize a shared OpenAlex client with retries, timeout, and a descriptive user agent
+If no API key is present:
+- local development can warn and continue
+- deployed environments should fail loudly at startup
+This is the first change I would make because it affects every query path.
+### Phase 2: URL normalization
+Refactor [`openalex_utils.py`](./openalex_utils.py#L7).
+Replace `openalex_url_to_pyalex_query(url)` with something like:
+- `parse_openalex_input_url(url) -> ParsedQuery`
+- `normalize_openalex_query(parsed) -> CanonicalQuery`
+Important note:
+The current parser does `query_params['filter'][0].split(',')`.
+That is unsafe for any filter value that legitimately contains commas after URL decoding. It is also the wrong foundation for long-term compatibility.
+Use either:
+- a small filter tokenizer that respects quoted values
+- or a normalization strategy that manipulates the raw filter string and only tokenizes when necessary
+### Phase 3: list fetching
+Replace the fetch logic in [`app.py`](./app.py#L477).
+#### For "All"
+- use cursor pagination until exhausted
+- stop relying on `query.count()` for control flow
+#### For "First n samples"
+- use cursor pagination
+- stop once `n` records have been collected
+This preserves visible behavior while avoiding the 10,000-record page limit.
+#### For "n random samples"
+Use two modes:
+- if `n <= 10000`, use OpenAlex sampling with `sample=n&seed=...`
+- if `n > 10000`, use cursor pagination plus deterministic reservoir sampling locally
+I would not keep the current "repeat `sample()` with different seeds and dedupe" strategy as the long-term design. It is workable, but it is not the cleanest statistical contract.
+### Phase 4: select only required fields
+Current list fetches pull full work records.
+The app only needs a subset for the main pipeline. Use `select=` with something close to:
+- `id`
+- `title`
+- `display_name`
+- `doi`
+- `publication_year`
+- `abstract_inverted_index`
+- `primary_location`
+- `primary_topic`
+- `referenced_works`
+Benefits:
+- less bandwidth
+- lower latency
+- lower credit usage
+- less memory pressure in the HF Space
+### Phase 5: rewrite readable-name helpers
+Refactor [`openalex_utils.py`](./openalex_utils.py#L218).
+The current readable-name path should stop assuming:
+- `host_venue.id`
+- Concept-first terminology
+- direct PyAlex singleton calls from arbitrary old filter names
+New logic:
+- build labels from the normalized query
+- resolve singleton IDs through the shared client
+- cache author/institution/work lookups in memory
+That prevents repeated network calls when the same query name is rendered multiple times.
+### Phase 6: DOI upload path
+Refactor [`openalex_utils.py`](./openalex_utils.py#L196).
+The DOI upload flow is still conceptually fine, but it should be moved into the shared client.
+Recommended behavior:
+- batch DOI OR-filters up to OpenAlex's supported set size
+- enforce URL-length limits
+- reuse the same normalization, retry, and `select=` logic
+This keeps uploaded DOI CSVs working exactly as they do now.
+### Phase 7: defensive record normalization
+Refactor [`process_records_to_df`](./openalex_utils.py#L98) only lightly.
+Keep the function, but make it more explicit about current schema tolerances:
+- missing `primary_location`
+- missing `source`
+- missing `primary_topic`
+- missing `title` but present `display_name`
+- missing `abstract_inverted_index`
+This is not a large migration, just defensive cleanup.
+### Phase 8: documentation and examples
+Update the user-facing examples in [`app.py`](./app.py#L1265) so that new examples use current canonical OpenAlex URLs.
+Important:
+- still accept old URLs
+- show new URLs in examples
+That keeps the interface the same while steering users toward the current API surface.
+## Proposed file-level changes
+### `app.py`
+Change:
+- remove direct reliance on PyAlex query objects
+- call shared client methods instead
+- replace API auth setup
+Keep:
+- UI
+- sampling controls
+- plotting logic
+- CSV export format
+### `openalex_utils.py`
+Keep:
+- `invert_abstract`
+- `process_records_to_df`
+- `get_pub`
+- `get_field`
+- filename/readable-name helpers, but rewritten to use normalized queries
+Remove or replace:
+- `openalex_url_to_pyalex_query`
+### `requirements.txt`
+Change:
+- pin `pyalex>=0.21,<0.22` if it remains
+- otherwise remove `pyalex`
+Keep:
+- `requests`
+Optional:
+- add `httpx` only if you want async or more structured retry middleware; it is not necessary for this migration
+## What should not change
+To satisfy "keep the interface working as is", I would explicitly preserve:
+- the textbox accepting pasted OpenAlex URLs
+- semicolon-separated multi-query input
+- the "Reduce Sample Size" flow
+- the "First n samples" and "n random samples" options
+- the CSV upload flow
+- the downloadable CSV schema
+- the use of `primary_topic` for field labels and coloring
+## Edge cases to support
+The migration should explicitly test and support:
+- old website URLs using `default.search`
+- old website URLs using `host_venue.id`
+- old website URLs using `concepts.id`
+- current URLs using `search=...`
+- semicolon-separated multiple URLs
+- DOI CSV uploads
+- large queries over 10,000 results
+- random samples over 10,000 requested records
+- filters with OR values using `|`
+- filters with year ranges
+## Testing plan
+### Unit tests
+Add tests for normalization:
+- `default.search` becomes canonical `search`
+- `host_venue.id` rewrites to `primary_location.source.id`
+- `alternate_host_venues.id` rewrites to `locations.source.id`
+- `concepts.id` is preserved but flagged as legacy
+- `per-page` becomes `per_page`
+Add tests for record normalization:
+- missing `title`
+- missing `primary_location`
+- missing `primary_topic`
+- missing `abstract_inverted_index`
+### Integration tests
+Use mocked HTTP responses or recorded fixtures for:
+- cursor pagination
+- random sampling
+- DOI batch resolution
+- singleton lookups for readable labels
+### Live smoke tests
+Run a small set of real queries against OpenAlex:
+- a search query
+- an institution-year filter
+- a citation query
+- a legacy `host_venue.id` query that must now succeed via rewrite
+## Rollout sequence
+Recommended order:
+1. Add the new client and auth handling.
+2. Add URL normalization and tests.
+3. Swap main fetch paths in `app.py`.
+4. Swap DOI upload path.
+5. Rewrite readable-name generation.
+6. Update examples and docs.
+7. Pin or remove PyAlex.
+This minimizes risk because the UI and plotting code remain untouched until the data layer is stable.
+## Risk assessment
+### Low risk
+- auth migration to API key
+- cursor pagination
+- field selection
+- defensive record normalization
+### Medium risk
+- URL normalization, because old OpenAlex website URLs are part of the app's public contract
+### High-risk area if handled incorrectly
+- automatic conversion from Concepts to Topics
+I would not promise perfect automatic Concept-to-Topic migration. Legacy Concept URLs should remain supported as long as OpenAlex still accepts them. If OpenAlex removes them later, that should become a clear user-facing compatibility warning, not a silent semantic rewrite.
+## Final recommendation
+Do not try to "patch" the current `openalex_url_to_pyalex_query()` approach into compliance.
+That function is the wrong abstraction now. The app's input is an OpenAlex URL, and the safest way to preserve the current interface is:
+- normalize that URL
+- call the OpenAlex API directly
+- keep the returned records in the same downstream shape
+PyAlex 0.21 is still useful, but it should no longer define the repository's transport contract.
+## Sources
+- OpenAlex API overview: https://developers.openalex.org/api-reference/works/list-works
+- OpenAlex works filters/search fields: https://developers.openalex.org/api-reference/works/list-works
+- OpenAlex authentication and pricing: https://developers.openalex.org/getting-started/api-overview
+- OpenAlex LLM quick reference: https://developers.openalex.org/api-guide-for-llms
+- OpenAlex pagination guide: https://developers.openalex.org/how-to-use-the-api/get-lists-of-entities/page-through-results
+- OpenAlex select-fields guide: https://developers.openalex.org/how-to-use-the-api/get-lists-of-entities/select-fields
+- OpenAlex deprecations: https://developers.openalex.org/guides/deprecations
+- PyAlex package metadata: https://pypi.org/project/pyalex/
+- Live OpenAlex host_venue failure check: https://api.openalex.org/works?filter=host_venue.id:S125754415&per_page=1
+- Live OpenAlex replacement filter check: https://api.openalex.org/works?filter=primary_location.source.id:S125754415&per_page=1
+- Live OpenAlex pagination failure check: https://api.openalex.org/works?filter=publication_year:2024&page=101&per_page=100

app.py CHANGED Viewed

@@ -5,6 +5,9 @@ print(f"Starting up: {time.strftime('%Y-%m-%d %H:%M:%S')}")
 # Standard library imports
 import os
 #Enforce local cching:
@@ -78,8 +81,6 @@ import colormaps
 import matplotlib.colors as mcolors
 from matplotlib.colors import Normalize
-import random
 import opinionated # for fonts
 plt.style.use("opinionated_rc")
@@ -159,11 +160,10 @@ def _get_token(request: gr.Request):
 #print(f"Spaces version: {spaces.__version__}")
 import datamapplot
-import pyalex
 # Local imports
 from openalex_utils import (
-    openalex_url_to_pyalex_query,
     get_field,
     process_records_to_df,
     openalex_url_to_filename,
@@ -195,7 +195,7 @@ except ImportError:
 # Configure OpenAlex
-pyalex.config.email = "maximilian.noichl@uni-bamberg.de"
 print(f"Imports completed: {time.strftime('%Y-%m-%d %H:%M:%S')}")
@@ -466,209 +466,61 @@ def predict(request: gr.Request, text_input, sample_size_slider, reduce_sample_c
         urls = [url.strip() for url in text_input.split(';')]
         records = []
         query_indices = []  # Track which query each record comes from
-        total_query_length = 0
-        expected_download_count = 0  # Track expected number of records to download for progress
         # Use first URL for filename
-        first_query, first_params = openalex_url_to_pyalex_query(urls[0])
         filename = openalex_url_to_filename(urls[0])
         print(f"Filename: {filename}")
         # Process each URL
-        for i, url in enumerate(urls):
-            query, params = openalex_url_to_pyalex_query(url)
-            query_length = query.count()
-            total_query_length += query_length
-            # Calculate expected download count for this query
-            if reduce_sample_checkbox and sample_reduction_method == "First n samples":
-                expected_for_this_query = min(sample_size_slider, query_length)
-            elif reduce_sample_checkbox and sample_reduction_method == "n random samples":
-                expected_for_this_query = min(sample_size_slider, query_length)
-            else:  # "All"
-                expected_for_this_query = query_length
-            expected_download_count += expected_for_this_query
-            print(f'Requesting {query_length} entries from query {i+1}/{len(urls)} (expecting to download {expected_for_this_query})...')
-            # Use PyAlex sampling for random samples - much more efficient!
-            if reduce_sample_checkbox and sample_reduction_method == "n random samples":
-                # Use PyAlex's built-in sample method for efficient server-side sampling
-                target_size = min(sample_size_slider, query_length)
-                try:
-                    seed_int = int(seed_value) if seed_value.strip() else 42
-                except ValueError:
-                    seed_int = 42
-                    print(f"Invalid seed value '{seed_value}', using default: 42")
-                print(f'Attempting PyAlex sampling: {target_size} from {query_length} (seed={seed_int})')
-                try:
-                    # Check if PyAlex sample method exists and works
-                    if hasattr(query, 'sample'):
-                        sampled_records = []
-                        seen_ids = set()  # Track IDs to avoid duplicates
-                        # If target_size > 10k, do batched sampling
-                        if target_size > 10000:
-                            batch_size = 9998  # Use 9998 to stay safely under 10k limit
-                            remaining = target_size
-                            batch_num = 1
-                            print(f'Target size {target_size} > 10k, using batched sampling with batch size {batch_size}')
-                            while remaining > 0 and len(sampled_records) < target_size:
-                                current_batch_size = min(batch_size, remaining)
-                                batch_seed = seed_int + batch_num  # Different seed for each batch
-                                print(f'Batch {batch_num}: requesting {current_batch_size} samples (seed={batch_seed})')
-                                # Sample this batch
-                                batch_query = query.sample(current_batch_size, seed=batch_seed)
-                                batch_records = []
-                                batch_count = 0
-                                for page in batch_query.paginate(per_page=200, method='page', n_max=None):
-                                    for record in page:
-                                        # Check for duplicates using OpenAlex ID
-                                        record_id = record.get('id', '')
-                                        if record_id not in seen_ids:
-                                            seen_ids.add(record_id)
-                                            batch_records.append(record)
-                                            batch_count += 1
-                                sampled_records.extend(batch_records)
-                                remaining -= len(batch_records)
-                                batch_num += 1
-                                print(f'Batch {batch_num-1} complete: got {len(batch_records)} unique records ({len(sampled_records)}/{target_size} total)')
-                                progress(0.1 + (0.15 * len(sampled_records) / target_size),
-                                        desc=f"Batched sampling from query {i+1}/{len(urls)}... ({len(sampled_records)}/{target_size})")
-                                # Safety check to avoid infinite loops
-                                if batch_num > 20:  # Max 20 batches (should handle up to ~200k samples)
-                                    print("Warning: Maximum batch limit reached, stopping sampling")
-                                    break
-                        else:
-                            # Single batch sampling for <= 10k
-                            sampled_query = query.sample(target_size, seed=seed_int)
-                            records_count = 0
-                            for page in sampled_query.paginate(per_page=200, method='page', n_max=None):
-                                for record in page:
-                                    sampled_records.append(record)
-                                    records_count += 1
-                                    progress(0.1 + (0.15 * records_count / target_size),
-                                            desc=f"Getting sampled data from query {i+1}/{len(urls)}... ({records_count}/{target_size})")
-                        print(f'PyAlex sampling successful: got {len(sampled_records)} records (requested {target_size})')
-                    else:
-                        raise AttributeError("sample method not available")
-                except Exception as e:
-                    print(f"PyAlex sampling failed ({e}), using fallback method...")
-                    # Fallback: get all records and sample manually
-                    all_records = []
-                    records_count = 0
-                    # Use page pagination for fallback method
-                    for page in query.paginate(per_page=200, method='page', n_max=None):
-                        for record in page:
-                            all_records.append(record)
-                            records_count += 1
-                            progress(0.1 + (0.15 * records_count / query_length),
-                                    desc=f"Downloading for sampling from query {i+1}/{len(urls)}...")
-                    # Now sample manually
-                    if len(all_records) > target_size:
-                        import random
-                        random.seed(seed_int)
-                        sampled_records = random.sample(all_records, target_size)
-                    else:
-                        sampled_records = all_records
-                    print(f'Fallback sampling: got {len(sampled_records)} from {len(all_records)} total')
-                # Add the sampled records
-                for idx, record in enumerate(sampled_records):
-                    records.append(record)
-                    query_indices.append(i)
-                    # Safe progress calculation
-                    if expected_download_count > 0:
-                        progress_val = 0.1 + (0.2 * len(records) / expected_download_count)
-                    else:
-                        progress_val = 0.1
-                    progress(progress_val, desc=f"Processing sampled data from query {i+1}/{len(urls)}...")
-            else:
-                # Keep existing logic for "First n samples" and "All"
-                target_size = sample_size_slider if reduce_sample_checkbox and sample_reduction_method == "First n samples" else query_length
-                records_per_query = 0
-                print(f"Query {i+1}: target_size={target_size}, query_length={query_length}, method={sample_reduction_method}")
-                should_break_current_query = False
-                # For "First n samples", limit the maximum records fetched to avoid over-downloading
-                max_records_to_fetch = target_size if reduce_sample_checkbox and sample_reduction_method == "First n samples" else None
-                for page in query.paginate(per_page=200, method='page', n_max=max_records_to_fetch):
-                    # Add retry mechanism for processing each page
-                    max_retries = 5
-                    base_wait_time = 1  # Starting wait time in seconds
-                    exponent = 1.5  # Exponential factor
-                    for retry_attempt in range(max_retries):
-                        try:
-                            for record in page:
-                                # Safety check: don't process if we've already reached target
-                                if reduce_sample_checkbox and sample_reduction_method == "First n samples" and records_per_query >= target_size:
-                                    print(f"Reached target size before processing: {records_per_query}/{target_size}, breaking from download")
-                                    should_break_current_query = True
-                                    break
-                                records.append(record)
-                                query_indices.append(i)  # Track which query this record comes from
-                                records_per_query += 1
-                                # Safe progress calculation
-                                if expected_download_count > 0:
-                                    progress_val = 0.1 + (0.2 * len(records) / expected_download_count)
-                                else:
-                                    progress_val = 0.1
-                                progress(progress_val, desc=f"Getting data from query {i+1}/{len(urls)}...")
-                                if reduce_sample_checkbox and sample_reduction_method == "First n samples" and records_per_query >= target_size:
-                                    print(f"Reached target size: {records_per_query}/{target_size}, breaking from download")
-                                    should_break_current_query = True
-                                    break
-                            # If we get here without an exception, break the retry loop
-                            break
-                        except Exception as e:
-                            print(f"Error processing page: {e}")
-                            if retry_attempt < max_retries - 1:
-                                wait_time = base_wait_time * (exponent ** retry_attempt) + random.random()
-                                print(f"Retrying in {wait_time:.2f} seconds (attempt {retry_attempt + 1}/{max_retries})...")
-                                time.sleep(wait_time)
-                            else:
-                                print(f"Maximum retries reached. Continuing with next page.")
-                        # Break out of retry loop if we've reached target
-                        if should_break_current_query:
-                            break
-                if should_break_current_query:
-                    print(f"Successfully downloaded target size for query {i+1}, moving to next query")
-                    # Continue to next query instead of breaking the entire query loop
-                    continue
-            # Continue to next query - don't break out of the main query loop
         print(f"Query completed in {time.time() - start_time:.2f} seconds")
         print(f"Total records collected: {len(records)}")
-        print(f"Expected to download: {expected_download_count}")
-        print(f"Available from all queries: {total_query_length}")
         print(f"Sample method used: {sample_reduction_method}")
         print(f"Reduce sample enabled: {reduce_sample_checkbox}")
         if sample_reduction_method == "n random samples":
             print(f"Seed value: {seed_value}")
         # Process records
         processing_start = time.time()
         records_df = process_records_to_df(records)
@@ -678,7 +530,7 @@ def predict(request: gr.Request, text_input, sample_size_slider, reduce_sample_c
         if reduce_sample_checkbox and sample_reduction_method != "All" and sample_reduction_method != "n random samples":
-            # Note: We skip "n random samples" here because PyAlex sampling is already done above
             sample_size = min(sample_size_slider, len(records_df))
             # Check if we have multiple queries for sampling logic

 # Standard library imports
 import os
+from config_loader import load_local_config
+load_local_config()
 #Enforce local cching:
 import matplotlib.colors as mcolors
 from matplotlib.colors import Normalize
 import opinionated # for fonts
 plt.style.use("opinionated_rc")
 #print(f"Spaces version: {spaces.__version__}")
 import datamapplot
 # Local imports
+from openalex_client import get_openalex_client, normalize_openalex_url
 from openalex_utils import (
     get_field,
     process_records_to_df,
     openalex_url_to_filename,
 # Configure OpenAlex
+openalex_client = get_openalex_client(require_api_key=is_running_in_hf_space())
 print(f"Imports completed: {time.strftime('%Y-%m-%d %H:%M:%S')}")
         urls = [url.strip() for url in text_input.split(';')]
         records = []
         query_indices = []  # Track which query each record comes from
         # Use first URL for filename
         filename = openalex_url_to_filename(urls[0])
         print(f"Filename: {filename}")
         # Process each URL
+        try:
+            for i, url in enumerate(urls):
+                query = normalize_openalex_url(url)
+                if query.entity != "works":
+                    raise ValueError("Only OpenAlex work queries are supported.")
+                progress_base = 0.1 + (0.2 * i / max(1, len(urls)))
+                if reduce_sample_checkbox and sample_reduction_method == "n random samples":
+                    try:
+                        seed_int = int(seed_value) if seed_value.strip() else 42
+                    except ValueError:
+                        seed_int = 42
+                        print(f"Invalid seed value '{seed_value}', using default: 42")
+                    progress(progress_base, desc=f"Sampling query {i+1}/{len(urls)}...")
+                    query_records = openalex_client.fetch_sampled_works(
+                        query,
+                        sample_size=sample_size_slider,
+                        seed=seed_int,
+                    )
+                    print(f"Query {i+1}: sampled {len(query_records)} records (seed={seed_int})")
+                else:
+                    target_size = sample_size_slider if reduce_sample_checkbox and sample_reduction_method == "First n samples" else None
+                    query_desc = f"Downloading query {i+1}/{len(urls)}..."
+                    if target_size is not None:
+                        query_desc = f"Downloading first {target_size} records from query {i+1}/{len(urls)}..."
+                    progress(progress_base, desc=query_desc)
+                    query_records = openalex_client.fetch_works(query, limit=target_size)
+                    print(f"Query {i+1}: fetched {len(query_records)} records")
+                records.extend(query_records)
+                query_indices.extend([i] * len(query_records))
+                progress(0.1 + (0.2 * (i + 1) / max(1, len(urls))), desc=f"Finished query {i+1}/{len(urls)}")
+        except Exception as e:
+            error_message = f"Error downloading data from OpenAlex: {str(e)}"
+            return create_error_response(error_message)
         print(f"Query completed in {time.time() - start_time:.2f} seconds")
         print(f"Total records collected: {len(records)}")
         print(f"Sample method used: {sample_reduction_method}")
         print(f"Reduce sample enabled: {reduce_sample_checkbox}")
         if sample_reduction_method == "n random samples":
             print(f"Seed value: {seed_value}")
+        if not records:
+            error_message = "Error: OpenAlex returned no records for the provided query."
+            return create_error_response(error_message)
         # Process records
         processing_start = time.time()
         records_df = process_records_to_df(records)
         if reduce_sample_checkbox and sample_reduction_method != "All" and sample_reduction_method != "n random samples":
+            # Random sampling is already handled in the OpenAlex fetch layer above.
             sample_size = min(sample_size_slider, len(records_df))
             # Check if we have multiple queries for sampling logic

config_loader.py ADDED Viewed

	@@ -0,0 +1,34 @@

+import json
+import os
+from functools import lru_cache
+from pathlib import Path
+DEFAULT_CONFIG_FILES = (
+    "openalex_config.local.json",
+)
+@lru_cache(maxsize=1)
+def load_local_config():
+    """Load local config values without overriding existing environment variables."""
+    config_file = os.environ.get("OPENALEX_CONFIG_FILE")
+    candidate_paths = [Path(config_file)] if config_file else [Path(name) for name in DEFAULT_CONFIG_FILES]
+    loaded = {}
+    for path in candidate_paths:
+        if not path.exists():
+            continue
+        with path.open("r", encoding="utf-8") as handle:
+            loaded = json.load(handle)
+        if not isinstance(loaded, dict):
+            raise ValueError(f"Config file {path} must contain a JSON object at the top level.")
+        for key, value in loaded.items():
+            if key not in os.environ and value is not None:
+                os.environ[key] = str(value)
+        break
+    return loaded

openalex_client.py ADDED Viewed

	@@ -0,0 +1,369 @@

+import os
+import random
+import re
+import time
+from dataclasses import dataclass, field
+from functools import lru_cache
+from typing import Iterable
+from urllib.parse import parse_qs, urlparse
+import requests
+from config_loader import load_local_config
+DEFAULT_BASE_URL = "https://api.openalex.org"
+DEFAULT_PER_PAGE = 100
+DEFAULT_TIMEOUT = 30
+DEFAULT_RETRIES = 5
+DEFAULT_SELECT_FIELDS = (
+    "id",
+    "title",
+    "display_name",
+    "doi",
+    "publication_year",
+    "abstract_inverted_index",
+    "primary_location",
+    "primary_topic",
+    "referenced_works",
+)
+IGNORED_INPUT_PARAMS = {
+    "api_key",
+    "cursor",
+    "page",
+    "per-page",
+    "per_page",
+    "sample",
+    "seed",
+    "select",
+}
+OPENALEX_ID_RE = re.compile(r"^[A-Za-z]\d+$")
+@dataclass(frozen=True)
+class FilterToken:
+    key: str
+    value: str
+@dataclass
+class OpenAlexQuery:
+    entity: str = "works"
+    params: dict[str, str] = field(default_factory=dict)
+    filter_tokens: list[FilterToken] = field(default_factory=list)
+    legacy_filters: list[str] = field(default_factory=list)
+    def as_params(self, select_fields: Iterable[str] | None = None, extra_params: dict[str, str] | None = None):
+        params = dict(self.params)
+        if self.filter_tokens:
+            params["filter"] = ",".join(f"{token.key}:{token.value}" for token in self.filter_tokens)
+        if select_fields:
+            params["select"] = ",".join(select_fields)
+        if extra_params:
+            params.update({key: str(value) for key, value in extra_params.items()})
+        return params
+    def without_params(self, *keys: str):
+        keys_to_remove = set(keys)
+        return OpenAlexQuery(
+            entity=self.entity,
+            params={key: value for key, value in self.params.items() if key not in keys_to_remove},
+            filter_tokens=list(self.filter_tokens),
+            legacy_filters=list(self.legacy_filters),
+        )
+def _normalize_url_input(url: str):
+    url = url.strip()
+    if "://" not in url and url.startswith(("openalex.org/", "api.openalex.org/")):
+        return f"https://{url}"
+    return url
+def _split_filter_string(filter_value: str):
+    tokens = []
+    current = []
+    quote = None
+    for char in filter_value:
+        if char in {"'", '"'}:
+            if quote == char:
+                quote = None
+            elif quote is None:
+                quote = char
+            current.append(char)
+            continue
+        if char == "," and quote is None:
+            token = "".join(current).strip()
+            if token:
+                tokens.append(token)
+            current = []
+            continue
+        current.append(char)
+    token = "".join(current).strip()
+    if token:
+        tokens.append(token)
+    return tokens
+def _normalize_filter_token(key: str, value: str):
+    legacy_key = None
+    if key == "default.search":
+        return "__search__", value, key
+    if key == "host_venue.id":
+        return "primary_location.source.id", value, key
+    if key.startswith("host_venue."):
+        return key.replace("host_venue.", "primary_location.source.", 1), value, key
+    if key == "alternate_host_venues.id":
+        return "locations.source.id", value, key
+    if key.startswith("alternate_host_venues."):
+        return key.replace("alternate_host_venues.", "locations.source.", 1), value, key
+    if key.startswith("x_concepts"):
+        return key.replace("x_concepts", "concepts", 1), value, key
+    return key, value, legacy_key
+def normalize_openalex_url(url: str):
+    url = _normalize_url_input(url)
+    parsed_url = urlparse(url)
+    path_parts = [part for part in parsed_url.path.split("/") if part]
+    entity = path_parts[0] if path_parts else "works"
+    query_params = parse_qs(parsed_url.query, keep_blank_values=True)
+    params = {}
+    legacy_filters = []
+    filter_tokens = []
+    search_value = query_params.get("search", [None])[0]
+    for raw_filter in query_params.get("filter", []):
+        for token in _split_filter_string(raw_filter):
+            key, sep, value = token.partition(":")
+            if not sep:
+                continue
+            key = key.strip()
+            value = value.strip()
+            key, value, legacy_key = _normalize_filter_token(key, value)
+            if legacy_key:
+                legacy_filters.append(legacy_key)
+            if key == "__search__":
+                if search_value is None:
+                    search_value = value
+                else:
+                    filter_tokens.append(FilterToken("default.search", value))
+                continue
+            filter_tokens.append(FilterToken(key, value))
+    if search_value:
+        params["search"] = search_value
+    for key, values in query_params.items():
+        if not values or key in {"filter", "search"} or key in IGNORED_INPUT_PARAMS:
+            continue
+        normalized_key = "per_page" if key == "per-page" else key
+        params[normalized_key] = values[0]
+    return OpenAlexQuery(
+        entity=entity,
+        params=params,
+        filter_tokens=filter_tokens,
+        legacy_filters=legacy_filters,
+    )
+def _normalize_openalex_id(entity_id: str):
+    entity_id = entity_id.strip()
+    if entity_id.startswith("https://openalex.org/"):
+        entity_id = entity_id.rstrip("/").split("/")[-1]
+    if OPENALEX_ID_RE.match(entity_id):
+        return entity_id[0].upper() + entity_id[1:]
+    return entity_id
+class OpenAlexClient:
+    def __init__(
+        self,
+        api_key=None,
+        base_url=DEFAULT_BASE_URL,
+        timeout=DEFAULT_TIMEOUT,
+        max_retries=DEFAULT_RETRIES,
+    ):
+        self.api_key = api_key
+        self.base_url = base_url.rstrip("/")
+        self.timeout = timeout
+        self.max_retries = max_retries
+        self.session = requests.Session()
+        self.session.headers.update(
+            {
+                "User-Agent": "OpenAlexMapper/1.0 (+https://huggingface.co/spaces/m7n/openalex_mapper)",
+                "Accept": "application/json",
+            }
+        )
+        self._entity_cache = {}
+    @classmethod
+    def from_env(cls, require_api_key=False):
+        load_local_config()
+        api_key = os.environ.get("OPENALEX_API_KEY")
+        if api_key:
+            api_key = api_key.strip()
+        if require_api_key and not api_key:
+            raise RuntimeError(
+                "OPENALEX_API_KEY is required. Set it as a Hugging Face Space secret or in openalex_config.local.json."
+            )
+        return cls(api_key=api_key or None)
+    def _request_json(self, path, params=None):
+        params = dict(params or {})
+        if self.api_key:
+            params["api_key"] = self.api_key
+        url = f"{self.base_url}/{path.lstrip('/')}"
+        last_error = None
+        for attempt in range(self.max_retries):
+            response = None
+            try:
+                response = self.session.get(url, params=params, timeout=self.timeout)
+                if response.status_code in {429, 500, 502, 503, 504}:
+                    retry_after = response.headers.get("Retry-After")
+                    wait_time = float(retry_after) if retry_after else (2 ** attempt)
+                    time.sleep(wait_time)
+                    continue
+                response.raise_for_status()
+                return response.json()
+            except requests.RequestException as exc:
+                last_error = exc
+                if attempt == self.max_retries - 1:
+                    break
+                time.sleep(2 ** attempt)
+        if response is not None:
+            try:
+                payload = response.json()
+            except ValueError:
+                payload = response.text
+            raise RuntimeError(f"OpenAlex request failed for {url}: {payload}") from last_error
+        raise RuntimeError(f"OpenAlex request failed for {url}: {last_error}") from last_error
+    def get_entity(self, entity, entity_id, select_fields=None):
+        normalized_id = _normalize_openalex_id(entity_id)
+        cache_key = (entity, normalized_id, tuple(select_fields or ()))
+        if cache_key in self._entity_cache:
+            return self._entity_cache[cache_key]
+        payload = self._request_json(
+            f"{entity}/{normalized_id}",
+            params={"select": ",".join(select_fields)} if select_fields else None,
+        )
+        self._entity_cache[cache_key] = payload
+        return payload
+    def count(self, query):
+        payload = self._request_json(
+            query.entity,
+            params=query.as_params(select_fields=("id",), extra_params={"per_page": 1}),
+        )
+        return int(payload.get("meta", {}).get("count") or 0)
+    def _normalize_work_record(self, record):
+        normalized = dict(record)
+        normalized["title"] = normalized.get("title") or normalized.get("display_name") or " "
+        normalized.setdefault("abstract_inverted_index", None)
+        normalized.setdefault("primary_location", None)
+        normalized.setdefault("primary_topic", None)
+        normalized.setdefault("referenced_works", [])
+        return normalized
+    def iter_works(self, query, limit=None, extra_params=None, per_page=DEFAULT_PER_PAGE):
+        params = query.as_params(select_fields=DEFAULT_SELECT_FIELDS, extra_params=extra_params)
+        params["cursor"] = params.get("cursor", "*")
+        fetched = 0
+        while True:
+            current_per_page = per_page
+            if limit is not None:
+                remaining = limit - fetched
+                if remaining <= 0:
+                    break
+                current_per_page = min(current_per_page, remaining)
+            params["per_page"] = current_per_page
+            payload = self._request_json(query.entity, params=params)
+            results = payload.get("results", [])
+            if not results:
+                break
+            for record in results:
+                yield self._normalize_work_record(record)
+                fetched += 1
+                if limit is not None and fetched >= limit:
+                    return
+            next_cursor = payload.get("meta", {}).get("next_cursor")
+            if next_cursor is None:
+                break
+            params["cursor"] = next_cursor
+    def fetch_works(self, query, limit=None):
+        return list(self.iter_works(query, limit=limit))
+    def fetch_sampled_works(self, query, sample_size, seed):
+        sampling_query = query.without_params("sort")
+        if sample_size <= 10000:
+            return list(
+                self.iter_works(
+                    sampling_query,
+                    limit=sample_size,
+                    extra_params={"sample": sample_size, "seed": seed},
+                )
+            )
+        return self.reservoir_sample_works(sampling_query, sample_size, seed)
+    def reservoir_sample_works(self, query, sample_size, seed):
+        rng = random.Random(seed)
+        reservoir = []
+        for index, record in enumerate(self.iter_works(query)):
+            if index < sample_size:
+                reservoir.append(record)
+                continue
+            sample_index = rng.randint(0, index)
+            if sample_index < sample_size:
+                reservoir[sample_index] = record
+        return reservoir
+    def fetch_records_from_dois(self, doi_list, block_size=50):
+        all_records = []
+        clean_dois = [doi.strip() for doi in doi_list if isinstance(doi, str) and doi.strip()]
+        for start in range(0, len(clean_dois), block_size):
+            sublist = clean_dois[start : start + block_size]
+            doi_filter = "|".join(sublist)
+            query = OpenAlexQuery(
+                entity="works",
+                filter_tokens=[FilterToken("doi", doi_filter)],
+            )
+            all_records.extend(self.fetch_works(query, limit=len(sublist)))
+        return all_records
+@lru_cache(maxsize=1)
+def get_openalex_client(require_api_key=False):
+    return OpenAlexClient.from_env(require_api_key=require_api_key)

openalex_config.example.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "OPENALEX_API_KEY": "replace-with-your-openalex-api-key"
+}

openalex_utils.py CHANGED Viewed

@@ -1,387 +1,232 @@
 import numpy as np
-from urllib.parse import urlparse, parse_qs
-from pyalex import Works, Authors, Institutions
 import pandas as pd
-import ast, json
-def openalex_url_to_pyalex_query(url):
-    """
-    Convert an OpenAlex search URL to a pyalex query.
-    Args:
-    url (str): The OpenAlex search URL.
-    Returns:
-    tuple: (Works object, dict of parameters)
-    """
-    parsed_url = urlparse(url)
-    query_params = parse_qs(parsed_url.query)
-    # Initialize the Works object
-    query = Works()
-    # Handle filters
-    if 'filter' in query_params:
-        filters = query_params['filter'][0].split(',')
-        for f in filters:
-            if ':' in f:
-                key, value = f.split(':', 1)
-                if key == 'default.search':
-                    query = query.search(value)
-                else:
-                    query = query.filter(**{key: value})
-    # Handle sort - Fixed to properly handle field:direction format
-    if 'sort' in query_params:
-        sort_params = query_params['sort'][0].split(',')
-        for s in sort_params:
-            if ':' in s:  # Handle field:direction format
-                field, direction = s.split(':')
-                query = query.sort(**{field: direction})
-            elif s.startswith('-'):  # Handle -field format
-                query = query.sort(**{s[1:]: 'desc'})
-            else:  # Handle field format
-                query = query.sort(**{s: 'asc'})
-    # Handle other parameters
-    params = {}
-    for key in ['page', 'per-page', 'sample', 'seed']:
-        if key in query_params:
-            params[key] = query_params[key][0]
-    return query, params
-def invert_abstract(inv_index):
-    """Reconstruct abstract from OpenAlex' inverted-index.
-    Handles dicts, JSON / repr strings, or missing values gracefully.
-    """
-    # Try to coerce a string into a Python object first
     if isinstance(inv_index, str):
         try:
-            inv_index = json.loads(inv_index)          # double-quoted JSON
         except Exception:
             try:
-                inv_index = ast.literal_eval(inv_index)  # single-quoted repr
             except Exception:
                 inv_index = None
     if isinstance(inv_index, dict):
-        l_inv = [(w, p) for w, pos in inv_index.items() for p in pos]
-        return " ".join(w for w, _ in sorted(l_inv, key=lambda x: x[1]))
-    else:
-        return " "
 def get_pub(x):
     """Extract publication name from record."""
-    try:
-        source = x['source']['display_name']
-        if source not in ['parsed_publication','Deleted Journal']:
             return source
-        else:
-            return ' '
-    except:
-            return ' '
 def get_field(x):
     """Extract academic field from record."""
     try:
-        field = x['primary_topic']['subfield']['display_name']
         if field is not None:
             return field
-        else:
-            return np.nan
-    except:
         return np.nan
 def process_records_to_df(records):
-    """
-    Convert OpenAlex records to a pandas DataFrame with processed fields.
-    Can handle either raw OpenAlex records or an existing DataFrame.
-    Args:
-    records (list or pd.DataFrame): List of OpenAlex record dictionaries or existing DataFrame
-    Returns:
-    pandas.DataFrame: Processed DataFrame with abstracts, publications, and titles
-    """
-    # If records is already a DataFrame, use it directly
     if isinstance(records, pd.DataFrame):
         records_df = records.copy()
-        # Only process abstract_inverted_index and primary_location if they exist
-        if 'abstract_inverted_index' in records_df.columns:
-            records_df['abstract'] = [invert_abstract(t) for t in records_df['abstract_inverted_index']]
-        if 'primary_location' in records_df.columns:
-            records_df['parsed_publication'] = [get_pub(x) for x in records_df['primary_location']]
-            records_df['parsed_publication'] = records_df['parsed_publication'].fillna(' ') # fill missing values with space, only if we have them.
     else:
-        # Process raw records as before
         records_df = pd.DataFrame(records)
-        records_df['abstract'] = [invert_abstract(t) for t in records_df['abstract_inverted_index']]
-        records_df['parsed_publication'] = [get_pub(x) for x in records_df['primary_location']]
-        records_df['parsed_publication'] = records_df['parsed_publication'].fillna(' ')
-    # Fill missing values and deduplicate
-    records_df['abstract'] = records_df['abstract'].fillna(' ')
-    records_df['title'] = records_df['title'].fillna(' ')
-    records_df = records_df.drop_duplicates(subset=['id']).reset_index(drop=True)
     return records_df
 def openalex_url_to_filename(url):
-    """
-    Convert an OpenAlex URL to a filename-safe string with timestamp.
-    Args:
-    url (str): The OpenAlex search URL
-    Returns:
-    str: A filename-safe string with timestamp (without extension)
-    """
-    from datetime import datetime
-    import re
-    # First parse the URL into query and params
-    parsed_url = urlparse(url)
-    query_params = parse_qs(parsed_url.query)
-    # Create parts of the filename
     parts = []
-    # Handle filters
-    if 'filter' in query_params:
-        filters = query_params['filter'][0].split(',')
-        for f in filters:
-            if ':' in f:
-                key, value = f.split(':', 1)
-                # Replace dots with underscores and clean the value
-                key = key.replace('.', '_')
-                # Clean the value to be filename-safe and add spaces around words
-                clean_value = re.sub(r'[^\w\s-]', '', value)
-                # Replace multiple spaces with single space and strip
-                clean_value = ' '.join(clean_value.split())
-                # Replace spaces with underscores for filename
-                clean_value = clean_value.replace(' ', '_')
-                if key == 'default_search':
-                    parts.append(f"search_{clean_value}")
-                else:
-                    parts.append(f"{key}_{clean_value}")
-    # Handle sort parameters
-    if 'sort' in query_params:
-        sort_params = query_params['sort'][0].split(',')
-        for s in sort_params:
-            if s.startswith('-'):
-                parts.append(f"sort_{s[1:].replace('.', '_')}_desc")
             else:
-                parts.append(f"sort_{s.replace('.', '_')}_asc")
-    # Add timestamp
-    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
-    # Combine all parts
-    filename = '__'.join(parts) if parts else 'openalex_query'
     filename = f"{filename}__{timestamp}"
-    # Ensure filename is not too long (max 255 chars is common filesystem limit)
     if len(filename) > 255:
-        filename = filename[:251]  # leave room for potential extension
     return filename
 def get_records_from_dois(doi_list, block_size=50):
-    """
-    Download OpenAlex records for a list of DOIs in blocks.
-    Args:
-        doi_list (list): List of DOIs (strings)
-        block_size (int): Number of DOIs to fetch per request (default 50)
-    Returns:
-        pd.DataFrame: DataFrame of OpenAlex records
-    """
-    from pyalex import Works
-    from tqdm import tqdm
-    all_records = []
-    for i in tqdm(range(0, len(doi_list), block_size)):
-        sublist = doi_list[i:i+block_size]
-        doi_str = "|".join(sublist)
-        try:
-            record_list = Works().filter(doi=doi_str).get(per_page=block_size)
-            all_records.extend(record_list)
-        except Exception as e:
-            print(f"Error fetching DOIs {sublist}: {e}")
-    return pd.DataFrame(all_records)
 def openalex_url_to_readable_name(url):
-    """
-    Convert an OpenAlex URL to a short, human-readable query description.
-    Args:
-    url (str): The OpenAlex search URL
-    Returns:
-    str: A short, human-readable description of the query
-    Examples:
-    - "Search: 'Kuramoto Model'"
-    - "Search: 'quantum physics', 2020-2023"
-    - "Cites: Popper (1959)"
-    - "From: University of Pittsburgh, 1999-2020"
-    - "By: Einstein, A., 1905-1955"
-    """
-    import re
-    # Parse the URL
-    parsed_url = urlparse(url)
-    query_params = parse_qs(parsed_url.query)
-    # Initialize description parts
     parts = []
     year_range = None
-    # Handle filters
-    if 'filter' in query_params:
-        filters = query_params['filter'][0].split(',')
-        for f in filters:
-            if ':' not in f:
-                continue
-            key, value = f.split(':', 1)
-            try:
-                if key == 'default.search':
-                    # Clean up search term (remove quotes if present)
-                    search_term = value.strip('"\'')
-                    parts.append(f"Search: '{search_term}'")
-                elif key == 'title_and_abstract.search':
-                    # Handle title and abstract search specifically
-                    from urllib.parse import unquote_plus
-                    search_term = unquote_plus(value).strip('"\'')
-                    parts.append(f"T&A: '{search_term}'")
-                elif key == 'publication_year':
-                    # Handle year ranges or single years
-                    if '-' in value:
-                        start_year, end_year = value.split('-')
-                        year_range = f"{start_year}-{end_year}"
-                    else:
-                        year_range = value
-                elif key == 'cites':
-                    # Look up the cited work to get author and year
-                    work_id = value
-                    try:
-                        cited_work = Works()[work_id]
-                        if cited_work:
-                            # Get first author's last name
-                            author_name = "Unknown"
-                            year = "Unknown"
-                            if cited_work.get('authorships') and len(cited_work['authorships']) > 0:
-                                first_author = cited_work['authorships'][0]['author']
-                                if first_author.get('display_name'):
-                                    # Extract last name (assuming "First Last" format)
-                                    name_parts = first_author['display_name'].split()
-                                    author_name = name_parts[-1] if name_parts else first_author['display_name']
-                            if cited_work.get('publication_year'):
-                                year = str(cited_work['publication_year'])
-                            parts.append(f"Cites: {author_name} ({year})")
-                        else:
-                            parts.append(f"Cites: Work {work_id}")
-                    except Exception as e:
-                        print(f"Could not fetch cited work {work_id}: {e}")
-                        parts.append(f"Cites: Work {work_id}")
-                elif key == 'authorships.institutions.lineage':
-                    # Look up institution name
-                    inst_id = value
-                    try:
-                        institution = Institutions()[inst_id]
-                        if institution and institution.get('display_name'):
-                            parts.append(f"From: {institution['display_name']}")
-                        else:
-                            parts.append(f"From: Institution {inst_id}")
-                    except Exception as e:
-                        print(f"Could not fetch institution {inst_id}: {e}")
-                        parts.append(f"From: Institution {inst_id}")
-                elif key == 'authorships.author.id':
-                    # Look up author name
-                    author_id = value
-                    try:
-                        author = Authors()[author_id]
-                        if author and author.get('display_name'):
-                            parts.append(f"By: {author['display_name']}")
-                        else:
-                            parts.append(f"By: Author {author_id}")
-                    except Exception as e:
-                        print(f"Could not fetch author {author_id}: {e}")
-                        parts.append(f"By: Author {author_id}")
-                elif key == 'type':
-                    # Handle work types
-                    type_mapping = {
-                        'article': 'Articles',
-                        'book': 'Books',
-                        'book-chapter': 'Book Chapters',
-                        'dissertation': 'Dissertations',
-                        'preprint': 'Preprints'
-                    }
-                    work_type = type_mapping.get(value, value.replace('-', ' ').title())
-                    parts.append(f"Type: {work_type}")
-                elif key == 'host_venue.id':
-                    # Look up venue name
-                    venue_id = value
-                    try:
-                        # For venues, we can use Works to get source info, but let's try a direct approach
-                        # This might need adjustment based on pyalex API structure
-                        parts.append(f"In: Venue {venue_id}")  # Fallback
-                    except Exception as e:
-                        parts.append(f"In: Venue {venue_id}")
-                elif key.startswith('concepts.id'):
-                    # Handle concept filters - these are topic/concept IDs
-                    concept_id = value
-                    parts.append(f"Topic: {concept_id}")  # Could be enhanced with concept lookup
                 else:
-                    # Generic handling for other filters
-                    from urllib.parse import unquote_plus
-                    clean_key = key.replace('_', ' ').replace('.', ' ').title()
-                    # Properly decode URL-encoded values
-                    try:
-                        clean_value = unquote_plus(value).replace('_', ' ')
-                    except:
-                        clean_value = value.replace('_', ' ')
-                    parts.append(f"{clean_key}: {clean_value}")
-            except Exception as e:
-                print(f"Error processing filter {f}: {e}")
-                continue
-    # Combine parts into final description
-    if not parts:
-        description = "OpenAlex Query"
-    else:
-        description = ", ".join(parts)
-    # Add year range if present
     if year_range:
-        if parts:
-            description += f", {year_range}"
-        else:
-            description = f"Works from {year_range}"
-    # Limit length to keep it readable
     if len(description) > 60:
         description = description[:57] + "..."
-    return description

+import ast
+import json
+import re
+from datetime import datetime
 import numpy as np
 import pandas as pd
+from openalex_client import get_openalex_client, normalize_openalex_url
+def invert_abstract(inv_index):
+    """Reconstruct abstract from OpenAlex' inverted-index."""
     if isinstance(inv_index, str):
         try:
+            inv_index = json.loads(inv_index)
         except Exception:
             try:
+                inv_index = ast.literal_eval(inv_index)
             except Exception:
                 inv_index = None
     if isinstance(inv_index, dict):
+        inv_list = [(word, pos) for word, positions in inv_index.items() for pos in positions]
+        return " ".join(word for word, _ in sorted(inv_list, key=lambda item: item[1]))
+    return " "
 def get_pub(x):
     """Extract publication name from record."""
+    try:
+        source = x["source"]["display_name"]
+        if source not in ["parsed_publication", "Deleted Journal"]:
             return source
+        return " "
+    except Exception:
+        return " "
 def get_field(x):
     """Extract academic field from record."""
     try:
+        field = x["primary_topic"]["subfield"]["display_name"]
         if field is not None:
             return field
+        return np.nan
+    except Exception:
         return np.nan
 def process_records_to_df(records):
+    """Convert OpenAlex records to a pandas DataFrame with the expected mapper fields."""
     if isinstance(records, pd.DataFrame):
         records_df = records.copy()
     else:
         records_df = pd.DataFrame(records)
+    if "title" not in records_df.columns and "display_name" in records_df.columns:
+        records_df["title"] = records_df["display_name"]
+    if "title" not in records_df.columns:
+        records_df["title"] = " "
+    if "abstract" not in records_df.columns:
+        if "abstract_inverted_index" in records_df.columns:
+            records_df["abstract"] = [invert_abstract(value) for value in records_df["abstract_inverted_index"]]
+        else:
+            records_df["abstract"] = " "
+    if "parsed_publication" not in records_df.columns:
+        if "primary_location" in records_df.columns:
+            records_df["parsed_publication"] = [get_pub(value) for value in records_df["primary_location"]]
+        else:
+            records_df["parsed_publication"] = " "
+    records_df["abstract"] = records_df["abstract"].fillna(" ")
+    records_df["parsed_publication"] = records_df["parsed_publication"].fillna(" ")
+    records_df["title"] = records_df["title"].fillna(" ")
+    if "id" in records_df.columns:
+        records_df = records_df.drop_duplicates(subset=["id"]).reset_index(drop=True)
+    else:
+        records_df = records_df.reset_index(drop=True)
     return records_df
+def _clean_value(value):
+    clean_value = value.strip().strip("\"'")
+    clean_value = re.sub(r"[^\w\s-]", "", clean_value)
+    clean_value = " ".join(clean_value.split())
+    return clean_value
+def _strip_quotes(value):
+    return value.strip().strip("\"'")
 def openalex_url_to_filename(url):
+    """Convert an OpenAlex URL to a filename-safe string with timestamp."""
+    query = normalize_openalex_url(url)
     parts = []
+    if query.params.get("search"):
+        search_value = _clean_value(query.params["search"]).replace(" ", "_")
+        if search_value:
+            parts.append(f"search_{search_value}")
+    for token in query.filter_tokens:
+        clean_key = token.key.replace(".", "_")
+        clean_value = _clean_value(token.value).replace(" ", "_")
+        if clean_value:
+            parts.append(f"{clean_key}_{clean_value}")
+    if query.params.get("sort"):
+        for sort_value in query.params["sort"].split(","):
+            if sort_value.startswith("-"):
+                parts.append(f"sort_{sort_value[1:].replace('.', '_')}_desc")
             else:
+                parts.append(f"sort_{sort_value.replace('.', '_')}_asc")
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    filename = "__".join(parts) if parts else "openalex_query"
     filename = f"{filename}__{timestamp}"
     if len(filename) > 255:
+        filename = filename[:251]
     return filename
 def get_records_from_dois(doi_list, block_size=50):
+    """Download OpenAlex records for a list of DOIs in blocks."""
+    client = get_openalex_client()
+    return pd.DataFrame(client.fetch_records_from_dois(doi_list, block_size=block_size))
+def _lookup_display_name(entity, entity_id):
+    client = get_openalex_client()
+    try:
+        record = client.get_entity(entity, entity_id, select_fields=("display_name",))
+    except Exception:
+        return None
+    return record.get("display_name")
+def _lookup_cited_work(entity_id):
+    client = get_openalex_client()
+    try:
+        cited_work = client.get_entity("works", entity_id, select_fields=("authorships", "publication_year"))
+    except Exception:
+        return None
+    return cited_work
 def openalex_url_to_readable_name(url):
+    """Convert an OpenAlex URL to a short, human-readable query description."""
+    query = normalize_openalex_url(url)
     parts = []
     year_range = None
+    if query.params.get("search"):
+        parts.append(f"Search: '{_strip_quotes(query.params['search'])}'")
+    for token in query.filter_tokens:
+        key = token.key
+        value = token.value
+        try:
+            if key == "title_and_abstract.search":
+                parts.append(f"T&A: '{_strip_quotes(value)}'")
+            elif key == "publication_year":
+                year_range = value
+            elif key == "cites":
+                cited_work = _lookup_cited_work(value)
+                if cited_work:
+                    author_name = "Unknown"
+                    authorships = cited_work.get("authorships") or []
+                    if authorships:
+                        first_author = authorships[0].get("author") or {}
+                        display_name = first_author.get("display_name")
+                        if display_name:
+                            author_name = display_name.split()[-1]
+                    year = cited_work.get("publication_year") or "Unknown"
+                    parts.append(f"Cites: {author_name} ({year})")
                 else:
+                    parts.append(f"Cites: Work {value}")
+            elif key == "authorships.institutions.lineage" and "|" not in value:
+                institution_name = _lookup_display_name("institutions", value)
+                parts.append(f"From: {institution_name or f'Institution {value}'}")
+            elif key == "authorships.author.id" and "|" not in value:
+                author_name = _lookup_display_name("authors", value)
+                parts.append(f"By: {author_name or f'Author {value}'}")
+            elif key == "primary_location.source.id" and "|" not in value:
+                source_name = _lookup_display_name("sources", value)
+                parts.append(f"In: {source_name or f'Source {value}'}")
+            elif key == "topics.id" and "|" not in value:
+                topic_name = _lookup_display_name("topics", value)
+                parts.append(f"Topic: {topic_name or value}")
+            elif key == "concepts.id" and "|" not in value:
+                concept_name = _lookup_display_name("concepts", value)
+                parts.append(f"Concept: {concept_name or value}")
+            elif key == "type":
+                type_mapping = {
+                    "article": "Articles",
+                    "book": "Books",
+                    "book-chapter": "Book Chapters",
+                    "dissertation": "Dissertations",
+                    "preprint": "Preprints",
+                }
+                parts.append(f"Type: {type_mapping.get(value, value.replace('-', ' ').title())}")
+            else:
+                clean_key = key.replace("_", " ").replace(".", " ").title()
+                clean_value = value.replace("_", " ")
+                parts.append(f"{clean_key}: {clean_value}")
+        except Exception:
+            continue
+    description = "OpenAlex Query" if not parts else ", ".join(parts)
     if year_range:
+        description = f"{description}, {year_range}" if parts else f"Works from {year_range}"
     if len(description) > 60:
         description = description[:57] + "..."
+    return description

requirements.txt CHANGED Viewed

@@ -4,7 +4,6 @@ uvicorn
 fastapi
 numpy
 requests
-pyalex
 compress-pickle
 transformers
 adapters

 fastapi
 numpy
 requests
 compress-pickle
 transformers
 adapters