From Unstructured Text to Structured Q&A: LangExtract, Omniscope and Insight Explorer

28 May From Unstructured Text to Structured Q&A: LangExtract, Omniscope and Insight Explorer

Posted at 18:50h in AI, Data Analytics, Data Processing, Data Visualisation, Uncategorized by Antonio Poggi 0 Comments

A common pattern in AI document workflows is to connect a large language model directly to a PDF, web page, report or long text document, then ask questions against the raw content.

That works in some cases, but it is not always the best analytical architecture.

For many business and technical use cases, the more useful approach is not simply to “chat with a document”. It is to convert the document into structured, queryable data first.

This article describes an end-to-end workflow built in Omniscope using:

a local language model running on a workstation,
Google LangExtract for structured information extraction,
an Omniscope workflow to orchestrate the extraction,
relational output tables for facts, attributes, metrics and entities,
and Insight Explorer for natural-language Q&A over structured data.

The example document was a Ferrari Luce press kit. The original source was a PDF-style document containing technical descriptions, design information, battery and powertrain details, performance figures, collaborators, quotes and specifications.

So, rather than asking a model directly questions about the document. The goal was to transform the document into data.

The resulting pipeline was:

unstructured PDF/text
→ LangExtract
→ structured facts, attributes, metrics and entities
→ Omniscope tables
→ Insight Explorer Q&A over structured data

This is best understood as a form of semantic ETL.

The source is unstructured text.
The extraction step turns meaning into rows.
The analytical layer works with tables.

Why structured extraction instead of direct document retrieval?

The standard AI approach to long documents is usually retrieval-augmented generation:

document
→ text chunks
→ embeddings
→ retrieval
→ LLM answer

This approach is useful, but it has limitations when the objective is analysis.

A retrieved text chunk is still text. It may contain the answer, but it is not yet a structured record. The model still has to interpret the text at question time. If a user asks ten questions, similar interpretation work may happen ten times. If facts are spread across sections, retrieval can miss context. If numeric values need to be compared, grouped or charted, the raw text must still be parsed.

For analytical workflows, a better pattern is often:

extract once
query many times

The document is processed into a structured semantic layer. The Q&A layer then works over that semantic layer rather than over the raw document.

This creates several advantages:

extracted facts can be inspected and audited;
facts can be filtered, joined, pivoted and visualised;
numeric metrics can become actual numbers;
entities can be deduplicated;
repeated questions reuse the same extracted data;
the model answering questions does not need to reread the full document;
the extraction prompt and schema can be improved independently of the Q&A layer.

This is the key architectural distinction.

The LLM is used as part of a data transformation pipeline.

The source document: Ferrari Luce press kit

The example used for this workflow was a Ferrari Luce press kit.

After text extraction from the original document, the content was approximately 90,000 characters. The text included sections such as:

Design
Exterior
Interior
Audio System
Aerodynamics
Thermal Management
Vehicle Dynamics
Powertrain
Battery
Inverter and Charging System
Chassis and Body
Connectivity
Technical Specifications

The document contained many different kinds of information:

vehicle name and positioning;
launch context;
design direction;
collaborators;
interior and exterior features;
battery capacity;
charging information;
powertrain architecture;
performance figures;
dimensions and weight;
audio and connectivity systems;
maintenance and service information;
quotes from named people;
supplier and technology references.

This is exactly the kind of document where direct Q&A can be useful, but structured extraction is more powerful.

For example, the document contains both narrative text and measurable specifications. A user may want to ask:

What battery and charging specifications are stated?
Which performance metrics are mentioned?
Which facts relate to aerodynamics?
Which collaborators are named?
What design features are associated with LoveFrom?
Which extracted values use kW, kWh, km/h or kg?
What facts appear in the Powertrain section?

These questions are easier to answer reliably if the document has already been converted into structured facts and metrics.

Running the extraction with a local model

The extraction model was run locally, using Gemma 4 E2B in Q4_K_M quantisation through Ollama.

The purpose of using a local model was to test a practical local-AI workflow:

local model
→ LangExtract
→ Omniscope Custom Block
→ structured output tables
→ Insight Explorer

Running locally is not automatically better than using a cloud model. Larger hosted models may produce higher-quality extractions and handle longer contexts more easily. However, local models are useful when the objective is to keep data on the machine, reduce external dependencies, or test self-contained AI workflows.

Local serving also introduces practical constraints.

With a document around 90,000 characters, the extraction cannot simply be sent to a small local model as one large prompt. The context window, output token budget, server configuration, GPU memory and chunk size all matter.

In practice, local extraction required conservative settings:

max_workers = 1
extraction_passes = 1
small chunk size
temperature = 0
limited output tokens
grounded extractions only

For Ollama, the model parameters were kept simple:

{
  "temperature": 0,
  "num_ctx": 4096,
  "num_predict": 384
}

The exact settings depend on the local machine, model and document size, but the principle is the same: for a local model, stability and predictable extraction are more important than parallel throughput.

What LangExtract adds

LangExtract is used as the structured extraction layer.

The role of LangExtract is to guide a language model to extract specific classes of information from text and return them as labelled extractions.

Instead of asking:

Summarise this Ferrari press kit.

the extraction task is defined more explicitly:

Extract facts about vehicle name, manufacturer, launch context,
product positioning, vehicle layout, design features, collaborators,
technical systems, powertrain specifications, battery and charging,
performance metrics, dimensions, quotes, maintenance and connectivity.

The extraction prompt also defines what should not be extracted:

Do not infer facts.
Do not invent missing values.
Do not output placeholders such as "not stated" or "unknown".
Ignore legal boilerplate, registered office details, VAT numbers,
addresses, copyright text, menus and navigation text.
Use exact source spans.

This matters. Without strict instructions, the model may produce rows that look structured but are analytically useless, such as:

not explicitly stated
unknown
not applicable

or it may classify footer text and company registration details as product facts.

The extraction schema used for the Ferrari Luce document included classes such as:

vehicle_name
manufacturer
launch_context
product_positioning
vehicle_layout
design_feature
collaborator
technical_system
powertrain_spec
battery_charging_spec
performance_metric
dimension_weight_spec
quote
maintenance_or_service
connectivity_feature

This schema makes the output more predictable. It also makes downstream Omniscope analysis more meaningful, because each extracted fact belongs to a defined category.

Omniscope as the orchestration layer

The extraction process was implemented as an Omniscope Python Custom Block.

The block takes a normal Omniscope input table and allows the user to select the field containing the document text. The block then:

reads the input table;
collects the selected text field;
builds the document input for LangExtract;
sends the text, prompt and examples to the configured model;
receives LangExtract extractions;
converts the extraction result into relational tables;
returns those tables into the Omniscope workflow.

This is where Omniscope becomes the orchestrator.

The language model is only one component in the workflow. Omniscope handles the broader pipeline:

input data
→ preparation
→ custom Python execution
→ AI extraction
→ table modelling
→ transformations
→ visual analytics
→ Insight Explorer Q&A
→ reporting

That is an important distinction. The workflow is not an AI chat box bolted onto a document. It is a data pipeline in which AI performs a semantic transformation step.

From one wide table to a relational output model

The first version of the block returned one flattened table.

That worked technically, but it was not a good analytical structure.

The output contained many dynamic attribute columns such as:

attr_type
attr_category
attr_powertrain_type
attr_feature_type
attr_positioning
...

This created a wide and sparse table. Most columns were empty for most rows. That kind of output is difficult to explore, query and visualise.

The better model was relational.

The revised custom block returns five output tables:

Facts
Attributes
Metrics
Entities
Run Metadata

This is a much better fit for Omniscope and for analytical Q&A.

The Facts table

The Facts table is the central output.

One row represents one extracted fact.

Typical fields include:

fact_id
document_id
source_row_index
source_row_id
section
extraction_class
fact_type
fact_text
start_char
end_char
grounded
alignment_status
source_context
attributes_json

Example rows might represent facts such as:

vehicle_name → Ferrari Luce
powertrain_spec → four electric engines
battery_charging_spec → 122 kWh battery
performance_metric → 0-100 km/h in 2.5 seconds
collaborator → LoveFrom
design_feature → four doors and five seats

The Facts table is the main table for filtering, searching and grouping.

Because each fact includes section and source context, it is also auditable. A user can inspect not only the extracted value, but where it came from in the source document.

The Attributes table

The Attributes table stores dynamic attributes as key-value rows.

Instead of creating one column per possible attribute, the structure is:

fact_id
attribute_name
attribute_value
attribute_value_number
attribute_unit
extraction_class
fact_text
section

This is more flexible and much easier to query.

For example:

fact_id: 104
attribute_name: capacity
attribute_value: 122 kWh
attribute_value_number: 122
attribute_unit: kWh

or:

fact_id: 118
attribute_name: charging_power
attribute_value: 350 kW
attribute_value_number: 350
attribute_unit: kW

This table is particularly useful for Insight Explorer, because natural-language questions can operate over normalized attribute names and values rather than a large sparse table.

A user can ask questions such as:

Which facts have a battery capacity attribute?
Show charging-related attributes.
Which extracted attributes contain kWh?
What attributes are linked to powertrain specifications?

The Metrics table

The Metrics table extracts numeric values and units from facts and attributes.

Technical documents often contain important numeric information:

122 kWh
350 kW
310 km/h
2260 kg
800 V
2.5 seconds
530 km
25%

The Metrics table makes those values queryable as numbers.

Its structure is:

metric_id
fact_id
metric_name
metric_value
metric_unit
metric_span
metric_source
extraction_class
fact_text
section

This means Omniscope can treat metrics as analytical data, not just text.

For a technical press kit, this is important. Values can be filtered by unit, grouped by section, charted, compared or joined to other datasets.

The Entities table

The Entities table deduplicates important names.

For the Ferrari Luce document, this may include entities such as:

Ferrari
Ferrari Luce
LoveFrom
Sir Jony Ive
Marc Newson
Samsung Display
Corning

The table structure is:

entity_id
entity_type
entity_name
first_fact_id
first_section
mention_count
fact_ids

This makes it easier to explore named collaborators, manufacturers, technologies, suppliers and people mentioned in the document.

The Run Metadata table

The Run Metadata table records the extraction context:

model_provider
model_id
input_rows_with_text
combined_text_chars
facts_rows
attribute_rows
metric_rows
entity_rows
grounded_only
prompt_description

This is useful because AI extraction is a data-generating process. The output should be traceable to the prompt, model and configuration used to create it.

Insight Explorer over structured data

Once LangExtract has converted the document into relational tables, the next step is analysis.

The extracted tables can be connected to Omniscope’s Insight Explorer view. In this workflow, the Attributes table was connected to Insight Explorer and queried using GPT-5.4 nano.

The important architectural point is that the Q&A model is not retrieving from the raw Ferrari press kit.

It is querying structured data produced by the extraction step.

The workflow becomes:

raw document
→ semantic extraction
→ facts and attributes
→ Insight Explorer Q&A

rather than:

raw document
→ retrieval
→ answer from text chunks

This difference is significant.

The Q&A layer can now work over fields such as:

section
extraction_class
fact_text
attribute_name
attribute_value
attribute_value_number
attribute_unit
metric_value
metric_unit
entity_name

That makes the questions more analytical:

What battery and charging specifications were extracted?
Which performance metrics are stated?
Which collaborators are mentioned?
Which facts relate to aerodynamics?
Which attributes contain kW or kWh?
Which design features are associated with LoveFrom?
What extracted facts are in the Powertrain section?

The model answering the question is no longer responsible for parsing the entire original document. It is operating over a curated semantic layer.

Why this pattern matters

This workflow demonstrates a useful shift in AI-assisted analytics.

The objective is not to generate text from text.

The objective is to turn unstructured information into structured data that can participate in a proper analytical workflow.

That means:

documents become facts
facts become tables
tables become queryable
queryable data becomes explorable

This is where Omniscope provides the surrounding structure.

LangExtract performs semantic extraction.
The local model provides the language understanding.
The Python Custom Block integrates the process into Omniscope.
Omniscope materialises the result as tables.
Insight Explorer provides natural-language access to the structured output.

The LLM is not the whole solution. It is one step in a larger workflow.

That is the important idea.

Lessons from the implementation

Several practical lessons came out of the workflow.

1. Extraction schema matters

A vague prompt produces vague data.

The extraction classes must be designed around the analytical objective. For the Ferrari document, classes such as battery_charging_spec, powertrain_spec, performance_metric and dimension_weight_spec were much more useful than broad generic labels.

2. Grounding matters

Rows that cannot be mapped back to the source text are much less useful. The workflow therefore favours grounded extractions with source offsets and source context.

3. Local models need constrained workloads

For local models, chunk size, output token limit and worker count are important. More parallelism is not always better. Smaller, predictable requests are usually more reliable.

4. Output shape matters as much as extraction quality

The first wide table was not analytically ideal. The relational model was a major improvement.

Facts, Attributes, Metrics and Entities are easier to query and visualise than a single sparse table with hundreds of dynamic columns.

5. Q&A works better over a semantic layer

Insight Explorer becomes more useful when connected to structured extracted facts rather than raw document text. The model can answer over fields, values, categories and metrics instead of trying to rediscover the relevant text every time.

The final workflow

The final workflow can be summarised as:

Ferrari Luce PDF / press kit
        ↓
text extraction
        ↓
Omniscope input table
        ↓
LangExtract Python Custom Block
        ↓
Facts table
Attributes table
Metrics table
Entities table
Run Metadata table
        ↓
Omniscope transformations and report
        ↓
Insight Explorer
        ↓
natural-language Q&A over structured data

A semantic ETL pipeline for unstructured data.

The document is not treated as something to search repeatedly. It is transformed into a structured analytical layer.

From that point, Omniscope can do what it does best: orchestrate data, transform it, visualise it, and make it explorable.

Conclusion

The interesting part is that an LLM can be used inside Omniscope to convert a document into structured, reusable, queryable data.

That creates a stronger architecture for AI-assisted analytics:

unstructured data
→ semantic extraction
→ relational tables
→ visual analytics
→ natural-language Q&A

For business users, analysts and technical teams, this is a more durable pattern than one-off document chat.

The output can be inspected.
The facts can be filtered.
The metrics can be visualised.
The entities can be explored.
The Q&A can operate over structured evidence.

That is the value of combining LangExtract, local models, Omniscope Custom Blocks and Insight Explorer.

It turns unstructured text into a data product.

Print page

28 May From Unstructured Text to Structured Q&A: LangExtract, Omniscope and Insight Explorer

Why structured extraction instead of direct document retrieval?

The source document: Ferrari Luce press kit

Running the extraction with a local model

What LangExtract adds

Omniscope as the orchestration layer

From one wide table to a relational output model

The Facts table

The Attributes table

The Metrics table

The Entities table

The Run Metadata table

Insight Explorer over structured data

Why this pattern matters

Lessons from the implementation

1. Extraction schema matters

2. Grounding matters

3. Local models need constrained workloads

4. Output shape matters as much as extraction quality

5. Q&A works better over a semantic layer

The final workflow

Conclusion

No Comments

Leave a ReplyCancel reply

28 May From Unstructured Text to Structured Q&A: LangExtract, Omniscope and Insight Explorer

Why structured extraction instead of direct document retrieval?

The source document: Ferrari Luce press kit

Running the extraction with a local model

What LangExtract adds

Omniscope as the orchestration layer

From one wide table to a relational output model

The Facts table

The Attributes table

The Metrics table

The Entities table

The Run Metadata table

Insight Explorer over structured data

Why this pattern matters

Lessons from the implementation

1. Extraction schema matters

2. Grounding matters

3. Local models need constrained workloads

4. Output shape matters as much as extraction quality

5. Q&A works better over a semantic layer

The final workflow

Conclusion

No Comments

Leave a ReplyCancel reply

Discover more from Visokio